TSQL Tuesday 95: Big Data

This month’s party brought to you by Mr. Hammer (b|t).

mc-hammer
No, not THAT one…

I apologize in advance for all the hammertime memes.  It was just too good to pass up.  Surely he must be used to this.  Or at least not surprised by it.  =D

So, Big Data.  What is it?  Well, in simple terms, it’s the realization and acceptance of the fact that data is multi-model, multi-faceted, multi-sourced, and constantly growing.  It’s the fact that the traditional RDBMS is no longer the be-all end-all source of truth and valuable information.  It’s part of a larger ecosystem involving JSON document stores, CSV files, streaming volatile bits of data coming from random devices and user activity that loses its meaning and potential impact almost as quickly as it can be gathered and sifted and stored.

But what do we actually get out of it?  As a small-medium enterprise NOT in the software business, I have to say, not as much as the hype would have us believe.  And look, I’m not so jaded and crusty that I refuse to adapt new tech.  I Just haven’t seen a meaningful transformative business use-case for it.  Sure, we have Google Analytics telling us how our websites are doing, and someone in marketing knows something about trending our social media traffic.  Does it really help us make more money?  Heck if I know.

cease thy actions, my timepiece has indicated the necessity of mallets
Old-timey colonials can even dig it…

Here’s what I’d like to see from the thought leaders.  Give me something I can chew on — a real-world, non-hypothetical, non-frivolous, impactful use-case for adopting and implementing something like Hadoop/Spark or Azure Data Lake.  Show me how my business can realistically journey down the path of predictive analytics and what it’s going to take from our Devs, IT staff, and management to actually get there.

Because they don’t get it yet.  I have managers still worrying about how much we’re spending on a dinky little flash storage array to support the growing needs of our on-prem converged infrastructure stack.  Meanwhile the AWS bill continues to baffle, and Devs want to play with Docker and Lambda.  But we can’t seem to convince the higher-ups that they’re short-staffed on the internal-apps team, even after a minor version upgrade takes 4 hours of Ops time and half a dozen end-users doing post-mortem testing just to be sure we didn’t break anything unexpected.

I’m not here to complain.  Really.  I do want to see something amazing, something inspiring, something that shows me what Big Data truly brings to the table.  And sure, I’ve see the vendor demos; they’re all just a bit outlandish, no?  I mean, they look really cool, sure — who doesn’t want to see a chord diagram of who’s killed who is GoT? — but does that really help my business improve sales and productivity?

My point is, there’s a gap.  A chasm of misunderstanding and mis-matched expectations between what management thinks Big Data is/means, and what it takes to actually implement.  They see the pretty pictures and the fancy demos, but they don’t see the toil and sweat (or at least, in the cloud, gobs of cash) that go into building & operating the underpinnings and pipelines that drive those nice graphics.  Not to mention the fundamental issues of data quality and governance.

continue not, time for hammer it is
OK OK, last one, I swear…

So do us a favor, Big Data pundits.  Show us something real, something that “the little guy” can use to up his/her game in the market.  Something that makes a positive impact on small non-startup non-software businesses with understaffed IT & Dev teams.  But more importantly, stop glossing over the effort and resources that it takes to “do Big Data right“.  Managers and executives need to understand that it’s not magic.  And IT practitioners need to understand that it’s actually worth-while.  Because I believe you — really — that the payoff in the end is there, and is good.  But you need to convince the whole stack.


PS: I know this is a fully day late for T-SQL Tuesday, and as such, I wasn’t going to post a ping-back in the comments of the invite, but then I saw there were only 8 others, so I felt it would benefit the event if I did add my late contribution.  I’ll tweet with a modified hash-tag instead of the standard #tsql2sday, to reflect my late-ness.  Hopefully that’s a fair compromise to the community & the event’s intentions.  =)

Advertisements

Dirty Laundry

It’s time for a more thought-y, less tech-y post.  Which is mostly my excuse for not wanting to write a bunch of code at the moment.  But that’s how I started this blog, with mostly opinion pieces, trying to offer some critical thinking on how DBAs and Developers work together.  So y’all better like it!

Today’s title is brought to you by Don Henley’s tune of the same name, which is now stuck in my head, thankyouverymuch.

dirty laundry goes in a basket not in a database
Paint.net is my friend… =D

This is about data quality.  When you have “dirty data”, just like dirty laundry, and you let it sit unattended, it starts to smell.  In software, this means the “badness” seeps into other areas of the environment, affecting systems and business processes that should otherwise function smoothly.

code smell is a surface indication that usually corresponds to a deeper problem in the system.

-Martin Fowler

And, more aptly:

Data quality is corporate America’s dirty little secret.

-Paul Gillen

But what is dirty data?  Generally, it’s anything that doesn’t quite fit the ideal data model — that perfect vision of how all the bits of information in the system fit together, the shape of each data entity and how they relate to each other.  Mostly, dirty data is what happens when you allow users to type things into text-boxes, and you write those text-box contents straight into the database without any layers of validation or cleansing.  (Coincidentally, that’s also how SQL injection happens, but most of us have been scared-straight by enough years of security bloggers hammering at our thick skulls — and our favorite XKCD — that we at least sanitize our inputs before dumping them to an INSERT statement.)

Let me take a recent example from my experience.  We have an ERP system that doubles as our CRM system (which is already a pair of bad idea jeans).  How do you think customer information gets into the database?  Customer Service Reps, typing stuff.  Usually by copying from a paper form.  Or the customers themselves, using an online form.  But guess what doesn’t happen in either case?  If you said “USPS address validation“, give yourself a hand!

joker give yourself a clap
Oh goooood for youuuuuu…. </Christian Bale>

Now, being that this system is our “source of truth” for customer info, it stands to reason that lots of other business functions & processes depend on it.  For example, let’s say we send a promotional calendar to our customers of a certain “subscription level” on a yearly basis.  We’re not in the publishing business, so we contract this out to another company.  But guess what they need from us in order to complete the job and mail out those calendars?  Addresses!  So what happens when there’s a bad address in our database?  A calendar gets returned, wasted cost and materials.  Multiply that by a couple thousand and you start to turn a few heads in the C-suite.

Later, around the Marketing table, someone has a brilliant idea that they need to run a mail-merge to send out a gift-package to the top 100 customers.  So they ask the DBA for a list of said customers.  “Sure!  Here ya go, here’s a report.”  And then the complaints start coming in.

“These customers aren’t active anymore.”

Then tell your CS reps to mark them as inactive in the system.  But no, we don’t do that, we just write “inactive” in the FirstName field.

“These ones are employees.”

Fine, figure out a special indicator to add for that, so I can exclude them from the report.  But no, of course, we can’t do that either; we just put “deactivated” in the FirstName field.

“This guys is dead.”

Yeah, not even kidding.  Apparently the powers-that-be decided to keep his info in the system, but type in “deceased” to the “Address 2” line (in the US, this is customarily the apartment/suite/unit number).

he's dead jim
Let’s beam him back up but write “deceased” on his badge, that’ll be sufficient.

But mostly, the biggest complaint is that we’re getting un-deliverable/return-to-sender when we try shipping out to some of these addresses.  And why?  Because they’re not subject to any external validation and quality-control.

So what’s the data professional’s responsibility in this?  In my opinion, it’s to advocate for data quality.  There are obviously big vendors out there like Melissa Data who will sell you a service to help get you there.  APIs abound, from USPS and other official sources, so building it isn’t out of the question.

One potential roadblock is, as usual, conservatism.  The business’s ERP system is its life-blood, highly sensitive to change and very guarded by over-protective management and finicky executives.  But the smelly dirty data-laundry continues to cause problems and has real-money impacts on corp. efficiency and profit.  Unfortunately, many people tend to take the ostrich approach.

if you bury your head in the sand your ass will get burnt
No idea who this Bennett person is, but they sound smart.

So, my good people, start “doing your laundry”.  Have those conversations with your teams and managers about the current state of your data quality, and what it’s going to look like moving forward.  Make some plans, have a road-map, and understand that it’s going to involve a lot of collaboration between key players.  And good luck!

Hacking RedGate DLM Dashboard

This is a work-in-progress, but I thought I’d put it out there because it may be interesting to others who use the tool.

This is a “teaser” in that I’m not done writing the content yet, but I wanted to get it out there as “something I’m working on”.  The problem at-hand is that we have a table trigger which is constantly being disabled/enabled by a stored-proc , which in turn controls the only acceptable method of updating the columns that would otherwise be forbidden from update by said trigger.  Clear as mud?  K.  Basically there’s a schema change (“drift”) of a certain specific type that I want DLM Dashboard to always ignore on this DB.

Outline:

  1. Basic architecture
    1. RavenDB back-end
    2. SQL db, tables, procs, & triggers (“installation”)
  2. How it does event tracking
    1. Jessica = awesome support agent
  3. Modifying the event-reader to exclude app-names and other things
    1. XML event data
    2. XML indexes to help filter queries
    3. https://www.simple-talk.com/sql/database-administration/getting-started-with-xml-indexes/
  4. Schema-compare files to filter results
  5. Wholesale ignoring of a change-set (if possible)

Some code:

USE RedGate;
EXEC dbo.RG_SQLLighthouse_ReadEvents;

SELECT TOP 100 * FROM RedGate.SQLLighthouse.DDL_Events de
ORDER BY de.PostTime

SELECT TOP 100 * FROM RedGate.SQLLighthouse.DDL_Events de
WHERE de.appname NOT IN ('SQLServerCEIP', 'Spotlight Diagnostic Server (Monitoring)')
AND de.appname NOT LIKE 'SQLAgent%'
ORDER BY de.PostTime DESC

ALTER TABLE SQLLighthouse.DDL_Events DROP CONSTRAINT PK__DDL_Even__3213E83FB62EF9A3;

ALTER TABLE SQLLighthouse.DDL_Events ADD CONSTRAINT PK_DDL_Events PRIMARY KEY NONCLUSTERED (id);

CREATE CLUSTERED INDEX CX_DDL_Events_PostTime ON SQLLighthouse.DDL_Events
 (PostTime);

CREATE INDEX IX_DDL_Events_SPID ON RedGate.SQLLighthouse.DDL_Events
 (spid)
 INCLUDE (transaction_id, options, nestlevel);
CREATE INDEX IX_DDL_Events_AppName ON RedGate.SQLLighthouse.DDL_Events
 (appname)
 INCLUDE (client_net_address);

CREATE PRIMARY XML INDEX XI_DDL_Events_EventData ON SQLLighthouse.DDL_Events
 ([eventdata]);

CREATE XML INDEX XI_DDL_Events_EvenData_PATH ON SQLLighthouse.DDL_Events
 ([eventdata])
USING XML INDEX XI_DDL_Events_EventData
 FOR PATH;

CREATE XML INDEX XI_DDL_Events_EvenData_PROPERTY ON SQLLighthouse.DDL_Events
 ([eventdata])
USING XML INDEX XI_DDL_Events_EventData
 FOR PROPERTY;

TBD, stay tuned!

Stupid Defaults

Every big enterprise product has them.  I just happen to pick on SQL because it’s my area of expertise.

This week’s topic was triggered by an amazing quote from Adam Machanic (b|t) in the sqlcommunity Slack group, where he was trying to help someone troubleshoot a slow linked-server query:

That default, I consider to have been checked into the SQL Server codebase directly by Satan.

satan-costume-guy-laughing
I imagine he looked something like this… Satan, not Adam.  =D

He’s referring, in this case, to the default option of remote proc transaction promotion, or “Enable Promotion of Distributed Transactions for RPC” in the GUI, which is set to True by default.  Admittedly, linked-servers are a dicey area of the technology and not everybody needs them, but when you do need to deal with them, there are some caveats to consider and a few potholes to avoid.

I won’t go into gory details, but the mile-high gist of it is that you should probably change a couple of the default linked-server settings when you’re going from MSSQL to MSSQL (which most of the time, you are): rpc and rpc out both to true, and the above dude to false.  The reasons behind that are subtle, but it boils down to usage patterns:  If you’re purely running basic SELECT statements against the remote server, you can leave this stuff alone; but if you want to do cool stuff like exec MyLinkedServer.RemoteDB.sys.sp_executesql '--some awesome dynamic-sql'​, you’ll want to change these.  (That last bit comes in especially handy if you’re building a dynamic query string, then running it against the remote server to bring in the results to a #temptable to further massage/mangle/munge said data.)

Even though you probably really shouldn’t be doing that in the database (that’s what web server farms are for!).

stack of colorful rings with arrow pointing up
push it up the stack!

So, what are some other “stupid defaults” in SQL Server?

Every big enterprise product has them.  I just happen to pick on SQL because it’s my area of expertise.  And it’s not even just “defaults”; there are some options which shouldn’t even be a thing — they should be completely and irrevocably in love with Edward removed from the product.  Yet, because the RDBMS tech space is infamously conservative and slow-to-change (the accepted euphemism is “mature“), these options and defaults have remained in the product despite our best attempts to convince MS that they’re heinous.

1. Parallelism settings (server/instance level)

Your servers have many-core CPUs, right?  And you want SQL to utilize those cores to the best of its ability, distributing the many users’ workloads fairly amongst them, yes?  Damn right, you paid $3k or more per core in freaking licensing costs!  “OK”, says SQL Server, “I’ll use all available CPUs for any query with a ‘cost’ over ‘5’“.  (To give context here, in case you’re not aware, ‘5’ is a LOW number; most OLTP workload queries are in the double to triple digits).  “But wait!”, you protest, “I have more than 1 user, obviously, and I don’t want their horrible queries bringing all CPUs to their knees and forcing the 50 other user queries to wait their turn!”

well... tough shit
-SQL Server

Hardly seems fair, no?  But those are the defaults, have been for over a decade, despite how much hardware has evolved since.  Just for completeness, I am talking about the following 2 options, which you can view by right-clicking on the instance in SSMS, or by running sys.sp_configure.  They are max degree of parallelism and cost threshold for parallelism, respectively (# cores to use, and how expensive the query should be to “go parallel” — as opposed to serial/single-threaded, which is perfectly fine for those itty-bitty-teeny-tiny queries that are so quick that 1 CPU core is more than enough to handle them).  We commonly abbreviate the former as maxDOP; the latter is less commonly abbreviated, but because I’m a sucker for acronyms, I like to call it CTFP, or maybe costFP if I’m feeling verbose.

Now obviously you can, and should, change these settings.  “To what?” you ask.  Well, the answer, as always my friend, is “It Depends ©®™“… a perennial favorite of DBAs and consultants alike.  But don’t ask me — there are plenty of people much smarter than I with blog posts on the topic.  If you put a gun to my head, for CTFP, I’d say “pick your favorite number between 50 and 100, start there, and test to see if it makes your server CPUs happy or sad”.  And for maxDOP I’d say “divide your # of CPU cores by 2 (if you have 8 or fewer) or 4 (if you have 16 or more)”.

And if you have somewhere between 9 and 15 CPU cores, don’t ask me, because you’re running some.. interesting hardware.  Figure it out yourself, and get that gun out of my face!

OK, I know 12 cores is probably not an unreasonable config, so.. extrapolating my logic above, divide by.. 3?  Yeah let’s go with that.

squirrel with "ship it" flag
ship it good

2. Auto Close (database option)

It’s at the top of the list in the GUI under Database Properties -> Options.  Yet nobody knows why you would ever enable it.  I once heard a community member theorize that it might have been put into the product back when it was considered a viable “local persistence option for Windows CE/Mobile apps” and it would help w/ resource management on said mobile device by freeing up resources when the DB wasn’t in-use.  Well, we all know how well that product line did in the market (hint: poorly).  There are so many better options for localized data stores in mobile dev, MS isn’t even a blip in the conversation.  (Again, talking local data persistence.)

If we’re talking cloud, MS is a big part of that conversation — Azure is amazing, and a solid competitor to AWS & GCP.

Anyway, if you ever find a SQL DB with the auto_close option enabled, find the person responsible, and slap them with a trout.

stick figure chasing another with purple trout
I will find you, and I will trout-slap you…

3. Server Max Memory

Last one for today.  This is something that’s not completely heinous, but could use a LOT more built-in intelligence during the installation process so that DBAs & SysAdmins didn’t need to think about it so much.  SQL Server will, by default, sets its max-memory to some-odd-billion-MBs (technically it’s the max value of a 32-bit int, which the more geeky among you have probably memorized), which is of course some-odd-million-GBs, which is more than even the most bleeding-edge servers have to date.  Which is fine in theory — you paid a crap-ton of money for this system, it might as well use up all the RAM that it can to perform to its potential, right?

all your ram are belong to us
in AD 2101…

Right.  Until you realize that “Oh wait, it’s running inside an OS” (whether that’s Windows or Linux, thanks to 2016 & 2017 product versions) — that that OS needs some RAM too, to keep itself running!  (Cue the Linux zealots with their “it’s so much more lightweight than Windoze, haha!!1” — yeah, well you still need some memory space, don’a ya?)

Here’s what I’d like to see, in my ideal world.  During SQL Server installation, it would detect how much RAM is on the system, subtract about 10% or 4-8 GB, leave that for the OS, and use the resulting number as its limit.  Boom, done, nobody has to think about configuring it and checking off another checkbox on their setup checklist.

But noooo… The vaunted MSSQL engineers can built all sorts of amazing things into the product like QueryStore, Adaptive Query Processing, and The Artist Formerly Known as Hekaton, but heaven forbid we get a little more intelligence in the installer.  It got a lot  better with 2016 when it let you configure tempDB reasonably correctly (multiple files, different locations, etc), but there’s still a LOT that could use some lurv.

two kitties cuddling
because KITTEHS!!

Do you have a favorite “stupid default” or “horrible setting” related to SQL Server or any other technology that you work with?  Share in the comments!

Reports for Auditors

This is how code quality dies…

Look, I get it, audits and auditors are a necessary evil.  But the sheer volume of crappy code, hack-y scripts, man-hours, and alcohol, dedicated to meeting the often arbitrary, sometimes ridiculous, and possibly even downright obscene requirements, of said auditors, is staggering.  There are entire product suites dedicate to help you “report on” or “meet/pass” audits of various kinds.  And thank god for those.  Because without them, we’d all be slightly more insane than we already are.

I’m not crazy; my mother had me tested.  And my rubber duck agrees.

Here’s an example.

In one of our main applications, at the database level (SQL), we’ve implemented poor-man’s data-change-tracking via table triggers.  In a past audit, there was a “flag” on an unauthorized data change.  As a result, an “exception report” was created, which basically produces a daily CSV dump of those changes and emails it to a few managers.  Now, because the tracking table includes “who” as well as “when” & “what”, and there are certain privileged accounts that can be ignored for auditing, and because there are dozens of application users whose app-controlled changes are also logged in said tracking table (and need to be ignored), the “report” writer at the time decided it was a wonderful (read: terrible) idea to hard-code a list of usernames into the query that produces this data-dump.

That is horrible.  Never do that!

using hard-coded values is bad
Mr. Garrison knows…

This query, the job that scheduled it, the data that it produced, and the emails that it sent, were subsequently filed away and silently forgotten about.  Until we did some environment cleanup and discovered “Oh look, here’s a thing that’s not doing anything” (because the hard-coded list of users was now woefully out of date), and the same managers who received the reports for years past, gave the OK to disable the job!

What could possibly go wrong?

Then the next audit season comes around.  And guess who comes running back to the DBA having changed his mind?  Why yes, that very same manager!

i dont always say i told you so
Stay shortsighted, my friends…

So I tell him what we need to do to fix it — remove the hard-coded list, replace it with an AD group or at least a look-up table that can be periodically updated with “current” users in the desired departments/teams.  We do that, and the CSV dump goes from zero to huge.  As in, too huge for email.  So we try to pare it down.  We get it to a manageable size.  Then we get ready to deploy the changes and confirm the desired email recipients.

  • Boss: “Oh it doesn’t really matter, we just ignore the emails anyway… it’s just to satisfy an audit requirement.”
  • Me: “So why did we even bother fixing it?”
  • Boss: “Well, it was broken!”
  • Me: “…”

I’d like the last hour of my life back please.

stress-reduction-kit-bang-head-here
It works, I sw…

Here’s another fun one.

The ERP system has a rickety tack-on audit-trail system involving some SQL Agent Jobs and an obscenely large data repository for storing all the changes that the ERP system users make every hour of every day.  Thankfully, somebody along the way was prescient enough to partition those storage tables using a partitioned view scheme.  But apparently there were NOT smart enough to realize that the clustered index needs to be the datetime column.  Yes, it was in the primary key, but that PK was not clustered (nor should it have been, since the actual unique keys were GUIDs, but this is a prime example of making the clustering key different and separate from the PK.

Obviously enough, in a date-driven partitioned view scheme, that date does need to be involved in the PK also.  But again, the best clustered index is just that date, because A) it’s the partition divider, and B) it’s always your queries’ main filter-predicate (which is a fancy way of saying “the thing in your JOIN/WHERE clauses”).

Once again, audit season comes around, and someone needs those reports driven by these audit-trail repos.  And guess what?  They’re slow as molasses.

you don't say
Nicholas Cage is apparently quite the flexible meme fodder..

So we get to spend another few hours applying the correct clustered indexes, waiting for them to rebuild, testing the queries for the report, and finally running the report itself.  Oh and don’t forget the down-time of the jobs, and having to “catch up” with all the changes that were made in the ERP system while we were working on all this.

Is it 5 o’clock yet?

In conclusion.

Audits suck.  But please, stop producing crappy code in response.  It just makes the next guy/gal that comes after you even more frustrated than he/she already is with the whole process.  Practice the “boy scout principle” — leave things at least a little better than you found them.  And if you’re forced to produce a ridiculous “exception report” that nobody will ever read, spend as little time on it as possible while still making it less heinous than the last terribly designed monstrosity.

oh my look its beer-o-clock
That clock might be slightly ahead… but we’re gonna go with it.