Welcome to the first post of the new year. I’ll be keeping things a little on the lighter side for now. I’m still very into my work and learning lots of share-worthy things in the data world. But for now, movies!
I also want to take a moment to appreciate those who reached out to us after our devastating loss. Thank you for your thoughts and prayers. Please continue to remember our family as we struggle to find a sense of normalcy.
So, some of my elder moviegoers asked me the question that many people have been asking over the last year or two: “What Marvel movies do I really need to watch before Infinity War?”, or more recently, “before End Game?”. More generally, which ones are worthwhile viewing to a casual non-geek, to someone who doesn’t need to obsess over every little minutiae, someone who is not by nature a “comic book movie lover”. It’s a completely fair question, and honestly it needs more.. less nerdy answers.
Data isn’t literally everything. I mean it is, technically, but it’s not all super happy fun times, so we need to take a break once in a while and do something less neuron-intensive. Thus, my new segment: movie reviews! Because, despite what you may have read, all work and no play make Nate a dull boy. And yes, I promised you this blog would be professional. Mostly. I remember specifically using that word. So don’t wag your naggy finger at me. If you don’t like it, you can simply avoid the tags like #offtopic and #movies.
If you have a core database using a different collation than the rest of the DBs around it, BAD THINGS HAPPEN.
And we’re back! Hi folks, thanks for being patient with my December hiatus. The holiday season is always a little hectic but this year it felt especially sudden. And hey, you all have better things to do than read a blog in between the home cooked meals and the family gatherings.. like sleep, shop, and go see all the new movies!
Thanks to both Pitch Perfect 3 and the latest New Year’s Rockin’ Eve, Britney’s “Toxic” is now stuck in my head, so that’s fun.
Some of you may not know this, but I’m a big movie nerd. Not like the weird “knows a bunch of obscure factoids about all the Tarantino movies” or whatever. But I do quite enjoy the behind-the-scenes / making-of stuff — what used to be called “bonus features” on DVDs (remember those things??) — whenever the wife will tolerate sitting thru them with me.
Our genre of choice is generally horror. Now, I’m gonna get nerdy on you for a bit; because there are several sub-types or horror, and I enjoy almost almost all of them. Campy, creepy, fun, found-footage, gory, spooky, slasher, supernatural, tense, psychological, revenge, deconstruction, possession. For the uninitiated, “deconstruction” is like 2012’s Cabin in the Woods — it pokes fun at the tropes while building on them in unique ways. Those are one of my favorite kind; that one in particular is definitely in my top 10 all-time.
So to kick off this year, before diving back into the technical stuff, I’d like to give you a coupe lightning reviews of some horror movies that we’ve watched that are perhaps underrated or you may have missed.
The Babysitter (2017) – comedy/deconstruction. A young preteen boy, whose parents are gone a lot, has a great friendship with his older teen babysitter, but one night decides to spy on what she and her friends do after he goes to bed. And well, crap hits the fan. Lots of fun, eye candy, and slapstick violence. 👍👍
Patchwork (2015) – campy/revenge. 3 girls are Frankenstein’d together and have to overcome their mental differences and physical struggles to piece together the perpetrator and hopefully exact some revenge. Superbly acted by the lead lady, plenty of violence and just enough funny bits to keep it going. 👍
Happy Death Day (2017) – slasher/deconstruction. Think Groundhog Day but with a college chick being killed by a masked marauder repeatedly. She must try to find out who it is before it’s too late! Somewhat predictable but still entertaining and engaging. 👍
Incarnate (2016) – possession/supernatural. A somewhat unique twist on the genre, a brain doc frees people from possession by mind-sharing & getting the person back in control of their own consciousness. Think Inception meets Exorcist. Very well-acted, convincingly scary demon, and nicely twisted ending. 👍👍
Demonic (2015) – creepy/found-footage. Bit of a misnomer, as it has nothing to do with demons; it’s about a ghost-summoning gone horribly wrong resulting in the deaths of all but 1 (ish?) member of the group that originally attempted said ritual. Frank Grillo is always on-point. Very engaging. 👍
Last Shift (2014) – gory/creepy/demon-y. Rookie cop gets stuck with the last watch in a soon-to-be-shut-down police station, chaos ensues. Literally, this is some crazy crap; scary and bloody. Original & vastly under-hyped, has an indie vibe but looks & feels professional-grade. 👍👍
Most of these should be stream-able. So check ’em out!
Now on to the SQL stuff.
Collations are Hard
If you ever have to integrate a vendor database into your existing environment, and the vendor ‘mandates’ their DB use a certain collation (which differs from the rest of your SQL instances / databases), run away screaming. Srsly.
Or convince your managers that you know better, and force it into the same collation as everything else you have to integrate with. Good luck & godspeed.
Let me give you an example. The ERP system is being upgraded, which of course means a new (upgraded) DB as well. Part of this upgrade seems to involve supporting case-sensitive searching/matching against name fields. To this end, the vendor insists that the DB should use a case-sensitive collation, namely Latin1_General_100_CS_AS. Problem is, the rest of your DB environment, in which a lot of stuff touches the ERP database (via joins, linked-server queries, etc.), uses the SQL default collation of SQL_Latin1_General_CP1_CI_AS.
If you follow the vendor’s mandate recommendation, guess what’s going to happen to your queries/views/stored-procedures that touch this & other DBs? Horrible things. Terrible performance degradation. Wailing a gnashing of teeth from the developers, business users, and customers.
Okay, I exaggerate. Slightly.
But it really does hurt performance, and I don’t feel like it’s talked about enough in the data professional community. In the next post, I’ll take this problem apart a little more and try to impart some of what I’ve learned from going through the pain of dealing with the aforementioned example.
It’s time for a more thought-y, less tech-y post. Which is mostly my excuse for not wanting to write a bunch of code at the moment. But that’s how I started this blog, with mostly opinion pieces, trying to offer some critical thinking on how DBAs and Developers work together. So y’all better like it!
Today’s title is brought to you by Don Henley’s tune of the same name, which is now stuck in my head, thankyouverymuch.
This is about data quality. When you have “dirty data”, just like dirty laundry, and you let it sit unattended, it starts to smell. In software, this means the “badness” seeps into other areas of the environment, affecting systems and business processes that should otherwise function smoothly.
A code smell is a surface indication that usually corresponds to a deeper problem in the system.
And, more aptly:
Data quality is corporate America’s dirty little secret.
But what is dirty data? Generally, it’s anything that doesn’t quite fit the ideal data model — that perfect vision of how all the bits of information in the system fit together, the shape of each data entity and how they relate to each other. Mostly, dirty data is what happens when you allow users to type things into text-boxes, and you write those text-box contents straight into the database without any layers of validation or cleansing. (Coincidentally, that’s also how SQL injection happens, but most of us have been scared-straight by enough years of security bloggers hammering at our thick skulls — and our favorite XKCD — that we at least sanitize our inputs before dumping them to an INSERT statement.)
Let me take a recent example from my experience. We have an ERP system that doubles as our CRM system (which is already a pair of bad idea jeans). How do you think customer information gets into the database? Customer Service Reps, typing stuff. Usually by copying from a paper form. Or the customers themselves, using an online form. But guess what doesn’t happen in either case? If you said “USPS address validation“, give yourself a hand!
Now, being that this system is our “source of truth” for customer info, it stands to reason that lots of other business functions & processes depend on it. For example, let’s say we send a promotional calendar to our customers of a certain “subscription level” on a yearly basis. We’re not in the publishing business, so we contract this out to another company. But guess what they need from us in order to complete the job and mail out those calendars? Addresses! So what happens when there’s a bad address in our database? A calendar gets returned, wasted cost and materials. Multiply that by a couple thousand and you start to turn a few heads in the C-suite.
Later, around the Marketing table, someone has a brilliant idea that they need to run a mail-merge to send out a gift-package to the top 100 customers. So they ask the DBA for a list of said customers. “Sure! Here ya go, here’s a report.” And then the complaints start coming in.
“These customers aren’t active anymore.”
Then tell your CS reps to mark them as inactive in the system. But no, we don’t do that, we just write “inactive” in the FirstName field.
“These ones are employees.”
Fine, figure out a special indicator to add for that, so I can exclude them from the report. But no, of course, we can’t do that either; we just put “deactivated” in the FirstName field.
“This guys is dead.”
Yeah, not even kidding. Apparently the powers-that-be decided to keep his info in the system, but type in “deceased” to the “Address 2” line (in the US, this is customarily the apartment/suite/unit number).
But mostly, the biggest complaint is that we’re getting un-deliverable/return-to-sender when we try shipping out to some of these addresses. And why? Because they’re not subject to any external validation and quality-control.
So what’s the data professional’s responsibility in this? In my opinion, it’s to advocate for data quality. There are obviously big vendors out there like Melissa Data who will sell you a service to help get you there. APIs abound, from USPS and other official sources, so building it isn’t out of the question.
One potential roadblock is, as usual, conservatism. The business’s ERP system is its life-blood, highly sensitive to change and very guarded by over-protective management and finicky executives. But the smelly dirty data-laundry continues to cause problems and has real-money impacts on corp. efficiency and profit. Unfortunately, many people tend to take the ostrich approach.
So, my good people, start “doing your laundry”. Have those conversations with your teams and managers about the current state of your data quality, and what it’s going to look like moving forward. Make some plans, have a road-map, and understand that it’s going to involve a lot of collaboration between key players. And good luck!
Every big enterprise product has them. I just happen to pick on SQL because it’s my area of expertise.
This week’s topic was triggered by an amazing quote from Adam Machanic (b|t) in the sqlcommunity Slack group, where he was trying to help someone troubleshoot a slow linked-server query:
That default, I consider to have been checked into the SQL Server codebase directly by Satan.
He’s referring, in this case, to the default option of remote proc transaction promotion, or “Enable Promotion of Distributed Transactions for RPC” in the GUI, which is set to Trueby default. Admittedly, linked-servers are a dicey area of the technology and not everybody needs them, but when you do need to deal with them, there are some caveats to consider and a few potholes to avoid.
I won’t go into gory details, but the mile-high gist of it is that you should probably change a couple of the default linked-server settings when you’re going from MSSQL to MSSQL (which most of the time, you are): rpc and rpc out both to true, and the above dude to false. The reasons behind that are subtle, but it boils down to usage patterns: If you’re purely running basic SELECT statements against the remote server, you can leave this stuff alone; but if you want to do cool stuff like exec MyLinkedServer.RemoteDB.sys.sp_executesql '--some awesome dynamic-sql', you’ll want to change these. (That last bit comes in especially handy if you’re building a dynamic query string, then running it against the remote server to bring in the results to a #temptable to further massage/mangle/munge said data.)
Even though you probably really shouldn’t be doing that in the database (that’s what web server farms are for!).
So, what are some other “stupid defaults” in SQL Server?
Every big enterprise product has them. I just happen to pick on SQL because it’s my area of expertise. And it’s not even just “defaults”; there are some options which shouldn’t even be a thing — they should be completely and irrevocably in love with Edward removed from the product. Yet, because the RDBMS tech space is infamously conservative and slow-to-change (the accepted euphemism is “mature“), these options and defaults have remained in the product despite our best attempts to convince MS that they’re heinous.
1. Parallelism settings (server/instance level)
Your servers have many-core CPUs, right? And you want SQL to utilize those cores to the best of its ability, distributing the many users’ workloads fairly amongst them, yes? Damn right, you paid $3k or more per core in freaking licensing costs! “OK”, says SQL Server, “I’ll use all available CPUs for any query with a ‘cost’ over ‘5’“. (To give context here, in case you’re not aware, ‘5’ is a LOW number; most OLTP workload queries are in the double to triple digits). “But wait!”, you protest, “I have more than 1 user, obviously, and I don’t want their horrible queries bringing all CPUs to their knees and forcing the 50 other user queries to wait their turn!”
Hardly seems fair, no? But those are the defaults, have been for over a decade, despite how much hardware has evolved since. Just for completeness, I am talking about the following 2 options, which you can view by right-clicking on the instance in SSMS, or by running sys.sp_configure. They are max degree of parallelism and cost threshold for parallelism, respectively (# cores to use, and how expensive the query should be to “go parallel” — as opposed to serial/single-threaded, which is perfectly fine for those itty-bitty-teeny-tiny queries that are so quick that 1 CPU core is more than enough to handle them). We commonly abbreviate the former as maxDOP; the latter is less commonly abbreviated, but because I’m a sucker for acronyms, I like to call it CTFP, or maybe costFP if I’m feeling verbose.
And if you have somewhere between 9 and 15 CPU cores, don’t ask me, because you’re running some.. interestinghardware. Figure it out yourself, and get that gun out of my face!
OK, I know 12 cores is probably not an unreasonable config, so.. extrapolating my logic above, divide by.. 3? Yeah let’s go with that.
2. Auto Close (database option)
It’s at the top of the list in the GUI under Database Properties -> Options. Yet nobody knows why you would ever enable it. I once heard a community member theorize that it might have been put into the product back when it was considered a viable “local persistence option for Windows CE/Mobile apps” and it would help w/ resource management on said mobile device by freeing up resources when the DB wasn’t in-use. Well, we all know how well that product line did in the market (hint: poorly). There are so manybetteroptions for localized data stores in mobile dev, MS isn’t even a blip in the conversation. (Again, talking local data persistence.)
If we’re talking cloud, MS is a big part of that conversation — Azure is amazing, and a solid competitor to AWS & GCP.
Anyway, if you ever find a SQL DB with the auto_close option enabled, find the person responsible, and slap them with a trout.
3. Server Max Memory
Last one for today. This is something that’s not completely heinous, but could use a LOT more built-in intelligence during the installation process so that DBAs & SysAdmins didn’t need to think about it so much. SQL Server will, by default, sets its max-memory to some-odd-billion-MBs (technically it’s the max value of a 32-bit int, which the more geeky among you have probably memorized), which is of course some-odd-million-GBs, which is more than even the most bleeding-edge servers have to date. Which is fine in theory — you paid a crap-ton of money for this system, it might as well use up all the RAM that it can to perform to its potential, right?
Right. Until you realize that “Oh wait, it’s running inside an OS” (whether that’s Windows or Linux, thanks to 2016 & 2017 product versions) — that that OS needs some RAM too, to keep itself running! (Cue the Linux zealots with their “it’s so much more lightweight than Windoze, haha!!1” — yeah, well you still need some memory space, don’a ya?)
Here’s what I’d like to see, in my ideal world. During SQL Server installation, it would detect how much RAM is on the system, subtract about 10% or 4-8 GB, leave that for the OS, and use the resulting number as its limit. Boom, done, nobody has to think about configuring it and checking off another checkbox on their setup checklist.
But noooo… The vaunted MSSQL engineers can built all sorts of amazing things into the product like QueryStore, Adaptive Query Processing, and The Artist Formerly Known as Hekaton, but heaven forbid we get a little more intelligence in the installer. It got a lot better with 2016 when it let you configure tempDB reasonably correctly (multiple files, different locations, etc), but there’s still a LOT that could use some lurv.
Do you have a favorite “stupid default” or “horrible setting” related to SQL Server or any other technology that you work with? Share in the comments!