The Horror of Conflicting Database Collations

All these queries with all these JOINs on columns with mis-matched collation, lead to very sad pandas: RBAR operations (row-by-agonizing-row, unable to use index-seeks), and high CPU.

Advertisements

It’s not even close to Halloween, but I promised I would get to this someday, and that someday is now. Strap in, grab the popcorn, and turn up the volume.

Oh wait, this is just a textual medium. Forget the volume bit.

If you’re not sure what this collation thing is, you should go back and read my teaser post, and as always there’s the docs. The one-liner recap on why it matters and how it became a target of my ire is this: legacy DBs use the old default SQL latin1 collation, but an upgraded ERP system’s DB now uses the new default Windows latin1 collation; these DBs all talk to each other, including linked-server JOINs, and performance of those queries has gone to shit.

Pardon the French. “Gone to hell in a hand-basket.”

where exactly are we going in this handbasket?
It can’t be any hotter than a California summer!

So why did this happen? Let’s try to find out. But first; let’s get specific about how all this wailing and gnashing of teeth actually manifests in real technical terms.

The Problem

Essentially what happens here is an implicit conversion problem. There are several blog posts from our distinguished community leaders on this topic and its woes. It causes heavy CPU load as the SQL engine tries desperately to match values of different data types. Even though both columns may be, say, nvarchar(20), the fact that one uses collation SQL_Latin1_General_CP1_CI_AS and the other uses Latin1_General_100_CI_AS, makes them somewhat strangers — they might as well be an INT and a varchar!

Now again, this is my example. We have dozens of critical application queries using their own little sandbox-y databases, joining to the central ERP system DB to fetch Customer or Sales data. This is already a recipe for sadness.

“Use a middle tier or service layer, dammit!”, you’d say.

“You’re preaching to the choir,” I’d reply.

Hell, you’re preaching to the preacher, at that point. But it’s not that simple, as I’ll touch on later.

There’s a subtle difference here, vs. those many community blog posts, which I’ll repeat.  The columns are of the same type.  Just different collations.

And when the collation on the join predicates is different, bad things happen. Let’s take CustomerNumber for example. On the ERP side, it’s a nvarchar(20) collate Latin1_General_100_CI_AS. On the internal & web apps side, it’s a varchar(20) collateSQL_Latin1_General_CP1_CI_AS. As you might imagine, this is a prime field for joining because it’s the main customer identified throughout all the systems.

Let’s be clear here. This is a numeric value only. Did it need to support Unicode? Absolutely not. Should it have been an int or bigint? Probably. But did The ERP designers choose to make it Unicode string anyway? Yep.

Premature optimization may be a root of evil in software development, but inappropriate data typing is at least as evil in database development.

Anyway. The point of this post is not to rant and whine about the design of that particular system. I’ll save that for another day. That won’t stop me from complaining about the improper ways it’s used.

this is why we can't have nice things
Exactly.

As stated above, all these queries with all these JOINs on columns with mis-matched collation, lead to very sad pandas: RBAR operations (row-by-agonizing-row, unable to use index-seeks), and high CPU.  Under duress, my 32 core servers ground to a halt with blocked sessions, ASYNC_NETWORK_IO waits, and 99% CPU utilization metrics.  Needless to say, these were not good days.

Side-note: I really enjoyed Bert’s post because it helped put things into very easy-to-understand terms, and is targeted at a wider audience than just the DBA.  Read it, and watch the video too!

Attempt to Treat the Symptoms

Because the ERP system has been upgraded a couple times in the last decade, the team came up with a [very thin] abstraction layer manifested as a series of “integration views” that UNION similar core entities from the older and newer incarnations of the database. Like Sales records. These have permeated many many areas of the ecosystem, to the point that almost all apps use them instead of the “raw” source tables. Which sounds great, right? Riiiight.

Until you throw in that monkey wrench of conflicting collations. Remember, Devs are lazy (in a good way). Their apps and queries need to not care about such a low level detail as database collation. So to support that “not caring”, we set up these integration views to apply the older default collation (the one that matches everything else in the environment) to the output columns. That way, no extra work is required to consume them in the same way they’ve been consumed for the last 10+ years.

Basically, we can add the keywords COLLATE DATABASE_DEFAULT after each column declaration of the view, like so (in the form “alias = sourceColumn”): CustomerNo = erp.CustomerNo COLLATE DATABASE_DEFAULT.

This was a terrible idea.

It made sense at the time. But as the months passed and the complaints of performance degradation mounted, the signs continued to point back at these views which now used this collation-conversion mechanism (prior to the latest upgrade, they did not).

The typical work-arounds involved temp tables and/or going straight to the “raw” source. Neither of these are ideal — the latter breaks that abstraction (however thin it was), while the former risks over-burdening tempdb (sometimes referred to as the “communal toilet” of SQL server). Generally this was acceptable, and often resulted in orders of magnitude improvement to performance. But it continued to rack up that technical debt.

One thing I tried was to remove those collation conversions from all columns in the view except the join predicates, because the consumers fed those values straight into the object or ORM layer, at which point they all became C# strings anyway, so it didn’t matter what collation they’d used or whether they were ANSI or Unicode at that point. But alas, because the core pitfall of these queries was still very present — that implicit conversion — performance still suffered.

Treating the Root Cause

Here I re-link the two scary articles that warn of the dangers and gotchas of changing your database and server (instance-level) default collations: StackOverflow answer, and blog post.  Given all that, it’s a daunting task.  But if we’re going to get our performance back up to snuff, it’ll probably have to happen at some point.  As the great Mike Rowe says…

it's a dirty job but somebody's gotta do it

And unlike the previous blogger, I will do my very best to follow up here and post about my journey as we undertake this harrowing trek.

The way I see it, there are three major attack vectors.

  1. We can try converting the ERP database to the old SQL collation.

Pros: smaller effort than #2, less integration/regression testing overall.
Cons: unsupported by ERP vendor, downtime for ERP system, high amount of risk.

  1. We can try converting all other DBs in the environment (across all SQL instances) to the new Windows collation to match that of the ERP DB.

Pros: supported by ERP vendor, future-proof, less tech-debt.
Cons: largest effort, heaviest testing burden, high risk.

  1. We could utilize some kind of data-replication to maintain copies of the required data on the other SQL instances in their default (matching) collation.

Pros: support not an issue, lowest effort & testing burden, lowest risk.
Cons: replication maintenance burden & overhead, loss of “real-time” (added data latency), and some tech-debt.

As the lone DBA, most if not all effort falls to me, so I’m quite inclined toward #3. And we were somewhat already going in this direction with the temp-table workarounds, i.e. pulling in the ERP data (usually via a linked-server) to the target app DB & using that temp-table for joins — essentially, that’s “lightweight replication”.

The technical debt we’re incurring here is that we’re leaving all of our legacy DBs (and servers) in the older SQL collation.  At some point, likely the far-future, these will be unsupported, or at least obsolete, in the sense that all new applications & DBs will prefer the newer Windows collation.  Perhaps during the next big “hardware refresh” cycle, i.e. when we have to plan and execute a big SQL server upgrade/migration, we can consider integrating the collation-change into that project.

But wait, you argue, what about a 4th option?

Oh sure, you mean say, “Screw it, not my problem!”…

  1. Force all ERP DB data access up the stack to the application layers, i.e. the apps pull the data into memory and join/merge it there (or relate to & interact with it however the developers want to, in that layer).

But this has several downsides that the business and the development teams would [justifiably] push back on: dev time & effort, the drastic-ness & unprecedented-ness of the change, the fear of not catching every single place & usage of the ERP in the mysterious myriad of apps that it seems nobody can ever quite get a unified visibility handle on (some of which they can’t even build or deploy anymore without diving down some seriously arcane rabbit-holes of ancient tech).  The risk is just too high.

More than that, and as much as I would love to say “ain’t my problem” and pass it off to a larger group of smart people, the fact is that over 50% of dependencies on the ERP DB are from my own BI/reporting queries.  There’s no shortage of reports that need to examine and relate legacy LOB app data with ERP customer & sales data.  And it’s not just to build a paginated SSRS report — where I could, arguably, do what I said above in faux-option 4: pull the data from 2 separate data-sets, then merge it in the report layer.  It’s heavily customized, painstakingly crafted mini-data-warehouses and data-marts that massage and tailor the data for an assortment of different purposes and reporting needs.  Thus, even with this, most of the burden still falls to me.

some guy with glasses posing under a globe like he's carrying the weight of the world
To be clear, this is NOT me; it’s the 4th pic I found on Google image search for “weight of the world” filtered for “noncommercial reuse” images. 😉  Kinda looks like Landry from Friday Night Lights, no?

 

So, to borrow a closing line from one of my favorite podcasts

May your server lights blink, your database collations be identical, and your cables be cleanly managed.

Movie Reviews and the Killer Database Collation

If you have a core database using a different collation than the rest of the DBs around it, BAD THINGS HAPPEN.

And we’re back!  Hi folks, thanks for being patient with my December hiatus.  The holiday season is always a little hectic but this year it felt especially sudden.  And hey, you all have better things to do than read a blog in between the home cooked meals and the family gatherings.. like sleep, shop, and go see all the new movies!

Thanks to both Pitch Perfect 3 and the latest New Year’s Rockin’ Eve, Britney’s “Toxic” is now stuck in my head, so that’s fun.

britney-stewardess-toxic
I think I’m ready now… for 2018.

Some of you may not know this, but I’m a big movie nerd.  Not like the weird “knows a bunch of obscure factoids about all the Tarantino movies” or whatever.  But I do quite enjoy the behind-the-scenes / making-of stuff — what used to be called “bonus features” on DVDs (remember those things??) — whenever the wife will tolerate sitting thru them with me.

Our genre of choice is generally horror.  Now, I’m gonna get nerdy on you for a bit; because there are several sub-types or horror, and I enjoy almost almost all of them.  Campy, creepy, fun, found-footage, gory, spooky, slasher, supernatural, tense, psychological, revenge, deconstruction, possession.  For the uninitiated, “deconstruction” is like 2012’s Cabin in the Woods — it pokes fun at the tropes while building on them in unique ways.  Those are one of my favorite kind; that one in particular is definitely in my top 10 all-time.

So to kick off this year, before diving back into the technical stuff, I’d like to give you a coupe lightning reviews of some horror movies that we’ve watched that are perhaps underrated or you may have missed.

  • The Babysitter (2017) – comedy/deconstruction. A young preteen boy, whose parents are gone a lot, has a great friendship with his older teen babysitter, but one night decides to spy on what she and her friends do after he goes to bed. And well, crap hits the fan.  Lots of fun, eye candy, and slapstick violence. 👍👍
  • Patchwork (2015) – campy/revenge. 3 girls are Frankenstein’d together and have to overcome their mental differences and physical struggles to piece together the perpetrator and hopefully exact some revenge. Superbly acted by the lead lady, plenty of violence and just enough funny bits to keep it going. 👍
  • Happy Death Day (2017) – slasher/deconstruction. Think Groundhog Day but with a college chick being killed by a masked marauder repeatedly.  She must try to find out who it is before it’s too late!  Somewhat predictable but still entertaining and engaging. 👍
  • Incarnate (2016) – possession/supernatural. A somewhat unique twist on the genre, a brain doc frees people from possession by mind-sharing & getting the person back in control of their own consciousness.  Think Inception meets Exorcist.  Very well-acted, convincingly scary demon, and nicely twisted ending. 👍👍
  • Demonic (2015) – creepy/found-footage. Bit of a misnomer, as it has nothing to do with demons; it’s about a ghost-summoning gone horribly wrong resulting in the deaths of all but 1 (ish?) member of the group that originally attempted said ritual.  Frank Grillo is always on-point.  Very engaging. 👍
  • Last Shift (2014) – gory/creepy/demon-y. Rookie cop gets stuck with the last watch in a soon-to-be-shut-down police station, chaos ensues.  Literally, this is some crazy crap; scary and bloody.  Original & vastly under-hyped, has an indie vibe but looks & feels professional-grade. 👍👍

Most of these should be stream-able.  So check ’em out!

Now on to the SQL stuff.

A not equal a
borrowed from the man himself, Pinal Dave =)

Collations are Hard

If you ever have to integrate a vendor database into your existing environment, and the vendor ‘mandates’ their DB use a certain collation (which differs from the rest of your SQL instances / databases), run away screamingSrsly.

Or convince your managers that you know better, and force it into the same collation as everything else you have to integrate with.  Good luck & godspeed.

Let me give you an example.  The ERP system is being upgraded, which of course means a new (upgraded) DB as well.  Part of this upgrade seems to involve supporting case-sensitive searching/matching against name fields.  To this end, the vendor insists that the DB should use a case-sensitive collation, namely ​Latin1_General_100_CS_AS.  Problem is, the rest of your DB environment, in which a lot of stuff touches the ERP database (via joins, linked-server queries, etc.), uses the SQL default collation of SQL_Latin1_General_CP1_CI_AS.

If you follow the vendor’s mandate recommendation, guess what’s going to happen to your queries/views/stored-procedures that touch this & other DBs?  Horrible things.  Terrible performance degradation.  Wailing a gnashing of teeth from the developers, business users, and customers.

Okay, I exaggerate.  Slightly.

But it really does hurt performance, and I don’t feel like it’s talked about enough in the data professional community.  In the next post, I’ll take this problem apart a little more and try to impart some of what I’ve learned from going through the pain of dealing with the aforementioned example.

Happy 2018!

PS: Apparently this is my 50th post!!  Go me!  :o)

50-cent-face-on-50-dollar-bill
fiddy. fiddy posts.