It’s not even close to Halloween, but I promised I would get to this someday, and that someday is now. Strap in, grab the popcorn, and turn up the volume.
Oh wait, this is just a textual medium. Forget the volume bit.
If you’re not sure what this collation thing is, you should go back and read my teaser post, and as always there’s the docs. The one-liner recap on why it matters and how it became a target of my ire is this: legacy DBs use the old default SQL latin1 collation, but an upgraded ERP system’s DB now uses the new default Windows latin1 collation; these DBs all talk to each other, including linked-server
JOINs, and performance of those queries has gone to shit.
Pardon the French. “Gone to hell in a hand-basket.”
So why did this happen? Let’s try to find out. But first; let’s get specific about how all this wailing and gnashing of teeth actually manifests in real technical terms.
Essentially what happens here is an implicit conversion problem. There are several blog posts from our distinguished community leaders on this topic and its woes. It causes heavy CPU load as the SQL engine tries desperately to match values of different data types. Even though both columns may be, say,
nvarchar(20), the fact that one uses collation
SQL_Latin1_General_CP1_CI_AS and the other uses
Latin1_General_100_CI_AS, makes them somewhat strangers — they might as well be an
INT and a
Now again, this is my example. We have dozens of critical application queries using their own little sandbox-y databases, joining to the central ERP system DB to fetch Customer or Sales data. This is already a recipe for sadness.
“Use a middle tier or service layer, dammit!”, you’d say.
“You’re preaching to the choir,” I’d reply.
Hell, you’re preaching to the preacher, at that point. But it’s not that simple, as I’ll touch on later.
There’s a subtle difference here, vs. those many community blog posts, which I’ll repeat. The columns are of the same type. Just different collations.
And when the collation on the join predicates is different, bad things happen. Let’s take
CustomerNumber for example. On the ERP side, it’s a
Latin1_General_100_CI_AS. On the internal & web apps side, it’s a
SQL_Latin1_General_CP1_CI_AS. As you might imagine, this is a prime field for joining because it’s the main customer identified throughout all the systems.
Let’s be clear here. This is a numeric value only. Did it need to support Unicode? Absolutely not. Should it have been an
bigint? Probably. But did The ERP designers choose to make it Unicode string anyway? Yep.
Premature optimization may be a root of evil in software development, but inappropriate data typing is at least as evil in database development.
Anyway. The point of this post is not to rant and whine about the design of that particular system. I’ll save that for another day. That won’t stop me from complaining about the improper ways it’s used.
As stated above, all these queries with all these
JOINs on columns with mis-matched collation, lead to very sad pandas:
RBAR operations (row-by-agonizing-row, unable to use index-seeks), and high CPU. Under duress, my 32 core servers ground to a halt with blocked sessions,
ASYNC_NETWORK_IO waits, and 99% CPU utilization metrics. Needless to say, these were not good days.
Attempt to Treat the Symptoms
Because the ERP system has been upgraded a couple times in the last decade, the team came up with a [very thin] abstraction layer manifested as a series of “integration views” that
UNION similar core entities from the older and newer incarnations of the database. Like Sales records. These have permeated many many areas of the ecosystem, to the point that almost all apps use them instead of the “raw” source tables. Which sounds great, right? Riiiight.
Until you throw in that monkey wrench of conflicting collations. Remember, Devs are lazy (in a good way). Their apps and queries need to not care about such a low level detail as database collation. So to support that “not caring”, we set up these integration views to apply the older default collation (the one that matches everything else in the environment) to the output columns. That way, no extra work is required to consume them in the same way they’ve been consumed for the last 10+ years.
Basically, we can add the keywords
COLLATE DATABASE_DEFAULT after each column declaration of the view, like so (in the form “alias = sourceColumn”):
CustomerNo = erp.CustomerNo COLLATE DATABASE_DEFAULT.
This was a terrible idea.
It made sense at the time. But as the months passed and the complaints of performance degradation mounted, the signs continued to point back at these views which now used this collation-conversion mechanism (prior to the latest upgrade, they did not).
The typical work-arounds involved temp tables and/or going straight to the “raw” source. Neither of these are ideal — the latter breaks that abstraction (however thin it was), while the former risks over-burdening tempdb (sometimes referred to as the “communal toilet” of SQL server). Generally this was acceptable, and often resulted in orders of magnitude improvement to performance. But it continued to rack up that technical debt.
One thing I tried was to remove those collation conversions from all columns in the view except the join predicates, because the consumers fed those values straight into the object or ORM layer, at which point they all became C# strings anyway, so it didn’t matter what collation they’d used or whether they were ANSI or Unicode at that point. But alas, because the core pitfall of these queries was still very present — that implicit conversion — performance still suffered.
Treating the Root Cause
Here I re-link the two scary articles that warn of the dangers and gotchas of changing your database and server (instance-level) default collations: StackOverflow answer, and blog post. Given all that, it’s a daunting task. But if we’re going to get our performance back up to snuff, it’ll probably have to happen at some point. As the great Mike Rowe says…
And unlike the previous blogger, I will do my very best to follow up here and post about my journey as we undertake this harrowing trek.
The way I see it, there are three major attack vectors.
- We can try converting the ERP database to the old SQL collation.
Pros: smaller effort than #2, less integration/regression testing overall.
Cons: unsupported by ERP vendor, downtime for ERP system, high amount of risk.
- We can try converting all other DBs in the environment (across all SQL instances) to the new Windows collation to match that of the ERP DB.
Pros: supported by ERP vendor, future-proof, less tech-debt.
Cons: largest effort, heaviest testing burden, high risk.
- We could utilize some kind of data-replication to maintain copies of the required data on the other SQL instances in their default (matching) collation.
Pros: support not an issue, lowest effort & testing burden, lowest risk.
Cons: replication maintenance burden & overhead, loss of “real-time” (added data latency), and some tech-debt.
As the lone DBA, most if not all effort falls to me, so I’m quite inclined toward #3. And we were somewhat already going in this direction with the temp-table workarounds, i.e. pulling in the ERP data (usually via a
linked-server) to the target app DB & using that temp-table for
joins — essentially, that’s “lightweight replication”.
The technical debt we’re incurring here is that we’re leaving all of our legacy DBs (and servers) in the older SQL collation. At some point, likely the far-future, these will be unsupported, or at least obsolete, in the sense that all new applications & DBs will prefer the newer Windows collation. Perhaps during the next big “hardware refresh” cycle, i.e. when we have to plan and execute a big SQL server upgrade/migration, we can consider integrating the collation-change into that project.
But wait, you argue, what about a 4th option?
Oh sure, you mean say, “Screw it, not my problem!”…
- Force all ERP DB data access up the stack to the application layers, i.e. the apps pull the data into memory and
join/merge it there (or relate to & interact with it however the developers want to, in that layer).
But this has several downsides that the business and the development teams would [justifiably] push back on: dev time & effort, the drastic-ness & unprecedented-ness of the change, the fear of not catching every single place & usage of the ERP in the mysterious myriad of apps that it seems nobody can ever quite get a unified visibility handle on (some of which they can’t even build or deploy anymore without diving down some seriously arcane rabbit-holes of ancient tech). The risk is just too high.
More than that, and as much as I would love to say “ain’t my problem” and pass it off to a larger group of smart people, the fact is that over 50% of dependencies on the ERP DB are from my own BI/reporting queries. There’s no shortage of reports that need to examine and relate legacy LOB app data with ERP customer & sales data. And it’s not just to build a paginated SSRS report — where I could, arguably, do what I said above in faux-option 4: pull the data from 2 separate data-sets, then merge it in the report layer. It’s heavily customized, painstakingly crafted mini-data-warehouses and data-marts that massage and tailor the data for an assortment of different purposes and reporting needs. Thus, even with this, most of the burden still falls to me.
May your server lights blink, your database collations be identical, and your cables be cleanly managed.