The Horror of Conflicting Database Collations

All these queries with all these JOINs on columns with mis-matched collation, lead to very sad pandas: RBAR operations (row-by-agonizing-row, unable to use index-seeks), and high CPU.

Advertisements

It’s not even close to Halloween, but I promised I would get to this someday, and that someday is now. Strap in, grab the popcorn, and turn up the volume.

Oh wait, this is just a textual medium. Forget the volume bit.

If you’re not sure what this collation thing is, you should go back and read my teaser post, and as always there’s the docs. The one-liner recap on why it matters and how it became a target of my ire is this: legacy DBs use the old default SQL latin1 collation, but an upgraded ERP system’s DB now uses the new default Windows latin1 collation; these DBs all talk to each other, including linked-server JOINs, and performance of those queries has gone to shit.

Pardon the French. “Gone to hell in a hand-basket.”

where exactly are we going in this handbasket?
It can’t be any hotter than a California summer!

So why did this happen? Let’s try to find out. But first; let’s get specific about how all this wailing and gnashing of teeth actually manifests in real technical terms.

The Problem

Essentially what happens here is an implicit conversion problem. There are several blog posts from our distinguished community leaders on this topic and its woes. It causes heavy CPU load as the SQL engine tries desperately to match values of different data types. Even though both columns may be, say, nvarchar(20), the fact that one uses collation SQL_Latin1_General_CP1_CI_AS and the other uses Latin1_General_100_CI_AS, makes them somewhat strangers — they might as well be an INT and a varchar!

Now again, this is my example. We have dozens of critical application queries using their own little sandbox-y databases, joining to the central ERP system DB to fetch Customer or Sales data. This is already a recipe for sadness.

“Use a middle tier or service layer, dammit!”, you’d say.

“You’re preaching to the choir,” I’d reply.

Hell, you’re preaching to the preacher, at that point. But it’s not that simple, as I’ll touch on later.

There’s a subtle difference here, vs. those many community blog posts, which I’ll repeat.  The columns are of the same type.  Just different collations.

And when the collation on the join predicates is different, bad things happen. Let’s take CustomerNumber for example. On the ERP side, it’s a nvarchar(20) collate Latin1_General_100_CI_AS. On the internal & web apps side, it’s a varchar(20) collateSQL_Latin1_General_CP1_CI_AS. As you might imagine, this is a prime field for joining because it’s the main customer identified throughout all the systems.

Let’s be clear here. This is a numeric value only. Did it need to support Unicode? Absolutely not. Should it have been an int or bigint? Probably. But did The ERP designers choose to make it Unicode string anyway? Yep.

Premature optimization may be a root of evil in software development, but inappropriate data typing is at least as evil in database development.

Anyway. The point of this post is not to rant and whine about the design of that particular system. I’ll save that for another day. That won’t stop me from complaining about the improper ways it’s used.

this is why we can't have nice things
Exactly.

As stated above, all these queries with all these JOINs on columns with mis-matched collation, lead to very sad pandas: RBAR operations (row-by-agonizing-row, unable to use index-seeks), and high CPU.  Under duress, my 32 core servers ground to a halt with blocked sessions, ASYNC_NETWORK_IO waits, and 99% CPU utilization metrics.  Needless to say, these were not good days.

Side-note: I really enjoyed Bert’s post because it helped put things into very easy-to-understand terms, and is targeted at a wider audience than just the DBA.  Read it, and watch the video too!

Attempt to Treat the Symptoms

Because the ERP system has been upgraded a couple times in the last decade, the team came up with a [very thin] abstraction layer manifested as a series of “integration views” that UNION similar core entities from the older and newer incarnations of the database. Like Sales records. These have permeated many many areas of the ecosystem, to the point that almost all apps use them instead of the “raw” source tables. Which sounds great, right? Riiiight.

Until you throw in that monkey wrench of conflicting collations. Remember, Devs are lazy (in a good way). Their apps and queries need to not care about such a low level detail as database collation. So to support that “not caring”, we set up these integration views to apply the older default collation (the one that matches everything else in the environment) to the output columns. That way, no extra work is required to consume them in the same way they’ve been consumed for the last 10+ years.

Basically, we can add the keywords COLLATE DATABASE_DEFAULT after each column declaration of the view, like so (in the form “alias = sourceColumn”): CustomerNo = erp.CustomerNo COLLATE DATABASE_DEFAULT.

This was a terrible idea.

It made sense at the time. But as the months passed and the complaints of performance degradation mounted, the signs continued to point back at these views which now used this collation-conversion mechanism (prior to the latest upgrade, they did not).

The typical work-arounds involved temp tables and/or going straight to the “raw” source. Neither of these are ideal — the latter breaks that abstraction (however thin it was), while the former risks over-burdening tempdb (sometimes referred to as the “communal toilet” of SQL server). Generally this was acceptable, and often resulted in orders of magnitude improvement to performance. But it continued to rack up that technical debt.

One thing I tried was to remove those collation conversions from all columns in the view except the join predicates, because the consumers fed those values straight into the object or ORM layer, at which point they all became C# strings anyway, so it didn’t matter what collation they’d used or whether they were ANSI or Unicode at that point. But alas, because the core pitfall of these queries was still very present — that implicit conversion — performance still suffered.

Treating the Root Cause

Here I re-link the two scary articles that warn of the dangers and gotchas of changing your database and server (instance-level) default collations: StackOverflow answer, and blog post.  Given all that, it’s a daunting task.  But if we’re going to get our performance back up to snuff, it’ll probably have to happen at some point.  As the great Mike Rowe says…

it's a dirty job but somebody's gotta do it

And unlike the previous blogger, I will do my very best to follow up here and post about my journey as we undertake this harrowing trek.

The way I see it, there are three major attack vectors.

  1. We can try converting the ERP database to the old SQL collation.

Pros: smaller effort than #2, less integration/regression testing overall.
Cons: unsupported by ERP vendor, downtime for ERP system, high amount of risk.

  1. We can try converting all other DBs in the environment (across all SQL instances) to the new Windows collation to match that of the ERP DB.

Pros: supported by ERP vendor, future-proof, less tech-debt.
Cons: largest effort, heaviest testing burden, high risk.

  1. We could utilize some kind of data-replication to maintain copies of the required data on the other SQL instances in their default (matching) collation.

Pros: support not an issue, lowest effort & testing burden, lowest risk.
Cons: replication maintenance burden & overhead, loss of “real-time” (added data latency), and some tech-debt.

As the lone DBA, most if not all effort falls to me, so I’m quite inclined toward #3. And we were somewhat already going in this direction with the temp-table workarounds, i.e. pulling in the ERP data (usually via a linked-server) to the target app DB & using that temp-table for joins — essentially, that’s “lightweight replication”.

The technical debt we’re incurring here is that we’re leaving all of our legacy DBs (and servers) in the older SQL collation.  At some point, likely the far-future, these will be unsupported, or at least obsolete, in the sense that all new applications & DBs will prefer the newer Windows collation.  Perhaps during the next big “hardware refresh” cycle, i.e. when we have to plan and execute a big SQL server upgrade/migration, we can consider integrating the collation-change into that project.

But wait, you argue, what about a 4th option?

Oh sure, you mean say, “Screw it, not my problem!”…

  1. Force all ERP DB data access up the stack to the application layers, i.e. the apps pull the data into memory and join/merge it there (or relate to & interact with it however the developers want to, in that layer).

But this has several downsides that the business and the development teams would [justifiably] push back on: dev time & effort, the drastic-ness & unprecedented-ness of the change, the fear of not catching every single place & usage of the ERP in the mysterious myriad of apps that it seems nobody can ever quite get a unified visibility handle on (some of which they can’t even build or deploy anymore without diving down some seriously arcane rabbit-holes of ancient tech).  The risk is just too high.

More than that, and as much as I would love to say “ain’t my problem” and pass it off to a larger group of smart people, the fact is that over 50% of dependencies on the ERP DB are from my own BI/reporting queries.  There’s no shortage of reports that need to examine and relate legacy LOB app data with ERP customer & sales data.  And it’s not just to build a paginated SSRS report — where I could, arguably, do what I said above in faux-option 4: pull the data from 2 separate data-sets, then merge it in the report layer.  It’s heavily customized, painstakingly crafted mini-data-warehouses and data-marts that massage and tailor the data for an assortment of different purposes and reporting needs.  Thus, even with this, most of the burden still falls to me.

some guy with glasses posing under a globe like he's carrying the weight of the world
To be clear, this is NOT me; it’s the 4th pic I found on Google image search for “weight of the world” filtered for “noncommercial reuse” images. 😉  Kinda looks like Landry from Friday Night Lights, no?

 

So, to borrow a closing line from one of my favorite podcasts

May your server lights blink, your database collations be identical, and your cables be cleanly managed.

How is Database Formed?

Data is digital information. A database is a collection of data. And a DBA manages it all.

Borrowing from an ‘old meme’ a bit.  My wife recently said I should “write something about ‘how to do databases’.”  As amusingly odd as her phrasing was, I figured she was right.

well like duh
Obviously. GAWD.

What is it?

I like to start at beginning.  As Julie Andrews said, it’s a very good place to start.  What is a database?  That’s a pretty good question.  Here’s the prerequisite question: What is data?  Well, as I’ve said before, data is everything.  But that’s a bit of a cop-out, isn’t it?  That’s my career’s bias showing through.

Data is digital information.  Anything that can be quantified, specified, categorized, searched, sorted, produced, consumed, read, written, measured, and stored digitally in some fashion.  Data is the digital currency of the 21st century.  Data is the very reason that most technology exists — to house and transport data among producers and consumers of information.  It’s the evolutionary culmination of the stone tablet, the papyrus scroll, the bound book, the printing press, the newspaper, the library, the vinyl record, the magnetic tape, the compact disc, the pocket organizer, and the telephone.

So then, what is a database?  Simply put, it’s a collection of data.  The simplest analogy, depending on your age, is either a phone book or your cell phone’s contacts list (which is really just a phone book, in digital form).  Of course, with the latter, it’s not so much an analogy as an example — you phone’s contact list IS a database.

Fun side-note, the phone book also makes a decent discussion prop for some DBA topics like index fragmentation.

Expanding on that example.  You can search and sort your contacts by several data points: first name, last name, phone #, email, notes.  Different database systems have various names for these: fields, columns, properties, criteria, values.  The point is, it’s all data.  Or if you want to get pedantic, each one is a datum, and together they are data.

Pedantic, me?  Never.

This is what a database, or DB for short, is all about: storing data in an organized fashion so that it can be sorted, searched, sliced and diced.  Building on that, a database management system is a set of technology tools, processes and programs, that are used to gather, store, manipulate, copy, move, read, maintain, back up, link together, and operate one or many databases.  This DBMS can come in many flavors.  I happen to specialize in one called SQL Server, a Microsoft product/platform of the ‘relational‘ flavor — so if you’re following along with the abbreviation game, that’s an RDBMS.

If you’re hungry for more acronyms, the Wiki article on ‘databases‘ has a decent breakdown of the types and history behind them.

But Why?

all your data are belong to us
Non-geeks might not understand this, so Google “all your base” if you’re confused.

The more data you have, the more you can do with it.  Why do you think Facebook, Google, Microsoft, and Amazon are such powerful technological forces?  They purposefully, systematically gather as much data as they can from every possible source, and they have become very good at organizing and managing that data to maximize its value.  Amazon product recommendations are a prime (see what I did there?) example — they are generally appropriate and effective because they have “learned” from your past purchases, i.e. your historical data.  This “learning” – Machine Learning, aka Data Science – is the hot new marketing buzzword of recent years, but it all still comes back to data at the core.

This is not a “bad thing” or a “scary thing” as the old media and tin-foil-hat-wearers would have you believe.  Yes, it can be a little disconcerting, and yes, people and companies can abuse data in malicious ways.  But the vast majority of our digital data stewards actually want to do good.  They want to connect you with more people that you may know and become friends with; they want you to watch movies that you’ll really enjoy; they want you to easily navigate to your destination without being stuck in traffic; they even want to help stop global warming!

As a general business rule, we crave data because it helps us make decisions.  Every time a customer buys a product, we want to measure “the 5 W’s”: who what when where and how (ok, that’s not a ‘W’, but there’s a reason for it).  Notice I didn’t list “why” there — only the customer knows why, and that information, that data, is stored inside their brain.  And we can’t (yet) access that data.  So it’s a guessing game now — we feed the other 5 data-points into our DBMS and eventually, given some time and analysis, we can guess the Why.  And pretty accurately, at that.  Then, we can make a decision to “Market more aggressively to Customer Type X”, or “Have a flash-sale on Product Y”, or “Move on this hot emerging market demographic.”

So what does that make you?

Well, I’m a Database Administrator – a DBA.  Which means I “administrate databases”.

‘Administrate’, less common form of ‘administer’: manage and be responsible for the running of.

Thanks, dictionary.  So in a nutshell, a DBA manages data.  Deceptively simple sounding, no?  I mean, what can data possibly do; it’s not alive, right?  Actually, if you hang around a DBA for any length of time, you’ll commonly hear the phrase “Where does that data live?” or “That set of data lives over here.”  So clearly we anthropomorphize our data.  Most tech professionals do that to whatever technology they work closely with — it’s human nature.  Software “behaves badly”, machines “throw a fit”, etc.

But anyway, why do databases need to be managed?  What can happen to them?

Developers.  Developers happen.  =D

I joke, as you know, dear reader; I love developers.  Users ‘happen’, too — often more catastrophically.  So it’s fair to say that “people happen”.  But besides that, here are some common reasons that databases, and data, need to be managed.

  • Data can be “wrong”.

Data can either be human-generated or machine-generated.  Fingers on a keyboard, or sensors on a circuit board.  You wouldn’t think the latter could possibly ever be “wrong”, but both kinds are subject to error.  It’s just that the level of “wrongness” is subjective and depends on who’s asking and what’s expected of the system as a whole.

  • Data gets lost.

Humans interact with and manipulate data, and humans make mistakes.  Why do you think the Undo button became such a staple of so many computer applications?

  • Data gets corrupted.

Storage media (magnetic disks, silicon chips, etc.) are not perfect — they have a documented level of fault tolerance and failure rate — so we need to ensure that our data is preserved (by moving it to another area that’s not ‘faulty’, usually) past those failures.  Why?  Because our data is essentially “more valuable” than the hardware on which it’s stored.

  • Data needs to be organized.

This is slightly more subjective than the above; how and why we organize data is highly dependent on the overall intent of the systems that will interact with it.  But fundamentally, if there’s not some form of organization, the data is effectively garbage.  If you ripped out every individual page in the phonebook and scattered them all on the floor, it’s no longer an effective tool to find someone’s phone number; it’s just a mess of papers.

  • Data needs to be useful.

If we can’t do something with the data, what’s the point of having it?  The temperature at the North Pole on January 1st 1989 is, by itself, inconsequential.  But a history of temperatures at the same and similar locations, over a long period of time, gives us some great value — we can see trends, look for anomalies, and even predict the future of what those temperatures might be.

  • Databases need to be available.

Similarly, if we can’t access the data, what good is it?  Databases are a technology, and like most technologies, they occasionally break.  Again, most of that comes back to humans, because humans are the ones writing the code that creates the software that houses the data and runs the database, or that interacts with it.  But of course we still have power failures, network losses, disk failures, and even solar flares.  (Ask your favorite superstitious engineer; they’ll have at least one good story about a system outage that could only be blamed on solar flares or gremlins or the full moon.)

  • Databases need to be maintained.

Every DBMS has some kind of assumed ongoing maintenance requirements to keep it “running smoothly”.  Just like your car needs an oil change every 3 to 8 thousand miles, your databases need periodic attention to retain all of those important qualities discussed above.

And the latest big topic, underscored by the GDPR:

  • Data needs to be governed.

This is a big topic for another conversation, but the gist of it is, data is generally “owned” by someone, and deciding who owns what, where it’s allowed to live, and how it’s allowed to be used, constitutes an entire sub-industry of rules, regulations, policies, tools, security practices, and consequences, much of which we’re only just beginning to shape and understand.

TL;DR: What do you actually do?

who is your DBA and what does he do
Umm… stuff? And thangs?

I currently work at a “small enterprise”, a business that has been around for some decades (as opposed to a Silicon Valley start-up who counts their anniversaries in months, like an infatuated teenager), managing their database systems.  Some of that is financial/accounting, some is customer info, some is internal/operational, and all of it is important to at least one person in the business in their daily decision-making efforts.

Thus, I help ensure that the data is always ready, when it’s needed, in whatever form & shape it’s needed in.  I model, massage, correct, enhance, and move it around.  I help developers write faster queries (that’s a fancy word for “questions” that we ask of our data); I aide analysts with understanding and gleaning more value from the data; I maintain the underlying systems that house the databases and ensure that they perform well and get upgraded when necessary; and I work with business drivers (VP’s, CxO’s) to build reporting solutions that leverage the data to enable better, smarter decisions, and ultimately (hopefully!) increase profit.  (This last part is actually crossing into the BI – Business Intelligence – job role, which tends to happen to most small-shop DBAs, because they’re usually in the best position to make that transition.)

If some of that sounds like a blurb from a résumé, it kinda is.  This job has existed since the 80’s.  But it’s always evolving, like the tech industry in general; so just because we’ve been around a while doesn’t mean we’re all old crusty bearded dudes.  (Although we do have some prolific beards among us!)

So there you have it.  Now you can tell your friends and family what a DBA does.  Or at least, hopefully, I’ve helped my own friends & family understand a bit about what I do.

T-SQL Tuesday #102: Giving Back

This month hosted by Mr. Major (scribnasium / @RileyMajor), whose past #tsql2sday post I may have missed but I love it all the same (BeyondCompare FTW!).  It’s about giving back to your community — mostly technical, i.e. the #SQLFamily / MS Data Platform community, but it can be about any community you’re a part of.

For starters, that’s what we blog for, at least partly.  I mostly blog about things I’ve learned or found interesting in my work, so that I don’t forget them later!  Of course, I’m always happy to learn that it helps someone else out there with a similar problem.  But there’s so much more to a community ecosystem than that.

SQL Saturday

SQL Saturdays are the perfect example of this.  I’ve been to 3 so far – 2 at Orange County, and 1 at San Diego.  I have to call out OC’s organizers, Ted Stathakis & .. the other guy.  Crap, I forgot his name.  He’s awesome though.  Srsly.  I talked with Ted a bit after a session at the last one and he truly is as nice as he is busy – 3 boys, all kinds of volunteer work, AND a full time job as the BI Director at Del Taco!  Wow!

I mean, it’s no Taco Bell, but still, Del Taco!  (Sorry, I had to..)  =P

I really want to volunteer to help at one of these.  I’m not sure what I’d do, but I know when the time comes, it’ll be worth it.

Lunch & Learns

To get my feet wet, I hosted a couple “lunch & learns” at my company with my Business Analyst and Development teams.  We talked about some basics, ranted about formatting a bit, tried to say goodbye to old habits from the SQL 2005 days (hello, date type!), and even dived into a few juicy things like performance-testing with IO stats and why you should never use scalar or multi-statement functions.  We also had a couple heart-to-hearts about DevOps and what it means in our environment.  (Hint: we’ve got a LOOONG way to go.)

At some point I’m going to scrub my slides and post them on SlideShare or something.  I just have to remove any company-specific info first.  ;o)

As with most teaching efforts, it helped me learn (and re-learn) some things myself!  I had to read some documentation to get wording exactly right, and I built a few playground DBs on my local machine to try out different concepts & examples.  Plus, it forced me to justify WHY I held certain opinions or notions on things, and in doing so, realize my own mistakes of the past.  Thus, I became a better DBA just by reinforcing some good practices and updating my own assumptions.

the-more-you-know
Yay learning!

Generosity

If there’s one thing I’ve learned from the many tributes to our late #SQLFamily member SQLSoldier, it’s the importance of being generous with your time.  Whether that means answering to the #sqlhelp tag on Twitter, participating in the SQL Community Slack, answering StackOverflow / DBA.StackExchange questions, or just taking a few moments to help someone at your workplace with a tech problem — it all makes a difference.  I need to be better about this, as far as my “online presence” goes.  In-person, I’m always eager to help, but because I work remotely most of the time (yay!), it’s a bit of a moving-target — the in-office days get packed with meetings and critical face-time (not FaceTime), so the peer-to-peer stuff is often postponed or relegated to Slack.

However, there’s a flip-side to this.  In being generous, we can’t forget to prioritize ourselves and our families.  I’m always being asked “Why are you working so much!?” — well, I could try to explain the difference between ‘work work’ and ‘tech community involvement’ and ‘self-betterment / career planning’ but… yeah.

Sorry, honey, be done in a sec!  =)

Anyway..

I encourage you, dear reader, to give back to your own communities, in whatever form that takes — tech/online, real-life, etc.  I promise that it will benefit you in some new & unexpected way.  Plus, whenever you make a real solid connection with someone, that’s worth its weight in gold.

it's not what you know, it's who you know
ORLY? Yes, RLY!