In which we discuss more about the EAV model and some of its merits and pitfalls.
continued from last week…
The Ugly (aka the “Wow really?!?”)
You’ll see this ‘creep’ even in product-catalog systems as mature as Amazon’s. If you search for (geeky as we are) graphics cards, and use the product attribute filters in the left pane to narrow it down, you’ll find that some correctly have their memory type (GDDR5, etc.) listed, while others may not. If you’re really unfortunate, there will be two semi-redundant attribute-sets that you’ll have to “juggle” between to really get at what you want. TVs, for example, may see both an “HDR support” (yes/no) and an “HDR type” (standard, ultra, etc.) — I’m kinda pulling those out of my arse for example’s sake, but you get the point.
Why does this happen? Because at some level, humans are still responsible for ‘tagging’ those products (for lack of better word). And as much encouragement and guidance as the ‘admin software’ may give them, they can (and do) still decide at times to side-step that guidance and say “Nope, I know better; make me this new thing!”
But isn’t that a problem with nearly all data-driven systems? Sure, of course it is. Yet with a model as flexible as EAV, the problem is intensely magnified by the fact that it’s made so easy to do — to ‘extend’.
And unfortunately, the biggest contributor to this problem is the lack of development-time and forethought given to the administration, or management, of the data. You see, this problem could be largely assuaged if the admin-toolset were the FIRST thought and priority in the roadmap. But so often, that thought comes LAST, if at all. So sure, your product feature tagging system looks great, it’s flexible and your customers love it. But you’re throwing tickets over the wall to your data team every time a requirement or use-case changes, or when you need to fix a data-quality problem caused by the users not knowing there was already a “Widget Type” before creating their new “Widget Kind” tag, or misspelling “Is Wierd” because English is weird and has more exceptions to the “I before E” rule than not.
Does this problem go away with a dedicated search-index or NoSQL technology like Elasticsearch or (shudder) MongoDB? Of course not! If anything, it may be worse. Maybe. But wait, those systems make it easier to de-dupe and manage redundancy & data quality, don’t they? Probably. I can’t speak from experience myself, but I’ve heard good things. Once again, it all comes down to the effort you’re willing to invest in the system. If you make data quality a priority, you’ll be happier with the experience. If you don’t, well you’re just another amateur data scientist complaining about dirty non-standardized/non-validated address fields, aren’t ya? =P
I joke with the data scientists, of course. What they do is awesome. They just tend to often re-invent the wheel of data-cleansing/data-wrangling that we DBAs have been doing for a few decades, because they didn’t know the right questions to ask or the right place to look. We need to get better at working together WITH them, not ‘for’ or ‘against’ them.
The Why or When (aka “Is it a decent model for this?”)
The long-story-short version is, consider your business and your data. Try to plan for the future, and anticipate potential changes and growth. It’s not easy, and we never “get it right the first time”. But we can try.
When your attributes are fairly static, and you know that you can tightly control them, you might consider a more rigid model. Something with a handful of lookup tables referenced by the main product entity. This is advantageous for performance and management, at the expense of scalability and extensibility.
When you literally need to support on-the-fly extension, and you’re absolutely married to SQL (i.e. not ready to venture out into NoSQL land just yet), the EAV model may fit the bill. Aaron’s article, and the comments therein, present some fairly valid and reasonable implementation suggestions to make it a little more palatable. Just beware the date — that was written back in 2009. Before we had such things as Elasticsearch and its ilk. I’d heavily encourage the consideration of purpose-built data-stores for this sort of thing, if you have any hope of scaling-out.
Other tools in your toolbox can help with this, too. For example, consider an in-memory data-grid for super-fast reads. The vast majority of data-access to these attributes & values is going to be reading, using it to filter & slice & dice a data-set. You can pay the small performance cost (e.g. write to the underlying SQL database) on the rare occasion when a write/change needs to occur.
Proving the age-old rule of “Just because you CAN, doesn’t mean you SHOULD”, the EAV model is sometimes okay and sometimes not. You need to understand your business and your data to make that call. And you need to consider the magnitude of effort that may be involved in pivoting from one model to another. Unfortunately, in many cases, that part overshadows the rest, and the show business must go on.
Still, I encourage you to at least think about it, and be ready with that knowledge of pros/cons when the time is right to discuss it with stakeholders.
In which we discuss the EAV model and some of its merits and pitfalls.
EAV, or Entity-Attribute-Value, is an data model that’s been around the block. It’s typically injected into a relational database at some point during the overall application/architecture life-cycle, somewhere between when the team realizes that they’ve got way too many lookup tables for a “main business entity” thing, and when they finally make the shift into polyglot data stores.
Wake me up when that actually happens, successfully.
I’m not going to rehash aging internet arguments here, nor bore you with replicated diagrams that you could just as easily look up on Google Images. No, instead, I’m going to tell you why this model is not great, why it’s not bad, and how you should go about deciding if it’s not wrong for you.
Yes, I did negate all those adjectives on purpose. Mostly because nobody really enjoys working with these structures, regardless of how they’re implemented; customers and business stakeholders are ALWAYS CHANGING the requirements and the attributes in question. But, we press on.
Proponents tell us that this model is easily searchable and easy to administer. The “searchable” bit is true; especially when you have the attribute-values pre-defined and don’t rely on end-user text-entry. But that’s true of basically any data model. The key here is that all attribute-values are effectively in one “search index”. But wait, don’t we have purpose-built search indexes nowadays? (Hint: see Elasticsearch.) This will come up again later.
Administerable? Administrable? Administratable? Damn you English! Anyway. Yes, again, if you’re fairly confident in your business users’ ability to effectively track and de-dupe (de-duplicate) incoming requirements/requests using their own brains/eyeballs and the admin tool-set that you build for them.
Oh, right, did I mention that? You have to build the admin app. Because you do NOT want to be writing ad-hoc SQL queries every time a new attribute requirement comes in. (Still, it’s better than making schema changes for new req’s, as I’ll discuss in a bit.)
Mainly, though, the biggest ‘pro’ of this model is that your business requirements, i.e. your attributes and the values they’re allowed to contain, can be flexible. The model allows a theoretically infinite amount of customization to suit your needs; though in practice, as Allen writes in the CodingBlocks article, you do run up against some pretty hard scalability hurdles right-quick. So in practice, you might want to consider more horizontally-scalable data stores, or (God help you) try scaling-out your SQL databases. (Spoiler-alert: big money big money!)
The Bad (aka the “Not Great”)
Which brings me to the first ‘con’. Performance. If you’re modeling this in a relational DB, and you expect it to scale well, you’re probably overly optimistic. Or very small. (If the latter, great! But you don’t always want to be small, right? Right!)
Don’t get me wrong; you can make it work half-decent with good indexing and sufficient layers of abstraction (i.e. don’t write a “kitchen-sink view” that’s responsible for pivoting/piecing-together all the attributes for a product straight outta SQL). But again, is it really the right tool for the job?
Momentary digression. SQL Server, or more generally, the relational database, has long been touted as the “Swiss army knife” of IT; we’ve thrown it at so many problems of different size and shape, that we’ve almost lost track of what it’s actually very GOOD at. Hint: it’s about relationships and normalization.
Another argument against it seems to be data integrity and enforcement. I can understand that, but again, with some clever software overlay and user-guidance, this can become almost a non-issue. But remember, your developers are the ones building said software. So that effort needs to be considered.
The Ugly (to be continued…)
The biggest problem, and quite a legit one, is ‘creep’ — both scope and feature. See, the inherent flexibility in the model will almost encourage data managers to be less careful and considerate when deciding when to add an attribute or value-set, and how to govern the data-set as a whole.
Build a solid CS/IT foundation, build your soft-skills, and then learn the latest & greatest tech stack. Or the mature enterprise stuff. There’s plenty of room for both!
Our benevolentoverlord prompted us (months ago; I know, I’m always late to the party) to write about what we do in our day-to-day jobs as SQL Server professionals, aka #SQLCareer. The actual idea is to do 4 posts in a row, for 4 working days (either adjacently or the same day each week for 4 weeks).
Sticklers, us? Nope, never.
Without any kind of future guarantees, here is my first.
Emails, tickets, and alerts, oh my! Today was actually pretty quiet on the monitoring front — no major performance issues, no interruptions, and no job failures. That one that occasionally takes 8+ hours did okay last night. More often than not, I have to kill it if I find it’s been running into the work-day. That should be automated. /backlogged
Re-wrote some analyst’s data change script to be more “friendly” and set-based, instead relying on identity values — which is important when they can differ between Dev & Production environments. Deployed a few change requests, including that one.
On the side of less tedium, I fleshed-out more of the plan to virtualize our remaining bare-metal SQL servers. The usual ‘gotchas’ have to be discussed with the Server Admins — reserving compute resources, preventing the ‘noisy neighbor’ problem, and having enough storage to do the migration.
Yes, that means about 2x the storage as they’re currently using. Got a problem with that? No? Good. 😉
Finally, I worked on some code for querying the SSRS ReportServer database to obtain report metadata (including stored-proc usage, parameters, and user activity). The core concepts came from Ted Stathakis‘s SQLSaturdaysession on SSRS 2016. Hopefully, some of that code will be coming to GitHub soon.
“Delight” being sarcastic. No really, I love my job, but sometimes there are just some things that make you go ‘grr arg’.
First up, developing and finalizing some “data cleanup effort” reports. These are things that show the business users “what’s wrong with the data” and give them some guidance on how to “fix it”. Now, because there is no easy “management GUI” for this data, and because actual changes need to go through change-control, it’s not going to be a cake-walk. But given the right inputs and some domain knowledge, they can make the decisions as to what those changes should be, and then hand them off to the BA’s (business analysts). Usually in Excel form.
Next we have the ‘grr’ part. This is where I spent about 40 minutes trying to explain to the 3rd-party software vendor exactly where and how to integrate with our core data systems. Most of the info should not be ‘new’ to them, but since I never know which support tech will be assigned to the issue, it’s best to err on the side of verbosity. To make things more confusing, the folks on my end, who have been interfacing with the vendor thus far, aren’t intimately familiar with the underlying technology bits, so the conversation gets a little dicey before we all arrive on the same page.
Fortunately, since I had their attention, I was able to sneak in a suggestion to upgrade the back-end database for the software (which we self-host, but is essentially managed by them & their software) to something from this decade. MySQL 5.0 is old, guys, mmkay?
A developer needs a new table & some initial data-load to production; it’s passed QA. Great! They wrote the query a little weird, but nothing ol’ SQLPrompt can’t fix.
Commit outstanding stored-proc code & RDL files (SSRS reports) to source control. Even if I’m the only one working on them, I’d rather they be safe & track-able.
Ruh-roh. A developer set off some proc in a Dev db, it’s been running for over 30 minutes… and then they cancelled it! But wait, it’s stuck in rollback. Yes, kids, rolling-back a transaction can take even longer than the transaction itself, due to rollbacks being single-threaded.
Now, since the user’s hitting of the ‘Cancel’ button (aka “Stop” in SSMS) does not wholly and entirely cause a kill SPID command, I do that for them (since only sysadmin or processadmin can kill sessions). Then I run ye oldekill 64 with statusonly (64 happened to be the SPID) to check on its rollback status. Good news, everyone! It’s at… 0%.
And it stays at 0%. For fifteen minutes.
I move on to other tasks of course, since there’s nothing more I can do (short of pulling the plug on the server, which is never a good idea (unless you’re testing your Disaster Recovery Hope plan!). We also create a ticket to fix this stored-proc so that it doesn’t “take forever”, even if it has to churn through millions of records.
Finally, the statusonly check starts moving. It’s about 45% rolled back, with about “300 seconds remaining”. This, as anybody who’s been a DBA will tell you, is in “Microsoft time”. You know those progress-bars you typically get in Windows while something is happening? They’ll start off fast, make it to around 90-some percent, tell you there’s about 1 minute remaining, then take at least five minutes to “finish that last little bit”? Yeah, that’s a “Microsoft minute”. Patent-pending.
But fortunately for us, it does finish in just about the time it promised, maybe a bit longer. Just in time for the Dev to test what they wanted to test before going home for the day. And just in time for me to have my evening not ruined by a runaway transaction.
PS: Anybody who’s ever worked with SQL, including yours truly, has done this (set off a long transaction and/or tried to rollback a long transaction) at least once in their career. If you ask a seasoned DBA or DB-Dev, and they say “No, I’ve never done that!”, they’re a dirty rotten liar. =D
Are You Not Entertained?
If the life of a DBA sounds like fun to you, get in touch with your local IT recruiter and find out what the job market is like! You’ll hear all sorts of doom & gloom about how “the Cloud is taking over everything” and “the DBA is dead; long live DevSecDataOpsUnicorns!”.
No, I joke. Yes, some of the fundamental responsibilities of the DBA are shifting to the cloud vendors, to IaaS/PaaS services, but at the end of the day, being in IT is about learningtech and working with people. Build a solid CS/IT foundation, build your soft-skills, and then learn the latest & greatest tech stack. Or the mature enterprise stuff. There’s plenty of room for both!
Over-engineering may be foolish, but under-engineering is just as perilous.
This month’sT-SQL Tuesday topic, thanks to Wayne Sheffield, is a doozy. It’s difficult to admit failures, even if we eventually overcame them. Mostly, it’s tough to fess up to those hours upon hours of what feels like wasted time preceding the hopefully inevitable breakthrough.
Let me tell you a tale of square pegs and round holes.
Get your mind out of the gutter.
And strap in. This puppy went over 2k words. I didn’t intend it that way, but, c’est la vie.
A Short History Lesson
I used to work for a software company that made K-12 educational assessment & reporting products. A large part of K12ED is dealing with learning standards, these historically fraught named concepts that students are supposed to have “mastered” and learned during their time in the US public school system. You ever hear a reference in pop-culture or media referring to “teaching to the test” or “state standardized testing” or similar? That’s what they’re talking about.
In the late 90’s, a bunch of states, including California in 1997, finalized and centralized a “list” of sorts, almost a database (but not formally, because let’s face it, this was still the early days of what we know as modern IT), of said Standards, which came to be unofficially known as the “CA ’97 Standards”. For over a decade, these learning goals dictated what was taught when in our schools. Government funding incentives, based on state standardized test scores, drove instruction to the point that teachers had little if any time to dynamically mold their students’ minds into critically thinking, multi-faceted, creative individuals.
But this article isn’t about my bias. This is about the broad, sweeping shift in the K12ED landscape called “Common Core”. As the technology sector grew more mature and started permeating more “traditional” industries, a vocal minority of thought-leaders had what they deemed an epiphany.
Hey, what if we took the old-guard educational bureaucracy, and all those disparate state standards, and turned it into a unified, technology-driven learning ecosystem?
Sounds great in theory, right? I mean, surely their intentions were good? Well, you know what they say about the road to Hell and how it’s paved…
Gosh, there goes my bias again. Sorry, let me just tuck that back in its pocket.
Anyway. These new Core Standards introduced some new ways of thinking. For example, that some learning concepts are inter-related and “cross-cutting” (a fancy way of saying that sometimes a Math concept requires a fundamental Reading-Literacy knowledge-point to fully grasp). This had some very interesting impacts on the underlying technology systems which relied on, and housed, said Standards. System which, I might add, had existed for over a decade at this point, in many cases.
Bringing it Back to Data
Our company’s system was one such. Our key partner’s system, from which we quite literally inherited the traditional, relational database structure to house the CA ’97 Standards, was another. You see, back before RESTful APIs ran the world, software systems relied heavily on what we call “local data stores”. In order to show the teachers and administrators, who primarily used our system, the Standards, in which their students were performing well (or poorly), we had to relate those Standards to the test questions that said students were tested on month after month, year after year. And, like so many other small businesses of the late 90’s / early 00’s, we had a trusty ol’ SQL Server lying around, just waiting to be loaded with all our precious data.
This was fine, for the most part. The legacy Standards conformed reasonably well to a relational data model, even though we had to throw in a bit of hierarchy (using the good ol’ adjacency list scheme). There wasn’t a complicated set of relationships from Standards in different subjects (Math, Science, History) to each other, and people didn’t care to do much in-depth analysis beyond “are my students getting well-prepared for the state tests?”.
Enter Common Core
You parents thought these things were complicated and convoluted — just Google “common core math problem” and you’ll find no shortage of critical satire. Well, the underlying data structures required to store these new Standards were going to be significantly more complex as well.
One of my main jobs, for about a year or two, was to oversee the importation of said Core Standards into our SQL database system. On the surface, it seemed reasonable — we had a hierarchy concept already, and we had a roll-up & drill-down mechanism for the handful of different “levels” of said hierarchy. But it was all very static. What that means, for us tech-heads, is that it was not easy to change or extend; not adaptable to new and expanded requirements. The older Standards adhered to a fairly strict hierarchy, and each level of roll-up had a distinct name. With Common Core, they broke out of that old mold, while simultaneously keeping some of the names only to change their meaning depending on context.
Think of it this way. A group of cattle is called a herd. A group of sheep is also called a herd. And a group of seagulls is called a flock.
And I ra-a-an.. I ran so far a-wa-a-ay…
Sorry, where was I? Right, group names. So what if the government suddenly decided for us that a group of sheep will from now on be called a ‘gaggle’. But only if they’re all female. If the group contains any male sheep, it’s called a ‘herd’ still. And groups of cattle will be still be called herds, unless it’s purely a group of bulls being raised for beef, in which case we call it a ‘meatsock’.
Have I lost you yet? Of course I have! This is pure nonsense, right? Language does not work this way. Moreover, hierarchies of groups of things do not work this way.
But I digress. There was, despite my jest, a method to the madness of the new Common Core hierarchy groupings. And I did learn it and understand it, for the most part. The problem was that it threw our existing legacy data model to the wind.
Enter Academic Benchmarks
As usual with a legacy software system in a small company, the resources and buy-in for a data-layer overhaul were nil. So it fell to us intrepid DB-Devs to shove that snowflake-shaped peg into the very square hole of the old relational model. We sketched out plenty of ERDs, brainstormed how to tack-on to our existing structures, but nothing short of a miracle would make this new data conform to these old static norms.
Thankfully, the “geniuses” (and yes, that is used sarcastically) over at Academic Benchmarks, or AB for short (at least for the purposes of this post), had already done this. And we paid them, thousands of dollars per year, for the convenience of using their GUIDs to identify Standards across different systems and vendors. Never mind that they were just perpetuating the bad model of yesteryear; never mind that they provided zero support for data quality feedback loops. We could happily take their Excel files or CSVs and load them nearly straight into our database.
Enter, and Exit, ASN
While I was searching for the words to express how insufficient our data model was, I came across this little gem from the Gates Foundation: Achievement Standards Network, or ASN. (ASN stands for many other things, so as with all acronyms, it’s all about context; just fair warning for the Google-happy.) The architects here had understood that learning standards needed a better and more flexible model, not only in terms of storage, but also in terms of data interchange format. This new kid on the block called JSON had been making waves for a while, and was subsequently widely adopted by the tech industry in general, so it stood to reason that this would be the preferred format for publishing and serving the Standards from ASN.
Bonus: it was FREE. Yes, really. What a wonderful thought, I said to my team, to imagine never having to pay those crooks at AB another red cent! Eventually. After years of transition. But alas, it was not to be. See, AB had been around the industry for a longtime. They had their hooks in almost every learning content publisher and assessment vendor. So as tempting as this shiny new source of academic truth may have been, sadly, it was not practical.
Enter the Contractor
Somewhere around the same time, we took on a promising new developer who, not only had a very strong background in Math and CS fundamentals, but who had also proven his worth with real world applications with actual “in the wild” deployments and users. He was a bit arrogant, actually. One could argue that he’d earned it, perhaps, but we didn’t appreciate always being told everything we were doing wrong, constantly, to the point that it was hard to hear the more important bits of wisdom and insight into how we could improve.
Sadly, that last part came in too little, too late.
Side-note, on the crest of the micro-services wave, he diagrammed and proposed an entire system re-write for our core software product, using micro-services architecture and the concept of small purpose-dedicated data stores. It looked real purty, but was ultimately and completely impractical for a company of our size. If we were a “true startup” with millions in VC funding coming out the ears, and could afford to throw a brand new, young & hungry “2 pizza team” at the product, then sure. But that was not reality.
The Brick Wall
No two bones about it: we had to support these new Standards. So we soldiered on with our relational database tables. And we tacked-on additional entities and relationships, logic and code, to make it “almost” do what the Common Core Standards wanted to do. Effectively, what we were trying to do, was to shove that pretty round sparkly peg of graph data, into the dingy old square hole of a 15-year-old SQL table structure.
Somewhere along the way, AB caught up with the times and started offering other, less “flat” formats, for their Standards data. So even though ASN was off the table, not all of our work toward JSON ingestion/conversion went to waste. Consuming the data exports from the vendors wasn’t a problem — we’d already been doing this. That was not the issue.
The issue, the brick wall against which we continually banged our heads, was the fact that we just plain couldn’t support the advanced & complex relationships and groupings (categorizations) of the new Standards. Which turned out, in retrospect, not to be the end of the world, because honestly it would take years, if not decades, for the educational system’s old-guard mentality to even comprehend such relationships and categorizations, let alone how they could help shape classroom instruction toward better outcomes for their students.
Good lord, that sounded like a bunch of jargon.
What I mean, in plainer English, is that we spent a lot of time worrying and arguing about something that did not matter as much as we thought it did. The consumers of our system, the teachers and principals and such, honestly didn’t care about all that. They just wanted to know if their kids were on-track to be “Proficient” on the state tests so their funding would remain steady. (I don’t give them enough credit, I know; teachers themselves also needed to know, on a regular basis, which kids needed special attention in what areas of learning, but that’s beyond the scope of most generalized reporting tools.)
Hindsight is Always 20/20
So what was the lesson here? I don’t want to overlook the fact that we were still using the wrong data storage technology for the problem, fundamentally. Or at least, the wrong data model. But, we live in a real world where software needs to be delivered on-time, in-budget, with less than perfect teams who have less experience and expertise than they feel they should. So instead of butting our heads against that brick wall, let’s try to remember to be pragmatic. To be realistic about what’s feasible for who & when; and to adapt the parts and segments of our application infrastructure, appropriately and efficiently, to the business problems and use-cases at-hand. Over-engineering may be foolish, but under-engineering is just as perilous.
Go check out your AWS RDS SQL Error Logs! See what tickles your curiosity.
Today’s post is brought to you by an unplanned AWS RDS outage, the desire to search its SQL Error Logs for events during the target time-frame, and the horrendously inefficient interface which AWS gives you in the GUI console for reading said logs.
Even the dedicated ‘admin’ user that you create for your instance, doesn’t have permission to read the error logs via the SSMS GUI nor with xp_readerrorlog. You can, however, use sp_readerrorlog. That’s with an ‘S‘.
The parameters here are quite arcane, namely@1, @2, @3, @4. Unfortunately, none of them allow you to filter on a time-span (those are the 5th and 6th parameters of the underlying xp, which we don’t have access to, as per #1).
My solution involves a #temptable, a loop of insert / exec commands, and then whatever queries you desire to search/filter/analyze the results. I also throw in a conversion to local time (from UTC, which is what the AWS servers use).
Details and The Why
You can check out the script; it’s short enough to embed, IMHO, so I’ll just leave it right here.
Line 25-26 is particularly interesting to me, and only works with SQL 2016 and up. I got the idea from this StackOverflow answer. You can chain two AT TIME ZONE commands together to convert a given datetime value from one zone to another. The reason you still need the CONVERT is because the output of the AT TIME ZONE command is always a datetimeoffset type, which, while quite useful in its own right, has its quirks, and doesn’t serve our purposes for ease-of-readability.
If you’re not running 2016, at least in RDS, you’ve got nearly no excuse. The upgrade process is vastly simpler with RDS than with traditional on-prem servers. Although, I did run into my share of snags with it recently, which I’ll blog about later.
You should plug in whatever values suit your need & environment — the @NumLogFiles and @StartDate & @EndDate. I used 2-2:30am, because… well, that’s always when those damn outages seem to happen, ain’t it?
As I mentioned, “the Why” is basically because AWS RDS limits your permissions (even as an admin) in some key ways, and one of those limitations prevents you from reading the error logs in the more “normal” ways — SSMS GUI, xp_readerrorlog, etc. And the interface given to read the logs in the AWS console GUI is quite a sad-panda. They offer a wrapper proc rdsadmin.dbo.rds_read_error_log @index = 0, @type = 1 , which really only serves the purpose of naming the parameters more nicely; under the hood it’s doing the exact same thing.
Of course we can’t prove that, because it’s encrypted, but the results are identical.
So there you have it. Go check out your AWS RDS SQL Error Logs! See what tickles your curiosity. =)
All these queries with all these JOINs on columns with mis-matched collation, lead to very sad pandas: RBAR operations (row-by-agonizing-row, unable to use index-seeks), and high CPU.
It’s not even close to Halloween, but I promised I would get to this someday, and that someday is now. Strap in, grab the popcorn, and turn up the volume.
Oh wait, this is just a textual medium. Forget the volume bit.
If you’re not sure what this collation thing is, you should go back and read my teaser post, and as always there’s the docs. The one-liner recap on why it matters and how it became a target of my ire is this: legacy DBs use the old default SQL latin1 collation, but an upgraded ERP system’s DB now uses the new default Windows latin1 collation; these DBs all talk to each other, including linked-server JOINs, and performance of those queries has gone to shit.
Pardon the French. “Gone to hell in a hand-basket.”
So why did this happen? Let’s try to find out. But first; let’s get specific about how all this wailing and gnashing of teeth actually manifests in real technical terms.
Essentially what happens here is an implicit conversion problem. There are several blog posts from our distinguishedcommunityleaders on this topic and its woes. It causes heavy CPU load as the SQL engine tries desperately to match values of different data types. Even though both columns may be, say, nvarchar(20), the fact that one uses collation SQL_Latin1_General_CP1_CI_AS and the other uses Latin1_General_100_CI_AS, makes them somewhat strangers — they might as well be an INT and a varchar!
Now again, this is my example. We have dozens of critical application queries using their own little sandbox-y databases, joining to the central ERP system DB to fetch Customer or Sales data. This is already a recipe for sadness.
“Use a middle tier or service layer, dammit!”, you’d say.
“You’re preaching to the choir,” I’d reply.
Hell, you’re preaching to the preacher, at that point. But it’s not that simple, as I’ll touch on later.
There’s a subtle difference here, vs. those many community blog posts, which I’ll repeat. The columns are of the same type. Just different collations.
And when the collation on the join predicates is different, bad things happen. Let’s take CustomerNumber for example. On the ERP side, it’s a nvarchar(20)collateLatin1_General_100_CI_AS. On the internal & web apps side, it’s a varchar(20)collateSQL_Latin1_General_CP1_CI_AS. As you might imagine, this is a prime field for joining because it’s the main customer identified throughout all the systems.
Let’s be clear here. This is a numeric value only. Did it need to support Unicode? Absolutely not. Should it have been an int or bigint? Probably. But did The ERP designers choose to make it Unicode string anyway? Yep.
Premature optimization may be a root of evil in software development, but inappropriate data typing is at least as evil in database development.
Anyway. The point of this post is not to rant and whine about the design of that particular system. I’ll save that for another day. That won’t stop me from complaining about the improper ways it’s used.
As stated above, all these queries with all these JOINs on columns with mis-matched collation, lead to very sad pandas: RBAR operations (row-by-agonizing-row, unable to use index-seeks), and high CPU. Under duress, my 32 core servers ground to a halt with blocked sessions, ASYNC_NETWORK_IO waits, and 99% CPU utilization metrics. Needless to say, these were not gooddays.
Side-note: I really enjoyed Bert’s post because it helped put things into very easy-to-understand terms, and is targeted at a wider audience than just the DBA. Read it, and watch the video too!
Attempt to Treat the Symptoms
Because the ERP system has been upgraded a couple times in the last decade, the team came up with a [very thin] abstraction layer manifested as a series of “integration views” that UNION similar core entities from the older and newer incarnations of the database. Like Sales records. These have permeated many many areas of the ecosystem, to the point that almost all apps use them instead of the “raw” source tables. Which sounds great, right? Riiiight.
Until you throw in that monkey wrench of conflicting collations. Remember, Devs are lazy (in a good way). Their apps and queries need to not care about such a low level detail as database collation. So to support that “not caring”, we set up these integration views to apply the older default collation (the one that matches everything else in the environment) to the output columns. That way, no extra work is required to consume them in the same way they’ve been consumed for the last 10+ years.
Basically, we can add the keywords COLLATE DATABASE_DEFAULT after each column declaration of the view, like so (in the form “alias = sourceColumn”): CustomerNo = erp.CustomerNo COLLATE DATABASE_DEFAULT.
This was a terrible idea.
It made sense at the time. But as the months passed and the complaints of performance degradation mounted, the signs continued to point back at these views which now used this collation-conversion mechanism (prior to the latest upgrade, they did not).
The typical work-arounds involved temp tables and/or going straight to the “raw” source. Neither of these are ideal — the latter breaks that abstraction (however thin it was), while the former risks over-burdening tempdb (sometimes referred to as the “communal toilet” of SQL server). Generally this was acceptable, and often resulted in orders of magnitude improvement to performance. But it continued to rack up that technical debt.
One thing I tried was to remove those collation conversions from all columns in the view except the join predicates, because the consumers fed those values straight into the object or ORM layer, at which point they all became C# strings anyway, so it didn’t matter what collation they’d used or whether they were ANSI or Unicode at that point. But alas, because the core pitfall of these queries was still very present — that implicit conversion — performance still suffered.
Treating the Root Cause
Here I re-link the two scary articles that warn of the dangers and gotchas of changing your database and server (instance-level) default collations: StackOverflow answer, and blog post. Given all that, it’s a daunting task. But if we’re going to get our performance back up to snuff, it’ll probably have to happen at some point. As the great Mike Rowe says…
And unlike the previous blogger, I will do my very best to follow up here and post about my journey as we undertake this harrowing trek.
The way I see it, there are three major attack vectors.
We can try converting the ERP database to the old SQL collation.
Pros: smaller effort than #2, less integration/regression testing overall. Cons: unsupported by ERP vendor, downtime for ERP system, high amount of risk.
We can try converting all other DBs in the environment (across all SQL instances) to the new Windows collation to match that of the ERP DB.
Pros: supported by ERP vendor, future-proof, less tech-debt. Cons: largest effort, heaviest testing burden, high risk.
We could utilize some kind of data-replication to maintain copies of the required data on the other SQL instances in their default (matching) collation.
Pros: support not an issue, lowest effort & testing burden, lowest risk. Cons: replication maintenance burden & overhead, loss of “real-time” (added data latency), and some tech-debt.
As the lone DBA, most if not all effort falls to me, so I’m quite inclined toward #3. And we were somewhat already going in this direction with the temp-table workarounds, i.e. pulling in the ERP data (usually via a linked-server) to the target app DB & using that temp-table for joins — essentially, that’s “lightweight replication”.
The technical debt we’re incurring here is that we’re leaving all of our legacy DBs (and servers) in the older SQL collation. At some point, likely the far-future, these will be unsupported, or at least obsolete, in the sense that all new applications & DBs will prefer the newer Windows collation. Perhaps during the next big “hardware refresh” cycle, i.e. when we have to plan and execute a big SQL server upgrade/migration, we can consider integrating the collation-change into that project.
But wait, you argue, what about a 4th option?
Oh sure, you mean say, “Screw it, not my problem!”…
Force all ERP DB data access up the stack to the application layers, i.e. the apps pull the data into memory and join/merge it there (or relate to & interact with it however the developers want to, in that layer).
But this has several downsides that the business and the development teams would [justifiably] push back on: dev time & effort, the drastic-ness & unprecedented-ness of the change, the fear of not catching every single place & usage of the ERP in the mysterious myriad of apps that it seems nobody can ever quite get a unified visibility handle on (some of which they can’t even build or deploy anymore without diving down some seriously arcane rabbit-holes of ancient tech). The risk is just too high.
More than that, and as much as I would love to say “ain’t my problem” and pass it off to a larger group of smart people, the fact is that over 50% of dependencies on the ERP DB are from my own BI/reporting queries. There’s no shortage of reports that need to examine and relate legacy LOB app data with ERP customer & sales data. And it’s not just to build a paginated SSRS report — where I could, arguably, do what I said above in faux-option 4: pull the data from 2 separate data-sets, then merge it in the report layer. It’s heavily customized, painstakingly crafted mini-data-warehouses and data-marts that massage and tailor the data for an assortment of different purposes and reporting needs. Thus, even with this, most of the burden still falls to me.
Data is digital information. A database is a collection of data. And a DBA manages it all.
Borrowing from an ‘old meme’ a bit. My wife recently said I should “write something about ‘how to do databases’.” As amusingly odd as her phrasing was, I figured she was right.
What is it?
I like to start at beginning. As Julie Andrews said, it’s a very good place to start. What is a database? That’s a pretty good question. Here’s the prerequisite question: What is data? Well, as I’ve said before, data is everything. But that’s a bit of a cop-out, isn’t it? That’s my career’s bias showing through.
Data is digital information. Anything that can be quantified, specified, categorized, searched, sorted, produced, consumed, read, written, measured, and stored digitally in some fashion. Data is the digital currency of the 21st century. Data is the very reason that most technology exists — to house and transport data among producers and consumers of information. It’s the evolutionary culmination of the stone tablet, the papyrus scroll, the bound book, the printing press, the newspaper, the library, the vinyl record, the magnetic tape, the compact disc, the pocket organizer, and the telephone.
So then, what is a database? Simply put, it’s a collection of data. The simplest analogy, depending on your age, is either a phone book or your cell phone’s contacts list (which is really just a phone book, in digital form). Of course, with the latter, it’s not so much an analogy as an example — you phone’s contact list IS a database.
Fun side-note, the phone book also makes a decent discussion prop for some DBA topics like index fragmentation.
Expanding on that example. You can search and sort your contacts by several data points: first name, last name, phone #, email, notes. Different database systems have various names for these: fields, columns, properties, criteria, values. The point is, it’s all data. Or if you want to get pedantic, each one is a datum, and together they aredata.
Pedantic, me? Never.
This is what a database, or DB for short, is all about: storing data in an organized fashion so that it can be sorted, searched, sliced and diced. Building on that, a database management system is a set of technology tools, processes and programs, that are used to gather, store, manipulate, copy, move, read, maintain, back up, link together, and operate one or many databases. This DBMS can come in many flavors. I happen to specialize in one called SQL Server, a Microsoft product/platform of the ‘relational‘ flavor — so if you’re following along with the abbreviation game, that’s an RDBMS.
If you’re hungry for more acronyms, the Wiki article on ‘databases‘ has a decent breakdown of the types and history behind them.
The more data you have, the more you can do with it. Why do you think Facebook, Google, Microsoft, and Amazon are such powerful technological forces? They purposefully, systematically gather as much data as they can from every possible source, and they have become very good at organizing and managing that data to maximize its value. Amazon product recommendations are a prime (see what I did there?) example — they are generally appropriate and effective because they have “learned” from your past purchases, i.e. your historical data. This “learning” – Machine Learning, aka Data Science – is the hot new marketing buzzword of recent years, but it all still comes back to data at the core.
This is not a “bad thing” or a “scary thing” as the old media and tin-foil-hat-wearers would have you believe. Yes, it can be a little disconcerting, and yes, people and companies can abuse data in malicious ways. But the vast majority of our digital data stewards actually want to do good. They want to connect you with more people that you may know and become friends with; they want you to watch movies that you’ll really enjoy; they want you to easily navigate to your destination without being stuck in traffic; they even want to help stop global warming!
As a general business rule, we crave data because it helps us make decisions. Every time a customer buys a product, we want to measure “the 5 W’s”: who what when where and how (ok, that’s not a ‘W’, but there’s a reason for it). Notice I didn’t list “why” there — only the customer knows why, and that information, that data, is stored inside their brain. And we can’t (yet) access that data. So it’s a guessing game now — we feed the other 5 data-points into our DBMS and eventually, given some time and analysis, we can guess the Why. And pretty accurately, at that. Then, we can make a decision to “Market more aggressively to Customer Type X”, or “Have a flash-sale on Product Y”, or “Move on this hot emerging market demographic.”
So what does that make you?
Well, I’m a Database Administrator – a DBA. Which means I “administrate databases”.
‘Administrate’, less common form of ‘administer’: manage and be responsible for the running of.
Thanks, dictionary. So in a nutshell, a DBA manages data. Deceptively simple sounding, no? I mean, what can data possibly do; it’s not alive, right? Actually, if you hang around a DBA for any length of time, you’ll commonly hear the phrase “Where does that data live?” or “That set of data lives over here.” So clearly we anthropomorphize our data. Most tech professionals do that to whatever technology they work closely with — it’s human nature. Software “behaves badly”, machines “throw a fit”, etc.
But anyway, why do databases need to be managed? What can happen to them?
Developers. Developers happen. =D
I joke, as you know, dear reader; I love developers. Users ‘happen’, too — often more catastrophically. So it’s fair to say that “people happen”. But besides that, here are some common reasons that databases, and data, need to be managed.
Data can be “wrong”.
Data can either be human-generated or machine-generated. Fingers on a keyboard, or sensors on a circuit board. You wouldn’t think the latter could possibly ever be “wrong”, but both kinds are subject to error. It’s just that the level of “wrongness” is subjective and depends on who’s asking and what’s expected of the system as a whole.
Data gets lost.
Humans interact with and manipulate data, and humans make mistakes. Why do you think the Undo button became such a staple of so many computer applications?
Data gets corrupted.
Storage media (magnetic disks, silicon chips, etc.) are not perfect — they have a documented level of fault tolerance and failure rate — so we need to ensure that our data is preserved (by moving it to another area that’s not ‘faulty’, usually) past those failures. Why? Because our data is essentially “more valuable” than the hardware on which it’s stored.
Data needs to be organized.
This is slightly more subjective than the above; how and why we organize data is highly dependent on the overall intent of the systems that will interact with it. But fundamentally, if there’s not some form of organization, the data is effectively garbage. If you ripped out every individual page in the phonebook and scattered them all on the floor, it’s no longer an effective tool to find someone’s phone number; it’s just a mess of papers.
Data needs to be useful.
If we can’t do something with the data, what’s the point of having it? The temperature at the North Pole on January 1st 1989 is, by itself, inconsequential. But a history of temperatures at the same and similar locations, over a long period of time, gives us some great value — we can see trends, look for anomalies, and even predict the future of what those temperatures might be.
Databases need to be available.
Similarly, if we can’t access the data, what good is it? Databases are a technology, and like most technologies, they occasionally break. Again, most of that comes back to humans, because humans are the ones writing the code that creates the software that houses the data and runs the database, or that interacts with it. But of course we still have power failures, network losses, disk failures, and even solar flares. (Ask your favorite superstitious engineer; they’ll have at least one good story about a system outage that could only be blamed on solar flares or gremlins or the full moon.)
Databases need to be maintained.
Every DBMS has some kind of assumed ongoing maintenance requirements to keep it “running smoothly”. Just like your car needs an oil change every 3 to 8 thousand miles, your databases need periodic attention to retain all of those important qualities discussed above.
And the latest big topic, underscored by the GDPR:
Data needs to be governed.
This is a big topic for another conversation, but the gist of it is, data is generally “owned” by someone, and deciding who owns what, where it’s allowed to live, and how it’s allowed to be used, constitutes an entire sub-industry of rules, regulations, policies, tools, security practices, and consequences, much of which we’re only just beginning to shape and understand.
TL;DR: What do you actually do?
I currently work at a “small enterprise”, a business that has been around for some decades (as opposed to a Silicon Valley start-up who counts their anniversaries in months, like an infatuated teenager), managing their database systems. Some of that is financial/accounting, some is customer info, some is internal/operational, and all of it is important to at least one person in the business in their daily decision-making efforts.
Thus, I help ensure that the data is always ready, when it’s needed, in whatever form & shape it’s needed in. I model, massage, correct, enhance, and move it around. I help developers write faster queries (that’s a fancy word for “questions” that we ask of our data); I aide analysts with understanding and gleaning more value from the data; I maintain the underlying systems that house the databases and ensure that they perform well and get upgraded when necessary; and I work with business drivers (VP’s, CxO’s) to build reporting solutions that leverage the data to enable better, smarter decisions, and ultimately (hopefully!) increase profit. (This last part is actually crossing into the BI – Business Intelligence – job role, which tends to happen to most small-shop DBAs, because they’re usually in the best position to make that transition.)
If some of that sounds like a blurb from a résumé, it kinda is. This job has existed since the 80’s. But it’s always evolving, like the tech industry in general; so just because we’ve been around a while doesn’t mean we’re all old crusty bearded dudes. (Although we do have some prolific beards among us!)
So there you have it. Now you can tell your friends and family what a DBA does. Or at least, hopefully, I’ve helped my own friends & family understand a bit about what I do.