T-SQL Tuesday #112: Cookies!!

..this analogy of “dipping into the cookie jar”. What events or accomplishments can I take sustenance from, draw strength from, during these times?

Advertisements

tsql2sday150x150

Hi folks. It’s been a minute. Frankly it’s been a rough 5 months. From losing my wife, to dealing with the holidays and her birthday, to moving houses again. On the career front, I’m faced with the challenge of virtualizing our core business-critical SQL instances with minimal downtime. And obviously, because of all that personal/life stuff, it’s been difficult to stay focused and productive.

So this #tsql2sday‘s topic is poignant, I suppose — this analogy of “dipping into the cookie jar”. What events or accomplishments can I take sustenance from, draw strength from, during these times?

As usual, we must thank our host, Shane O’Neill (b|t), and remind readers to check out tsqltuesday.com.

Encouragement

Back when I first took this current job, I was worried sick about doing the commute every day. One and a half to two hours in traffic each way. Even if I used toll-roads, it might save 10-15 minutes, but it was still super stressful. But my wife never stopped encouraging me, telling me it would pay off. She put up with the crazy hours, she checked on me periodically, she stayed on the phone to keep me awake sometimes. She reminded me often, when the time was right, to have the telecommute/remote-work conversation with management.

And of course, to nobody’s surprise, she was right. I now work from home 4 days a week, take a vanpool the 5th day, and am much happier and more productive (in general!), much less stressed, and a markedly better driver. More importantly, because of her unwavering support, I can still look back and draw renewed energy from those memories and from her still-present spirit that stays with me in my heart.

the wife is always right
GOTO 1

Accomplishment

One of the first big projects on my plate was upgrading & migrating the SQL 2005 instances to new hardware and SQL 2016. We didn’t use anything super fancy for the actual migration, just backup shipping, essentially. DbaTools came in very handy for copying over the logins/users without having to hash/unhash passwords. The main hiccups were around Agent Jobs and Replication. Jobs were turned off at the old server before they were enabled on the new, but due to lack of documentation and understanding of some, it became difficult to see which ones were critically needed “now” vs. which could wait a while, and which may have dependencies on other SQL instances (via linked-servers) that may or may not have moved.

And replication is just a PITA in general. Fortunately, I took this as a ‘growth opportunity’ to more fully document and understand the replication environment we were dealing with. So at the end of the project, we had a full list of replication pub-subs with their articles, a sense of how long they each took to re-initialize, and a decision-process to consult when adding-to or updating articles within them.

Continuous Learning

A similar upgrade-migration project came to fruition in 2017: the ERP system upgrade. This was a delicious combo meal of both the database instance upgrade (SQL 2008R2 to 2016) and the application & related services/middleware (Dynamics NAV 2009 to 2017). And as I blogged about, it came with a super sticky horrible side-effect that we’re still dealing with over a year later: a different collation than the rest of our DB environment.

Which reminds me, I need to do some follow-up posts to that. Briefly, the “best” band-aids so far have been thus:

  1. If only dealing with the ERP entities, without joins to other DBs, just leave collations alone. The presentation layer won’t care (whether that SSRS or a .NET app).
  2. If relating (joining) ERP entities to other DB entities, I’ll load the ERP data into temp-tables that use the other DB’s collation, then join to those temp-tables. The “conversion” is essentially ‘free on write’, meaning when we load the weird-collation data into a temp-table that’s using the normal-collation, there’s no penalty for it.

As I said, I’ll try to dive into this more in a future post. It’s been a life-saver for performance problems that cropped up as a result of the upgrade & the different collations.

But the point here is that, even though this project didn’t end up as wildly successful as we’d hoped, it’s still a success, because we learned some key lessons and were able to pivot effectively to address the problems in a reasonable way. And frankly, there was no going back anyway; it’s not like the business would have said “Oh, never mind, we’ll just stick with the old versions of everything” (even though some reticent managers would probably have enjoyed that!). So even when things seem bleak, there’s always a way around.

when the going gets tough, the tough take a coffee break
Wait a minute…

Conclusion

I’m still trying to figure out what this new chapter of my life looks like, without her. I’m still trying to be the very best DBA & BI-Dev I can, supporting the dozens of requests & projects that the business throws at us. Fortunately, with the incredible SQL Community, a wonderfully supportive family, and a fantastic team of colleagues, I remember how far I’ve come, on whose shoulders I stand, and how much promise the future holds.

Even though the one person I was meant to share it all with is gone; she still smiles down and encourages me in subtle ways.

…with love & light ❤

What Marvel Movies Do I Need to Watch?

Welcome to the first post of the new year. I’ll be keeping things a little on the lighter side for now. I’m still very into my work and learning lots of share-worthy things in the data world. But for now, movies!

I also want to take a moment to appreciate those who reached out to us after our devastating loss. Thank you for your thoughts and prayers. Please continue to remember our family as we struggle to find a sense of normalcy.

So, some of my elder moviegoers asked me the question that many people have been asking over the last year or two: “What Marvel movies do I really need to watch before Infinity War?”, or more recently, “before End Game?”. More generally, which ones are worthwhile viewing to a casual non-geek, to someone who doesn’t need to obsess over every little minutiae, someone who is not by nature a “comic book movie lover”. It’s a completely fair question, and honestly it needs more.. less nerdy answers.

Hence, this post! Go read it on the new blog!

Just Another Brick (in the) Wall

Over-engineering may be foolish, but under-engineering is just as perilous.

This month’s T-SQL Tuesday topic, thanks to Wayne Sheffield, is a doozy.  It’s difficult to admit failures, even if we eventually overcame them.  Mostly, it’s tough to fess up to those hours upon hours of what feels like wasted time preceding the hopefully inevitable breakthrough.

Let me tell you a tale of square pegs and round holes.

Get your mind out of the gutter.

And strap in.  This puppy went over 2k words.  I didn’t intend it that way, but, c’est la vie.

A Short History Lesson

I used to work for a software company that made K-12 educational assessment & reporting products.  A large part of K12ED is dealing with learning standards, these historically fraught named concepts that students are supposed to have “mastered” and learned during their time in the US public school system.  You ever hear a reference in pop-culture or media referring to “teaching to the test” or “state standardized testing” or similar?  That’s what they’re talking about.

In the late 90’s, a bunch of states, including California in 1997, finalized and centralized a “list” of sorts, almost a database (but not formally, because let’s face it, this was still the early days of what we know as modern IT), of said Standards, which came to be unofficially known as the “CA ’97 Standards”.  For over a decade, these learning goals dictated what was taught when in our schools.  Government funding incentives, based on state standardized test scores, drove instruction to the point that teachers had little if any time to dynamically mold their students’ minds into critically thinking, multi-faceted, creative individuals.

But this article isn’t about my bias.  This is about the broad, sweeping shift in the K12ED landscape called “Common Core”.  As the technology sector grew more mature and started permeating more “traditional” industries, a vocal minority of thought-leaders had what they deemed an epiphany.

Hey, what if we took the old-guard educational bureaucracy, and all those disparate state standards, and turned it into a unified, technology-driven learning ecosystem?

Sounds great in theory, right?  I mean, surely their intentions were good?  Well, you know what they say about the road to Hell and how it’s paved…

Gosh, there goes my bias again.  Sorry, let me just tuck that back in its pocket.

Anyway.  These new Core Standards introduced some new ways of thinking.  For example, that some learning concepts are inter-related and “cross-cutting” (a fancy way of saying that sometimes a Math concept requires a fundamental Reading-Literacy knowledge-point to fully grasp).  This had some very interesting impacts on the underlying technology systems which relied on, and housed, said Standards.  System which, I might add, had existed for over a decade at this point, in many cases.

Bringing it Back to Data

Our company’s system was one such.  Our key partner’s system, from which we quite literally inherited the traditional, relational database structure to house the CA ’97 Standards, was another.  You see, back before RESTful APIs ran the world, software systems relied heavily on what we call “local data stores”.  In order to show the teachers and administrators, who primarily used our system, the Standards, in which their students were performing well (or poorly), we had to relate those Standards to the test questions that said students were tested on month after month, year after year.  And, like so many other small businesses of the late 90’s / early 00’s, we had a trusty ol’ SQL Server lying around, just waiting to be loaded with all our precious data.

This was fine, for the most part.  The legacy Standards conformed reasonably well to a relational data model, even though we had to throw in a bit of hierarchy (using the good ol’ adjacency list scheme).  There wasn’t a complicated set of relationships from Standards in different subjects (Math, Science, History) to each other, and people didn’t care to do much in-depth analysis beyond “are my students getting well-prepared for the state tests?”.

Enter Common Core

You parents thought these things were complicated and convoluted — just Google “common core math problem” and you’ll find no shortage of critical satire.  Well, the underlying data structures required to store these new Standards were going to be significantly more complex as well.

One of my main jobs, for about a year or two, was to oversee the importation of said Core Standards into our SQL database system.  On the surface, it seemed reasonable — we had a hierarchy concept already, and we had a roll-up & drill-down mechanism for the handful of different “levels” of said hierarchy.  But it was all very static.  What that means, for us tech-heads, is that it was not easy to change or extend; not adaptable to new and expanded requirements.  The older Standards adhered to a fairly strict hierarchy, and each level of roll-up had a distinct name.  With Common Core, they broke out of that old mold, while simultaneously keeping some of the names only to change their meaning depending on context.

Think of it this way.  A group of cattle is called a herd.  A group of sheep is also called a herd.  And a group of seagulls is called a flock.

And I ra-a-an.. I ran so far a-wa-a-ay…

Sorry, where was I?  Right, group names.  So what if the government suddenly decided for us that a group of sheep will from now on be called a ‘gaggle’.  But only if they’re all female.  If the group contains any male sheep, it’s called a ‘herd’ still.  And groups of cattle will be still be called herds, unless it’s purely a group of bulls being raised for beef, in which case we call it a ‘meatsock’.

Have I lost you yet?  Of course I have!  This is pure nonsense, right?  Language does not work this way.  Moreover, hierarchies of groups of things do not work this way.

But I digress.  There was, despite my jest, a method to the madness of the new Common Core hierarchy groupings.  And I did learn it and understand it, for the most part.  The problem was that it threw our existing legacy data model to the wind.

Enter Academic Benchmarks

As usual with a legacy software system in a small company, the resources and buy-in for a data-layer overhaul were nil.  So it fell to us intrepid DB-Devs to shove that snowflake-shaped peg into the very square hole of the old relational model.  We sketched out plenty of ERDs, brainstormed how to tack-on to our existing structures, but nothing short of a miracle would make this new data conform to these old static norms.

Thankfully, the “geniuses” (and yes, that is used sarcastically) over at Academic Benchmarks, or AB for short (at least for the purposes of this post), had already done this.  And we paid them, thousands of dollars per year, for the convenience of using their GUIDs to identify Standards across different systems and vendors.  Never mind that they were just perpetuating the bad model of yesteryear; never mind that they provided zero support for data quality feedback loops.  We could happily take their Excel files or CSVs and load them nearly straight into our database.

Enter, and Exit, ASN

While I was searching for the words to express how insufficient our data model was, I came across this little gem from the Gates Foundation: Achievement Standards Network, or ASN.  (ASN stands for many other things, so as with all acronyms, it’s all about context; just fair warning for the Google-happy.)  The architects here had understood that learning standards needed a better and more flexible model, not only in terms of storage, but also in terms of data interchange format.  This new kid on the block called JSON had been making waves for a while, and was subsequently widely adopted by the tech industry in general, so it stood to reason that this would be the preferred format for publishing and serving the Standards from ASN.

Bonus: it was FREE.  Yes, really.  What a wonderful thought, I said to my team, to imagine never having to pay those crooks at AB another red cent!  Eventually.  After years of transition.  But alas, it was not to be.  See, AB had been around the industry for a long time.  They had their hooks in almost every learning content publisher and assessment vendor.  So as tempting as this shiny new source of academic truth may have been, sadly, it was not practical.

Enter the Contractor

Somewhere around the same time, we took on a promising new developer who, not only had a very strong background in Math and CS fundamentals, but who had also proven his worth with real world applications with actual “in the wild” deployments and users.  He was a bit arrogant, actually.  One could argue that he’d earned it, perhaps, but we didn’t appreciate always being told everything we were doing wrong, constantly, to the point that it was hard to hear the more important bits of wisdom and insight into how we could improve.

Despite that ego, one of his biggest contributions to the team was a fresh impetus to learn new things and branch out to new technologies.  To start looking at new Javascript frameworks.  To revisit OO fundamentals like Dependency Injection, and stop writing such procedural code.  To consider NoSQL data stores.  In particular, graph data stores.

Sadly, that last part came in too little, too late.

Side-note, on the crest of the micro-services wave, he diagrammed and proposed an entire system re-write for our core software product, using micro-services architecture and the concept of small purpose-dedicated data stores.  It looked real purty, but was ultimately and completely impractical for a company of our size.  If we were a “true startup” with millions in VC funding coming out the ears, and could afford to throw a brand new, young & hungry “2 pizza team” at the product, then sure.  But that was not reality.

The Brick Wall

No two bones about it: we had to support these new Standards.  So we soldiered on with our relational database tables.  And we tacked-on additional entities and relationships, logic and code, to make it “almost” do what the Common Core Standards wanted to do.  Effectively, what we were trying to do, was to shove that pretty round sparkly peg of graph data, into the dingy old square hole of a 15-year-old SQL table structure.

Somewhere along the way, AB caught up with the times and started offering other, less “flat” formats, for their Standards data.  So even though ASN was off the table, not all of our work toward JSON ingestion/conversion went to waste.  Consuming the data exports from the vendors wasn’t a problem — we’d already been doing this.  That was not the issue.

The issue, the brick wall against which we continually banged our heads, was the fact that we just plain couldn’t support the advanced & complex relationships and groupings (categorizations) of the new Standards.  Which turned out, in retrospect, not to be the end of the world, because honestly it would take years, if not decades, for the educational system’s old-guard mentality to even comprehend such relationships and categorizations, let alone how they could help shape classroom instruction toward better outcomes for their students.

Good lord, that sounded like a bunch of jargon.

What I mean, in plainer English, is that we spent a lot of time worrying and arguing about something that did not matter as much as we thought it did.  The consumers of our system, the teachers and principals and such, honestly didn’t care about all that.  They just wanted to know if their kids were on-track to be “Proficient” on the state tests so their funding would remain steady.  (I don’t give them enough credit, I know; teachers themselves also needed to know, on a regular basis, which kids needed special attention in what areas of learning, but that’s beyond the scope of most generalized reporting tools.)

Hindsight is Always 20/20

So what was the lesson here?  I don’t want to overlook the fact that we were still using the wrong data storage technology for the problem, fundamentally.  Or at least, the wrong data model.  But, we live in a real world where software needs to be delivered on-time, in-budget, with less than perfect teams who have less experience and expertise than they feel they should.  So instead of butting our heads against that brick wall, let’s try to remember to be pragmatic.  To be realistic about what’s feasible for who & when; and to adapt the parts and segments of our application infrastructure, appropriately and efficiently, to the business problems and use-cases at-hand.  Over-engineering may be foolish, but under-engineering is just as perilous.

Know your Role

DBA does not mean “database archaeologist”, even though sometimes that’s what you end up doing…

Aka “Dammit Jim, I’m a DBA, not a data researcher!”  Or, as I stated on Twitter:

DBA != Database Archaeologist

penguin archaeologist
because… SQL ❤ Linux!

Today I spent more hours than I care to admit, tracking down some obscure data from two disparate systems in an attempt to reconcile what were supposed to be matching records.  Part of that is my own fault — I’m a sucker for interesting problems and edge cases, as I’ve blogged about before…

mostly just for the smug satisfaction of proving to the business that “your assumptions about how your data works are invalid“.

But mostly it’s because, the further back in time you go, the less reliable the data becomes.  Especially (exponentially) when that data originates from human free-form text input.

mr garrison manual inputs are bad
but GUI’s are OK.. and CLI’s are great!

Let’s contrive an example.  We have our core business product system, WidgetMaster, which tracks Widgets we ship out by WidgetNumber.  Our partner associate business runs an online widget exchange where people can buy and sell their Widgets in a sort of second-hand/after-market fashion.  PartnerExchange listings are supposed to include the WidgetNumber for ease of tracking and associating data between the two systems, even though they’re officially run by different companies (or in my case, different departments of the same company — yeah, think about that for a second).

Now, ideally, theoretically, those WidgetNumbers should always match up. But unfortunately, up until late 2014, our WidgetMaster system didn’t have an API that PartnerExchange could call to obtain a widget by its number; and even if we did, they have to at some level rely on the customer (or a worker) to read and enter that WidgetNumber into the exchange listing.  But wait, we started doing bar-codes back in 2010, so “most” of them are actually scanned from the bar-code, but not every customer has that capability, so there’s still a lot of hand entered data.

So we have some dirty data.  Let’s complicate things a bit. Over time, those widgets can come back to WidgetMaster for update/upgrade and then ship back out.  Again, ​WidgetNumber should remain consistent throughout that process.  Now, when PartnerExchange sells certain particular widgets, sometimes they’re part of a SuperSpecialCollection.  This collection spans many years, maybe even a decade or more. WidgetMaster got wind of this SuperSpecialCollection, being bought-up by Mr. HighRollerCustomer, so we started marking the incoming/outgoing records with a new property.

But it’s text.

It’s entered by the receiver, based on looking at the Widget’s buy/sell history in PartnerExchange.  And yes, the HighRollerCustomer who last bought the widget is aware that it’s part of their SuperSpecialCollection, but they aren’t guaranteed to specify that when they send the widget back in to WidgetMaster for upgrade.

Do we see the problem yet?

oh it gets better (again)
Yes, yes it does!

See, about 5 years ago, there was a reorg, and the dev team for WidgetMaster completely revamped the way in which “collection membership” for incoming widgets is designated/tracked. So now it’s over in some property table. To make matters worse, PartnerExchange renamed SuperSpecialCollection to AwesomeCltn a few years ago because they were tired of typing so many letters (and apparently fans of cryptic abbreviations).

Fortunately, PartnerExchange has done a decent job of at least storing the correct WidgetType and WidgetQuality in their listings, despite WidgetNumbers being fairly sparse.  But again, because over in WidgetMaster, we’re supposed to associate each WidgetNumber with the AwesomeCollection, we now have this secondary task of mapping unmatched WidgetNumbers across systems, by using Type and Quality from one side (partner) combined with Collection-membership from the other side (master), by assuming that the partner’s designation of SuperSpecial/AwesomeCollection is correct.

If your head’s not spinning yet, give yourself a round of applause. While rubbing your tummy and tapping your foot.

Needless to say, this is hard.  We’ll probably get the majority of records matched (mapped?) eventually by using a couple string LIKE predicates and some clever try/pass/retry flow, but it’s tedious at best.  Another bit of frustration will come up when we do a couple ad-hoc searches thru each system to attempt to apply reason and logic, i.e. find a pattern; and because we’ve already done the work, we might as well put that info into our results, even if it doesn’t show us a useful pattern by itself.

So how do we approach this?  We’ll as I said, I spent what I felt was too much time on it, but essentially I did an initial “majority rules” mapping attempt (first pass), followed by a few reconciliation attempts for the remainders.  Those consisted of fairly identifiable patterns with a couple outliers. With those outliers, as with the rest of the unmapped records at the end of the day, I had to tell the business, basically, “Here’s the majority of the results. You can assign a research specialist or analyst to the rest, if you feel it’s that important.”

I may have done this with slightly more attitude than necessary.

How could we improve this?  The bigger geeks in the room may pipe up with “machine learning!”  Ok sparky, do you have that infrastructure ready to go?  No?  How long for implementation?  mumble mumble something about Azure mumble…  Okay, sure, how about training the model so you can feed it your data?  Cool, well enjoy the incredulous laugh you’ll get when you tell the manager that.

How about other tool sets? Sure,  we could check out Python or R, write a C# app maybe?  Well guess what, we still need to look at the data to understand what patterns (or lack thereof) there are to work with.  And again, lead time.  Unfamiliar tools means longer development cycles.  And they’re really not guaranteed to produce any better results (more matches) at the end of the day, are they?

Because your data models and your analyses are only as good as the data itself.

data is the iceberg
Or is it “data are“?

And with that, I’ll call it a day.  Thanks for reading!