Aka “Dammit Jim, I’m a DBA, not a data researcher!” Or, as I stated on Twitter:
DBA != Database Archaeologist
Today I spent more hours than I care to admit, tracking down some obscure data from two disparate systems in an attempt to reconcile what were supposed to be matching records. Part of that is my own fault — I’m a sucker for interesting problems and edge cases, as I’ve blogged about before…
mostly just for the smug satisfaction of proving to the business that “your assumptions about how your data works are invalid“.
But mostly it’s because, the further back in time you go, the less reliable the data becomes. Especially (exponentially) when that data originates from human free-form text input.
Let’s contrive an example. We have our core business product system, WidgetMaster, which tracks Widgets we ship out by
WidgetNumber. Our partner associate business runs an online widget exchange where people can buy and sell their Widgets in a sort of second-hand/after-market fashion. PartnerExchange listings are supposed to include the
WidgetNumber for ease of tracking and associating data between the two systems, even though they’re officially run by different companies (or in my case, different departments of the same company — yeah, think about that for a second).
Now, ideally, theoretically, those
WidgetNumbers should always match up. But unfortunately, up until late 2014, our WidgetMaster system didn’t have an API that PartnerExchange could call to obtain a widget by its number; and even if we did, they have to at some level rely on the customer (or a worker) to read and enter that
WidgetNumber into the exchange listing. But wait, we started doing bar-codes back in 2010, so “most” of them are actually scanned from the bar-code, but not every customer has that capability, so there’s still a lot of hand entered data.
So we have some dirty data. Let’s complicate things a bit. Over time, those widgets can come back to WidgetMaster for update/upgrade and then ship back out. Again,
WidgetNumber should remain consistent throughout that process. Now, when PartnerExchange sells certain particular widgets, sometimes they’re part of a SuperSpecialCollection. This collection spans many years, maybe even a decade or more. WidgetMaster got wind of this SuperSpecialCollection, being bought-up by Mr. HighRollerCustomer, so we started marking the incoming/outgoing records with a new property.
But it’s text.
It’s entered by the receiver, based on looking at the Widget’s buy/sell history in PartnerExchange. And yes, the HighRollerCustomer who last bought the widget is aware that it’s part of their SuperSpecialCollection, but they aren’t guaranteed to specify that when they send the widget back in to WidgetMaster for upgrade.
Do we see the problem yet?
See, about 5 years ago, there was a reorg, and the dev team for WidgetMaster completely revamped the way in which “collection membership” for incoming widgets is designated/tracked. So now it’s over in some property table. To make matters worse, PartnerExchange renamed SuperSpecialCollection to AwesomeCltn a few years ago because they were tired of typing so many letters (and apparently fans of cryptic abbreviations).
Fortunately, PartnerExchange has done a decent job of at least storing the correct
WidgetQuality in their listings, despite
WidgetNumbers being fairly sparse. But again, because over in WidgetMaster, we’re supposed to associate each
WidgetNumber with the AwesomeCollection, we now have this secondary task of mapping unmatched
WidgetNumbers across systems, by using Type and Quality from one side (partner) combined with Collection-membership from the other side (master), by assuming that the partner’s designation of SuperSpecial/AwesomeCollection is correct.
If your head’s not spinning yet, give yourself a round of applause. While rubbing your tummy and tapping your foot.
Needless to say, this is hard. We’ll probably get the majority of records matched (mapped?) eventually by using a couple string
LIKE predicates and some clever try/pass/retry flow, but it’s tedious at best. Another bit of frustration will come up when we do a couple ad-hoc searches thru each system to attempt to apply reason and logic, i.e. find a pattern; and because we’ve already done the work, we might as well put that info into our results, even if it doesn’t show us a useful pattern by itself.
So how do we approach this? We’ll as I said, I spent what I felt was too much time on it, but essentially I did an initial “majority rules” mapping attempt (first pass), followed by a few reconciliation attempts for the remainders. Those consisted of fairly identifiable patterns with a couple outliers. With those outliers, as with the rest of the unmapped records at the end of the day, I had to tell the business, basically, “Here’s the majority of the results. You can assign a research specialist or analyst to the rest, if you feel it’s that important.”
I may have done this with slightly more attitude than necessary.
How could we improve this? The bigger geeks in the room may pipe up with “machine learning!” Ok sparky, do you have that infrastructure ready to go? No? How long for implementation?
mumble mumble something about Azure
mumble… Okay, sure, how about training the model so you can feed it your data? Cool, well enjoy the incredulous laugh you’ll get when you tell the manager that.
How about other tool sets? Sure, we could check out
R, write a
C# app maybe? Well guess what, we still need to look at the data to understand what patterns (or lack thereof) there are to work with. And again, lead time. Unfamiliar tools means longer development cycles. And they’re really not guaranteed to produce any better results (more matches) at the end of the day, are they?
Because your data models and your analyses are only as good as the data itself.
And with that, I’ll call it a day. Thanks for reading!