TSQL Tuesday 95: Big Data

This month’s party brought to you by Mr. Hammer (b|t).

mc-hammer
No, not THAT one…

I apologize in advance for all the hammertime memes.  It was just too good to pass up.  Surely he must be used to this.  Or at least not surprised by it.  =D

So, Big Data.  What is it?  Well, in simple terms, it’s the realization and acceptance of the fact that data is multi-model, multi-faceted, multi-sourced, and constantly growing.  It’s the fact that the traditional RDBMS is no longer the be-all end-all source of truth and valuable information.  It’s part of a larger ecosystem involving JSON document stores, CSV files, streaming volatile bits of data coming from random devices and user activity that loses its meaning and potential impact almost as quickly as it can be gathered and sifted and stored.

But what do we actually get out of it?  As a small-medium enterprise NOT in the software business, I have to say, not as much as the hype would have us believe.  And look, I’m not so jaded and crusty that I refuse to adapt new tech.  I Just haven’t seen a meaningful transformative business use-case for it.  Sure, we have Google Analytics telling us how our websites are doing, and someone in marketing knows something about trending our social media traffic.  Does it really help us make more money?  Heck if I know.

cease thy actions, my timepiece has indicated the necessity of mallets
Old-timey colonials can even dig it…

Here’s what I’d like to see from the thought leaders.  Give me something I can chew on — a real-world, non-hypothetical, non-frivolous, impactful use-case for adopting and implementing something like Hadoop/Spark or Azure Data Lake.  Show me how my business can realistically journey down the path of predictive analytics and what it’s going to take from our Devs, IT staff, and management to actually get there.

Because they don’t get it yet.  I have managers still worrying about how much we’re spending on a dinky little flash storage array to support the growing needs of our on-prem converged infrastructure stack.  Meanwhile the AWS bill continues to baffle, and Devs want to play with Docker and Lambda.  But we can’t seem to convince the higher-ups that they’re short-staffed on the internal-apps team, even after a minor version upgrade takes 4 hours of Ops time and half a dozen end-users doing post-mortem testing just to be sure we didn’t break anything unexpected.

I’m not here to complain.  Really.  I do want to see something amazing, something inspiring, something that shows me what Big Data truly brings to the table.  And sure, I’ve see the vendor demos; they’re all just a bit outlandish, no?  I mean, they look really cool, sure — who doesn’t want to see a chord diagram of who’s killed who is GoT? — but does that really help my business improve sales and productivity?

My point is, there’s a gap.  A chasm of misunderstanding and mis-matched expectations between what management thinks Big Data is/means, and what it takes to actually implement.  They see the pretty pictures and the fancy demos, but they don’t see the toil and sweat (or at least, in the cloud, gobs of cash) that go into building & operating the underpinnings and pipelines that drive those nice graphics.  Not to mention the fundamental issues of data quality and governance.

continue not, time for hammer it is
OK OK, last one, I swear…

So do us a favor, Big Data pundits.  Show us something real, something that “the little guy” can use to up his/her game in the market.  Something that makes a positive impact on small non-startup non-software businesses with understaffed IT & Dev teams.  But more importantly, stop glossing over the effort and resources that it takes to “do Big Data right“.  Managers and executives need to understand that it’s not magic.  And IT practitioners need to understand that it’s actually worth-while.  Because I believe you — really — that the payoff in the end is there, and is good.  But you need to convince the whole stack.


PS: I know this is a fully day late for T-SQL Tuesday, and as such, I wasn’t going to post a ping-back in the comments of the invite, but then I saw there were only 8 others, so I felt it would benefit the event if I did add my late contribution.  I’ll tweet with a modified hash-tag instead of the standard #tsql2sday, to reflect my late-ness.  Hopefully that’s a fair compromise to the community & the event’s intentions.  =)

Advertisements

SQL Server Performance Troubleshooting Free Scripts and Tools List

Originally posted on David Peter Hansen:
Back in the days, I used to collect a lot of different scripts, tools, and other goodies for troubleshooting SQL Server performance issues. These days, however, I tend to use what is publicly and freely available (as well as some internal stuff), and keep a list of those in my…

I don’t normally reblog.  But when I do, it’s something awesome. =D

David Peter Hansen

Back in the days, I used to collect a lot of different scripts, tools, and other goodies for troubleshooting SQL Server performance issues. These days, however, I tend to use what is publicly and freely available (as well as some internal stuff), and keep a list of those in my head.

I’ve meant to write that list down for a while, and today Chrissy asked:

So here it is…

Disclaimer: While I do work as a Premier Field Engineer for Microsoft, this is my list – this is not an official list from my employer. Others likely have a list that differs from mine.

Free scripts and tools from Microsoft
These are scripts and tools provided…

View original post 892 more words

Dirty Laundry

It’s time for a more thought-y, less tech-y post.  Which is mostly my excuse for not wanting to write a bunch of code at the moment.  But that’s how I started this blog, with mostly opinion pieces, trying to offer some critical thinking on how DBAs and Developers work together.  So y’all better like it!

Today’s title is brought to you by Don Henley’s tune of the same name, which is now stuck in my head, thankyouverymuch.

dirty laundry goes in a basket not in a database
Paint.net is my friend… =D

This is about data quality.  When you have “dirty data”, just like dirty laundry, and you let it sit unattended, it starts to smell.  In software, this means the “badness” seeps into other areas of the environment, affecting systems and business processes that should otherwise function smoothly.

code smell is a surface indication that usually corresponds to a deeper problem in the system.

-Martin Fowler

And, more aptly:

Data quality is corporate America’s dirty little secret.

-Paul Gillen

But what is dirty data?  Generally, it’s anything that doesn’t quite fit the ideal data model — that perfect vision of how all the bits of information in the system fit together, the shape of each data entity and how they relate to each other.  Mostly, dirty data is what happens when you allow users to type things into text-boxes, and you write those text-box contents straight into the database without any layers of validation or cleansing.  (Coincidentally, that’s also how SQL injection happens, but most of us have been scared-straight by enough years of security bloggers hammering at our thick skulls — and our favorite XKCD — that we at least sanitize our inputs before dumping them to an INSERT statement.)

Let me take a recent example from my experience.  We have an ERP system that doubles as our CRM system (which is already a pair of bad idea jeans).  How do you think customer information gets into the database?  Customer Service Reps, typing stuff.  Usually by copying from a paper form.  Or the customers themselves, using an online form.  But guess what doesn’t happen in either case?  If you said “USPS address validation“, give yourself a hand!

joker give yourself a clap
Oh goooood for youuuuuu…. </Christian Bale>

Now, being that this system is our “source of truth” for customer info, it stands to reason that lots of other business functions & processes depend on it.  For example, let’s say we send a promotional calendar to our customers of a certain “subscription level” on a yearly basis.  We’re not in the publishing business, so we contract this out to another company.  But guess what they need from us in order to complete the job and mail out those calendars?  Addresses!  So what happens when there’s a bad address in our database?  A calendar gets returned, wasted cost and materials.  Multiply that by a couple thousand and you start to turn a few heads in the C-suite.

Later, around the Marketing table, someone has a brilliant idea that they need to run a mail-merge to send out a gift-package to the top 100 customers.  So they ask the DBA for a list of said customers.  “Sure!  Here ya go, here’s a report.”  And then the complaints start coming in.

“These customers aren’t active anymore.”

Then tell your CS reps to mark them as inactive in the system.  But no, we don’t do that, we just write “inactive” in the FirstName field.

“These ones are employees.”

Fine, figure out a special indicator to add for that, so I can exclude them from the report.  But no, of course, we can’t do that either; we just put “deactivated” in the FirstName field.

“This guys is dead.”

Yeah, not even kidding.  Apparently the powers-that-be decided to keep his info in the system, but type in “deceased” to the “Address 2” line (in the US, this is customarily the apartment/suite/unit number).

he's dead jim
Let’s beam him back up but write “deceased” on his badge, that’ll be sufficient.

But mostly, the biggest complaint is that we’re getting un-deliverable/return-to-sender when we try shipping out to some of these addresses.  And why?  Because they’re not subject to any external validation and quality-control.

So what’s the data professional’s responsibility in this?  In my opinion, it’s to advocate for data quality.  There are obviously big vendors out there like Melissa Data who will sell you a service to help get you there.  APIs abound, from USPS and other official sources, so building it isn’t out of the question.

One potential roadblock is, as usual, conservatism.  The business’s ERP system is its life-blood, highly sensitive to change and very guarded by over-protective management and finicky executives.  But the smelly dirty data-laundry continues to cause problems and has real-money impacts on corp. efficiency and profit.  Unfortunately, many people tend to take the ostrich approach.

if you bury your head in the sand your ass will get burnt
No idea who this Bennett person is, but they sound smart.

So, my good people, start “doing your laundry”.  Have those conversations with your teams and managers about the current state of your data quality, and what it’s going to look like moving forward.  Make some plans, have a road-map, and understand that it’s going to involve a lot of collaboration between key players.  And good luck!

Quickie: TempDB on local SSD

What could possibly go wrong? As it turns out, plenty.

Faithful reader(s), it’s been a while!  I’ve been busy preparing for some big transitions.  I’m also getting better at MDX queries, tweaking SSAS-based reports to more accurately reflect the business rules.  But enough about that, on with the post!

In which we doubt the SAN

A storage area network (SAN) is a management & administration solution, not a performance solution.

-someone wiser than me

SANs are wonderful technology.  They inspire all kinds of geekery and are purported to solve all your storage woes.  But there’s a catch: they’re expensive.  Not just as a capital expense, but in maintenance and licensing costs.  And if you ever want to upgrade it, like add some more drives to a particular tier/pool — fuhgeddaboudit.

So what do we do with SQL on a SAN?  Well, it has tiers, right?  Slower storage with huge capacity, faster storage with less, etc.  We put the data files (heavy random read workload, typically) on the pool optimized for that kind of I/O pattern.  We put the TLog files (heavy sequential write workload) on the pool best suited for that.  And what about good ol’ TempDB?  Its access pattern is fairly unique — random writes and reads, and frequent overwrites, which means it could potentially wear out your typical prosumer SSD relatively quickly.  But we’re not complete cheapskates, we’ll buy enterprise class SSDs, no?

So we go read some stuff and figure, hey, sounds like a great idea, right?  Put TempDB on a local SSD, or better yet, a pair of SSDs in RAID-0 for pure performance (because this is a cluster, we’ve got HA already).  We’ll reduce the load on the SAN I/O channels and make our overworked TempDB happier with lower latency and better throughput.  Right?

what could possibly go wrong
ooh, sparkly!

 

In which we discover what could possibly go wrong.

Once the new drive(s) is(are) installed and “presented” to Windows (that’s my SysAdmin’s term), it’s fairly trivial to do the SQL configuration change — it does of course require a SQL service restart (or cluster failover).  Code example, assuming your new drive is ‘T’:

use master;
alter database tempdb
modify file (name=tempdev, filename='T:\tempdb.mdf')
alter database tempdb
modify file (name=tempdb2, filename='T:\tempdb2.ndf')
--etc...

You do of course have multiple TempDB data files, yes?  Good.

Side-debate

Should we put templog (TempDB’s transaction log) on the same drive as the TempDB data files, or put it on the same storage pool as the regular DBs’ TLogs?  As usual, “it depends” — ask your favorite SQL gurus and do some testing.

Back on topic

We’ve made the change, we’ve done the cluster failover.  TempDB is now running on our spankin’ new SSD.  So we start monitoring performance metrics.  Things like file I/O stats (from SQL DMV sys.dm_io_virtual_file_stats), latency and waits (from our monitoring tools), and good ol’ PerfMon.

But wait, what’s this?  I/O stalls are higher?  Write latency is higher?!?  Perfmon agrees?

nooooooooooooo-vader
why god why?!?

Write latency on the TempDB files was over 10x higher than it was when they were on the SAN (the performance tier, to be clear).  The file_stats DMV showed large increases in I/O stalls.  Sad-trombone.

 

In which we have several theories

Then ensued various conversations and brainstorms among my colleagues.

Someone check the firmware/drivers!

It’s up-to-date.

Maybe it’s got the wrong block-size.

Nope, 64k.

Well, it’s only 6Gbps SAS… maybe we should’ve sprung for the 12Gbps.

The write latencies went up by a factor of 10.  I don’t think an improvement by a factor of 2 is going to win you any trophies.

Why didn’t we get an NVMe or M.2 one?

Because the damn blades don’t have those slots, goober.

Another interesting observation, and potentially the silver lining.  Overall instance waits (wait stats), according to our monitoring tool, went down.  That’s good news, right?  Maybe.  Does application performance & user experience corroborate it?  Possibly!  We’ll be observing the patient for another week or so.

Let’s turn to the community again to see what others have experience.

In which we eat some crow

And by “we” I mean “me”.  Being the DBA and the primary proponent of the SSD addition, because I knew our workloads were very TempDB-heavy, I had to hang-tail and admit that the SAN gods won this round.

Maybe.

But wait, what about the fact that our wait stats are down?  What about app/user experience?  Valid arguments, I agree.  That’s why we’re still observing.  But I’m not optimistic, given the follow-up links above.  We may utilize local SSDs for something else (index filegroups?) — but if those write latencies don’t improve, I’m concerned that it won’t help anybody.

keep calm because only time will tell
And beer. Lots of beer.

In which I ask for your help

Yes, you!  If you have ideas on what we did wrong, what we’re missing, or any other advice about getting the most “bang for the buck” out of a direct attached SSD on a converged-infrastructure Cisco UCS blade server platform with a VNX SAN, by all means, drop me a line.  I’m all ears.

TSQL Tuesday 93: Interviews

This month‘s event is hosted by the fabulous DBA/SQL-consultant (& part time cartoonist) Kendra Little! Go check out her blog, podcast, and training at sqlworkbooks.com – really great stuff.

DBA interviews are tricky.

The problem isn’t so much that the role is vaguely defined. Although, depending on the size of the IT org and the tech stack, it can vary widely from a jack-of-all (DB Dev, report writer, production ops, the works) to a highly specialized performance tuner who works with 7 other teammates, each of whom has a unique surgical specialty in the data platform. But that’s not the problem — well, not the main problem. It is a problem in the sense that the business folks, especially HR, are notoriously and astonishingly ignorant of what a DBA or related role actually involves. But you get past that once you start talking to the tech leads and IT directors.

No, the problem is that, like most higher level technical roles, you don’t really know how a candidate is going to function in it (the role) without actually seeing him or her.. IN it. Do they keep a cool head when production goes down? Do they have a solid plan of attack for the dreaded “everything is slow!” complaint-storm? Do they have a good handle on HA & DR architecture & implementation? Can you rely on them to actually practice and follow thru with those strategies? Can they be a continuous learner and keep abreast of new developments while still tempering that with wisdom & maturity, applying the correct tools to the proper problems? Do try add value to the team and organization by both teaching and learning from others?

These are truly difficult, complex questions that are nearly impossible to deeply assess and fully answer during an interview process. Largely because the only real evidence of their answers lies in actual experience. Sure, a cert here or an MVP there definitely helps your case. But at any rate, we try our best to chip away at the boulder.

aerosmith chip away at the stone record label
it was a record. from the 70s. it’s the best i could come up with. =P

Pivoting to a more positive note, I’ll share some of the better questions that I’ve experienced during my career so far.

Some good examples.

How would you design and build a data copy/sync process across/between tiered environments, say DEV-QA-PROD?

Really great question.  This is a common problem is small-to-medium enterprises with legacy systems where DevOps hasn’t quite reached down to the depths of the internal application stacks and people are still dealing with “refresh cycles” on the order of months, quarters, or even years.  You can approach it purely from a tooling perspective, but that’s not the whole picture.  Thus, it calls for some thought and team-culture ideas as well as “knowing the nerd-knobs”.

We have a complex process flow that involves a lot of stored procedures, say 50 total.  Some of these are non-sequential, meaning they can be executed in arbitrary order, while others need to be sequenced with each other in “blocks”.  This is a vendor product, so ultimately, the customer gets to decide the schedule and order of execution of this flow.  The solution needs to be maintainable by field engineers.  How would you handle this?

Woah.  Talk about diving down a rabbit-hole.  This is interesting in the sense that it exposes a bit of the architecture and some of the potential pain-points that the team is hoping to solve, while leaving enough room for improvement and experimentation by the hopeful candidate.  More to the point, it’s just an example of a more general technique, which to me is very effective: taking an architectural problem that actually comes from the “real world” (the company/team that’s interviewing) and asking for the candidate’s ideas on how to solve it.  You don’t need to get in-the-weeds super-detailed about it, but outlining your ideas helps indicate how you think about complex challenges and shows what kind of value-add you would bring to the team.

And finally, a perennial favorite:

Tell me about a time you broke production, and more importantly, how you addressed and resolved it.

So many stories from the trenches involve downtime and mistakes, it’s good to ‘bond’ over them.  It helps bring the egos back down to earth, and reminds us that we’re all just meatbags, making technology to do our bidding, occasionally to our own regret.  It shows the candidate’s “pressure cooker” mentality, or at least, what they tell you about it.

server did what you instructed it to...you don't say?
Don’t ya hate it when…

In conclusion.

If you’re a DBA, Dev, or IT pro, help your managers better understand your team’s needs when it comes to hiring.  Get involved in the job description write-ups and screening process questionnaires.  Barge your way into those ivory towers, if you have to — or you’ll regret the time you waste on candidates who really belong in a different role than the one you’re after.

If you’re a manager, PLEASE LISTEN to your reports and tech leads.  They know what makes a good team member, they’ve been doing it for a long time.  Don’t dismiss their advice or block them from being part of the hiring process — yes, they are busy, and yes, they can be crotchety, but their input is highly valuable toward bringing in effective & productive talent.

That’s all folks!

PS: I know I missed the “deadline” by about an hour..ish.  I blame DST.  Heck, it’s still Tuesday for the majority of the Western hemisphere.  I’m not biased, but I write in English, so… ya know.  Take it as you will.  Now excuse me while I go hide from the blog-police in my ASCII-bunker.

Automating SQL Installation

..while there are likely even better ways to do this in the long-run, this quick & easy approach was sufficient to save me time and effort..

At this point in my career, I’m not managing massive environments, so I don’t feel the need (nor have the expertise) to use a large scale solution like DSC or SCCM.  But I’ve had to install SQL Server a few times, so I figured it’s worth at least scripting out a standard pre-configured installation, so that A) I don’t need click through a GUI ‘wizard’ hearkening back to the ’90s, and B) the SysAdmins can “fire and forget” (read: stop bugging me about it).

keep it simple stupid
the patented one-eyebrow-raise..

The Disclaimer

Thus, I’m attempting to K.I.S.S., while making it configurable & repeatable.  There are some limitations of this approach, as alluded above.  It’s not “massively scalable” (scaleable? scale-able?) because:

  1. The PoSh script still needs to be deployed locally to the server in question
  2. The installer config (.ini) also lives locally (though it probably could be a UNC path, it’s just a file after all)
  3. The script prompts you for the service account (SQL engine, Agent) credentials and the sa password using the Read-Host -AsSecureString method cmdlet, so some meatbag still has to type those in.  This is because we don’t have an enterprise pwd/secret-management system where I could, say, ask it for a service account credential set and tell it to embed that securely in a script without it actually being visible to me.  So, while yes, they’re kept in a “vault”, it’s not query-able by anything else, so an admin still needs to copy & paste them into whatever configuration screen he’s working with at the time.  Not ideal, I know, but we work with what we’ve got.

PS:  Yeah, yeah, “don’t use sa, rename it or disable it; or use Windows Auth only!”.  Rage, howl, fire & brimstone.  I’m not going to argue about it; we can save that for another post.  This environment dictates that its used during setup and then disabled later, so that’s beyond the scope of the installer config.

So yes, while there are likely even better ways to do this in the long-run, this quick & easy approach was sufficient to save me time and effort for the occasions when a new SQL box/VM needs to be spun-up.

Useful links

  1. A primer on SQL cmd-prompt installation & its arguments
  2. A couple community articles on the subject (the latter about slipstreaming updates)
  3. A technet article & couple Q&A threads (technet, stackoverflow) that helped me figure out how to securely get & put the credentials
  4. An example for mounting an ISO in PowerShell
  5. And finally, two things that I attempted to understand but ultimately failed to implement, because (apparently, at least to me), PowerShell remote-ing is a P.I.T.A.
config.ini, to command prompt, to PowerShell
3 steps toward a better workflow

The Outline

First we need an .ini file to work with.  You could either create it from scratch, or take it from an existing SQL box’s “Setup Bootstrap” folder.  Example path C:\Program Files\Microsoft SQL Server\130\Setup Bootstrap\Log\20170801_073414\ConfigurationFile.ini​  — indicating this was an install done on 8/1/2017 at 7:34am.  Right above that, at simply C:\Program Files\Microsoft SQL Server\130\Setup Bootstrap\Log\, you’ll see a Summary.txt file, which can actually come in handy while you’re testing these unattended installs and wanting to see why it failed.

The first link above, from MSFT Docs, does a pretty nice job of telling you all the things that make up this config file.  You get to bypass the TOS prompt, enter service account details, specify drive letters (paths) for default data/log file locations & tempdb, slipstream update packages (UpdateSource​), and even more advanced stuff like AG settings and whatnot.  My example will be a simple standalone instance using the default name, so I’ll be sticking with the basics.

We can then use this file in the ConfigurationFile argument of setup.exe from the SQL Server install media.  To put a little more color on that: the .ini file is really just a collection of command-line arguments to setup.exe​; you could also list them all out in-line, but that would be tedious and silly.  Here’s a couple major selling points of creating your own config file:

  1. Slipstream updates (SP’s, CU’s), instead of having it go out to MSFT update servers (or *aghast* sticking with the original RTM bits, you heathen you!)
  2. Specify drive letters / default file locations: sure, this may be considered old-hat if you’re running super slick storage, but I still find it makes management a bit easier if I know where my MDFs, LDFs, TempDB, & backups will always be.
  3. Take advantage of 2016’s better TempDB setup options (# files, size & growth)

We will, however, keep a couple arguments out of the .ini file and instead throw them into the ArgumentList from the calling PowerShell script.  Speaking of, here’s what the PowerShell script needs to do:

  1. Prompt the operator (SysAdmin or DBA) for the SQL & Agent service account credentials, and (optionally) the sa pwd (if using it).
  2. Fetch our install media from the central network share where we store such things (server & office ISO​s, for example).
  3. Mount said ISO to our virtual disc drive.
  4. Run its setup.exe with the following arguments:
    1. The config .ini file
    2. The service & sa accounts
  5. After it’s done, un-mount (dismount) the ISO.

Then the DBA can connect to the brand-spankin’-new running SQL instance and do other post-setup configurations as desired (i.e. set max-memory, maxDOP/CTFP, etc).  And sure, those could also be done in PowerShell (thanks in no small part to the awesome team at DbaTools), I chose not to do so in this case.

As the bloggers say, “that’s left as an exercise to the reader”.

Plus, they’re never quite as deterministic as we’d like them to be — they depend on the server’s compute resources, i.e. memory size & CPU cores, as well as estimated workload & environment tier, so it’s often a gamble in “how correct” your initial settings will be anyway.  Still, anything is better than the defaults, so configure-away!

husky puppies sharing
because sharing is caring!

The Code

Here are the Gists I’ve created to go along with this post.  If I’ve made a mistake, or if you, dear reader, have a suggestion, we can incorporate them into the gist without me having to go back and edit the blog post!

Yay technology!

I’d love to get feedback on how you would improve this, what you might do differently, etc.  Drop me a comment or a tweet!

Config/INI file:

PowerShell install script:

Know your Role

DBA does not mean “database archaeologist”, even though sometimes that’s what you end up doing…

Aka “Dammit Jim, I’m a DBA, not a data researcher!”  Or, as I stated on Twitter:

DBA != Database Archaeologist

penguin archaeologist
because… SQL ❤ Linux!

Today I spent more hours than I care to admit, tracking down some obscure data from two disparate systems in an attempt to reconcile what were supposed to be matching records.  Part of that is my own fault — I’m a sucker for interesting problems and edge cases, as I’ve blogged about before…

mostly just for the smug satisfaction of proving to the business that “your assumptions about how your data works are invalid“.

But mostly it’s because, the further back in time you go, the less reliable the data becomes.  Especially (exponentially) when that data originates from human free-form text input.

mr garrison manual inputs are bad
but GUI’s are OK.. and CLI’s are great!

Let’s contrive an example.  We have our core business product system, WidgetMaster, which tracks Widgets we ship out by WidgetNumber.  Our partner associate business runs an online widget exchange where people can buy and sell their Widgets in a sort of second-hand/after-market fashion.  PartnerExchange listings are supposed to include the WidgetNumber for ease of tracking and associating data between the two systems, even though they’re officially run by different companies (or in my case, different departments of the same company — yeah, think about that for a second).

Now, ideally, theoretically, those WidgetNumbers should always match up. But unfortunately, up until late 2014, our WidgetMaster system didn’t have an API that PartnerExchange could call to obtain a widget by its number; and even if we did, they have to at some level rely on the customer (or a worker) to read and enter that WidgetNumber into the exchange listing.  But wait, we started doing bar-codes back in 2010, so “most” of them are actually scanned from the bar-code, but not every customer has that capability, so there’s still a lot of hand entered data.

So we have some dirty data.  Let’s complicate things a bit. Over time, those widgets can come back to WidgetMaster for update/upgrade and then ship back out.  Again, ​WidgetNumber should remain consistent throughout that process.  Now, when PartnerExchange sells certain particular widgets, sometimes they’re part of a SuperSpecialCollection.  This collection spans many years, maybe even a decade or more. WidgetMaster got wind of this SuperSpecialCollection, being bought-up by Mr. HighRollerCustomer, so we started marking the incoming/outgoing records with a new property.

But it’s text.

It’s entered by the receiver, based on looking at the Widget’s buy/sell history in PartnerExchange.  And yes, the HighRollerCustomer who last bought the widget is aware that it’s part of their SuperSpecialCollection, but they aren’t guaranteed to specify that when they send the widget back in to WidgetMaster for upgrade.

Do we see the problem yet?

oh it gets better (again)
Yes, yes it does!

See, about 5 years ago, there was a reorg, and the dev team for WidgetMaster completely revamped the way in which “collection membership” for incoming widgets is designated/tracked. So now it’s over in some property table. To make matters worse, PartnerExchange renamed SuperSpecialCollection to AwesomeCltn a few years ago because they were tired of typing so many letters (and apparently fans of cryptic abbreviations).

Fortunately, PartnerExchange has done a decent job of at least storing the correct WidgetType and WidgetQuality in their listings, despite WidgetNumbers being fairly sparse.  But again, because over in WidgetMaster, we’re supposed to associate each WidgetNumber with the AwesomeCollection, we now have this secondary task of mapping unmatched WidgetNumbers across systems, by using Type and Quality from one side (partner) combined with Collection-membership from the other side (master), by assuming that the partner’s designation of SuperSpecial/AwesomeCollection is correct.

If your head’s not spinning yet, give yourself a round of applause. While rubbing your tummy and tapping your foot.

Needless to say, this is hard.  We’ll probably get the majority of records matched (mapped?) eventually by using a couple string LIKE predicates and some clever try/pass/retry flow, but it’s tedious at best.  Another bit of frustration will come up when we do a couple ad-hoc searches thru each system to attempt to apply reason and logic, i.e. find a pattern; and because we’ve already done the work, we might as well put that info into our results, even if it doesn’t show us a useful pattern by itself.

So how do we approach this?  We’ll as I said, I spent what I felt was too much time on it, but essentially I did an initial “majority rules” mapping attempt (first pass), followed by a few reconciliation attempts for the remainders.  Those consisted of fairly identifiable patterns with a couple outliers. With those outliers, as with the rest of the unmapped records at the end of the day, I had to tell the business, basically, “Here’s the majority of the results. You can assign a research specialist or analyst to the rest, if you feel it’s that important.”

I may have done this with slightly more attitude than necessary.

How could we improve this?  The bigger geeks in the room may pipe up with “machine learning!”  Ok sparky, do you have that infrastructure ready to go?  No?  How long for implementation?  mumble mumble something about Azure mumble…  Okay, sure, how about training the model so you can feed it your data?  Cool, well enjoy the incredulous laugh you’ll get when you tell the manager that.

How about other tool sets? Sure,  we could check out Python or R, write a C# app maybe?  Well guess what, we still need to look at the data to understand what patterns (or lack thereof) there are to work with.  And again, lead time.  Unfamiliar tools means longer development cycles.  And they’re really not guaranteed to produce any better results (more matches) at the end of the day, are they?

Because your data models and your analyses are only as good as the data itself.

data is the iceberg
Or is it “data are“?

And with that, I’ll call it a day.  Thanks for reading!