Dirty Laundry

It’s time for a more thought-y, less tech-y post.  Which is mostly my excuse for not wanting to write a bunch of code at the moment.  But that’s how I started this blog, with mostly opinion pieces, trying to offer some critical thinking on how DBAs and Developers work together.  So y’all better like it!

Today’s title is brought to you by Don Henley’s tune of the same name, which is now stuck in my head, thankyouverymuch.

dirty laundry goes in a basket not in a database
Paint.net is my friend… =D

This is about data quality.  When you have “dirty data”, just like dirty laundry, and you let it sit unattended, it starts to smell.  In software, this means the “badness” seeps into other areas of the environment, affecting systems and business processes that should otherwise function smoothly.

code smell is a surface indication that usually corresponds to a deeper problem in the system.

-Martin Fowler

And, more aptly:

Data quality is corporate America’s dirty little secret.

-Paul Gillen

But what is dirty data?  Generally, it’s anything that doesn’t quite fit the ideal data model — that perfect vision of how all the bits of information in the system fit together, the shape of each data entity and how they relate to each other.  Mostly, dirty data is what happens when you allow users to type things into text-boxes, and you write those text-box contents straight into the database without any layers of validation or cleansing.  (Coincidentally, that’s also how SQL injection happens, but most of us have been scared-straight by enough years of security bloggers hammering at our thick skulls — and our favorite XKCD — that we at least sanitize our inputs before dumping them to an INSERT statement.)

Let me take a recent example from my experience.  We have an ERP system that doubles as our CRM system (which is already a pair of bad idea jeans).  How do you think customer information gets into the database?  Customer Service Reps, typing stuff.  Usually by copying from a paper form.  Or the customers themselves, using an online form.  But guess what doesn’t happen in either case?  If you said “USPS address validation“, give yourself a hand!

joker give yourself a clap
Oh goooood for youuuuuu…. </Christian Bale>

Now, being that this system is our “source of truth” for customer info, it stands to reason that lots of other business functions & processes depend on it.  For example, let’s say we send a promotional calendar to our customers of a certain “subscription level” on a yearly basis.  We’re not in the publishing business, so we contract this out to another company.  But guess what they need from us in order to complete the job and mail out those calendars?  Addresses!  So what happens when there’s a bad address in our database?  A calendar gets returned, wasted cost and materials.  Multiply that by a couple thousand and you start to turn a few heads in the C-suite.

Later, around the Marketing table, someone has a brilliant idea that they need to run a mail-merge to send out a gift-package to the top 100 customers.  So they ask the DBA for a list of said customers.  “Sure!  Here ya go, here’s a report.”  And then the complaints start coming in.

“These customers aren’t active anymore.”

Then tell your CS reps to mark them as inactive in the system.  But no, we don’t do that, we just write “inactive” in the FirstName field.

“These ones are employees.”

Fine, figure out a special indicator to add for that, so I can exclude them from the report.  But no, of course, we can’t do that either; we just put “deactivated” in the FirstName field.

“This guys is dead.”

Yeah, not even kidding.  Apparently the powers-that-be decided to keep his info in the system, but type in “deceased” to the “Address 2” line (in the US, this is customarily the apartment/suite/unit number).

he's dead jim
Let’s beam him back up but write “deceased” on his badge, that’ll be sufficient.

But mostly, the biggest complaint is that we’re getting un-deliverable/return-to-sender when we try shipping out to some of these addresses.  And why?  Because they’re not subject to any external validation and quality-control.

So what’s the data professional’s responsibility in this?  In my opinion, it’s to advocate for data quality.  There are obviously big vendors out there like Melissa Data who will sell you a service to help get you there.  APIs abound, from USPS and other official sources, so building it isn’t out of the question.

One potential roadblock is, as usual, conservatism.  The business’s ERP system is its life-blood, highly sensitive to change and very guarded by over-protective management and finicky executives.  But the smelly dirty data-laundry continues to cause problems and has real-money impacts on corp. efficiency and profit.  Unfortunately, many people tend to take the ostrich approach.

if you bury your head in the sand your ass will get burnt
No idea who this Bennett person is, but they sound smart.

So, my good people, start “doing your laundry”.  Have those conversations with your teams and managers about the current state of your data quality, and what it’s going to look like moving forward.  Make some plans, have a road-map, and understand that it’s going to involve a lot of collaboration between key players.  And good luck!

Advertisements

Author: natethedba

I'm a SQL Server DBA, family man, and all-around computer geek.

One thought on “Dirty Laundry”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s