Dirty Laundry

It’s time for a more thought-y, less tech-y post.  Which is mostly my excuse for not wanting to write a bunch of code at the moment.  But that’s how I started this blog, with mostly opinion pieces, trying to offer some critical thinking on how DBAs and Developers work together.  So y’all better like it!

Today’s title is brought to you by Don Henley’s tune of the same name, which is now stuck in my head, thankyouverymuch.

dirty laundry goes in a basket not in a database
Paint.net is my friend… =D

This is about data quality.  When you have “dirty data”, just like dirty laundry, and you let it sit unattended, it starts to smell.  In software, this means the “badness” seeps into other areas of the environment, affecting systems and business processes that should otherwise function smoothly.

code smell is a surface indication that usually corresponds to a deeper problem in the system.

-Martin Fowler

And, more aptly:

Data quality is corporate America’s dirty little secret.

-Paul Gillen

But what is dirty data?  Generally, it’s anything that doesn’t quite fit the ideal data model — that perfect vision of how all the bits of information in the system fit together, the shape of each data entity and how they relate to each other.  Mostly, dirty data is what happens when you allow users to type things into text-boxes, and you write those text-box contents straight into the database without any layers of validation or cleansing.  (Coincidentally, that’s also how SQL injection happens, but most of us have been scared-straight by enough years of security bloggers hammering at our thick skulls — and our favorite XKCD — that we at least sanitize our inputs before dumping them to an INSERT statement.)

Let me take a recent example from my experience.  We have an ERP system that doubles as our CRM system (which is already a pair of bad idea jeans).  How do you think customer information gets into the database?  Customer Service Reps, typing stuff.  Usually by copying from a paper form.  Or the customers themselves, using an online form.  But guess what doesn’t happen in either case?  If you said “USPS address validation“, give yourself a hand!

joker give yourself a clap
Oh goooood for youuuuuu…. </Christian Bale>

Now, being that this system is our “source of truth” for customer info, it stands to reason that lots of other business functions & processes depend on it.  For example, let’s say we send a promotional calendar to our customers of a certain “subscription level” on a yearly basis.  We’re not in the publishing business, so we contract this out to another company.  But guess what they need from us in order to complete the job and mail out those calendars?  Addresses!  So what happens when there’s a bad address in our database?  A calendar gets returned, wasted cost and materials.  Multiply that by a couple thousand and you start to turn a few heads in the C-suite.

Later, around the Marketing table, someone has a brilliant idea that they need to run a mail-merge to send out a gift-package to the top 100 customers.  So they ask the DBA for a list of said customers.  “Sure!  Here ya go, here’s a report.”  And then the complaints start coming in.

“These customers aren’t active anymore.”

Then tell your CS reps to mark them as inactive in the system.  But no, we don’t do that, we just write “inactive” in the FirstName field.

“These ones are employees.”

Fine, figure out a special indicator to add for that, so I can exclude them from the report.  But no, of course, we can’t do that either; we just put “deactivated” in the FirstName field.

“This guys is dead.”

Yeah, not even kidding.  Apparently the powers-that-be decided to keep his info in the system, but type in “deceased” to the “Address 2” line (in the US, this is customarily the apartment/suite/unit number).

he's dead jim
Let’s beam him back up but write “deceased” on his badge, that’ll be sufficient.

But mostly, the biggest complaint is that we’re getting un-deliverable/return-to-sender when we try shipping out to some of these addresses.  And why?  Because they’re not subject to any external validation and quality-control.

So what’s the data professional’s responsibility in this?  In my opinion, it’s to advocate for data quality.  There are obviously big vendors out there like Melissa Data who will sell you a service to help get you there.  APIs abound, from USPS and other official sources, so building it isn’t out of the question.

One potential roadblock is, as usual, conservatism.  The business’s ERP system is its life-blood, highly sensitive to change and very guarded by over-protective management and finicky executives.  But the smelly dirty data-laundry continues to cause problems and has real-money impacts on corp. efficiency and profit.  Unfortunately, many people tend to take the ostrich approach.

if you bury your head in the sand your ass will get burnt
No idea who this Bennett person is, but they sound smart.

So, my good people, start “doing your laundry”.  Have those conversations with your teams and managers about the current state of your data quality, and what it’s going to look like moving forward.  Make some plans, have a road-map, and understand that it’s going to involve a lot of collaboration between key players.  And good luck!

Advertisements

Stupid Defaults

Every big enterprise product has them.  I just happen to pick on SQL because it’s my area of expertise.

This week’s topic was triggered by an amazing quote from Adam Machanic (b|t) in the sqlcommunity Slack group, where he was trying to help someone troubleshoot a slow linked-server query:

That default, I consider to have been checked into the SQL Server codebase directly by Satan.

satan-costume-guy-laughing
I imagine he looked something like this… Satan, not Adam.  =D

He’s referring, in this case, to the default option of remote proc transaction promotion, or “Enable Promotion of Distributed Transactions for RPC” in the GUI, which is set to True by default.  Admittedly, linked-servers are a dicey area of the technology and not everybody needs them, but when you do need to deal with them, there are some caveats to consider and a few potholes to avoid.

I won’t go into gory details, but the mile-high gist of it is that you should probably change a couple of the default linked-server settings when you’re going from MSSQL to MSSQL (which most of the time, you are): rpc and rpc out both to true, and the above dude to false.  The reasons behind that are subtle, but it boils down to usage patterns:  If you’re purely running basic SELECT statements against the remote server, you can leave this stuff alone; but if you want to do cool stuff like exec MyLinkedServer.RemoteDB.sys.sp_executesql '--some awesome dynamic-sql'​, you’ll want to change these.  (That last bit comes in especially handy if you’re building a dynamic query string, then running it against the remote server to bring in the results to a #temptable to further massage/mangle/munge said data.)

Even though you probably really shouldn’t be doing that in the database (that’s what web server farms are for!).

stack of colorful rings with arrow pointing up
push it up the stack!

So, what are some other “stupid defaults” in SQL Server?

Every big enterprise product has them.  I just happen to pick on SQL because it’s my area of expertise.  And it’s not even just “defaults”; there are some options which shouldn’t even be a thing — they should be completely and irrevocably in love with Edward removed from the product.  Yet, because the RDBMS tech space is infamously conservative and slow-to-change (the accepted euphemism is “mature“), these options and defaults have remained in the product despite our best attempts to convince MS that they’re heinous.

1. Parallelism settings (server/instance level)

Your servers have many-core CPUs, right?  And you want SQL to utilize those cores to the best of its ability, distributing the many users’ workloads fairly amongst them, yes?  Damn right, you paid $3k or more per core in freaking licensing costs!  “OK”, says SQL Server, “I’ll use all available CPUs for any query with a ‘cost’ over ‘5’“.  (To give context here, in case you’re not aware, ‘5’ is a LOW number; most OLTP workload queries are in the double to triple digits).  “But wait!”, you protest, “I have more than 1 user, obviously, and I don’t want their horrible queries bringing all CPUs to their knees and forcing the 50 other user queries to wait their turn!”

well... tough shit
-SQL Server

Hardly seems fair, no?  But those are the defaults, have been for over a decade, despite how much hardware has evolved since.  Just for completeness, I am talking about the following 2 options, which you can view by right-clicking on the instance in SSMS, or by running sys.sp_configure.  They are max degree of parallelism and cost threshold for parallelism, respectively (# cores to use, and how expensive the query should be to “go parallel” — as opposed to serial/single-threaded, which is perfectly fine for those itty-bitty-teeny-tiny queries that are so quick that 1 CPU core is more than enough to handle them).  We commonly abbreviate the former as maxDOP; the latter is less commonly abbreviated, but because I’m a sucker for acronyms, I like to call it CTFP, or maybe costFP if I’m feeling verbose.

Now obviously you can, and should, change these settings.  “To what?” you ask.  Well, the answer, as always my friend, is “It Depends ©®™“… a perennial favorite of DBAs and consultants alike.  But don’t ask me — there are plenty of people much smarter than I with blog posts on the topic.  If you put a gun to my head, for CTFP, I’d say “pick your favorite number between 50 and 100, start there, and test to see if it makes your server CPUs happy or sad”.  And for maxDOP I’d say “divide your # of CPU cores by 2 (if you have 8 or fewer) or 4 (if you have 16 or more)”.

And if you have somewhere between 9 and 15 CPU cores, don’t ask me, because you’re running some.. interesting hardware.  Figure it out yourself, and get that gun out of my face!

OK, I know 12 cores is probably not an unreasonable config, so.. extrapolating my logic above, divide by.. 3?  Yeah let’s go with that.

squirrel with "ship it" flag
ship it good

2. Auto Close (database option)

It’s at the top of the list in the GUI under Database Properties -> Options.  Yet nobody knows why you would ever enable it.  I once heard a community member theorize that it might have been put into the product back when it was considered a viable “local persistence option for Windows CE/Mobile apps” and it would help w/ resource management on said mobile device by freeing up resources when the DB wasn’t in-use.  Well, we all know how well that product line did in the market (hint: poorly).  There are so many better options for localized data stores in mobile dev, MS isn’t even a blip in the conversation.  (Again, talking local data persistence.)

If we’re talking cloud, MS is a big part of that conversation — Azure is amazing, and a solid competitor to AWS & GCP.

Anyway, if you ever find a SQL DB with the auto_close option enabled, find the person responsible, and slap them with a trout.

stick figure chasing another with purple trout
I will find you, and I will trout-slap you…

3. Server Max Memory

Last one for today.  This is something that’s not completely heinous, but could use a LOT more built-in intelligence during the installation process so that DBAs & SysAdmins didn’t need to think about it so much.  SQL Server will, by default, sets its max-memory to some-odd-billion-MBs (technically it’s the max value of a 32-bit int, which the more geeky among you have probably memorized), which is of course some-odd-million-GBs, which is more than even the most bleeding-edge servers have to date.  Which is fine in theory — you paid a crap-ton of money for this system, it might as well use up all the RAM that it can to perform to its potential, right?

all your ram are belong to us
in AD 2101…

Right.  Until you realize that “Oh wait, it’s running inside an OS” (whether that’s Windows or Linux, thanks to 2016 & 2017 product versions) — that that OS needs some RAM too, to keep itself running!  (Cue the Linux zealots with their “it’s so much more lightweight than Windoze, haha!!1” — yeah, well you still need some memory space, don’a ya?)

Here’s what I’d like to see, in my ideal world.  During SQL Server installation, it would detect how much RAM is on the system, subtract about 10% or 4-8 GB, leave that for the OS, and use the resulting number as its limit.  Boom, done, nobody has to think about configuring it and checking off another checkbox on their setup checklist.

But noooo… The vaunted MSSQL engineers can built all sorts of amazing things into the product like QueryStore, Adaptive Query Processing, and The Artist Formerly Known as Hekaton, but heaven forbid we get a little more intelligence in the installer.  It got a lot  better with 2016 when it let you configure tempDB reasonably correctly (multiple files, different locations, etc), but there’s still a LOT that could use some lurv.

two kitties cuddling
because KITTEHS!!

Do you have a favorite “stupid default” or “horrible setting” related to SQL Server or any other technology that you work with?  Share in the comments!

Reports for Auditors

This is how code quality dies…

Look, I get it, audits and auditors are a necessary evil.  But the sheer volume of crappy code, hack-y scripts, man-hours, and alcohol, dedicated to meeting the often arbitrary, sometimes ridiculous, and possibly even downright obscene requirements, of said auditors, is staggering.  There are entire product suites dedicate to help you “report on” or “meet/pass” audits of various kinds.  And thank god for those.  Because without them, we’d all be slightly more insane than we already are.

I’m not crazy; my mother had me tested.  And my rubber duck agrees.

Here’s an example.

In one of our main applications, at the database level (SQL), we’ve implemented poor-man’s data-change-tracking via table triggers.  In a past audit, there was a “flag” on an unauthorized data change.  As a result, an “exception report” was created, which basically produces a daily CSV dump of those changes and emails it to a few managers.  Now, because the tracking table includes “who” as well as “when” & “what”, and there are certain privileged accounts that can be ignored for auditing, and because there are dozens of application users whose app-controlled changes are also logged in said tracking table (and need to be ignored), the “report” writer at the time decided it was a wonderful (read: terrible) idea to hard-code a list of usernames into the query that produces this data-dump.

That is horrible.  Never do that!

using hard-coded values is bad
Mr. Garrison knows…

This query, the job that scheduled it, the data that it produced, and the emails that it sent, were subsequently filed away and silently forgotten about.  Until we did some environment cleanup and discovered “Oh look, here’s a thing that’s not doing anything” (because the hard-coded list of users was now woefully out of date), and the same managers who received the reports for years past, gave the OK to disable the job!

What could possibly go wrong?

Then the next audit season comes around.  And guess who comes running back to the DBA having changed his mind?  Why yes, that very same manager!

i dont always say i told you so
Stay shortsighted, my friends…

So I tell him what we need to do to fix it — remove the hard-coded list, replace it with an AD group or at least a look-up table that can be periodically updated with “current” users in the desired departments/teams.  We do that, and the CSV dump goes from zero to huge.  As in, too huge for email.  So we try to pare it down.  We get it to a manageable size.  Then we get ready to deploy the changes and confirm the desired email recipients.

  • Boss: “Oh it doesn’t really matter, we just ignore the emails anyway… it’s just to satisfy an audit requirement.”
  • Me: “So why did we even bother fixing it?”
  • Boss: “Well, it was broken!”
  • Me: “…”

I’d like the last hour of my life back please.

stress-reduction-kit-bang-head-here
It works, I sw…

Here’s another fun one.

The ERP system has a rickety tack-on audit-trail system involving some SQL Agent Jobs and an obscenely large data repository for storing all the changes that the ERP system users make every hour of every day.  Thankfully, somebody along the way was prescient enough to partition those storage tables using a partitioned view scheme.  But apparently there were NOT smart enough to realize that the clustered index needs to be the datetime column.  Yes, it was in the primary key, but that PK was not clustered (nor should it have been, since the actual unique keys were GUIDs, but this is a prime example of making the clustering key different and separate from the PK.

Obviously enough, in a date-driven partitioned view scheme, that date does need to be involved in the PK also.  But again, the best clustered index is just that date, because A) it’s the partition divider, and B) it’s always your queries’ main filter-predicate (which is a fancy way of saying “the thing in your JOIN/WHERE clauses”).

Once again, audit season comes around, and someone needs those reports driven by these audit-trail repos.  And guess what?  They’re slow as molasses.

you don't say
Nicholas Cage is apparently quite the flexible meme fodder..

So we get to spend another few hours applying the correct clustered indexes, waiting for them to rebuild, testing the queries for the report, and finally running the report itself.  Oh and don’t forget the down-time of the jobs, and having to “catch up” with all the changes that were made in the ERP system while we were working on all this.

Is it 5 o’clock yet?

In conclusion.

Audits suck.  But please, stop producing crappy code in response.  It just makes the next guy/gal that comes after you even more frustrated than he/she already is with the whole process.  Practice the “boy scout principle” — leave things at least a little better than you found them.  And if you’re forced to produce a ridiculous “exception report” that nobody will ever read, spend as little time on it as possible while still making it less heinous than the last terribly designed monstrosity.

oh my look its beer-o-clock
That clock might be slightly ahead… but we’re gonna go with it.

An Open Letter To Management

We need to talk about this stuff – candidly, openly, broadly, deeply.

More accurately, to legacy enterprise management.

Let’s say the following directive comes down from on-high: “Hey, our CEO wants us to provide better financial metrics reports and a dashboard that management can see to show real-time stats about the company.”

*groan*

I mean… Sure!  Yay, digital transformation, modernization, mobile friendly, all that good stuff!!

So, I have some thoughts on this, because I’ve seen the current state of things in small-medium enterprise, and am anxious to help improve that state to provide better value to the business.  To misquote Dennis Miller, I don’t mean to on a rant here, but…

First topic: Reality Check

It starts at the top, with a couple realizations:

  1. Data is ever-growing.
    1. We need to get smarter about managing its growth, including archiving/retention schemes, data warehousing, etc.
    2. This involves compliance regulations and operational resources.
      1. We need to ensure compliance with biz standards and data shelf-life.
      2. We need to automate as much as possible to avoid over-burdening our human resources (and to some extent our servers too).

For example, you can’t expect the same response-time for a query into 10-year-old financial data as you do for 1-year-old data.

  1. Traditional SSRS (SQL Server Reporting Services) is an operational time-sink.
    1. We spend way too much time assigning access, creating redundant “on demand” reports, and making seldom-used email subscriptions.
    2. We’re probably running on an old version, say 2008R2
      1. Vast improvements have come to the MS Data/BI platforms in the last decade and we need to take advantage of them.
      2. It’s not mobile-friendly at all; it’s not even modern-browser friendly, as some of its UX elements are still explicitly functional in Internet Explorer
    3. We tacked-on some 3rd-party application to attempt to bring some data-warehouse functionality into the environment, but only 1 person “knows it intimately” and is comfortable developing new reports with it.
  2. Our ERP system, in its current state/version, is a tangled mess, to the eyes of a DBA & query-writer/report-writer.
    1. We’ve bolted-on so much customization and special-configuration that it’s not suitable for stock/canned reports from the vendor, even if we upgraded to a version of the app that had a decent reporting engine.
    2. We can’t even decide on very basic things like “What is a ‘unit‘ of production?”, or “What are the different areas/groupings we break-out for revenue metrics?”

Ok sure, maybe we can agree on what those groupings are, but we can’t even get a consensus on what we call them!

Second topic: Single Source of Truth

We need to agree on a standard, documented, official set of business rules that answer such questions as “how do we measure revenue?”, “what are our different sub-orgs/departments/groupings for how we report on revenue?”, “what is ‘production output’ and how do we measure it?”, “how do we calculate bonuses for this group of employees?”, etc. More than that, we need to agree on naming things – we need a common, consistent nomenclature and understanding of what it means when someone says “N# Units”, “Department X” or “Order Aging” or “Membership Level” or “Bonus Type Y”.

And even more than that, we need to map those concepts to concrete, documented rule-sets that are manifested in the data somehow (from the simplest example, a “look-up table” or “reference table”, to the complex examples like a “data mart” or “analysis cube” or “ETL process”). This concept is sometimes called a “data dictionary”, which kinda belies its complexity, because it’s really more of a “data encyclopedia” – it needs to document what, how, why, & when.

What our concepts/terms/data-points mean, how they’re used, why they’re useful, & when they should be used.

Third topic: KISS and KPI’s

Management reports need to be simple. Yes, there are power-users who want the detail, and there are auditors who in fact require the detail. But your average C-level (or even P/VP-level) exec doesn’t care about that stuff – they want very simple answers to deceivingly simple (i.e. can be very complex under-the-hood) questions, like “How much money did we make this quarter for department X?”, or “What kind of productivity bonus do I give to group Y?”. But that’s just the beginning – that’s descriptive analytics. What they really want, but are sometimes too afraid to ask, are more powerful questions, like “How much money can I expect to make in market Z or state XX?”, “What are our expected new loyalty program memberships, and how much will they profit us?” — predictive analytics.  (And we’re not even going to touch prescriptive analytics yet, because you’re not ready for that.)

KISS means we need to try our best to hide the nitty-gritty details and “under the hood” logic/calculations from the end-user or report audience. But, that means fully knowing and understanding those details and rules and logic flows so that we can implement them!

KPI is Key Performance Metric. That’s the golden nugget, the one piece of information that the manager/report-viewer ultimately is after, the thing that makes them go “Got it! That’s the answer I was looking for!”, so they can make their business-decision and move on with their day. These aren’t necessarily just single numbers (like an overall revenue figure); they can be pie-charts, bar-graphs, a clear & concise grid, or whatever makes the most sense for the business-problem/business-decision at hand.

This all sounds fantastic, right? So what’s the catch?

Fourth topic: Time & Effort

Time is money, which is resources, which is people, learning, training, developing, implementing, testing, validating… rinse, repeat.  You don’t put that all on the shoulders of a lone DBA; that life-cycle touches many different disciplines and team members – managers, business users, accounting folks, marketing people, analysts, developers, testers, operational leads, and yes, of course, all of IT infrastructure (helpdesk, engineering, DBA).  And you don’t just buy a box off the shelf at your local software retailer and say “look, we’re gonna implement Tableu!”, wave a magical IT wand, and call it day.

Now we, as technologists, are more than willing to learn and educate ourselves, but…

There needs to be a matching dedication from the business to that effort, and to the platform(s) that is/are chosen.

That means, in concrete terms, a few things:

  1. Training budget & resources
    • Conferences, courses thru online training providers, cross-team collaboration.
  2. Product & technology investment
    • Upgrades, net-new products, whatever is needed.
  3. Time allowances & agreements
    • Dedicated scheduling where the “daily grind” operations take a back-seat and we can focus on the new stuff.
  4. Support from SME’s
    • The ability to call-out to a qualified expert when critical questions or roadblocks arise.
    • Can be contractors, consultants, service-providers, or platform-providers. The point is, you only use them if you need them, so you keep the cost relatively low.

That’s if you’re dedicated to in-house team/ability build-up. If you want to outsource, you have a different set of challenges:

  1. Contractors are expensive!
    1. Their requirements are exceedingly rigid.
    2. They’re likely to scoff (yes, even outright laugh) at the quagmire of data & logic & rules that we’ve created and/or want to build into our “magical reporting stack”.
  2. They’ll still require that same product/tech investment.
    1. No contractor is going to accept your old legacy SSRS instance as a baseline for building a modern, responsive, effect reporting system. The first thing they’ll say is “upgrade that, & come back to us.”
    2. Likewise for your legacy ERP system – sure, it’s a little less obsolete, and there are probably plenty of shops running it & developing on it, but good luck getting new-hire contractors to embrace it; at best, they’ll begrudge it; at worst, they’ll charge exorbitant fees for having to work on such an old platform.
  3. Technical debt is their worst enemy.
    1. Like it or not, like most decades-old enterprises, we have technical debt up the wazoo.
    2. Contractors won’t work in a debt-heavy environment; they’ll insist you “fix the debt” and come back to them in a few months/years when it’s all happy & pretty & green.

Technical debt is our enemy, too, but at least we “own” it – i.e. we’re aware of it and we have ideas on how to fix it, if/when we ever get the time.

It’s like our city roads: at least we know where the potholes are, and how to avoid them.

Executive Summary

My point, from this rambling and probably way too lengthy post, is this: We need to talk about this stuff. Yes, Mr. Manager, I know you already said that. Let me embellish:

We need to talk about this stuff, candidly, openly, broadly, deeply, cross-functionally (made-up phrase #2), even company-wide.

Because, while the end-goal is deceptively simple (“We want report dashboards!”), the underlying systems are complex, with lots of moving parts, requiring lots of knowledge (both domain/biz and tech), and lots of management (compliance, governance, automation, visibility/monitoring).

It’s not just a technology challenge. It’s a people challenge. It’s a cultural challenge. It’s an organizational challenge.

It’s a challenge that, when faced, met, and overcome, can lead to spectacular growth and success for all involved!

(And that’s my attempt to end this rant on a positive note. Enjoy!)


PS: No, I’m not happy about WordPress’s inability to understand the ‘style’ attribute of a simple <ol> tag, but I tried… so apologies if the outlines are not intuitive because each level is just another set of numbers, instead of Word-style outlining like 1.. a.. i.. etc.  Grr arg!

Edge Cases

“Oh, that’ll never happen!”

I have a confession. I enjoy edge cases. More than I probably should. I take an unhealthy amount of pleasure in finding those unexpected oddities and poking holes in the assumptions and “soft rules” that business users and even developers make about their processes, applications, and most importantly, their data.

But let’s back up a minute. What exactly is an edge case? Well, I’ll quote Wikipedia’s definition at you, then I’ll put my own spin on it.

An edge case is a problem or situation that occurs only at an extreme (maximum or minimum) operating parameter.

edge-cases-99-to-1
it’s in a pie chart, so it must be true…

In my mind, there are really two types of edge cases. Although the results and treatment are largely similar, so it’s not a terribly important distinction, but for conversation’s sake, here they are.

  1. Known: while we can (and usually do) identify these cases, they are typically thought of as ultra-rare and/or “too difficult to reproduce”; thus, the plan for handling them involves tedious one-off or ad-hoc procedures, which are often completely undocumented.
  2. Unknown: the saying goes, “You don’t know what you don’t know” – these cases don’t even cross our minds as a possibility, either due to ignorance or because they are [supposedly] contrary to our business rules or our [assumed/remembered] application logic.

Again, the end result is the same: panic, discord, technical debt, and wasted hours of remediation. So why do we do this to ourselves? Well, one common justification tends to be “Oh, that’ll never happen!”, and we sweep it under the rug. Then there’s the laziness, busy-ness / lack of time, pressure to deliver, gap in abilities or tool-sets, passing the buck, etc. We’re all guilty of at least two of these in any given week.

basic-technical-debt-legos

So let’s move on to the important part: What can we do about it? Largely, we can simply take the excuses & reasons and simply turn them on their heads. Take ownership, commit to learning, communicate with management, make time for planning and documentation, and once a path is laid out for remediation, actually do the work. It’s often not easy or pretty, but a little pain now beats a lot of pain later.

tech-debt-cute-line-graph

I know, easier said than done, right? :o)

Example time.

Let’s say our sales offices are closed on Sunday – this is our “operating assumption”. Therefore, no orders will be processed on Sundays – this is our “business rule”. So because of this, we’ve decided to do some ETL processing and produce a revenue report for the previous week. Now, we’re using some antiquated tooling, so our first batch of ETL, which takes the orders from the sales system and loads them into the bookkeeping system, runs from, say, 2am to about 5am. Then we have a second batch, which moves that bookkeeping data into the report-staging area, from 6am to about 7am. We need those hours of “buffer zones” because the ETL times are unpredictable. And finally, our reporting engine churns & burns at 8am. But along comes Overachieving Oliver, on a Sunday at 5:30am, and he’s processed a couple orders from the other day (so perhaps he’s really Underachieving Oliver, but he’s trying to make up for it, because he enjoys alliteration almost as much as I do).

lumberg-work-on-sunday

Woah nelly! What happened? Oliver’s sales didn’t make it into the report! Not only that, but they don’t even exist according to bookkeeping. But come Monday, if he tries to re-process those orders, he’s probably going to get an error because they’re already in the sales system. So now he’s gotta get IT involved – probably an analyst, a developer, and maybe even a DBA. Damn, that’s a lot of resources expended because an assumption and a rule were broken!

warning-assumptions-ahead

Here’s another one. An order consists of 1 or more order-lines, which each contain 1 or more of a given item (product). Let’s say we store these in our database in a table called OrderLines, and each line has a LineNumber. Now, we have an assumption that those LineNumbers are always sequential. It’s even a rule in our applications – maybe not all parts or all modules, but at least some, and enough to cause a fuss if there’s a gap in that sequence or if a line is moved around somehow without proper data dependency updates (which, by the way, are application-controlled, not database-controlled). Plus there some manager who depends on this old reporting metric that also breaks when those line numbers are out-of-whack. But this should never happen, right?

assumptions-blind-spots-blanchard

The operative word there being “should”. But apparently there was a bug in an “update order” routine that left a gap in the sequence. Or maybe the DBA was asked to delete something from an order post-mortem, and there’s no way within the app’s ecosystem to do it, so he had to write some queries to make it work. And guess what? Because the Dev team is super busy working on the hot new feature that everybody wants, it will be 2 weeks before they can circle back around to address that update-bug or add that utility function for line-deletion. So now the DBA’s writing a stored-proc to wrap in a scheduled job to “fix” those order-line sequences every night, to prevent that one app module from breaking and keep those management reports accurate. And, to quote my very first post ever, the DBA waxe[s] wroth.

picard-why-we-cant-have-nice-things
Picard knows best..

So, prevention? Well, I’d probably start by locking down the order entry system during that off-limits window. We should also wire-up those big 3 processes so that there’s no need for indeterminate buffer-zones and inconsistent timing. And yeah, it could be a big change needing lots of team buy-in and approvals, but it’s worth the investment. Right? Right!

Hope you enjoyed reading! Until next time…

Drafted with StackEdit, finished with WordPress

No Regrets

Continuing from my previous post…

I loved my first job, no doubt. But slowly over the years, I started to feel pigeon-holed. I was always on-deck for problems that happened in the system due to what I’ll call “bad data”. If you’ve been in the industry, you know what that means – something in your datastore doesn’t “follow the rules”, or “breaks something that normally doesn’t break”, or “is just a little off-kilter”. (Otherwise known as “edge cases”, which I’ll talk about in a later post!) And often, it’s because the application allowed that data, which breaks the rules, into the system – Devs hate to hear this. But just as often, it’s because the data-guy, the DB-dev or analyst, ingested or transformed the data in a way that breaks the rules – so there’s plenty of blame to go around. And we knew that, and we respected that. We didn’t pass the blame, we just buckled down and tried to get it fixed.

But, too often, I was the one called upon to fix it, because I was the closest to the data – and yes, I knew it intimately, no doubt. Because of that, I started to feel left behind the curve. The other devs were pushing new tech stacks, learning new things, while I kept pluggin’ away at good ol’ TSQL (with the occasional smattering of C#). And yes, of course that’s partially my fault; I could have self-asserted, pushed for more learning opportunities, etc. I don’t regret my first job. I learned a ton – basically everything I know about databases, technology, and IT. But it was time for a change.

And that’s how I ended up where I am now!

Written with StackEdit.