SQL Style Guide – a Rebuttal

Both the original article, and my responses, are guidelines, meant to help you start a conversation with your own team about how you can get better at writing code.

Advertisements

In what will surely be a controversial post, I give my take on some of the major points of this “SQL style guide” that made the rounds on Hacker News / Reddit recently a few years ago.  Hey, I never claimed to be a source of breaking news.

stamp of controversy
Feels so empty, without me…

Now remember kids, these are opinions — everyone’s entitled to theirs, and it doesn’t mean you’re right or wrong.  As the author repeatedly points out, consistency among a team / project / environment is more important than anything else.  Both the original article, and my responses, are guidelines, written by people with some experience (guided by others with more experience) to help you start a conversation with your own team about how you can get better at writing code.  Because that’s what we’re paid to do, among other things.

Basics

I agree with most of the points here:

  • Consistent descriptive names (I would add “concise” too — autocomplete/intellisense has come a long way, but we’re still typing stuff sometimes)
  • White space & indentation
  • Datetime literals – hail Saint Bertrand
  • Comments – you, dear reader, know my thoughts already
  • Some OOP principles should be avoided because they usually lead to poor performance
  • Hungarian notation is best left in the ’80s where it belongs

Something I don’t abide by:

  • Sticking to the ‘standard’ and void vendor-specific functions/features

Some brief justification for my rejection:

Database code portability is largely a myth — it sounds great, in theory, but most teams/projects never actually do it.  If your data persistence layer is ever slated for migration to a new tech-stack, it’s going to require a massive overhaul anyway, in which those vendor-specific functions/code-bits will the least of your worries.  More likely, you’ll be swapping-out pieces of the data layer in small chunks to move into other storage tech, like a NoSQL store or a DocumentDB or something; and eventually over time, the whole original data layer will have been moved, and the concern over SQL code portability will be moot.

Furthermore, database vendors give you these features/functions for a reason — they’ve found that it greatly enhances their product and the productivity of developers who work with it.  So why not take advantage?

Finally, if your application is ‘cutting-edge’ enough that ALL db access is done via ORM or some kind of repository layer in the code-base… guess what?  You don’t have this problem in the first place!  Because the database is a dumb state storage mechanism, containing little to no actual code whatsoever (storec procs, functions, etc.).  So, port away!

now that's what i call edgy
I can almost feel the edgy-ness…

Other basic issues:

  • CamelCase (actually TitleCase) is pretty standard in a lot of DB schemas, and I see nothing wrong with it.  Despite my love of underscores (snake_case) , it does make for more awkward typing.
  • Plural or singular entity names should match the convention preferred by your overlaying ORM, if you’re at the point of DB design; most of the time, though, you’re working with a DB that you’ve inherited and you have no control over entity naming anyway, so stop whining about it and get on with life.
  • Leading vs. trailing commas: I prefer leading, but the arguments against it can sound convincing (not just his, but in general in the tech community) — my experience leans toward being more likely to futz with the middle-to-end of a list than the beginning (1st item), thus making the leading commas more likely to help, but that’s just me. Also, thanks to an awesome member of the Coding Blocks Slack, a point in my favor is that source code comparison (diff) tools will only show the one changed line instead of two, if you’ve had to add to the end of the column list.

Naming Conventions

Yes, please avoid reserved keywords, replace spaces with underscores (or use TitleCase to avoid having spaces), and use concise yet meaningful table aliases when you’re writing queries.  I still remember, when I first started working at my current company, literally gasping and cursing under my breath when I found that some databases actually had a space in the name.

Beyond that, the article goes a bit too deep in the weeds for me, especially the whole “known suffixes” thing — because isn’t that just Hungarian notation on the other end?  How about names that make it intuitive, such as IsActive for a bit flag, or LineNumber or RecordSequence for a sequential integer that’s not auto-generated (not an identity value), or @NumMonths as a parameter for a stored-proc that indicates how many months of reporting to fetch?  Common sense should rule the day, not arcane prefix/suffix conventions that will never be documented or enforced.

White Space

This whole notion of a “river” feels strange and awkward.  It’s made worse by the fact that some clause’s keywords are “too large” for the “standard” river width (which is the width of the SELECT keyword, ish), such as group by and left join).  Plus, I’ve yet to see truly excellent tooling support for this kind of style (be it VSCode, Visual Studio, SSMS, SQL Prompt, or other styling tools / auto-formatters).  Given that I still largely write my code without continuous automatic re-style-on-the-fly styling assistance, I find this hard to digest.

Side-bar: big kudos to the author for pointing me at this typography article, which challenged my long-ingrained writing preference of double-spacing between sentences.  Even now, I do it while I write this post, knowing it’s wrong — I can’t help myself!

For similar reasons, JOINs and sub-queries don’t need to be “on the other side of the river” — since I’m actually saying “there is no river”, what I mean is, don’t indent those JOINs or sub-queries excessively.  In the FROM clause, the JOINed tables are just as important as the first one, so I don’t see the need to reduce their importance by putting them so far to the right.  And please for the love of all things holy, stop putting the JOIN predicates in-line (on the same line) after the joined table — put the ON conditions to their own line and indent it!

Sub-queries are a strange animal, and I won’t dive deep on it here.  Basically, I try to style them the same as I would normally, just indented to the context of their scope; and within Javascript-style parentheses — meaning, open-paren on the preceding line (or its own line), then the body, then close-paren on its own line to finish.

Special note about the “Preferred formalities” section

BETWEEN is mostly evil.  I’m not saying you should never use it; just be very clear about why you’re using it, and only use it with discrete valued types (DATE, INT), NOT with continuous (or conceptually/nearly-continuous) value types (DATETIME, REAL/NUMERIC/DECIMAL).

The UNION operator is often misused, usually because UNION ALL is preferred (and is what you really meant anyway), but a blanket statement to “avoid it” misses the point of why it exists in the first place.  Likewise, temporary tables (#temptables) are wonderful tools when used properly and in moderation, but flagrant overuse can lead to what I call #tempocalypse (which means you’re hammering your TempDB so hard that its underlying storage system screams for mercy).

cry for help cheeto puffs
Like Cheeto Puffs, your TempDB has become bloated and full of cheeze.

Misnamed section “Create syntax”

What he really does here is expound upon table and database design principles.  This does not belong in a “Style Guide”; it probably belongs in a “Design Guide”, but because the relational database as a technology is so mature (yes, that means ‘old’) by this point, most of this feels completely unnecessary and redundant.  And again, you’re working with inherited designs over 90% of the time, where you don’t get to make these decisions, because they were made for you by your predecessors.  If you are so lucky as to be designing a relational model from scratch, look for advice from the tried-and-true architects of the trade.

I do echo and reiterate his advice to look at other data storage tech for things that are not ideally served by an RDBMS, such as EAV models, document storage, key-value storage, text search, etc.  There is no shortage of NoSQL tech to meet those needs, and they will do a much better job of it than you or I could in designing a hacked-up SQL model that somewhat does the job but falls over flat when faced with scaling up to a larger size or heavier workload.

In Conclusion

As I said at the beginning, these are opinions.  I applaud the author’s effort, and the fact that he actually got something together, made it open-source on GitHub, asked for feedback, and actually got traction with several communities discussing it.  That’s more than I can say for myself at this point!

I hope that this spurs some discussion on your database development team, and perhaps helps you work toward making a consistent style guide for your own environment.  Even if it’s just water-cooler talk, it’s always good to remember why we write code: so that others can read it.  (Sure, it’s also to make stuff happen, i.e. “make the computer do things”, but that’s a side-effect — your main goal should always be readability and clarity.)

Do you have some comments, thoughts, disagree? Leave me a comment! I’d love to hear from you.  🙂

needs-moar-drama
Drama-Cat is bored…

Just Another Brick (in the) Wall

Over-engineering may be foolish, but under-engineering is just as perilous.

This month’s T-SQL Tuesday topic, thanks to Wayne Sheffield, is a doozy.  It’s difficult to admit failures, even if we eventually overcame them.  Mostly, it’s tough to fess up to those hours upon hours of what feels like wasted time preceding the hopefully inevitable breakthrough.

Let me tell you a tale of square pegs and round holes.

Get your mind out of the gutter.

And strap in.  This puppy went over 2k words.  I didn’t intend it that way, but, c’est la vie.

A Short History Lesson

I used to work for a software company that made K-12 educational assessment & reporting products.  A large part of K12ED is dealing with learning standards, these historically fraught named concepts that students are supposed to have “mastered” and learned during their time in the US public school system.  You ever hear a reference in pop-culture or media referring to “teaching to the test” or “state standardized testing” or similar?  That’s what they’re talking about.

In the late 90’s, a bunch of states, including California in 1997, finalized and centralized a “list” of sorts, almost a database (but not formally, because let’s face it, this was still the early days of what we know as modern IT), of said Standards, which came to be unofficially known as the “CA ’97 Standards”.  For over a decade, these learning goals dictated what was taught when in our schools.  Government funding incentives, based on state standardized test scores, drove instruction to the point that teachers had little if any time to dynamically mold their students’ minds into critically thinking, multi-faceted, creative individuals.

But this article isn’t about my bias.  This is about the broad, sweeping shift in the K12ED landscape called “Common Core”.  As the technology sector grew more mature and started permeating more “traditional” industries, a vocal minority of thought-leaders had what they deemed an epiphany.

Hey, what if we took the old-guard educational bureaucracy, and all those disparate state standards, and turned it into a unified, technology-driven learning ecosystem?

Sounds great in theory, right?  I mean, surely their intentions were good?  Well, you know what they say about the road to Hell and how it’s paved…

Gosh, there goes my bias again.  Sorry, let me just tuck that back in its pocket.

Anyway.  These new Core Standards introduced some new ways of thinking.  For example, that some learning concepts are inter-related and “cross-cutting” (a fancy way of saying that sometimes a Math concept requires a fundamental Reading-Literacy knowledge-point to fully grasp).  This had some very interesting impacts on the underlying technology systems which relied on, and housed, said Standards.  System which, I might add, had existed for over a decade at this point, in many cases.

Bringing it Back to Data

Our company’s system was one such.  Our key partner’s system, from which we quite literally inherited the traditional, relational database structure to house the CA ’97 Standards, was another.  You see, back before RESTful APIs ran the world, software systems relied heavily on what we call “local data stores”.  In order to show the teachers and administrators, who primarily used our system, the Standards, in which their students were performing well (or poorly), we had to relate those Standards to the test questions that said students were tested on month after month, year after year.  And, like so many other small businesses of the late 90’s / early 00’s, we had a trusty ol’ SQL Server lying around, just waiting to be loaded with all our precious data.

This was fine, for the most part.  The legacy Standards conformed reasonably well to a relational data model, even though we had to throw in a bit of hierarchy (using the good ol’ adjacency list scheme).  There wasn’t a complicated set of relationships from Standards in different subjects (Math, Science, History) to each other, and people didn’t care to do much in-depth analysis beyond “are my students getting well-prepared for the state tests?”.

Enter Common Core

You parents thought these things were complicated and convoluted — just Google “common core math problem” and you’ll find no shortage of critical satire.  Well, the underlying data structures required to store these new Standards were going to be significantly more complex as well.

One of my main jobs, for about a year or two, was to oversee the importation of said Core Standards into our SQL database system.  On the surface, it seemed reasonable — we had a hierarchy concept already, and we had a roll-up & drill-down mechanism for the handful of different “levels” of said hierarchy.  But it was all very static.  What that means, for us tech-heads, is that it was not easy to change or extend; not adaptable to new and expanded requirements.  The older Standards adhered to a fairly strict hierarchy, and each level of roll-up had a distinct name.  With Common Core, they broke out of that old mold, while simultaneously keeping some of the names only to change their meaning depending on context.

Think of it this way.  A group of cattle is called a herd.  A group of sheep is also called a herd.  And a group of seagulls is called a flock.

And I ra-a-an.. I ran so far a-wa-a-ay…

Sorry, where was I?  Right, group names.  So what if the government suddenly decided for us that a group of sheep will from now on be called a ‘gaggle’.  But only if they’re all female.  If the group contains any male sheep, it’s called a ‘herd’ still.  And groups of cattle will be still be called herds, unless it’s purely a group of bulls being raised for beef, in which case we call it a ‘meatsock’.

Have I lost you yet?  Of course I have!  This is pure nonsense, right?  Language does not work this way.  Moreover, hierarchies of groups of things do not work this way.

But I digress.  There was, despite my jest, a method to the madness of the new Common Core hierarchy groupings.  And I did learn it and understand it, for the most part.  The problem was that it threw our existing legacy data model to the wind.

Enter Academic Benchmarks

As usual with a legacy software system in a small company, the resources and buy-in for a data-layer overhaul were nil.  So it fell to us intrepid DB-Devs to shove that snowflake-shaped peg into the very square hole of the old relational model.  We sketched out plenty of ERDs, brainstormed how to tack-on to our existing structures, but nothing short of a miracle would make this new data conform to these old static norms.

Thankfully, the “geniuses” (and yes, that is used sarcastically) over at Academic Benchmarks, or AB for short (at least for the purposes of this post), had already done this.  And we paid them, thousands of dollars per year, for the convenience of using their GUIDs to identify Standards across different systems and vendors.  Never mind that they were just perpetuating the bad model of yesteryear; never mind that they provided zero support for data quality feedback loops.  We could happily take their Excel files or CSVs and load them nearly straight into our database.

Enter, and Exit, ASN

While I was searching for the words to express how insufficient our data model was, I came across this little gem from the Gates Foundation: Achievement Standards Network, or ASN.  (ASN stands for many other things, so as with all acronyms, it’s all about context; just fair warning for the Google-happy.)  The architects here had understood that learning standards needed a better and more flexible model, not only in terms of storage, but also in terms of data interchange format.  This new kid on the block called JSON had been making waves for a while, and was subsequently widely adopted by the tech industry in general, so it stood to reason that this would be the preferred format for publishing and serving the Standards from ASN.

Bonus: it was FREE.  Yes, really.  What a wonderful thought, I said to my team, to imagine never having to pay those crooks at AB another red cent!  Eventually.  After years of transition.  But alas, it was not to be.  See, AB had been around the industry for a long time.  They had their hooks in almost every learning content publisher and assessment vendor.  So as tempting as this shiny new source of academic truth may have been, sadly, it was not practical.

Enter the Contractor

Somewhere around the same time, we took on a promising new developer who, not only had a very strong background in Math and CS fundamentals, but who had also proven his worth with real world applications with actual “in the wild” deployments and users.  He was a bit arrogant, actually.  One could argue that he’d earned it, perhaps, but we didn’t appreciate always being told everything we were doing wrong, constantly, to the point that it was hard to hear the more important bits of wisdom and insight into how we could improve.

Despite that ego, one of his biggest contributions to the team was a fresh impetus to learn new things and branch out to new technologies.  To start looking at new Javascript frameworks.  To revisit OO fundamentals like Dependency Injection, and stop writing such procedural code.  To consider NoSQL data stores.  In particular, graph data stores.

Sadly, that last part came in too little, too late.

Side-note, on the crest of the micro-services wave, he diagrammed and proposed an entire system re-write for our core software product, using micro-services architecture and the concept of small purpose-dedicated data stores.  It looked real purty, but was ultimately and completely impractical for a company of our size.  If we were a “true startup” with millions in VC funding coming out the ears, and could afford to throw a brand new, young & hungry “2 pizza team” at the product, then sure.  But that was not reality.

The Brick Wall

No two bones about it: we had to support these new Standards.  So we soldiered on with our relational database tables.  And we tacked-on additional entities and relationships, logic and code, to make it “almost” do what the Common Core Standards wanted to do.  Effectively, what we were trying to do, was to shove that pretty round sparkly peg of graph data, into the dingy old square hole of a 15-year-old SQL table structure.

Somewhere along the way, AB caught up with the times and started offering other, less “flat” formats, for their Standards data.  So even though ASN was off the table, not all of our work toward JSON ingestion/conversion went to waste.  Consuming the data exports from the vendors wasn’t a problem — we’d already been doing this.  That was not the issue.

The issue, the brick wall against which we continually banged our heads, was the fact that we just plain couldn’t support the advanced & complex relationships and groupings (categorizations) of the new Standards.  Which turned out, in retrospect, not to be the end of the world, because honestly it would take years, if not decades, for the educational system’s old-guard mentality to even comprehend such relationships and categorizations, let alone how they could help shape classroom instruction toward better outcomes for their students.

Good lord, that sounded like a bunch of jargon.

What I mean, in plainer English, is that we spent a lot of time worrying and arguing about something that did not matter as much as we thought it did.  The consumers of our system, the teachers and principals and such, honestly didn’t care about all that.  They just wanted to know if their kids were on-track to be “Proficient” on the state tests so their funding would remain steady.  (I don’t give them enough credit, I know; teachers themselves also needed to know, on a regular basis, which kids needed special attention in what areas of learning, but that’s beyond the scope of most generalized reporting tools.)

Hindsight is Always 20/20

So what was the lesson here?  I don’t want to overlook the fact that we were still using the wrong data storage technology for the problem, fundamentally.  Or at least, the wrong data model.  But, we live in a real world where software needs to be delivered on-time, in-budget, with less than perfect teams who have less experience and expertise than they feel they should.  So instead of butting our heads against that brick wall, let’s try to remember to be pragmatic.  To be realistic about what’s feasible for who & when; and to adapt the parts and segments of our application infrastructure, appropriately and efficiently, to the business problems and use-cases at-hand.  Over-engineering may be foolish, but under-engineering is just as perilous.

Know your Role

DBA does not mean “database archaeologist”, even though sometimes that’s what you end up doing…

Aka “Dammit Jim, I’m a DBA, not a data researcher!”  Or, as I stated on Twitter:

DBA != Database Archaeologist

penguin archaeologist
because… SQL ❤ Linux!

Today I spent more hours than I care to admit, tracking down some obscure data from two disparate systems in an attempt to reconcile what were supposed to be matching records.  Part of that is my own fault — I’m a sucker for interesting problems and edge cases, as I’ve blogged about before…

mostly just for the smug satisfaction of proving to the business that “your assumptions about how your data works are invalid“.

But mostly it’s because, the further back in time you go, the less reliable the data becomes.  Especially (exponentially) when that data originates from human free-form text input.

mr garrison manual inputs are bad
but GUI’s are OK.. and CLI’s are great!

Let’s contrive an example.  We have our core business product system, WidgetMaster, which tracks Widgets we ship out by WidgetNumber.  Our partner associate business runs an online widget exchange where people can buy and sell their Widgets in a sort of second-hand/after-market fashion.  PartnerExchange listings are supposed to include the WidgetNumber for ease of tracking and associating data between the two systems, even though they’re officially run by different companies (or in my case, different departments of the same company — yeah, think about that for a second).

Now, ideally, theoretically, those WidgetNumbers should always match up. But unfortunately, up until late 2014, our WidgetMaster system didn’t have an API that PartnerExchange could call to obtain a widget by its number; and even if we did, they have to at some level rely on the customer (or a worker) to read and enter that WidgetNumber into the exchange listing.  But wait, we started doing bar-codes back in 2010, so “most” of them are actually scanned from the bar-code, but not every customer has that capability, so there’s still a lot of hand entered data.

So we have some dirty data.  Let’s complicate things a bit. Over time, those widgets can come back to WidgetMaster for update/upgrade and then ship back out.  Again, ​WidgetNumber should remain consistent throughout that process.  Now, when PartnerExchange sells certain particular widgets, sometimes they’re part of a SuperSpecialCollection.  This collection spans many years, maybe even a decade or more. WidgetMaster got wind of this SuperSpecialCollection, being bought-up by Mr. HighRollerCustomer, so we started marking the incoming/outgoing records with a new property.

But it’s text.

It’s entered by the receiver, based on looking at the Widget’s buy/sell history in PartnerExchange.  And yes, the HighRollerCustomer who last bought the widget is aware that it’s part of their SuperSpecialCollection, but they aren’t guaranteed to specify that when they send the widget back in to WidgetMaster for upgrade.

Do we see the problem yet?

oh it gets better (again)
Yes, yes it does!

See, about 5 years ago, there was a reorg, and the dev team for WidgetMaster completely revamped the way in which “collection membership” for incoming widgets is designated/tracked. So now it’s over in some property table. To make matters worse, PartnerExchange renamed SuperSpecialCollection to AwesomeCltn a few years ago because they were tired of typing so many letters (and apparently fans of cryptic abbreviations).

Fortunately, PartnerExchange has done a decent job of at least storing the correct WidgetType and WidgetQuality in their listings, despite WidgetNumbers being fairly sparse.  But again, because over in WidgetMaster, we’re supposed to associate each WidgetNumber with the AwesomeCollection, we now have this secondary task of mapping unmatched WidgetNumbers across systems, by using Type and Quality from one side (partner) combined with Collection-membership from the other side (master), by assuming that the partner’s designation of SuperSpecial/AwesomeCollection is correct.

If your head’s not spinning yet, give yourself a round of applause. While rubbing your tummy and tapping your foot.

Needless to say, this is hard.  We’ll probably get the majority of records matched (mapped?) eventually by using a couple string LIKE predicates and some clever try/pass/retry flow, but it’s tedious at best.  Another bit of frustration will come up when we do a couple ad-hoc searches thru each system to attempt to apply reason and logic, i.e. find a pattern; and because we’ve already done the work, we might as well put that info into our results, even if it doesn’t show us a useful pattern by itself.

So how do we approach this?  We’ll as I said, I spent what I felt was too much time on it, but essentially I did an initial “majority rules” mapping attempt (first pass), followed by a few reconciliation attempts for the remainders.  Those consisted of fairly identifiable patterns with a couple outliers. With those outliers, as with the rest of the unmapped records at the end of the day, I had to tell the business, basically, “Here’s the majority of the results. You can assign a research specialist or analyst to the rest, if you feel it’s that important.”

I may have done this with slightly more attitude than necessary.

How could we improve this?  The bigger geeks in the room may pipe up with “machine learning!”  Ok sparky, do you have that infrastructure ready to go?  No?  How long for implementation?  mumble mumble something about Azure mumble…  Okay, sure, how about training the model so you can feed it your data?  Cool, well enjoy the incredulous laugh you’ll get when you tell the manager that.

How about other tool sets? Sure,  we could check out Python or R, write a C# app maybe?  Well guess what, we still need to look at the data to understand what patterns (or lack thereof) there are to work with.  And again, lead time.  Unfamiliar tools means longer development cycles.  And they’re really not guaranteed to produce any better results (more matches) at the end of the day, are they?

Because your data models and your analyses are only as good as the data itself.

data is the iceberg
Or is it “data are“?

And with that, I’ll call it a day.  Thanks for reading!