Dates, Date-pickers, and the Devil

When a date range, or time period, is specified in SQL, it’s easiest, clearest, and most concise, to use a “greater-than-or-equal-to Period-Start, and less-than Next-Period-Start” logic. Mathematically speaking, we are defining the range as closed on the left, open on the right.

This is a bit rant-y, but… indulge me.  I’ve been writing/refactoring a lot of old reporting queries.  And, like most reports, they deal with dates and datetimes — as parameters, boundaries, or where/join predicates.  I also got way too intense with a recent SSC post (Sql Server Central), which fueled the fire even more.

fluffy-angry-puppy
I’m so cute and ANGRY!

SQL Server is very good at handling temporal datatypes and calculations against them.  We’ve got functions like dateadd, datediff, dateparts, datatypes datetime2 and datetimeoffset, datetime, etc.  It supports all sorts of format conversions if you need to display them in various ways.

..even though that should be left to the presentation layer!

Here’s the issue.  Well, there are several issues, but we only have time for a few.

Here’s the first problem

Report users don’t understand the “end of a time period” problem.  I don’t have a good name for it; others might call it the “Day plus one” problem or the “Less than date” problem.  What do I mean by this?  Well, let’s back up a bit, to DBA Commandment #6, “Thou shalt not use between with datetimes.”  In order to first understand the issue, we have to understand why this is a commandment.

When a date range, or time period, is specified in SQL, it’s easiest, clearest, and most concise, to specify it like so: @TheDate >= @StartOfPeriod and @TheDate < @StartOfNextPeriod.  Mathematically speaking, we’re defining the range as “closed on the left, open on the right”.  In other words, Min <= X < Max.

The reason we do this with datetimes is found right there in the name of the datatype — it has (or can have) a time component!

stone-tablets-with-roman-numerals-to-10
There are probably more than 10, but it’s a good starting point…

Let’s talk examples

Say you’d like to report on the month of March 2017.  How do you determine if your data-points (stored as datetime or, hopefully, datetime2) are within that period, that month?  Well sure, you could write where month(MyDateColumn) = 3 and year(myDateColumn) = 2017 

NO.  That is horrible, don’t do that.

It’s not SARGable and renders your index on that column useless.  (You do have an index on it, don’t you? No? Make one!)  Okay, let’s stick with something SARGable.  How about MyDateColumn between '20170301' and '2017-03-31T23:59:55.999'?  (You did read this post about using culture-neutral datetime literals right?)  But wait!  If your data is a datetime, it’s not actually that precise — your literal gets rounded up to 20170401 and you’re now including dates from April 1st (at midnight)!

Oh that’ll never happen… until it does.

Second problem

Many developers and report-writers assume that the values in their data will never be within the typical “1 second before midnight” or “1/300th of a second before midnight” escape window of your “3/31/2017 23:59:59.997” bounding value.  But can you guarantee that?  Didn’t think so.  Worse, if you use the .999 fraction as given in the 2nd example, you’re either “more” or “less” correct, and nobody can actually tell you which way that pendulum swings because it depends on the statistical likelihood of your data having actual literal “midnight” values vs. realistic (millisecond-y, aka “continuous”) values.  Sure, if you’re storing just a date, these things become a lot less complicated and more predictable.

But then why aren’t you storing it as an actual date, not a datetime!?

So what’s the right answer?

As I said, “greater than or equal to  ‘Start’, and less than ‘End'”, where ‘End’ is the day after the end of the period, at midnight (no later!).  Hence, MyDateColumn >= '20170301' and MyDateColumn < '20170401'.  Simple, yes?

keep calm and keep it simple
KCKS

But wait, there’s more!

I mentioned “date-pickers” in the title.  When it comes to UX, date-pickers are a sore subject, and rightly so — it’s difficult to truly “get it right”.  On a “desktop-ish” device (i.e. something with a keyboard), it may be easiest on the user to give them a simple text-box which can handle various formats and interpret them intelligently — this is what SSRS does.  But on mobile devices, you often see those “spinner” controls, which is a pain in the arse when you have to select, say, your birth date and the “Year” spinner starts at 2017.  #StopIt

I mean, I’m not that old, but spinning thru a few decades is still slower than just typing 4 digits on my keyboard — especially if your input-box is smart enough to flip my keyboard into “numeric only” mode.

Another seemingly popular date-picker UX is the “calendar control”.  Oh gawd.  It’s horrible!  Clicking thru pages and pages of months to find and click (tap?) on an itty bitty day box, only to realize “Oh crap, that was the wrong year… ok let me go back.. click, click, tap..” ad-nauseum.

stop-it-sign
#StopIt again

The point here is, use the type of date-picker that’s right for the context.  If it’s meant to be a date within a few days/weeks of today, past/future — OK, spinner or calendar is probably fine.  If it’s a birth date or something that could reasonably be several years in the past or future, just give me a damn box.  (Heck, I’ll take a series of 3 boxes, M/D/Y or Y/M/D, as long as they’re labeled and don’t break when I omit the leading-zero from a single-digit month #!)  If there’s extra pre-validation logic that “blocks out” certain dates (think bill-payer calendars or Disneyland annual-pass blackout-days), that probably needs to be a calendar too.

..just make sure it’s responsive on a mobile device.

And in all cases, pass that “ending date” to your SQL queries in a consistent, logical, sensible manner.  For reporting, where the smallest increment of a period is 1 day, that probably means automagically “adding 1 day” to their given end-date, because the end-user tends to think in those terms.  I.e. if I say “show me my bank activity from 1/1/2017 to 1/31/2017”, I really mean “through the end of the month“, i.e. the end of the day of 1/31.  So your query is going to end up wanting the end-date parameter to be 2/1/2017, because it’s using the correct & consistent “greater than or equal to start, and less than start-of-next” logic.

context-consistency-clarity
The 3 C’s

Final thoughts

I know it’s not easy to explain to business folks, and it’s not easy to implement correctly.  But it’s important.  The >= & < logic is clear, concise, and can be used consistently regardless of underlying datatype.  You just need to adjust your presentation layer (whether that’s SSRS parameters or a .NET date-picker) to convey their intent to the user, whether that’s “show/enter the last day of the month, but translate to the next day to feed to the query/proc.”, or “make them enter the next-day (day after the end of the month/period) and understand the ‘less than’ logic.”  I’m more inclined to the first, but it depends on your audience.

Thanks for reading, and happy date-ing!

DBA Holy Wars Part 2

Battle 4: GUIDs vs Identities

This is an oldie but goody.  A) Developers want their apps to manage the record identifiers, but DBAs want the database to do it.  B) Developers prefer abstracting the identity values out of sight/mind, DBAs know that occasionally (despite your best efforts to avoid it) your eyeballs will have to look at those values and visually connect them with their foreign key relationships while troubleshooting some obscure bug.

but-wait-theres-more-billy-mays
there’s ALWAYS more…

But there’s more to it than that.  See, none of those arguments really matter, because there are easy answers to those problems.  The real core issue lies with the lazy acceptance of GUI/designer defaults, instead of using a bit of brainpower to make a purposeful decision about your Primary Key and your Clustered Index.

Now wait a minute Mr. DBA, aren’t those the same thing?

NO!  That’s where this problem comes from!

A good Clustered Index is: narrow (fewer bytes), unique (or at least, highly selective), static (not subject to updates), and ever-increasing (or decreasing, if you really want).  NUSE, as some writers have acronym’d it.  A GUID fails criteria ‘N’ and ‘E’.  However, that’s not to say a GUID isn’t a fine Primary Key!  See, your PK really only needs to be ‘U’; and to a lesser extent, ‘S’.  See how those don’t overlap each other?  So sure, use those GUIDs, make them your PK.  Just don’t let your tool automagically also make that your CX (Clustered indeX).  Spend a few minutes making a conscious effort to pick a different column (or couple columns) that meet more of these requirements.

For example, a datetime column that indicates the age of each record.  Chances are, you’re using this column in most of your queries on this table anyway, so clustering on it will speed those up.

Most of the time, though, if your data model is reasonably normalized and you’re indexing your foreign keys (because you should!), your PKs & CX’s will be the same.  There’s nothing wrong with that.  Just be mindful of the trade-offs.

Battle 5: CSV vs TAB

bluray-vs-hddvd-fight
Who doesn’t love a good format-war?

Often, we have to deal with data from outside sources that gets exchanged via “flat files”, i.e. text files that represent a single monolithic table of data.  Each line is a row, and within each line, each string between each delimiting character is a column value.  So the question is, which is easier to deal with as that delimiter: comma, or tab?

String data values often have commas in them, so usually,the file also needs a “quoting character”, i.e. something that surrounds the string values so that the reader/interpreter of the file knows that anything found inside those quotes is all one value, regardless of any commas found within it.

But tabs are bigger.. aren’t they?  No, they’re still just 1 byte (or 2, in Unicode).  So that’s a non-argument.  Compatibility?  Every program that can read and automatically parse a .csv can just as easily do so with a .tab, even if Windows Explorer’s file icon & default-program handler would lead you to believe otherwise.

I recently encountered an issue with BCP (a SQL command-line utility for bulk copying data into / out of SQL server), where the csv was just being a pain in the arse. I tried a tab and all was well! I’m sure it was partially my fault but regardless, it was the path of least resistance.

Battle 6: designers vs scripting

no-wizard-allowed
Wizards are usually good, but in this case, they’re lazy and bad for you…

This should be a no-brainer. There is absolutely no excuse for using the table designer or any other wizardy GUIs for database design and maintenance, unless you’re just learning the ropes. And even then, instead of pressing ‘OK’, use the ‘Script’ option to let SSMS generate a `tsql` script to perform whatever actions you just clicked-thru.  Now yes, admittedly those generated scripts are rarely a shining example of clean code, but they get the job done, even with some unnecessary filler and fluff.  Learn the critical bits and try to write the script yourself next time– and sure, use the GUI-to-script to double check your work, if you still need to.

Confession: I still use the GUI to create new SQL Agent Jobs. It’s not that I don’t know how to script it, it’s just that there are so many non-intuitive parameters to those msdb system-sp’s that I usually have to look them up, thereby spending the time I would have otherwise saved.

Bonus round: the pronunciation of “Data”

its-data-not-data
Call me “big Data” one more time…

Dah-tuh, or Day-tuh?  Or, for the 3 people in the world who can actually read those ridiculous pronunciation glyphs, /ˈdeɪtə/ or /ˈdætə/ ?  It’s a question as old as the industry itself… or maybe not.  Anecdotally, it seems like most data professionals, and people in related industries, tend to say “day-tuh”; while those in the media and generally less technical communities tend to say “dah-tuh”.  (Where the first syllable is the same vowel-sound as in “dad” or “cat”.)  This likely means that the latter is more popular, but the former is more industrially accepted.

In either case, it doesn’t really matter, because at the end of the day, we’re talking about the same thing.  So if some dogmatic DBA or pedantic PHB tries to correct your pronunciation, tell ’em to stop being so persnickety and get on with the task at hand!

Until next time…

Little Gotchas

If the caller of our stored-procedure literally passes NULL as the parameter value, we might have a problem!

A large part of most DBA/DBD’s daily job is writing & maintaining stored-procedures.  In SQL Server or other RDBMSs, stored-procs (“SP’s”, “procs”, however you like to abbreviate), serve as one of the building-blocks of your overlaying applications and day-to-day operations, including maintenance and automation.

sprocket
This is a sprocket, not to be confused with a sproc, which is really just a proc.

Today, something struck me, and I was both shocked and comforted by the fact that this hadn’t really “come back to bite me in the arse“, as the proverbial saying goes.  But first, some context.

When we declare our proc signature with our parameters, we of course give them datatypes, and often default values — the parameter value that is assumed & used upon execution when the caller (operator, application, agent job, etc.) calls said proc without passing a value to that parameter.  So we create our proc like so:

CREATE PROCEDURE dbo.MyProc @MyParam BIT = 0 AS BEGIN SET NOCOUNT ON; END

So that users are allowed to call it like so, and assume some correct default behavior:

EXEC dbo.MyProc;

Coincidentally, that CREATE line is part of a typical “boilerplate” snippet or template which I use to create procs with “create if not exists, else alter” logic and a nice header-comment-block, which I’ll publish on my GitHub or Gist shortly, so I can show it here.  I know that MS recently added DROP IF EXISTS support to the language, but frankly, I like to keep procs intact if they exist because it’s easier not to have to remember/re-apply their metadata, such as security (grants/deny’s, certificate signatures, etc.) and extended properties.  Wake me up when they add true CREATE OR ALTER syntax!  Oh snap, they did… in 2016 SP1.  Anyway.

Now for the “catch”, the gotcha.

gotcha-programming-wikipedia-def
In programming/software-dev/IT-systems, “gotcha” has a specific meaning.  Thanks Wikipedia!

If the caller says exec dbo.MyProc, that’s great — they didn’t pass a parameter value, so the execution uses the default value (0) and off we go.  However, if the caller is so malicious as to literally pass NULL, we might have a problem!  Because let’s say that @MyParam value is used in a JOIN predicate or a IN (SELECT...) block, or even a CASE expression.  We won’t get an actual error; SQL Server is smart enough to jump over the syntactical variations required for equivalence checking (i.e. Column1 = 0 vs. Column1 is NULL) when it interprets/compiles the stored-procedure.  But, what we’re probably going to get is unexpected or unknown behavior.

warning-assumptions-ahead

It seemed worth re-using a classic…

And really, it all comes back to those nasty things called assumptions.  See, as the proc author, we’re assuming that our @MyParam will always be a 0 or 1, because it’s a BIT, and we gave it a default value, right?  Sure, maybe in another language, but this is T-SQL!  NULL is a separate and distinct thing, a valid value for any datatype, and must be accounted for and treated as such.  It can get especially dicey when you have a NOT IN (SELECT...) block that ends up as an empty-set, which suddenly morphs the outer query into a “without a WHERE clause” beast, and.. well, you can guess the rest.

So what do we do about it?  Well, we can add a “check parameter values” block to the top of our procedure where we either throw an error, or set the NULL value back to a default.

Examples:

IF (@MyParam IS NULL) RAISERROR ('@MyParam cannot be NULL; try again.', 15, 1);
IF (@MyParam IS NULL) SET @MyParam = 0;

We could also work on the internal proc logic to account for NULL values and “eliminate the guesswork” (i.e. prevent unexpected behavior) by actually having logical branches/conditions which “do something” if the parameter is NULL.  Then, at least we know what our proc will do if that infamous caller does exec MyProc @MyParam = NULL.  Yay!  But that sounds like a lot of work.  Maybe.

Or maybe it’s worthwhile because you actually want NULL to be treated differently than all other parameter values, and then, hey, you’ve already spent the time on that logic, so you’re done!

what-if-i-told-you-null-does-not-equal-null
But NULL does not NOT equal NULL, either!  Crap, somebody give me the red pill…

I hope this helps somebody else avoid the same assumptions.

Nested Set++ Wrap-Up

So we’ve built our Cat Tree. But how do we know it’s all correct?

One more time, with feeling!  Not that this dead horse needs another beating, but I did promise…

So we’ve built our Cat Tree.  We’ve written our CrUD ops, our “move” op, and even some readers.  But how do we know it’s all correct?  We can select from our Cats view, of course.  But we want to be really sure.  Plus there’s that pesky SwapCatNode method.

Easy one first.  SwapCatNode can mean swapping sibling order, or switching a parent with a child or grandchild, or toggling nodes that are in completely different places in the tree & not related at all!  This is the least logical operation, if you think about a proper hierarchy, but it turns out to be necessary sometimes.  We’re just swapping the nodes’ position values & ParentIDs with each other, and updating ParentIDs on their children to each others’ IDs.

I really don’t even need to draw this one… but because I needed a header image, I did.  Anyway, just get the rows with the given target IDs, swap the PLeft, PRight, Depth, and ParentID values, and call it a day.

Now the complex.  To validate that our tree is properly structured, the following statements need to be true:

  1. Each node’s Right value is greater than its Left.
  2. More to the point, each node’s Right value is greater than all of its ancestors’ Left values.
  3. Similarly, each node’s Left value is less than all of its descendants’ Left values (and Right values, obviously!)
  4. Leaf nodes have no gaps between Left & Right: Right = Left + 1
  5. Depth is easy to verify because we already wrote the rCTE to calculate it!
  6. And of course, no orphans – all ParentIDs lead to an actual parent node, except of course if they’re NULL (root nodes).

We can either go thru lots of logical checks in different queries, or we can try building a mock tree out of the base adjacency-list structure (ParentIDs) and compare values.  The latter will only help us with #1-5; the orphans problem is a different animal, but it’s also not part of the model per-se, so it’s actually good to separate that check from the rest.  (And it’s really simple – use a not exists query on ParentIDs and presto, orphans checked!)

Building a mock-tree, or a “position re-builder”, will come in handy for another reason:  Let’s say we need to completely revamp a subtree, i.e. insert & update a bunch of nodes at once because somebody royally screwed up that branch.  And we’ve got our shiny fixed data, 100’s of rows, ready to go, if only those damn triggers weren’t there, preventing us from doing bulk operations!  What we’d really like to do is, knowing the starting ParentID, just insert all our new nodes with PLeft values in sequence to each other (and not care about the rest of the tree); and/or, update a few sets or families of nodes to massively re-order them, without having to call the Swap routine one-at-a-time ad-nauseum.  We also don’t want to care about figuring out correct PRight & Depth values.  After that’s all done, our new subtree will have “bad” position values, so we need to rely on some other routine to fix them for us, so that the tree can again be well-formed and things can go back to normal.

nsm-cat-bulk-reorder-insert-rebuild
Simon & Tigger got re-ordered. Then we bulk-added 3 new Cats under Mittens and didn’t know what their position values would be, so we let the rebuilder take care of it.

In our RebuildCatTree routine, we actually need to re-number all nodes to the right & above our “bulk-inserted” subtree, just in case we’ve caused things to move.  And since we’ve re-ordered some siblings elsewhere, it turns out to be easiest, in practice, to re-number the whole tree.  This is where our fair-weather friend recursion comes in — and not just another rCTE, but real stored-procedure recursion.  This can get dicey; SQL only supports a certain # of recursion levels, and it can really eat up those CPU cycles & RAM buffers.  So this should be done rarely, and preferably during a time where the tree is not under heavy usage.

The code samples are now available on my GitHub page.  Comments abound!

I hope you’ve enjoyed this little mini-series.  And now, I promise to move on to new topics & rantings of various nature!  Thanks for reading.

~Fin~

Update:

I’d like to point future readers at two very informative articles for those interested in deep-diving down the hierarchical rabbit-hole: Aaron Bertrand, and Jeff Moden.  There are many more tweaks and enhancements that can be made to the “classical” Nested Set model, which those lucky Devs/DBAs who are in a position to actually [re]implement their hierarchies will want to read about and take advantage of.

The Nested Set Model++

This time we talk about adding a Depth field, and good ol’ CrUD ops – Create, Update, Delete.

Since my first post on this topic got a lot of attention and traction, I felt it appropriate to expand on the topic a bit, even if it’s been largely covered by other bloggers in the past.  I’ve also found it very useful to have a “depth” field, which isn’t canonically part of the model (hence the “++” in the title!), but is quite handy not only for display purposes (while you’re querying & testing the thing), and also for making certain “get” ops easier.  Sure, it adds a wee bit more to structural maintenance, but since that’s already the most complicated part of the model anyway, it’s hardly worth a second thought.  So let’s dive in!

The big topic last time was this operation of “move a subtree” — of course, sometimes you’re just moving one node, but only if it’s a leaf; otherwise you’re moving a node and all its descendants, so I’ve kept the procedure name MoveCatSubtree intact.  This time we’ll talk about good ol’ CrUD ops – Create, Update, Delete.  In my implementation, I chose to handle these with table triggers.  Some would argue in favor of stored-procs, and while that would seem “more consistent” with the precedent set, I’d counter with 2 points:

  1. To be really fool-proof, you’ll need to prevent ungoverned inserts/updates/deletes anyway; you could either do this with GRANT/DENY permissions, or triggers.  Permissions would be more complex because you’d still need your users to be able to exec the CrUD procs, so you’d end up using some convoluted security mechanisms that can be tricky to maintain over time.
  2. With triggers, we can allow the consumers of the data (apps, users) to continue to use “plain-ol’-TSQL” to access and manipulate the data, instead of having to remember stored-proc names and hunt for documentation on them.  (The exception being, of course, MoveCatSubtree, which, honestly, could be integrated into the insert trigger, but I’ll leave that as an exercise to the reader!)

Again, yes, we could easily do the same implementations in stored-proc form, and you’re welcome to fork my GitHub repo if you feel like exploring that.

Let’s outline the steps and draw some pictures.

1. InsertMake a hole!

When we INSERT a node, we want to specify its parent and a name, and let the triggers do the rest!  We place it at the right of its siblings-to-be, and update the position values of all nodes to the right so that everything stays kosher.  This should sound familiar — it’s essentially that “make a gap” part of the subtree-move op.  In terms of depth, we just +1 to the parent’s.

nsm-cats-insert-gadget-under-stripes
Stripes is a breeder; Gadget comes in and makes Fluffy & children move over.

Also, for some reason, our cats reproduce asexually…

 

2. Delete: Think of the children!

Similarly, to DELETE a node, we want to “close the gap” left by said deleted node.  But what of the children?  We don’t want to leave any orphans behind!  So we “promote” the children of our deleted node to the level (depth) of their parent, sandwiching them in between the deleted node’s siblings (aka their former aunts/uncles!).  This is easier than it sounds.

nsm-cast-delete-fluffy
He killed Fluffy!

Fluffy is survived by his children, who are now for some reason his siblings, and are very confused by their sudden increase in age & status.

3. UpdateRename; everything else is encapsulated.

Finally, we only allow UPDATEs on the Name, because everything else (position values, depth, parent) is structural, and encapsulated by our tree maintenance logic.  Moving a node or subtree?  MoveCatSubtree.  Swapping positions with another node?  SwapCatNode (TBD!).

4. Depth: Set it once, & encapsulate it!

Depth is pretty simple to add if you’ve already got a tree full of data.  We can use a recursive common table expression, or “rCTE“.  While normally these are frown-worthy (remember, recursion is not SQL’s strong suite), we’re only using it one time to populate an existing data-set, so we can keep on smiling.

;WITH CatTree AS
(
    SELECT CatID, ParentID, Name, PLeft, PRight, Depth = 0
    FROM nsm.Cat
    WHERE ParentID IS NULL
  UNION ALL
    SELECT cat.CatID, cat.ParentID, cat.Name
        , cat.PLeft, cat.PRight, Depth = tree.Depth + 1
    FROM CatTree tree
    JOIN nsm.Cat cat
        ON cat.ParentID = tree.CatID
)
UPDATE cat SET cat.Depth = CatTree.Depth
FROM CatTree
JOIN nsm.Cat cat
    ON cat.CatID = CatTree.CatID

The last order of business (for now) is to add Depth support to our MoveCatSubtree method.  As illustrated below, we have to move the subtree “up” or “down” in Depth depending on its new parent’s position relative to its old position.  The details are, of course, in the GitHub repo, but here’s a quick snippet of what that looks like: NodeNewDepth = /*NodeCurrent*/Depth + (@NewParentDepth - @SubtreeOldDepth) + 1  (where @SubtreeOldDepth is the depth of the top node of the moving subtree.)

nsm-cast-move-jack-to-mittens
Move Jack to under Mittens; I won’t repeat the Left/Right logic, just note the Depth logic.

 

In a future little addendum, I’ll briefly go over the “get” queries and that TDB SwapCatNode method.  For now,  enjoy the cats (again)!  Thanks or sticking around, I know it’s been a few more weeks than normal.

PS: A big thank-you to the dudes in the CodingBlocks #blogging Slack channel for their encouragement and motivation to get this done!  You guys rock.  Check out their blogs for some terrific content: http://dotnetcore.gaprogman.com/ , http://www.codeshare.co.uk/ , http://thereactionary.net/ .

Update:

I’d like to point future readers at two very informative articles for those interested in deep-diving down the hierarchical rabbit-hole: Aaron Bertrand, and Jeff Moden.  There are many more tweaks and enhancements that can be made to the “classical” Nested Set model, which those lucky Devs/DBAs who are in a position to actually [re]implement their hierarchies will want to read about and take advantage of.

The Nested Set Model

The #1 rule of the Nested Set Model is: FAST READs. The #2 rule of the Nested Set Model is: see #1

There are probably definitely several articles out there which cover the SQL implementation of the Nested Set Model, aka “modified preorder tree traversal” (which is more the name of the algorithm by which you traverse the tree, rather than the structure itself).  But I found it interesting enough, and more importantly, applicable enough to my job experience, that I feel it deserves some treatment.  Not the basic “how to”, but more an example of a particular operation and a specific pitfall to avoid. (Jump straight to the example diagrams.)

Now, we’re not going to debate about whether this model is “the best” representation of hierarchical data in an RDBMS (some argue that Closure Tables, aka “Ancestor Tables“, or some kind of hybrid approach is better, and I’d probably agree).  The fact is, sometimes (read: almost always) as a DBA/DBDev, you’re “stuck with” an existing database in a legacy application environment that you pretty much can’t change — or if you can, changes need to be small, incremental, and non-disruptive.

Okay, with that disclaimer out of the way, let’s dive in.  First things first:

The #1 rule of implementing the Nested Set Model is: FAST READs.

I can’t stress that enough.  Fast SELECTs.  Everything else pales in comparison.  In other words, we don’t care how long and painful and slow write operations are against this table (updates, inserts, deletes), as long as our SELECTs remain super speedy.  If that is not your use-case, consider a different model.

The #2 rule of the Nested Set Model is: see #1

Moving on…

The #3 rule is: encapsulate tree operations to maintain its integrity & structure.

Put another way, the #3 rule is that you should always operate on the tree (CrUD ops) using stored-procedures and/or triggers that encapsulate all the nitty-gritty details of maintaining the correct position values during said insert/update/delete operations.  Of course, somebody is responsible for writing those stored-procs.  Any volunteers?  Easy now, don’t raise your hands all at once!  Generally, this responsibility falls to the DBA(s) or DBDev(s).

The problem at-hand, in my current situation, was that of “moving a sub-tree”, i.e. taking a node and all its descendants, and moving it to place it under another “parent” node.  In some models, and/or in some languages, this is a simple recursive operation.  However, SQL is not spectacular at recursion — after all, we’re working in a relational engine — so let’s try to play to its strengths:

namely, SET-BASED operations!

A previous DBDev had written a stored-proc for just such an operation.  However, as (somewhat) expected, it was horribly slow, to the tune of hours of run-time.  This is not acceptable, even given the #1 rule stated above.

Well it turns out that most of it was pretty efficient, but the last step, in which they attempted to “fix” the left/right values in the entire table “just to make sure we didn’t leave any gaps“, was, frankly, quite silly.  Because the only “gaps” you create are created by the previous steps in the proc, and you know exactly how big that gap is (the width of the subtree you’re moving), and where it is, so you should be able to target that specific area of the tree and close the gap more intelligently, using some simple math. (addition and subtraction — the simplest math there is!)

Doing that improved the performance of the whole proc by a factor of 10.  That’s huge.  Or, “yuuuuge“.

So let’s get specific.  As you’ll see from my diagrams, the model actually is a hybrid, combining an Adjacency List (each record knows its “parent”) with a Nested Set (each record has a “left” & “right” position value).  We do this for two big reasons.  First, having the parent relationship along with the position values makes all that nasty book-keeping (rule #3) a bit easier to manage (and to check our work).  And second, because, conveniently, we can store the data from both models in one table.

On to the examples!

First, we have our tree of Cats.

cat-tree-1
Or, as a coincidentally cute table alias, CatTree

Now, we want to move Jack & his children to become descendants of Mittens (Jack being the child, Smush & Smash being grandchildren).  So we start by “making a gap” of the subtree’s “width” (6, the distance between Jack’s PLeft and PRight inclusive of end-points).  We add that amount to all PRight values >= Mittens’ original PRight,  and add it to all PLeft values > Mittens’ PRight — see the blue #s in diagram below, and code here:

UPDATE Cats
SET PLeft = (CASE WHEN PLeft > @NewParentRight
             THEN PLeft + @SubtreeSize
             ELSE PLeft END)
  , PRight = (CASE WHEN PRight >= @NewParentRight
             THEN PRight + @SubtreeSize
             ELSE PRight END)
WHERE PRight >= @NewParentRight

The red values haven’t changed (yet) but are now wrong, so we’ll have to fix them next.  And of course the green values are the moved subtree’s new positions based on the new parent’s (Mittens) PLeft.

cat-tree-2
Jack is now Mittens’ child.

Finally, now that we’ve moved Jack & his children under Mittens, we need to “close the gaps” that we created at first, to make sure that the tree’s position values remain contiguous.  This isn’t as difficult as it sounds: if we’ve stored Jack’s original PRight value (10), we can use that as a cutoff to subtract the subtree width from higher position values and intelligently (and quickly) close the gaps we created before.  Again, code & diagram:

--Notice this looks very similar to the previous
--code snippet! (We're basically doing the reverse)
UPDATE Cats
SET PLeft = (CASE WHEN PLeft > @SubtreeOldRight
             THEN PLeft - @SubtreeSize
             ELSE PLeft END)
  , PRight = (CASE WHEN PRight >= @SubtreeOldRight
             THEN PRight - @SubtreeSize
             ELSE PRight END)
WHERE PRight >= @SubtreeOldRight
cat-tree-3
Red values indicate “closing the gap” that was created by removing the subtree of Jack. Blue values indicate the incidental gap closures for the rest of the tree (above and right). Green values, you’ll notice, are “reverted” (i.e. same as they were originally).

SQL-wise, this should translate pretty well.  I’ve posted the setup and stored-proc scripts to GitHub, so the distinguishing reader can review and offer feedback.  In theory, there’s probably a way to exclude the green reverted values from the first pass operation (gap-making) so that we don’t have to revert them (at gap-closing), but again, since we’re doing SQL set-based operations, it seems hardly worth the effort — i.e. the potential speed gain would be outweighed by the logical/maintenance complexity.

 

So what’s the lesson here?  Well hopefully, if you’re “stuck with” a SQL DB with a Nested Set Model table containing a hierarchical tree of data, you don’t have to completely re-invent the wheel and write your CrUD ops from scratch.  But if your predecessors didn’t plan for certain kinds of operations, and this “move a subtree to a new parent” happens to be one of those, this should help you (re)implement it efficiently.

I’d love to get some feedback on this.  Let me know if I’ve missed anything conceptually, if there are better ways or methods to doing any of this, or any other tips & tricks that folks might have for dealing with such data.  Leave me a comment!

[footnote 1]
The root of the problem, in this case, was simply taking the code from a slideshare presentation and copy-pasting it into the routine without analyzing its effectiveness and efficiency.  It proposed re-calculating the position values after a move, across the entire tree, by using a triple-cartesian-product (or cross-join) to “get the count of nodes to the left/right of each node” for every node, which should sound dirty even as you say it silently in your head, let alone attempt to write it in query form!

[footnote 2]
There’s a 3rd model that we could consider storing in the same table, called “Enumerated Path” or “Materialized Path” or “Breadcrumbs”, which may look good on paper and to your human eyeballs, but breaks down spectacularly when you start talking performance and scale — but to be fair, so do most of these models, eventually, in one way or another, which is why we’ve invented fantastic alternative technologies to address these problems… and frankly, if you’re using all 3 models at once, you’re #DoingItWrong, creating a veritable maintenance nightmare for yourself and everyone around you.  Note that the elusive 4th model, the Ancestor Table, requires (as the name would imply) another table — not an argument for or against anything, just an observation.

PS: Happy 2017!

Dates, Times, and Datetimes, Oh My!

There’s a tool for every job. Just stop abusing the tool!

This MSDN page,  CAST and CONVERT (T-SQL), specifically the section on DATETIME conversion formats, is easily one of my most frequently visited links.

It really shouldn’t be.

SQL Server is very good at storing and manipulating Date/Time values.  There are dedicated data-types for all flavors — DATETIME, DATE, TIME, the newer DATETIME2, and the less common SMALLDATETIME and DATETIMEOFFSET.  Then there are the functions & operators that let you do all sorts of fun stuff with them — DATEDIFF, DATEADD, DATEPART, GETDATE, ISDATE, and even some newer ones like EOMONTH and DATEFROMPARTS.  These are really powerful tools in the hands of a DB-Developer or DBA.

i just want to use dates
Is that so much to ask?  Courtesy of this guy’s blog, which sounds like a great place to learn iOS programming if one was interested in such things…

But you know what SQL Server is not so great at?  Reading your mind.  Oh, wait, that goes for most applications & systems.  Let me rephrase.  SQL is not the best platform for knowing how end-users will want their Dates/Times displayed in a contextually/culturally sensitive manner, and executing said preferences.

That’s what we have UX/UI layers for!

While it’s true that the underlying data store (SQL, in this case) needs to be aware of localization & globalization requirements, it shouldn’t be asked to serve-up, say, a Sales-Order-Date in 5 different flavors just because Report X wants it in typical USA fashion (mm/dd/yyyy), User B wants it in “long-form” because they’re reading it like prose (“Jan 13 2016 08:32pm”) in an email, and SSIS Package FooBar needs it in “ISO” format (yyyymmdd) because it’s using the date in a filename!  Actually, of those 3 examples, the latter is the most “legit” — or at least, the most justifiable use-case.  The other two should have been handled by the overlaying application or middleware — SSRS in the first case, or whatever automation app produced User B’s email in the second.

i can has string to store dates
Because once wasn’t enough…

But surely there’s a good reason that the T-SQL gods included the CAST/CONVERT functionality with all those special date-format arguments, right?  Obviously.  There are always valid use-cases; or, more proverbially, “There’s a tool for every job.”  Just stop abusing the tool!

broken-hammer-pulling-nail
Abused tools can fail you…

A typical DBA or DB-Dev is often asked to write ad-hoc queries or build one-off reports to meet some business request, and he/she often doesn’t have the time or the resources to offload the nitty-gritty formatting details to the appropriate layer of abstraction.  So yes, that’s why these conversion options exist (among other reasons).  And there’s nothing wrong with that, in and of itself.  But like anything, if it becomes a bad habit and a hindrance to overall productivity, it’s time to take a step back and re-examine the situation.  Ask the hard questions, like “Why am I being asked to create these one-off reports all the time, which sound so similar to each other, yet inevitably are always a bit different?”, or “Have my business users developed unrealistic expectations about what can/should be done by me vs. by other teams/contributors in the organization?”

This isn’t about passing the buck — I’ve already established that’s not my style.  It’s about working smarter, bringing more value to the organization by leveraging better technologies and techniques than obsolete habits and old-guard mentality would otherwise allow.

calvin-and-susie-arguing
“Why are you making me write these horrible queries!?” .. “Because you’re the DBA!” .. “Fine, but give me the resources to automate this for the future.”

So, dear reader, take the time to learn about SQL’s Date/Time types & functions, including the myriad formatting options of CONVERT.  But do yourself a favor and consider, when you find yourself using & abusing them, whether the task at-hand is truly best suited for the database layer, or if it really belongs somewhere else.

Thanks for reading!