Know your Role

DBA does not mean “database archaeologist”, even though sometimes that’s what you end up doing…

Aka “Dammit Jim, I’m a DBA, not a data researcher!”  Or, as I stated on Twitter:

DBA != Database Archaeologist

penguin archaeologist
because… SQL ❤ Linux!

Today I spent more hours than I care to admit, tracking down some obscure data from two disparate systems in an attempt to reconcile what were supposed to be matching records.  Part of that is my own fault — I’m a sucker for interesting problems and edge cases, as I’ve blogged about before…

mostly just for the smug satisfaction of proving to the business that “your assumptions about how your data works are invalid“.

But mostly it’s because, the further back in time you go, the less reliable the data becomes.  Especially (exponentially) when that data originates from human free-form text input.

mr garrison manual inputs are bad
but GUI’s are OK.. and CLI’s are great!

Let’s contrive an example.  We have our core business product system, WidgetMaster, which tracks Widgets we ship out by WidgetNumber.  Our partner associate business runs an online widget exchange where people can buy and sell their Widgets in a sort of second-hand/after-market fashion.  PartnerExchange listings are supposed to include the WidgetNumber for ease of tracking and associating data between the two systems, even though they’re officially run by different companies (or in my case, different departments of the same company — yeah, think about that for a second).

Now, ideally, theoretically, those WidgetNumbers should always match up. But unfortunately, up until late 2014, our WidgetMaster system didn’t have an API that PartnerExchange could call to obtain a widget by its number; and even if we did, they have to at some level rely on the customer (or a worker) to read and enter that WidgetNumber into the exchange listing.  But wait, we started doing bar-codes back in 2010, so “most” of them are actually scanned from the bar-code, but not every customer has that capability, so there’s still a lot of hand entered data.

So we have some dirty data.  Let’s complicate things a bit. Over time, those widgets can come back to WidgetMaster for update/upgrade and then ship back out.  Again, ​WidgetNumber should remain consistent throughout that process.  Now, when PartnerExchange sells certain particular widgets, sometimes they’re part of a SuperSpecialCollection.  This collection spans many years, maybe even a decade or more. WidgetMaster got wind of this SuperSpecialCollection, being bought-up by Mr. HighRollerCustomer, so we started marking the incoming/outgoing records with a new property.

But it’s text.

It’s entered by the receiver, based on looking at the Widget’s buy/sell history in PartnerExchange.  And yes, the HighRollerCustomer who last bought the widget is aware that it’s part of their SuperSpecialCollection, but they aren’t guaranteed to specify that when they send the widget back in to WidgetMaster for upgrade.

Do we see the problem yet?

oh it gets better (again)
Yes, yes it does!

See, about 5 years ago, there was a reorg, and the dev team for WidgetMaster completely revamped the way in which “collection membership” for incoming widgets is designated/tracked. So now it’s over in some property table. To make matters worse, PartnerExchange renamed SuperSpecialCollection to AwesomeCltn a few years ago because they were tired of typing so many letters (and apparently fans of cryptic abbreviations).

Fortunately, PartnerExchange has done a decent job of at least storing the correct WidgetType and WidgetQuality in their listings, despite WidgetNumbers being fairly sparse.  But again, because over in WidgetMaster, we’re supposed to associate each WidgetNumber with the AwesomeCollection, we now have this secondary task of mapping unmatched WidgetNumbers across systems, by using Type and Quality from one side (partner) combined with Collection-membership from the other side (master), by assuming that the partner’s designation of SuperSpecial/AwesomeCollection is correct.

If your head’s not spinning yet, give yourself a round of applause. While rubbing your tummy and tapping your foot.

Needless to say, this is hard.  We’ll probably get the majority of records matched (mapped?) eventually by using a couple string LIKE predicates and some clever try/pass/retry flow, but it’s tedious at best.  Another bit of frustration will come up when we do a couple ad-hoc searches thru each system to attempt to apply reason and logic, i.e. find a pattern; and because we’ve already done the work, we might as well put that info into our results, even if it doesn’t show us a useful pattern by itself.

So how do we approach this?  We’ll as I said, I spent what I felt was too much time on it, but essentially I did an initial “majority rules” mapping attempt (first pass), followed by a few reconciliation attempts for the remainders.  Those consisted of fairly identifiable patterns with a couple outliers. With those outliers, as with the rest of the unmapped records at the end of the day, I had to tell the business, basically, “Here’s the majority of the results. You can assign a research specialist or analyst to the rest, if you feel it’s that important.”

I may have done this with slightly more attitude than necessary.

How could we improve this?  The bigger geeks in the room may pipe up with “machine learning!”  Ok sparky, do you have that infrastructure ready to go?  No?  How long for implementation?  mumble mumble something about Azure mumble…  Okay, sure, how about training the model so you can feed it your data?  Cool, well enjoy the incredulous laugh you’ll get when you tell the manager that.

How about other tool sets? Sure,  we could check out Python or R, write a C# app maybe?  Well guess what, we still need to look at the data to understand what patterns (or lack thereof) there are to work with.  And again, lead time.  Unfamiliar tools means longer development cycles.  And they’re really not guaranteed to produce any better results (more matches) at the end of the day, are they?

Because your data models and your analyses are only as good as the data itself.

data is the iceberg
Or is it “data are“?

And with that, I’ll call it a day.  Thanks for reading!

Indexing a HUGE Table

How it lost its original clustering key is a perfect topic for this month’s T-SQL Tuesday!

tsql2sday150x150

As previously referenced, we had a half-billion row table (which we’ll “round up” to a billion, just for the sake of argument) that needed an index.  A clustering key, to be exact.  How it lost its original clustering key isn’t that interesting is a perfect topic for this month’s T-SQL Tuesday (we planned to replace it, before realizing how long it would take, and had to cancel the job after it was removed but before it could finish adding the new one).  Anybody can drop an index; it takes mere milliseconds.  But to create (or rebuild) an index, the SQL engine needs to touch every single row that index will include, and of course for a clustered index, that’s errverybody.

So the “Lessons learned the hard way” are:

When planning an index replacement for a billion-row table, don’t just schedule the job as “drop old index, then create new index“, and expect nothing to go wrong.

..and..

Don’t try to index a billion-row table all at once.  Use a smarter methodology.  Unless you can literally afford to have that table be offline for several hours (depending on your hardware, and assuming you’re on Standard Edition.

Of course, some of you crazy kids with Expensive Edition are scoffing and thinking “Oh that’s nothin’!”, with your billions of rows and online index rebuilds.  Well you can go back to your fancy Always Encrypted data and your terabytes of RAM and just pretend I said “trillion” instead of “billion” — maybe that’ll fit your scale a bit better.  But for the rest of us mere mortals…

one billion records dr evil
it’s yuuge, I’m telling you…

Anyway, since we’re peons, we can’t just go create the billion-row index without taking the table offline.  So we need to come up with a way to minimize that downtime for the table, and avoid causing excessive blocking or I/O overload on the instance.  As I said before, SSIS to the rescue!

The Plan

i love it when a plan comes together
classic… plain and simple.

I call this the “setup, dump, & swap”.  Essentially we need to create an empty copy of the table, with the desired index(es), dump all the data into it, and then swap it in.  There are couple ways you can do this, but it boils down to the same basic premise: It’s “better” (probably not in terms of speed, but definitely in terms of efficiency and overhead) to fill this new copy of the table & its indexes, than it is to build the desired index on the existing table.

So here’s an outline:

  1. Script out the table (SSMS, right-click, script table, create to… or, if you use one of these cool extensions, F12 or similar “get definition” shortcut) — say, if the original is MyTable, script-create & replace MyTable with MyTableCopy
  2. Here’s a little room for choice.  You could create the table in the same schema with a new name; or, you could create the table in a different schema, with the same name or another name.  This will determine how you do the “swap” toward the end.
    • In the first case, we’d use sp_rename
    • In the 2nd, we’d use alter schema transfer
    • Both are essentially meta-data changes, but the latter could be potentially take a hair longer just because it pulls more strings; whereas the former requires taking care of dependencies ahead of time so you don’t break a schema-bound view or orphan a foreign key.
  3. Add the desired index(es), e.g. create clustered index CX_MyTableCopy_Datestamp_ThingName on dbo.MyTableCopy (Datestamp, ThingName)
  4. Build the SSIS task to copy the data from MyTable to MyTableCopy
  5. Schedule said task via SQL Agent.
  6. Use the previous tip to monitor its progress and estimate time to completion!
  7. Once done, prepare to swap!
    • Again, use sp_rename, and if needed, alter schema transfer.
    • Likely, this will involve several renames – the constraints and other indexes can’t be named the same either, so get all that stuff renamed with an _old suffix first, then you can swap the actual tables.
  8. Clean up after yourself and drop the old table once you verify everything’s working well and all dependencies are accounted for.

But Why?

can-you-tell-me-why
Fry shares your curiosity…

Let’s back up a bit.  Why are we doing it this way?  More importantly, what are the general use-cases for this kind of thing?  Well as I said, it’s “better” in certain ways, than simply creating the index on the “live” T.O.U.S.  It avoids locking said live table, and it has the potential to put less stress on the database and less synchronous I/O against its primary files.  Finally, and most importantly, as a bulk insert operation, with configurable batch sizing, the SSIS task will put exponentially less load on the transaction log.  Whereas, with the regular inline index creation, it could easily fill up the Tlog and will definitely cause performance issues.

Furthermore, swapping tables, or in more advanced cases, partitions, is a fantastic way to move massive amounts of data around with minimal impact.  It also happens to be quite helpful when we have to deal with indexing such massive amounts of data.

In my case, this was actually an archive of 2016 audit-trail data, so if I’d have chose, I could have taken it out of the partitioned view it was part of, and dealt with it being offline for several hours.  But because I’m a sucker for a good problem, and because I wanted to be able to semi-accurately monitor the progress to let the boss-man know when it might be done, I chose to take this route instead.  It’ll come in handy again soon, I’m sure — we have several other jumbo-tron tables laying about that may need some index tuning and/or partitioning.

So, happy swapping, and don’t let those T.O.U.S.‘s get the best of you!

rodent of unusual size
Meh, I don’t think they even exist!

 

PS: thanks to @SQLDoubleG for hosting, and thanks in advance to the rest of the community being lenient of my “submission” being a re-purposed existing post with a few extra blurbs thrown in to bring it on-topic!  =D

Quickie: Timing a HUGE Data Copy Operation

We think the total operation will take several hours, possibly days.  So we want to be able to check up on it.

While I try to compose my resources and samples for the previous two teasers, I found this to be a particularly interesting problem and thought I’d share my solution.

Scenario

trump it's gonna be yuge
yuuuuge

We have a very large table, half a billion rows.  It needs to be copied (well actually, indexed, but that’s another part of the discussion).  We know that standard TSQL would suck for this, even if we wrote a batch-loop proc for it (sure, it might be less blocking, but it’ll take forever).  Plus, we might be going between two different servers (well, not in my case, but you very well could be, and in fact that’s probably the more likely case, given the implications of such a large operation).  SSIS to the rescue!

Now, again, we’re lazy, so we use the Import/Export Data wizard (a component of SSIS) to build the .dtsx package.  Then we can schedule it via a SQL Agent Job.

Sure, we could build it with Visual Studio and/or BIML and be super-awesome, but where’s the fun lazy in that?

Based on some preliminary estimates of the throughput using SSIS for a small sub-set of the data, we think the total operation will take several hours, possibly days.  So we want to be able to check up on it — to see its status & elapsed time, and get an estimate of time remaining.  Sound good?

Assumptions

assumptions this way
over thar!

Two assumptions going into this, to keep the example simple.

  1. We know the Agent Job’s scheduled start time, and it does start on-schedule.
  2. We’re only dealing with 2 tables — 1 source, 1 destination.  Furthermore, they’re both on the same server instance; or if not, we can query a linked-server connection once to get the “source” table size-stats, because they won’t change.

(Okay that was slightly more than 2, but again, simple.  One can imagine expanding this to multiple tables & multiple instances, but then at some point you’re going to need a “witness” that can talk to all the disparate sources of data and conglomerate those bits & pieces together, and then you have to ask yourself “is it really worth it, or can I just give a SWAG & move on with my day?”)

Before I move on: helpful SO answer that reminded me how surprisingly-not-that-difficult (and-still-pretty-efficient) it is to convert a time interval (seconds, in this case) in a “human friendly format” like Days.hh:mm:ss.  Yay.

I’m sure I’ve done it before, and subsequently forgotten about it.  Because, again, it’s one of those things that should be done in the presentation layer.

The Outline

One key component of any kind of “check-up” or general monitoring solution is, you need it to be light weight, i.e. have low overhead.  We don’t want our method of monitoring the process to add some non-trivial extra load to that process (or the servers doing said processing).  So let’s avoid COUNT(*) shall we?  Unfortunately, the top Google results don’t readily point to this, but MS actually gave us a handy-dandy built-in way of measuring this, sys.sp_spaceused.  Yay again.

Granted, it doesn’t get you “up to the millisecond” accuracy, but it’s truly close enough for what we’re doing; and frankly, it’s your only sane option when asking the question “how many rows?” of a TOUS.

So we’re going to use the output of that system proc, specifically the rows column, to measure how much data has been bulk-copied into our destination table.  We’ll also measure the source table, once, as stated, because (we’re assuming!) that won’t change.

Finally, we’ll use those row-counts, with some simple tricks & nonsense — basic math, dateadd/datediff — to calculate a percent-complete, elapsed time, estimated total time, and ETA (estimated time of arrival completion).

it's all a lot of simple tricks and nonsense
The Force? Hah!

The Code

See Gist.

In the next post, I’ll zoom out a bit and explain why I needed to do this, and in what situations it should come in handy.  Stay tuned!

Hacking DLM Dashboard, Part 2

Another teaser, just to keep you waiting… I promise I’ll finish it soon!

This hack is about change buildup/bloat.  When you ignore a particular DB’s schema drifts for a while, the amount of changes that the “Review” page needs to show gets larger and larger.  Until, eventually, there’s simply too much content for it to handle, and it dies.  This is a bummer, because you want to ack those changes, but you don’t necessarily need to see them all in detail, as long as you have a general idea of what they were about.

So, this will be a tale of how I purged a set of un-acknowledged changes from DLM Dashboard for a specific Server/Database.  RG support, bless their heart, wanted me to go to the document repo (which, if you’re not aware, uses a RavenDB back-end stored on your local file system) and purge everything.  But I said no, I think we can do better than that.

Plus, if you really really care about those changes, you can use the RavenDB Studio (GUI) to dump the documents to CSV — i.e. if this tool is part of your audit/compliance toolbox and you don’t want to lose those changes, but you need to be able to “move on with life” by acking the changes to the DB so that you can resume checking it on a regular basis without having a page-hang/crash.

This will be screenshot-heavy, censorship-heavy (to avoid divulging company secrets!), and meme-free (I know, I know, some of you will thank me and others will be a sad panda).

More to come, stay tuned!

Dates, Date-pickers, and the Devil

When a date range, or time period, is specified in SQL, it’s easiest, clearest, and most concise, to use a “greater-than-or-equal-to Period-Start, and less-than Next-Period-Start” logic. Mathematically speaking, we are defining the range as closed on the left, open on the right.

This is a bit rant-y, but… indulge me.  I’ve been writing/refactoring a lot of old reporting queries.  And, like most reports, they deal with dates and datetimes — as parameters, boundaries, or where/join predicates.  I also got way too intense with a recent SSC post (Sql Server Central), which fueled the fire even more.

fluffy-angry-puppy
I’m so cute and ANGRY!

SQL Server is very good at handling temporal datatypes and calculations against them.  We’ve got functions like dateadd, datediff, dateparts, datatypes datetime2 and datetimeoffset, datetime, etc.  It supports all sorts of format conversions if you need to display them in various ways.

..even though that should be left to the presentation layer!

Here’s the issue.  Well, there are several issues, but we only have time for a few.

Here’s the first problem

Report users don’t understand the “end of a time period” problem.  I don’t have a good name for it; others might call it the “Day plus one” problem or the “Less than date” problem.  What do I mean by this?  Well, let’s back up a bit, to DBA Commandment #6, “Thou shalt not use between with datetimes.”  In order to first understand the issue, we have to understand why this is a commandment.

When a date range, or time period, is specified in SQL, it’s easiest, clearest, and most concise, to specify it like so: @TheDate >= @StartOfPeriod and @TheDate < @StartOfNextPeriod.  Mathematically speaking, we’re defining the range as “closed on the left, open on the right”.  In other words, Min <= X < Max.

The reason we do this with datetimes is found right there in the name of the datatype — it has (or can have) a time component!

stone-tablets-with-roman-numerals-to-10
There are probably more than 10, but it’s a good starting point…

Let’s talk examples

Say you’d like to report on the month of March 2017.  How do you determine if your data-points (stored as datetime or, hopefully, datetime2) are within that period, that month?  Well sure, you could write where month(MyDateColumn) = 3 and year(myDateColumn) = 2017 

NO.  That is horrible, don’t do that.

It’s not SARGable and renders your index on that column useless.  (You do have an index on it, don’t you? No? Make one!)  Okay, let’s stick with something SARGable.  How about MyDateColumn between '20170301' and '2017-03-31T23:59:55.999'?  (You did read this post about using culture-neutral datetime literals right?)  But wait!  If your data is a datetime, it’s not actually that precise — your literal gets rounded up to 20170401 and you’re now including dates from April 1st (at midnight)!

Oh that’ll never happen… until it does.

Second problem

Many developers and report-writers assume that the values in their data will never be within the typical “1 second before midnight” or “1/300th of a second before midnight” escape window of your “3/31/2017 23:59:59.997” bounding value.  But can you guarantee that?  Didn’t think so.  Worse, if you use the .999 fraction as given in the 2nd example, you’re either “more” or “less” correct, and nobody can actually tell you which way that pendulum swings because it depends on the statistical likelihood of your data having actual literal “midnight” values vs. realistic (millisecond-y, aka “continuous”) values.  Sure, if you’re storing just a date, these things become a lot less complicated and more predictable.

But then why aren’t you storing it as an actual date, not a datetime!?

So what’s the right answer?

As I said, “greater than or equal to  ‘Start’, and less than ‘End'”, where ‘End’ is the day after the end of the period, at midnight (no later!).  Hence, MyDateColumn >= '20170301' and MyDateColumn < '20170401'.  Simple, yes?

keep calm and keep it simple
KCKS

But wait, there’s more!

I mentioned “date-pickers” in the title.  When it comes to UX, date-pickers are a sore subject, and rightly so — it’s difficult to truly “get it right”.  On a “desktop-ish” device (i.e. something with a keyboard), it may be easiest on the user to give them a simple text-box which can handle various formats and interpret them intelligently — this is what SSRS does.  But on mobile devices, you often see those “spinner” controls, which is a pain in the arse when you have to select, say, your birth date and the “Year” spinner starts at 2017.  #StopIt

I mean, I’m not that old, but spinning thru a few decades is still slower than just typing 4 digits on my keyboard — especially if your input-box is smart enough to flip my keyboard into “numeric only” mode.

Another seemingly popular date-picker UX is the “calendar control”.  Oh gawd.  It’s horrible!  Clicking thru pages and pages of months to find and click (tap?) on an itty bitty day box, only to realize “Oh crap, that was the wrong year… ok let me go back.. click, click, tap..” ad-nauseum.

stop-it-sign
#StopIt again

The point here is, use the type of date-picker that’s right for the context.  If it’s meant to be a date within a few days/weeks of today, past/future — OK, spinner or calendar is probably fine.  If it’s a birth date or something that could reasonably be several years in the past or future, just give me a damn box.  (Heck, I’ll take a series of 3 boxes, M/D/Y or Y/M/D, as long as they’re labeled and don’t break when I omit the leading-zero from a single-digit month #!)  If there’s extra pre-validation logic that “blocks out” certain dates (think bill-payer calendars or Disneyland annual-pass blackout-days), that probably needs to be a calendar too.

..just make sure it’s responsive on a mobile device.

And in all cases, pass that “ending date” to your SQL queries in a consistent, logical, sensible manner.  For reporting, where the smallest increment of a period is 1 day, that probably means automagically “adding 1 day” to their given end-date, because the end-user tends to think in those terms.  I.e. if I say “show me my bank activity from 1/1/2017 to 1/31/2017”, I really mean “through the end of the month“, i.e. the end of the day of 1/31.  So your query is going to end up wanting the end-date parameter to be 2/1/2017, because it’s using the correct & consistent “greater than or equal to start, and less than start-of-next” logic.

context-consistency-clarity
The 3 C’s

Final thoughts

I know it’s not easy to explain to business folks, and it’s not easy to implement correctly.  But it’s important.  The >= & < logic is clear, concise, and can be used consistently regardless of underlying datatype.  You just need to adjust your presentation layer (whether that’s SSRS parameters or a .NET date-picker) to convey their intent to the user, whether that’s “show/enter the last day of the month, but translate to the next day to feed to the query/proc.”, or “make them enter the next-day (day after the end of the month/period) and understand the ‘less than’ logic.”  I’m more inclined to the first, but it depends on your audience.

Thanks for reading, and happy date-ing!

Reverse Engineering SSAS Reports

MDX is not SQL. It may look like it has SELECT/FROM/WHERE clauses, but god help you if you start drawing parallels to your standard TSQL query.

This is an exercise I had to go through recently, because A) the reports in question were deployed in SSRS but used an SSAS backing, i.e. cubes, and the source queries (MDX) were not stored in source-control, and B) I don’t write MDX queries.

a basic MDX query example with labeled parts
shamelessly borrowed from an excellent article

The outline:

  1. Run Profiler or XEvents against the SSAS server
    1. set to capture “Query Begin” events only, with “Event Subtype = 0” (for MDX query)
    2. optionally, set filter on NTUserName to the dedicated SSRS account (if you have it set up that way)
  2. Run the SSRS report(s) that you want to dive into
  3. For each event in the Trace, copy-paste the MDX query to a new MDX editor window
  4. SSRS parameter substitution happens via some XML at the bottom; but in MDX, the parameters are standard @params like in T-SQL.  So we need to manually substitute our parameter values.
    1. 2 blocks of XML: the “Parameters”, and the “PropertyList” — delete the latter.
    2. In the former, text-replace &amp; for simply & .
    3. Side-bar: You’ll notice that the MDX parameters are usually inside STRTOMEMBER() or STRTOSET(), which are built-in MDX functions that do exactly what they sound like — parse a string into a dimension’s attribute’s member or set of members.  That’s why they’ll usually have at least 3 .‘s (dots) — Dimension.Attribute.&MemberValue, for example.  I’m grossly oversimplifying that because it’s beyond the scope of this post, but read the docs if you need more gritty details.
  5. For each parameter node:
    1. Copy/cut the <Value> node content (I like to ‘cut’ because it helps me keep track of which ones I’ve done already)
    2. Find-and-Replace @ParameterName with that Value node’s content, surrounded in single-quotes
    3. Example: we have parameter ​@ReportDate (in MDX), corresponding to <Parameter><Name>ReportDate</Name> in XML, with <Value xsi:type="xsd:string">[Some Dimension].[Some Attribute].[Some sub-attribute].&[2017-05-01T00:00:00]</Value> — where that last bit is a standard SQL datetime literal.
    4. So you replace @ReportDate with '[Some Dimension].[Some Attribute].[Some sub-attribute].&[2017-05-01T00:00:00]' .
  6. Delete the XML block.
  7. Boom, now you have a valid MDX query that you can run and view results.

Why do this?  Well, it can help you learn MDX from a working example, instead of from super-basic dummy examples.  That’s not always a good learning style — you should still learn the fundamentals of MDX and why it’s so very different from SQL.  Especially if you’ll be responsible for writing and maintaining more than a few MDX queries.  But, in a pinch, if you need to start somewhere, and possibly all that the MDX / overlaying report needs is a slight tweak, this may be enough to get you going.

Lessons learned:

  • Profiler can still be a useful tool, despite some people’s attempts to kill it.
  • MDX is not SQL.  It may look like it has select, from, and where clauses, but god help you if you start drawing parallels to your standard TSQL query.
  • SSRS does parameter-passing in an odd way.
  • SSAS & MDX are fascinating and I need to learn more about them!
the-more-you-know
Canadians, eh?

DBA Holy Wars Part 2

Battle 4: GUIDs vs Identities

This is an oldie but goody.  A) Developers want their apps to manage the record identifiers, but DBAs want the database to do it.  B) Developers prefer abstracting the identity values out of sight/mind, DBAs know that occasionally (despite your best efforts to avoid it) your eyeballs will have to look at those values and visually connect them with their foreign key relationships while troubleshooting some obscure bug.

but-wait-theres-more-billy-mays
there’s ALWAYS more…

But there’s more to it than that.  See, none of those arguments really matter, because there are easy answers to those problems.  The real core issue lies with the lazy acceptance of GUI/designer defaults, instead of using a bit of brainpower to make a purposeful decision about your Primary Key and your Clustered Index.

Now wait a minute Mr. DBA, aren’t those the same thing?

NO!  That’s where this problem comes from!

A good Clustered Index is: narrow (fewer bytes), unique (or at least, highly selective), static (not subject to updates), and ever-increasing (or decreasing, if you really want).  NUSE, as some writers have acronym’d it.  A GUID fails criteria ‘N’ and ‘E’.  However, that’s not to say a GUID isn’t a fine Primary Key!  See, your PK really only needs to be ‘U’; and to a lesser extent, ‘S’.  See how those don’t overlap each other?  So sure, use those GUIDs, make them your PK.  Just don’t let your tool automagically also make that your CX (Clustered indeX).  Spend a few minutes making a conscious effort to pick a different column (or couple columns) that meet more of these requirements.

For example, a datetime column that indicates the age of each record.  Chances are, you’re using this column in most of your queries on this table anyway, so clustering on it will speed those up.

Most of the time, though, if your data model is reasonably normalized and you’re indexing your foreign keys (because you should!), your PKs & CX’s will be the same.  There’s nothing wrong with that.  Just be mindful of the trade-offs.

Battle 5: CSV vs TAB

bluray-vs-hddvd-fight
Who doesn’t love a good format-war?

Often, we have to deal with data from outside sources that gets exchanged via “flat files”, i.e. text files that represent a single monolithic table of data.  Each line is a row, and within each line, each string between each delimiting character is a column value.  So the question is, which is easier to deal with as that delimiter: comma, or tab?

String data values often have commas in them, so usually,the file also needs a “quoting character”, i.e. something that surrounds the string values so that the reader/interpreter of the file knows that anything found inside those quotes is all one value, regardless of any commas found within it.

But tabs are bigger.. aren’t they?  No, they’re still just 1 byte (or 2, in Unicode).  So that’s a non-argument.  Compatibility?  Every program that can read and automatically parse a .csv can just as easily do so with a .tab, even if Windows Explorer’s file icon & default-program handler would lead you to believe otherwise.

I recently encountered an issue with BCP (a SQL command-line utility for bulk copying data into / out of SQL server), where the csv was just being a pain in the arse. I tried a tab and all was well! I’m sure it was partially my fault but regardless, it was the path of least resistance.

Battle 6: designers vs scripting

no-wizard-allowed
Wizards are usually good, but in this case, they’re lazy and bad for you…

This should be a no-brainer. There is absolutely no excuse for using the table designer or any other wizardy GUIs for database design and maintenance, unless you’re just learning the ropes. And even then, instead of pressing ‘OK’, use the ‘Script’ option to let SSMS generate a `tsql` script to perform whatever actions you just clicked-thru.  Now yes, admittedly those generated scripts are rarely a shining example of clean code, but they get the job done, even with some unnecessary filler and fluff.  Learn the critical bits and try to write the script yourself next time– and sure, use the GUI-to-script to double check your work, if you still need to.

Confession: I still use the GUI to create new SQL Agent Jobs. It’s not that I don’t know how to script it, it’s just that there are so many non-intuitive parameters to those msdb system-sp’s that I usually have to look them up, thereby spending the time I would have otherwise saved.

Bonus round: the pronunciation of “Data”

its-data-not-data
Call me “big Data” one more time…

Dah-tuh, or Day-tuh?  Or, for the 3 people in the world who can actually read those ridiculous pronunciation glyphs, /ˈdeɪtə/ or /ˈdætə/ ?  It’s a question as old as the industry itself… or maybe not.  Anecdotally, it seems like most data professionals, and people in related industries, tend to say “day-tuh”; while those in the media and generally less technical communities tend to say “dah-tuh”.  (Where the first syllable is the same vowel-sound as in “dad” or “cat”.)  This likely means that the latter is more popular, but the former is more industrially accepted.

In either case, it doesn’t really matter, because at the end of the day, we’re talking about the same thing.  So if some dogmatic DBA or pedantic PHB tries to correct your pronunciation, tell ’em to stop being so persnickety and get on with the task at hand!

Until next time…