Replication “Just Trust Me”

For what seems like years, I’ve bemoaned the fact that SQL Transactional Replication doesn’t come with a “Just Trust Me” option. I’ll explain more about what I mean in a moment. The other thing I’ve complained about is that there’s no “Pause” button — which not entirely accurate, since obviously you could just stop the distribution and subscription agents. But specifically what I mean is, it’s not easy to ‘put it on hold so you can make some schema changes to one of the tables that’s being replicated’, and then easily “Resume” it after you’re done with said changes.

Well, I’m happy to say that now I have both of these tools/methodologies in my arsenal!

Quick level-set: If you’ve been living under a virtual rock, SQL replication is an old-hat “tried-and-true” method of producing readable copies of your data on other SQL servers, whether for reporting or DR. It’s not an HA technology per-se, although I suppose you could use it for that if you were feeling adventurous. It’s more for “I need a reasonably up-to-date copy of my data ‘over there’ so I can run reports / crappy user-formed / EF-generated queries against it without slowing down my production OLTP system.”

Yes, I did just take a pot-shot at Entity Framework. #DealWithIt

i don't always break replication but when i do it drives me to drink
But not Dos Equis. That stuff is terrible. =P

Why?

The word that comes to most DBA’s minds when they think of replication is ‘brittle’. And for good reason — when it breaks, it breaks hard, and you’re often left trying to pick up the pieces while wondering how much worse it could be if you just started over from scratch (i.e. dropped all the replications and re-created them). Which, honestly, sometimes is easier. But not if you have a large volume of data, and certainly not if that data is indexed and you don’t want your apps to experience a performance-crisis!

Now, because this post has been sitting in my ‘Drafts’ area for far too long, I’m going to break this up into 2 parts, so I can get something out the door. In part 1, I’ll briefly explain each of the key components of the process. In part 2, I’ll dive into a little more step-by-step detail.

Primary resources that went into this: docs, article1, article2, article3. And my very own dba.SEanswer where I apparently went through a similar process back in 2016 and subsequently forgot about it (mostly).

Key 1: Sync-Type

TL;DR: the “Just Trust Me” option is, when you create the subscription, sys.sp_addsubscription, specifying the @sync_type = 'none' parameter value. Huge thanks to @garethn in the SQL Community Slack.

Sidebar: if you haven’t yet joined the SQL Community Slack, WHAT ARE YOU WAITING FOR?!?!? DO IT, DO IT NOW!!!

Ahnold ‘teh Governator’

@sync_type = 'replication support only' may be applicable in some scenarios as well, but I’m not 100% clear on the difference / use-cases at the moment. More to come later, hopefully.

Key 2: Script Publication Procs

Protip: sys.sp_scriptpublicationcustomprocs @publication = 'PublicationName' generates the internal repl-procs that control the table creations/updates on the subscriber. You run this ‘script’ command on the publisher, then get the results (the script it generates), copy-paste to a new SQL file, and run on the subscriber.

This has come in handy on several recent occasions, wherein I had to either swap tables behind-the-scenes due to a PK change, or make a column & index change that involved truncation. Using the “stop, shuffle, start” method, which I’ll get into in part 2, I’m able to tell the subscriber “Hey, the definition of this table has changed, you need to grab these new repl-procs so you can handle it correctly!”

Key 3: Publication Properties

In order to tell our publication that “We’re gonna be making some changes, don’t panic!”, we want to turn OFF 2 properties (assuming they’re true, which they likely are by default) using sys.sp_changepublication @publication='MyPub'. The properties are 'allow_anonymous' and 'immediate_sync', and you simply append the arguments to the proc call like so: @property='allow_anonymous', @value='false' / @property='immediate_sync', @value='false'.

Later, after we’re all done with our under-the-hood changes, we’ll want to turn the back on, in reverse order: first enable 'immediate_sync', then 'allow_anonymous'. Cool? Don’t ask me why; DBAs much smarter than I have decreed it so.

OMG, remember Xena Warrior Princess? Holy wow that’s some nostalgia for ya.

Honorable Mention: Pull Subscriptions

In one instance, I was using a PULL subscription (as opposed to PUSH). I had to re-start the Distribution agent (on the subscriber) twice for it to work (to start actually synchronizing). It STILL shows as ‘Uninitialized Subscription’ in the repl-monitor, though. Kinda annoying.

Pull subscriptions can be nice because they shift the burden to the subscriber DB, so that your publisher (master, primary, whatever you wanna call it) doesn’t get too bogged-down. But as always, there are trade-offs. Check out this handy little comparison guide on the topic from a fellow DBA blogger.

That’s all for now; stay tuned for more as I go into detail about how I used these in what scenarios. Thanks for reading! ❤

Advertisements

Facepalms Per Hour

My current velocity is sometimes measured in FPH – facepalms per hour.

This is a rant. Fair warning.

I guess the new ‘Millenial’ colloquialism for “grumpy” or “sarcastic” is “salty“. So I’m feeling extra salty this week. For several reasons. One, it’s audit season. Two, I had to churn out about a dozen new reports in the span of 4 days because the manager who was supposed to be tracking that project dropped the ball and forgot they were due by the end of this month until… yeah, last Friday. Wheeeeee!

Thus, I decided, my current ‘velocity’ (a SCRUM/DevOps term for “how much work are you getting done”) shall be measured in FPH – Facepalms Per Hour. Currently I’m at 3. Earlier this week I was approaching the double-digits, when the lovely report consumers kept thinking of “just one more little thing” they forgot about until after I’d delivered the ‘final’ product.

‘Final’ actually being a meaningful adjective in this context approximately NEVER.

facepalm original picard
The original gangsta.

Change Logs

How best to describe this scenario while still maintaining separation of “real job” from “blog land”… Hrm. So let’s say we have a CRM, like most companies. This stores customers, among other things, in a database. And since it also stores sales transactions and financials, it’s heavily audited — it has a lot of change-tracking mechanisms.

Now, auditors come along and want a report of some specific type of change over time. I happily oblige. Then… PANIC! And not at the disco. “What are all these changes to these customers by these users who don’t have permission to make said changes?!?”

K, calm down sparky. Try not to sound the alarm; auditors are a sensitive bunch.

Turns out, those changes are, in a word, “fake”. You see, there’s this background “customer sync” process that keeps them up to date with another part of the CRM where the actual changes were made. But, because it’s written poorly, it thinks that ANY field change, even just the Name or Address (which a lot more CSR’s, customer service reps, have the permission to change, because, you know, that’s their job), constitutes a change to the ENTIRE customer record on the other end. So the change tracker logs a change to every single field on the receiving end of that sync process, even though nothing really changed on the source side except maybe one or two fields.

With me so far? Great. So now the question is, “Well, can we get a report that doesn’t show those ‘fake’ changes?” But wait, it has to be “system generated” and you’re not allowed to “filter” or “add special exceptions” to it, because it still needs to be audit-able.

So what you’re saying is, give me a report that shows me what I care about, but you’re not allowed to change the logic behind said report.

Riiiiiiight.

So I give them a new report. I don’t explain how the sausage is made, I just make it and serve it up. “But why is this different from the original report?”

double facepalm

Well, do you want the audit-able answer, or the real answer? The audit-able answer is, “We made a system change that allowed us to prevent the ‘fake’ changes from being logged incorrectly.”

The real answer is, “B*tch, I AM the system!” — meaning yes, I excluded those with some hacky logic, and you need to stop asking questions about it.

Anyway. Change Logs are super fun.

Reports

Speaking of reporting. I could really go on for pages about how terrible and broken this whole system of “request-based report development” is. But it’s frankly all we have right now. Until there’s sufficient business buy-in to the concept of agile data warehousing and collaborative cross-functional data modeling, shit just comes in one funnel and goes out another with a little sparkle spackled to it. And we call it a report.

Example, you say? Sure! Let’s say we run a special sale on certain types of widgets every quarter. We want to track how these ‘specials’ perform — do they increase our sales of those widgets? By what factor, compared to the other not-on-sale widgets? Can we trend this over several quarters?

Oh but wait. The data structures that govern widget pricing and time-span-based sale pricing, and the logic that relates customer orders to what pricing structure they used at the time of ordering, is awful, terrible, and changes every time there’s a new quarterly promotional sale.

So you’re saying you want a report that trends sales of widgets based on arbitrarily changing promotional pricing as compared to other widgets that may or may not be subject to ‘normal’ pricing during that same time period, all without a simple definitive data-point that says “This is a Quarterly Promo sale, and That is Not.”

elrond's facepalm
Mister Anderson… I mean, wait.

Let’s try to get at the root of the problem, shall we? The business doesn’t seem to understand that the way they implement promo-sales is detrimental to long-term/comparative reporting. The data model makes this harder, not easier. Can we perhaps put some heads together and come up with a compromise that both A) makes more business sense, and B) improves the data model to be a bit more intuitive?

In Closing…

What’s your FPH? What causes you to facepalm on a regular basis? Let me know in the comments!  :o)

T-SQL Tuesday #115: Dear 20-year-old Self

Don’t be afraid of that big change, that big opportunity.

This month’s #tsql2sday is brought to you by Mohammad Darab, a relatively new #SQLSaturday speaker (congrats!) and all-around-great-guy. Feeling introspective, he invites us to write a letter to our 20 year old self. Quite coincidental that I’ve been in a “letter writing mood” lately on my other blog. 😉

Dear Nate_the_College_Student

A couple things. First, DTB (rated PG-13 link). SRSLY. The high school girlfriend is NOT the one, nor does she treat you well at all. It’s gonna be over a year after you graduate, so do yourself a favor and break it off now so you can enjoy college more.

Second, study more, game less. It’s gotten ridiculous. You had straight A’s. You’ll get your first D EVER if you don’t take that network programming class seriously. And it will hurt. Your ego AND your GPA.

Finally! Hardware ain’t for you. Software is where it’s at. And data stuff. ALL the data! Someone’s gonna coin the phrase “data is the new oil”. Hey! That should be YOU! Do it.

Where the Rubber Meets the Road

Your first job is going to be amazing. You’ll spend NINE years at that company. Probably a bit too long. They weren’t very understanding when your wife was very sick and needed your help running around to doctors and pharmacies and such. But you’ll work some truly awesome people, connections that will last many years. That’s called ‘networking’ — making sure you stay in touch with those people. It will help open up future career opportunities.

Yes, I said “wife”. We’ll get to that in a bit.

Spend your 20s doing silly things, adventurous things, but focus on your career. If I had to do it all over again, I would branch out sooner, explore more opportunities earlier, and look at the world of tech beyond this little suburb. There is so much more out there. Not that there’s anything wrong with suburbia — our town is great. I love it. But it’s not a tech-job haven. So don’t be afraid of that big change, that big opportunity.

Because once you go for it, once you finally take that chance, you’ll be oh-so-much happier! Ride that wave of confidence and of learning new things. Start attending SQL Saturdays earlier, and look into other data-centric meetups and events. There is so much to learn, and never enough time.

Oh, and PS: be super careful driving. Please.

Love

Yes, you will find love. You will find the love of your life. The One. Your soulmate. When you least expect it. You will spend your latter 20s and early 30s having the best years of your life with her. She will be your everything. She is laughter, passion, heartache, support, grace, care, light, love, and life itself. Cherish every moment.

For our time on this earth is but a whisper on the winds of eternity.

A question that’s often asked of grievers is, if they knew what would happen, would they go back, change anything, do anything different. In this thought experiment, even — would I tell my past self what would happen to my wife? No, I don’t think I would. For even now, as I look back on the million little moments that we shared and loved and laughed and cried. I would do it all again in a heartbeat. Our love was once-in-a-lifetime. I know that in my soul. And I carry that flame in my heart. May it never be extinguished.

i love you 3000 with picture of wife inside
Always. ❤

Follow-up: Cribbage “15’s Counter”

The actual method involves joining 5 copies of the table together, by each right-side table only including cards with higher ID values than the table to its left.

To be honest, my T-SQL Tuesday puzzle was a bit of a last-minute idea, which is why I didn’t have a solution ready-made. But, dear reader, you’re in luck! I have one now.

The code is over here in Gist. You can read thru it, but since the final query — the actual “answer” — is kinda ugly, let me explain my thought process.

Modeling is Important

Even when I’m putting together a silly little demo script like this, I feel that good habits and fundamentals are important. You never know what future developer might read it, copy-paste it, and say to themselves “Cool, I’m gonna follow this example when I do this other thing over here!” So you’ll see my formatting preferences, naming convention (though I must admit, I argued with myself over whether to pluralize the table names or not!), and correctly allocated Primary Keys. And since we’re modeling a card deck, even though I didn’t need to store the ‘NumValue’ (which is what you’d use for a straight/run, where the Jack is 11, Queen is 12, etc.), I did anyway.

Now, when we set up our “Hands”, we’re going to use two ‘PlayerNum’s, just so we can test two different hands at the same time. Cribbage can be played with 3 or 4 players, but we’re keeping this simple. Also, I could have built the hands more aesthetically, i.e. by selecting from Cards using PtValue and Suit, but again, I was trying to script quickly, so I just used the IDs that I knew from the previous query (the “full deck”). And again, there’s a “little extra” tidbit, the ‘IsCut’ indicator — we won’t be using that right now. If you’re still not sure what that means, go read the rules.

The Method

At the end of the original post, I mentioned loops and cursors as possible routes to a solution. That may still be true, but I decided to challenge myself to avoid them. Not because they’re “always bad”, as popular media would have you believe; they’re just often an indicator that a developer isn’t thinking in set-theory when they probably should be.

Let’s start with some basic principles. You have 5 cards in your hand. It takes a minimum of two cards to make 15 (examples include Jack+5, 6+9, etc.), and up to a maximum of.. you guessed it, five cards. So we need to check all combinations of any two, three, four, or five cards. We cannot re-use a card within the same combination; and putting the same three cards in a different order, for example, does NOT count as a separate combo (another ’15’).

So as you start to think about these rules, and if you’ve been around data for a while, especially data with identity values, you might have a little light-bulb. “Aha! I know how to do that. We can simply order the combos by the ID value, and that way we won’t allow duplicates!” And that’s kinda what I did, by enforcing the JOIN predicates that every subsequent derived-table have a ‘CardID’ greater than the prior one. But I’m getting ahead of myself.

The actual method here involves JOINing 5 copies of the table together, mainly just on PlayerNum, but also, as I said, by each right-side table only including cards with higher ID values than the left-side. In this way, we ensure that we’re not allowing the same cards to be “joined” to each other, i.e. we’re removing them from the right-side tables.

And finally, we have four OR‘d conditions: simply “do any of those combinations add up to 15, by the Card’s PtValue?” These are echo’d in the CASE-expression in the SELECT line, where we want to essentially “show the combo”, i.e. tell you what cards make up the ’15’. (Again, for style’s sake, we have an ELSE, but we don’t really need it because it’ll never actually happen.)

Now, it does look kinda ugly. It’s not very extensible — meaning, if you wanted to scale it up to find the ’15’s in a 6- or 7-card hand, or you wanted to look for other kinds of combos (like ’18’s or ’27’s), you’d end up re-writing a good portion of it, or at least copy-pasting a lot. Fortunately for us, Cribbage is fairly simple in this regard — your hand is always the same size, and you only ever care about ’15’s.

(Well, and pairs, 3- and 4-of-a-kinds, straights, flushes, knobs, etc., but again, read the rules if you’re curious. We kept this very simple by limiting ourselves to just one small fraction of the game mechanics.)

The cool thing about this sample, though, at least to me, is that you’re already set up to build on it if you want to try out other Cribbage mechanics. Or even other card games, if you just use the base Suits & Cards.

What Did We Learn?

What’s the point of a puzzle like this? Well, besides introducing you to a fantastic card game, if you didn’t already know about it. The point is to make your brain think in a different way than usual. Are any of us programming card games using a SQL back-end? Probably not. (Although an in-memory equivalent like SQLite or something might be viable!) But the next time you have a “combinations problem” with some real-world data, you might wonder if a method like this could come in handy. Or at least, if it could work out better than a double-nested-loop. =)

PS: I believe, instead of the LEFT JOIN​s, we could have used OUTER APPLYs. We’d move the conditions from the JOINs into the inner WHERE clause of each derived table, i.e. “this ID > previous ID” and “PlayerNums are equal”. If you’re curious, try it out!

T-SQL Tuesday #114: A Puzzle

One of the main things a new cribbage player needs to learn is how to easily spot the combos that make ‘a 15’ (the ways to combine cards to add up to a numeric value of 15). Let’s do that with SQL!

It’s that time again! The 2nd Tuesday of the month, T-SQL Tuesday. This month’s invitation is on the lighter side, which is nice, and it comes from Matthew McGiffen (b | t). The theme is “Puzzle Party!” And I’m going to cheat, since it’s getting horribly late already and I’m lacking in inspiration.

So, I propose a puzzle! Which you must solve using SQL. Then I’ll post my own solution in a day or two. Bwahahaha.

I actually really wanted to do a Sudoku solver, but @SQLRnnr beat me to it. By a few years. =P   I might still work on that when I’m bored, just to have a standby for another blog post. Maybe we’ll compare notes.

But for now…

Do You Even Cribbage, Bro?

If you’ve never heard of the card game cribbage, it might sound weird. When you read the rules, it sounds even weirder. Legend has it that it was invented by drunk Englishmen in a pub. Reality is actually not that far off. It’s also heavily played by Navy submariners, and that’s how it was passed down in my family.

There are already many great mobile & web versions of the game, and it will quickly become obvious to anyone who’s tried to program a card game before, that a query language like T-SQL is NOT suited (omg see what I did there?) to the task. However, we can probably come up with a small sub-task of the game that’s acceptable for our purposes.

Enter: the hand scorer. There’s a nice example of a finished product here. The input would be a set of 5 ‘cards’ — the ‘hand’ has 4, and the ‘cut’ adds 1 more, used as part of each player’s hand in scoring (like community property). A ‘card’ is simply an alphanumeric value — 1-10 plus JQK (which are ‘worth’ 10 for arithmetic, but can be used like normal for ‘straights’ aka ‘runs’) — and a ‘suit’ (heart, spade, diamond, club). Think for a moment on how you’d store that as a data structure.

The output, then, is a single numeric value, the ‘score’. But how do you score? You look for the following: combinations of any numeric values that add up to 15; pairs, 3-of-a-kinds, or 4-of-a-kinds; straights (suit does not matter); a flush, if all 4 ‘hand’ cards are the same suit (and a bonus point if the ‘cut’ card matches as well). And then there’s a funky thing where you get an extra point if you have a Jack that matches the suite of the ‘cut’ card. o_@

Dude… What?

Wow, that sounds complicated, no? Let’s make it simpler. One of the main things a new cribbage player needs to learn is how to easily spot the combos that make ‘a 15′ (the ways to combine cards to add up to a numeric value of 15). For each ’15’ you make, you score 2 points. That sounds pretty feasible in SQL, right?

For starters, we don’t really care about suit anymore. But we do need some way to distinguish the cards from each other. This is a single-deck game, so you’re never going to have more than 4 of the same number; never more than one of the same card (like the Ace of Spaces). And when you’re counting combinations (or is it permutations?), you can’t use the same card twice. So let’s still use the suits for card distinction; I’ll just suffix the number with an ‘h’, ‘s’, ‘d’, or ‘c’.

We also don’t care about differentiating a 10 or J/Q/K, since they’re all just worth 10, numerically. So your ‘input’ can just consist of five numbers between 1 and 10. Cool? Just find the ’15’s!

Example:

  • Your hand is 3h, 6s, 6d, 9c, and the ‘cut’ is 3c.
  • Combos for ’15’: 6s+9c, 6d+9c, 3h+3c+9c, 3h+6s+6d, 3c+6s+6d.

That’s five unique combos, for a total of 10 points! Good job, that’s a bit better than average hand. In cribbage lingo, you’d say it like so: “fifteen two, fifteen four, fifteen six, fifteen eight, and fifteen ten.” Or if you’re playing with more experience, you’d abbreviate to simply “two four six eight ten”.

In “normal” programming land, we’d probably use a loop and some branching logic. What will we do in SQL? A loop, a cursor, or something more (or less!) elegant? You decide!

I’ll come up with something solution-y soon. Update: Solution posted! Enjoy! ❤

cribbage board close-up of winning peg and partial hand
Red won by 2 points! Close game.

Quickie: SSRS Sort Descending

Let’s make it real simple for the end-user to get the ‘best’ behavior by default. Let’s sort by our ‘Revenue’ column in DEscending order first.

You ever have a paginated report, i.e. a standard old-fashioned report with a big table (tablix, matrix), where you want the users to be able to sort by any column they want? And let’s say one such column is ‘Revenue’ — some money amount where the “biggest is the most important”. Like your top customer, for example.

Now, you know about ‘Interactive Sorting’ right? No?!? Oh it’s awesome! Let me remind you. It’s super easy to configure: just get to the Text Box Properties of the header box of whatever column you want to use, & go to the ‘Interactive Sorting’ tab. Like so:

ssrs interactive sorting property
Simple example — we just have ‘Details’ rows, no other row-groups or hierarchies.

Make sure you pick the appropriate Field in the Dataset to “Sort by”. Hint: it’s usually the one you’re displaying in that column!

But you know what’s kinda annoying? You can’t dictate a “first sort direction” — it just assumes that the first time you click the sort-arrows, you want ‘Ascending’ (lowest first). Then you can switch to ‘DEscending’ (highest first). This makes perfect sense for alpha values (strings), but not always for numeric values — at least not when you’re dealing with money, when generally the highest dollar amount is the most important!

So let’s make it real simple for the end-user to get the ‘best’ behavior by default. Let’s try to make it sort by our ‘Revenue’ column in DEscending order first. Ready?

ssrs interactive sort expression negative revenue
Set the ‘Sort by’ expression to the NEGATIVE of the field value! Simple as that.

It’s so small in that picture you might miss it. I’m doing this, but I’ll add whitespace below for easier reading:

= - Fields!Revenue.Value
..^look, there's the minus sign!
(assuming your viewing device hasn't mangled the fixed-width text)

You could also get more pedantic/readable and spell-it-out like so:

=(-1)*(Fields!Revenue.Value)

Right? Beautiful. So what does it look like on the report?

ssrs interactive sort by revenue descending
I clicked it! And it sorted highest-first just like I wanted!

Notice my silly hyphenated column names, like ‘Custom-er No’ and ‘Rev-enue’. That’s because those sort-icons take up space within the textbox, and you can’t control it, so the text wraps in a non-obvious manner. Thus, I purposefully added a hyphen and a line-break to the text, so I could control how they looked, instead of leaving it to the wiles of the SSRS render-er.

That’s all there is to it! Happy reporting. =)

 

The EAVil Cycle, Part 2

In which we discuss more about the EAV model and some of its merits and pitfalls.

continued from last week…

The Ugly (aka the “Wow really?!?”)

You’ll see this ‘creep’ even in product-catalog systems as mature as Amazon’s. If you search for (geeky as we are) graphics cards, and use the product attribute filters in the left pane to narrow it down, you’ll find that some correctly have their memory type (GDDR5, etc.) listed, while others may not. If you’re really unfortunate, there will be two semi-redundant attribute-sets that you’ll have to “juggle” between to really get at what you want. TVs, for example, may see both an “HDR support” (yes/no) and an “HDR type” (standard, ultra, etc.) — I’m kinda pulling those out of my arse for example’s sake, but you get the point.

Why does this happen? Because at some level, humans are still responsible for ‘tagging’ those products (for lack of better word). And as much encouragement and guidance as the ‘admin software’ may give them, they can (and do) still decide at times to side-step that guidance and say “Nope, I know better; make me this new thing!”

But isn’t that a problem with nearly all data-driven systems? Sure, of course it is. Yet with a model as flexible as EAV, the problem is intensely magnified by the fact that it’s made so easy to do — to ‘extend’.

so preoccupied with whether or not you could, you didn't stop to think if you should
It’s probably not the exact quote, for you pedants, but it’s close enough for government blog work.

And unfortunately, the biggest contributor to this problem is the lack of development-time and forethought given to the administration, or management, of the data. You see, this problem could be largely assuaged if the admin-toolset were the FIRST thought and priority in the roadmap. But so often, that thought comes LAST, if at all. So sure, your product feature tagging system looks great, it’s flexible and your customers love it. But you’re throwing tickets over the wall to your data team every time a requirement or use-case changes, or when you need to fix a data-quality problem caused by the users not knowing there was already a “Widget Type” before creating their new “Widget Kind” tag, or misspelling “Is Wierd” because English is weird and has more exceptions to the “I before E” rule than not.

Does this problem go away with a dedicated search-index or NoSQL technology like Elasticsearch or (shudder) MongoDB? Of course not! If anything, it may be worse. Maybe. But wait, those systems make it easier to de-dupe and manage redundancy & data quality, don’t they? Probably. I can’t speak from experience myself, but I’ve heard good things. Once again, it all comes down to the effort you’re willing to invest in the system. If you make data quality a priority, you’ll be happier with the experience. If you don’t, well you’re just another amateur data scientist complaining about dirty non-standardized/non-validated address fields, aren’t ya?  =P

I joke with the data scientists, of course. What they do is awesome. They just tend to often re-invent the wheel of data-cleansing/data-wrangling that we DBAs have been doing for a few decades, because they didn’t know the right questions to ask or the right place to look. We need to get better at working together WITH them, not ‘for’ or ‘against’ them.

ninja cat riding a unicorn with laser-eyes
How the data scientists see themselves…

The Why or When (aka “Is it a decent model for this?”)

The long-story-short version is, consider your business and your data. Try to plan for the future, and anticipate potential changes and growth. It’s not easy, and we never “get it right the first time”. But we can try.

When your attributes are fairly static, and you know that you can tightly control them, you might consider a more rigid model. Something with a handful of lookup tables referenced by the main product entity. This is advantageous for performance and management, at the expense of scalability and extensibility.

When you literally need to support on-the-fly extension, and you’re absolutely married to SQL (i.e. not ready to venture out into NoSQL land just yet), the EAV model may fit the bill. Aaron’s article, and the comments therein, present some fairly valid and reasonable implementation suggestions to make it a little more palatable. Just beware the date — that was written back in 2009. Before we had such things as Elasticsearch and its ilk. I’d heavily encourage the consideration of purpose-built data-stores for this sort of thing, if you have any hope of scaling-out.

Other tools in your toolbox can help with this, too. For example, consider an in-memory data-grid for super-fast reads. The vast majority of data-access to these attributes & values is going to be reading, using it to filter & slice & dice a data-set. You can pay the small performance cost (e.g. write to the underlying SQL database) on the rare occasion when a write/change needs to occur.

In Conclusion

Proving the age-old rule of “Just because you CAN, doesn’t mean you SHOULD”, the EAV model is sometimes okay and sometimes not. You need to understand your business and your data to make that call. And you need to consider the magnitude of effort that may be involved in pivoting from one model to another. Unfortunately, in many cases, that part overshadows the rest, and the show business must go on.

queen the show must go on
You’re welcome again, ears.

Still, I encourage you to at least think about it, and be ready with that knowledge of pros/cons when the time is right to discuss it with stakeholders.