continued from last week…
The Ugly (aka the “Wow really?!?”)
You’ll see this ‘creep’ even in product-catalog systems as mature as Amazon’s. If you search for (geeky as we are) graphics cards, and use the product attribute filters in the left pane to narrow it down, you’ll find that some correctly have their memory type (GDDR5, etc.) listed, while others may not. If you’re really unfortunate, there will be two semi-redundant attribute-sets that you’ll have to “juggle” between to really get at what you want. TVs, for example, may see both an “HDR support” (yes/no) and an “HDR type” (standard, ultra, etc.) — I’m kinda pulling those out of my arse for example’s sake, but you get the point.
Why does this happen? Because at some level, humans are still responsible for ‘tagging’ those products (for lack of better word). And as much encouragement and guidance as the ‘admin software’ may give them, they can (and do) still decide at times to side-step that guidance and say “Nope, I know better; make me this new thing!”
But isn’t that a problem with nearly all data-driven systems? Sure, of course it is. Yet with a model as flexible as EAV, the problem is intensely magnified by the fact that it’s made so easy to do — to ‘extend’.
And unfortunately, the biggest contributor to this problem is the lack of development-time and forethought given to the administration, or management, of the data. You see, this problem could be largely assuaged if the admin-toolset were the FIRST thought and priority in the roadmap. But so often, that thought comes LAST, if at all. So sure, your product feature tagging system looks great, it’s flexible and your customers love it. But you’re throwing tickets over the wall to your data team every time a requirement or use-case changes, or when you need to fix a data-quality problem caused by the users not knowing there was already a “Widget Type” before creating their new “Widget Kind” tag, or misspelling “Is Wierd” because English is weird and has more exceptions to the “I before E” rule than not.
Does this problem go away with a dedicated search-index or NoSQL technology like Elasticsearch or (shudder) MongoDB? Of course not! If anything, it may be worse. Maybe. But wait, those systems make it easier to de-dupe and manage redundancy & data quality, don’t they? Probably. I can’t speak from experience myself, but I’ve heard good things. Once again, it all comes down to the effort you’re willing to invest in the system. If you make data quality a priority, you’ll be happier with the experience. If you don’t, well you’re just another amateur data scientist complaining about dirty non-standardized/non-validated address fields, aren’t ya? =P
I joke with the data scientists, of course. What they do is awesome. They just tend to often re-invent the wheel of data-cleansing/data-wrangling that we DBAs have been doing for a few decades, because they didn’t know the right questions to ask or the right place to look. We need to get better at working together WITH them, not ‘for’ or ‘against’ them.
The Why or When (aka “Is it a decent model for this?”)
The long-story-short version is, consider your business and your data. Try to plan for the future, and anticipate potential changes and growth. It’s not easy, and we never “get it right the first time”. But we can try.
When your attributes are fairly static, and you know that you can tightly control them, you might consider a more rigid model. Something with a handful of lookup tables referenced by the main product entity. This is advantageous for performance and management, at the expense of scalability and extensibility.
When you literally need to support on-the-fly extension, and you’re absolutely married to SQL (i.e. not ready to venture out into NoSQL land just yet), the EAV model may fit the bill. Aaron’s article, and the comments therein, present some fairly valid and reasonable implementation suggestions to make it a little more palatable. Just beware the date — that was written back in 2009. Before we had such things as Elasticsearch and its ilk. I’d heavily encourage the consideration of purpose-built data-stores for this sort of thing, if you have any hope of scaling-out.
Other tools in your toolbox can help with this, too. For example, consider an in-memory data-grid for super-fast reads. The vast majority of data-access to these attributes & values is going to be reading, using it to filter & slice & dice a data-set. You can pay the small performance cost (e.g. write to the underlying SQL database) on the rare occasion when a write/change needs to occur.
Proving the age-old rule of “Just because you CAN, doesn’t mean you SHOULD”, the EAV model is sometimes okay and sometimes not. You need to understand your business and your data to make that call. And you need to consider the magnitude of effort that may be involved in pivoting from one model to another. Unfortunately, in many cases, that part overshadows the rest, and the
show business must go on.
Still, I encourage you to at least think about it, and be ready with that knowledge of pros/cons when the time is right to discuss it with stakeholders.