This month’s invitation is brought to you by Ewald Cress (blog, twitter), who I already like based on his tagline —
finds joy in minutiae..
Yes, my friend, don’t we all.
The topic at hand is fairly non-technical, but still important: folks who have made a positive contribution to your career or professional development. So it’s time for a shout-out! About a year ago, I wrote about my first major career move. There were several great influences in my first job, from the developers that taught me how to code, to the DBA who taught me how to keep cool & calm in the face of outages, to the boss who taught me the importance of time management and breadth of knowledge.
Since I am way too late in posting this, and I don’t feel like waxing poetic, I’ll just say a general “thank you” to all those who’ve helped me along in my career so far, with special acknowledgement to my former boss, my current boss, the SQL family, and my own family. Happy belated Thanksgiving and have a safe & pleasant holiday season! I’ll have a real post again quite soon, diving back into the tech stuff.
This month’s party brought to you by Mr. Hammer (b|t).
I apologize in advance for all the hammertime memes. It was just too good to pass up. Surely he must be used to this. Or at least not surprised by it. =D
So, Big Data. What is it? Well, in simple terms, it’s the realization and acceptance of the fact that data is multi-model, multi-faceted, multi-sourced, and constantly growing. It’s the fact that the traditional RDBMS is no longer the be-all end-all source of truth and valuable information. It’s part of a larger ecosystem involving JSON document stores, CSV files, streaming volatile bits of data coming from random devices and user activity that loses its meaning and potential impact almost as quickly as it can be gathered and sifted and stored.
But what do we actually get out of it? As a small-medium enterprise NOT in the software business, I have to say, not as much as the hype would have us believe. And look, I’m not so jaded and crusty that I refuse to adapt new tech. I Just haven’t seen a meaningful transformative business use-case for it. Sure, we have Google Analytics telling us how our websites are doing, and someone in marketing knows something about trending our social media traffic. Does it really help us make more money? Heck if I know.
Here’s what I’d like to see from the thought leaders. Give me something I can chew on — a real-world, non-hypothetical, non-frivolous, impactful use-case for adopting and implementing something like Hadoop/Spark or Azure Data Lake. Show me how my business can realistically journey down the path of predictive analytics and what it’s going to take from our Devs, IT staff, and management to actually get there.
Because they don’t get it yet. I have managers still worrying about how much we’re spending on a dinky little flash storage array to support the growing needs of our on-prem converged infrastructure stack. Meanwhile the AWS bill continues to baffle, and Devs want to play with Docker and Lambda. But we can’t seem to convince the higher-ups that they’re short-staffed on the internal-apps team, even after a minor version upgrade takes 4 hours of Ops time and half a dozen end-users doing post-mortem testing just to be sure we didn’t break anything unexpected.
I’m not here to complain. Really. I do want to see something amazing, something inspiring, something that shows me what Big Data truly brings to the table. And sure, I’ve see the vendor demos; they’re all just a bit outlandish, no? I mean, they look really cool, sure — who doesn’t want to see a chord diagram of who’s killed who is GoT? — but does that really help my business improve sales and productivity?
My point is, there’s a gap. A chasm of misunderstanding and mis-matched expectations between what management thinks Big Data is/means, and what it takes to actually implement. They see the pretty pictures and the fancy demos, but they don’t see the toil and sweat (or at least, in the cloud, gobs of cash) that go into building & operating the underpinnings and pipelines that drive those nice graphics. Not to mention the fundamental issues of data quality and governance.
So do us a favor, Big Data pundits. Show us something real, something that “the little guy” can use to up his/her game in the market. Something that makes a positive impact on small non-startup non-software businesses with understaffed IT & Dev teams. But more importantly, stop glossing over the effort and resources that it takes to “do Big Data right“. Managers and executives need to understand that it’s not magic. And IT practitioners need to understand that it’s actually worth-while. Because I believe you — really — that the payoff in the end is there, and is good. But you need to convince the whole stack.
PS: I know this is a fully day late for T-SQL Tuesday, and as such, I wasn’t going to post a ping-back in the comments of the invite, but then I saw there were only 8 others, so I felt it would benefit the event if I did add my late contribution. I’ll tweet with a modified hash-tag instead of the standard #tsql2sday, to reflect my late-ness. Hopefully that’s a fair compromise to the community & the event’s intentions. =)
this is about using PowerShell and the BITS framework to copy very large files across servers…
Welcome back! This month’s topic is PowerShell — thanks to one of our prominently bearded community members. PowerShell is a fantastic tool in the IT professional’s toolbelt. I was first introduced to it somewhere in 2014 or 2015 by a colleague, and started making much heavier use of it when my career took me to a new & bigger environment.
Actually, funny side-story. I remember seeing one of the very early incarnations of PowerShell, or what would eventually evolve into it, in college. A graphics programming course, of all things, had a MS partner come in to show us this “cool new” Windows command-shell thing (different and separate from the DOS-style CMD, obviously), where he demonstrated fetching some data from the filesystem, feeding it into a CSV, and doing some kind of super-basic analysis on it to show in a “report” (which was really just another text file). This was 2005-2006, so I couldn’t say what it was specifically, though I seem to remember something about the word “Longhorn”. Although, reading up on some of the Wiki-history, it seems more likely that it was a Monad beta.
Anyway, back on topic. Today’s post is pretty simplistic in comparison to what most people may be writing about. But I’ve already blogged about doing hands-off SQL installation with PowerShell & CLI, and this was another thing kicking-around the back of my mind. So this is about using PowerShell and the BITS framework (*-BitsTransfer cmdlets) to copy very large files across servers. Specifically, database backups. Because let’s face it, they can be really large. And if you’re faced with fetching them off a PROD box, you want to minimize the impact on that box’s resources.
Now sure, there are other ways – xcopy or robocopy with the /J flag (un-buffered IO), or fancy GUI tools. And in an ideal world your backups would be written to a network share that’s not a local drive on the PROD SQL server, right? Right…
Oh, and one more thing. You need to enable BITS via the Windows Features console — search “features” in your Start menu and it should come up as Turn Windows features on or off (Control Panel) . On a server, it’s under the Server Role “Web Server (IIS)”, feature “Background Intelligent Transfer Service (BITS)”. Underneath there are 2 sub-feature choices, “IIS Server Extension” and “Compact Server”. Honestly I don’t know which is preferable, but I left it with the default selection, the first (former). It should go without saying, but don’t do this in production (unless you have the blessing of your SysAdmins).
Why BITS? Well, as per the Docs, it has the following 3 key features (emphasis mine):
Asynchronously transfer files in the foreground or background.
Preserve the responsiveness of other network applications.
Automatically resume file transfers after network disconnects and computer restarts.
Wow, nifty! So it doesn’t hog the network, and it’s resumable (resume-able?) in case of connectivity hiccups. Pretty sweet, no? Also, it can run asynchronously in the background, which means it won’t hog your storage bandwidth or compute resources.
Let’s See an Example
Most of the guts and inspiration for this came from this article over on “Windows OS Hub” (woshub, a somewhat unfortunate sounding acronym, but certainly not as bad as some!). The datePattern nonsense is just to make it “dynamic” in the sense that, if you have a backup scheme like me, with Sunday FULLs, daily DIFFs, and obviously TLogs in some every-X-minutes fashion, you’ll usually want the latest set of FULLs and DIFFs. But you could easily tweak this, make it more point-in-time aware or whatever, as needed.
So, here’s a bit of a talk-thru outline, and then I’ll just link the Gist.
Get the list of files we want to copy, from “source”
For each file:
Make sure it doesn’t exist in the “destination”
If not, start a BITS transfer job (saving said job to a variable for checking/finishing later)
While said BITS job is pending, print a progress message and sleep for some seconds
Finish (“complete”) said job and move on to the next file
Conclude with a message about how much work we just did!
Repeat steps 1-3 for another “set of files” (list) if desired
And without further ado, the code.
There are some downsides here. First, you cannot use BITS in a non-interactive mode, i.e. inside a Scheduled Task as a User that’s not logged-in. This is because it’s a “desktop”-oriented feature, not a “server” one. Second, I’ve never been able to get multiple transfers going at once — or at least, multiple PoSh scripts which use BITS transfers. This could very well be my fault, but it does seem like the BITS jobs are “serial” in nature, i.e. one must finish before the next one can start. Again, not the expert, just observing what I found during my experiments.
BITS transfer is an interesting method for copying extra-large files around your environment with low overhead. PowerShell makes it easily accessible and lets you wrap it up in loops and checks so you can effectively build a progress-indicative, predictable and reproducible method for copying a whole SQL server’s set of backups from one place to another.
What cool little things have you discovered using PowerShell? Let me know! Thanks for reading.
The problem isn’t so much that the role is vaguely defined. Although, depending on the size of the IT org and the tech stack, it can vary widely from a jack-of-all (DB Dev, report writer, production ops, the works) to a highly specialized performance tuner who works with 7 other teammates, each of whom has a unique surgical specialty in the data platform. But that’s not the problem — well, not the main problem. It is a problem in the sense that the business folks, especially HR, are notoriously and astonishingly ignorant of what a DBA or related role actually involves. But you get past that once you start talking to the tech leads and IT directors.
No, the problem is that, like most higher level technical roles, you don’t really know how a candidate is going to function in it (the role) without actually seeing him or her.. IN it. Do they keep a cool head when production goes down? Do they have a solid plan of attack for the dreaded “everything is slow!” complaint-storm? Do they have a good handle on HA & DR architecture & implementation? Can you rely on them to actually practice and follow thru with those strategies? Can they be a continuous learner and keep abreast of new developments while still tempering that with wisdom & maturity, applying the correct tools to the proper problems? Do try add value to the team and organization by both teaching and learning from others?
These are truly difficult, complex questions that are nearly impossible to deeply assess and fully answer during an interview process. Largely because the only real evidence of their answers lies in actual experience. Sure, a cert here or an MVP there definitely helps your case. But at any rate, we try our best to chip away at the boulder.
Pivoting to a more positive note, I’ll share some of the better questions that I’ve experienced during my career so far.
Some good examples.
How would you design and build a data copy/sync process across/between tiered environments, say DEV-QA-PROD?
Really great question. This is a common problem is small-to-medium enterprises with legacy systems where DevOps hasn’t quite reached down to the depths of the internal application stacks and people are still dealing with “refresh cycles” on the order of months, quarters, or even years. You can approach it purely from a tooling perspective, but that’s not the whole picture. Thus, it calls for some thought and team-culture ideas as well as “knowing the nerd-knobs”.
We have a complex process flow that involves a lot of stored procedures, say 50 total. Some of these are non-sequential, meaning they can be executed in arbitrary order, while others need to be sequenced with each other in “blocks”. This is a vendor product, so ultimately, the customer gets to decide the schedule and order of execution of this flow. The solution needs to be maintainable by field engineers. How would you handle this?
Woah. Talk about diving down a rabbit-hole. This is interesting in the sense that it exposes a bit of the architecture and some of the potential pain-points that the team is hoping to solve, while leaving enough room for improvement and experimentation by the hopeful candidate. More to the point, it’s just an example of a more general technique, which to me is very effective: taking an architectural problem that actually comes from the “real world” (the company/team that’s interviewing) and asking for the candidate’s ideas on how to solve it. You don’t need to get in-the-weeds super-detailed about it, but outlining your ideas helps indicate how you think about complex challenges and shows what kind of value-add you would bring to the team.
And finally, a perennial favorite:
Tell me about a time you broke production, and more importantly, how you addressed and resolved it.
So many stories from the trenches involve downtime and mistakes, it’s good to ‘bond’ over them. It helps bring the egos back down to earth, and reminds us that we’re all just meatbags, making technology to do our bidding, occasionally to our own regret. It shows the candidate’s “pressure cooker” mentality, or at least, what they tell you about it.
If you’re a DBA, Dev, or IT pro, help your managers better understand your team’s needs when it comes to hiring. Get involved in the job description write-ups and screening process questionnaires. Barge your way into those ivory towers, if you have to — or you’ll regret the time you waste on candidates who really belong in a different role than the one you’re after.
If you’re a manager, PLEASE LISTEN to your reports and tech leads. They know what makes a good team member, they’ve been doing it for a long time. Don’t dismiss their advice or block them from being part of the hiring process — yes, they are busy, and yes, they can be crotchety, but their input is highly valuable toward bringing in effective & productive talent.
That’s all folks!
PS: I know I missed the “deadline” by about an hour..ish. I blame DST. Heck, it’s still Tuesday for the majority of the Western hemisphere. I’m not biased, but I write in English, so… ya know. Take it as you will. Now excuse me while I go hide from the blog-police in my ASCII-bunker.
How it lost its original clustering key is a perfect topic for this month’s T-SQL Tuesday!
As previously referenced, we had a half-billion row table (which we’ll “round up” to a billion, just for the sake of argument) that needed an index. A clustering key, to be exact. How it lost its original clustering key isn’t that interesting is a perfect topic for this month’s T-SQL Tuesday (we planned to replace it, before realizing how long it would take, and had to cancel the job after it was removed but before it could finish adding the new one). Anybody can drop an index; it takes mere milliseconds. But to create (or rebuild) an index, the SQL engine needs to touch every single row that index will include, and of course for a clustered index, that’s errverybody.
When planning an index replacement for a billion-row table, don’t just schedule the job as “drop old index, then create new index“, and expect nothing to go wrong.
Don’t try to index a billion-row table all at once. Use a smarter methodology. Unless you can literally afford to have that table be offline for several hours (depending on your hardware, and assuming you’re on Standard Edition.
Of course, some of you crazy kids with Expensive Edition are scoffing and thinking “Oh that’s nothin’!”, with your billions of rows and online index rebuilds. Well you can go back to your fancy Always Encrypted data and your terabytes of RAM and just pretend I said “trillion” instead of “billion” — maybe that’ll fit your scale a bit better. But for the rest of us mere mortals…
Anyway, since we’re peons, we can’t just go create the billion-row index without taking the table offline. So we need to come up with a way to minimize that downtime for the table, and avoid causing excessive blocking or I/O overload on the instance. As I said before, SSIS to the rescue!
I call this the “setup, dump, & swap”. Essentially we need to create an empty copy of the table, with the desired index(es), dump all the data into it, and then swap it in. There are couple ways you can do this, but it boils down to the same basic premise: It’s “better” (probably not in terms of speed, but definitely in terms of efficiency and overhead) to fill this new copy of the table & its indexes, than it is to build the desired index on the existing table.
So here’s an outline:
Script out the table (SSMS, right-click, script table, create to… or, if you use oneofthesecoolextensions, F12 or similar “get definition” shortcut) — say, if the original is MyTable, script-create & replace MyTable with MyTableCopy
Here’s a little room for choice. You could create the table in the same schema with a new name; or, you could create the table in a different schema, with the same name or another name. This will determine how you do the “swap” toward the end.
In the first case, we’d use sp_rename
In the 2nd, we’d use alter schema transfer
Both are essentially meta-data changes, but the latter could be potentially take a hair longer just because it pulls more strings; whereas the former requires taking care of dependencies ahead of time so you don’t break a schema-bound view or orphan a foreign key.
Add the desired index(es), e.g. create clustered index CX_MyTableCopy_Datestamp_ThingName on dbo.MyTableCopy (Datestamp, ThingName)
Build the SSIS task to copy the data from MyTable to MyTableCopy
Schedule said task via SQL Agent.
Use the previous tip to monitor its progress and estimate time to completion!
Once done, prepare to swap!
Again, use sp_rename, and if needed, alter schema transfer.
Likely, this will involve several renames – the constraints and other indexes can’t be named the same either, so get all that stuff renamed with an _old suffix first, then you can swap the actual tables.
Clean up after yourself and drop the old table once you verify everything’s working well and all dependencies are accounted for.
Let’s back up a bit. Why are we doing it this way? More importantly, what are the general use-cases for this kind of thing? Well as I said, it’s “better” in certain ways, than simply creating the index on the “live” T.O.U.S. It avoids locking said live table, and it has the potential to put less stress on the database and less synchronous I/O against its primary files. Finally, and most importantly, as a bulk insert operation, with configurable batch sizing, the SSIS task will put exponentially less load on the transaction log. Whereas, with the regular inline index creation, it could easily fill up the Tlog and will definitely cause performance issues.
Furthermore, swapping tables, or in more advanced cases, partitions, is a fantastic way to move massive amounts of data around with minimal impact. It also happens to be quite helpful when we have to deal with indexing such massive amounts of data.
In my case, this was actually an archive of 2016 audit-trail data, so if I’d have chose, I could have taken it out of the partitioned view it was part of, and dealt with it being offline for several hours. But because I’m a sucker for a good problem, and because I wanted to be able to semi-accurately monitor the progress to let the boss-man know when it might be done, I chose to take this route instead. It’ll come in handy again soon, I’m sure — we have several other jumbo-tron tables laying about that may need some index tuning and/or partitioning.
So, happy swapping, and don’t let those T.O.U.S.‘s get the best of you!
PS: thanks to @SQLDoubleG for hosting, and thanks in advance to the rest of the community being lenient of my “submission” being a re-purposed existing post with a few extra blurbs thrown in to bring it on-topic! =D