February and March 2025

TABROOM

So we’ve launched an actual formal Tabroom newsletter!  I’ll still talk here about the process of software development in the context of what it means for me, since this space is just mine to ramble and update folks about my goings on.  Leaving out my day job omits too much, and there’s always things I’d like to say that apply more to me than to the software. The intent of the newsletter is for people who are more interested in this as a Tabroom user than a Friend of Me, which is fine.  And it’ll still go out in my voice, I’m told.

If you’d prefer just the Tabroom announcements, please sign up here.

At any rate, I’ve started tinkering with TanStack Query thanks to Hardy’s recommendation, and while it simplified things, it did cause me to retread on some conquered ground as it were.  I’m going to have to stop doing that.  The challenge with a rewrite is when your drive for the ultimate purity in code means you just keep rewriting the first parts over and over again.  Coming from a stack that hasn’t changed or had new tooling since 2010, I’m not used to avoiding the temptation to switch to the latest shiny tool; and in JS world that tool is released every two weeks.

In operational land, I made it through the Stanford+ weekend (26,000 students & judges) and the Cal/Harvard weekend (23,500 students & judges) without a blink.

WHEREABOUTS

I’m taking my traditional sojourn after the tournament up at the North Coast, in Fort Bragg this time.  Fort Bragg is festooned with signs protesting a proposal to rename the town so it’s no longer named after a Confederate general.  The name predates the Confederacy, as it happens, but not for great reasons.  It turns out ol’ Braxton Bragg was the former commanding officer of the person who built that first fort here.  However, the fort existed to exterminate the local Indian tribes, to clear the land for more redwood logging. So even if it’s sorta free of the Confederate association, it runs smack into another stain on history.

The keep-the-name crew is mostly focused on the fact that the town has never felt a real connection with ol’ General Bragg in any of its traditions, tourism or marketing. And changing the name would cost a pile of money that’s not insignificant for a small town on the remote coast.  Personally, I think their best unmade argument is that Bragg was such a horrendous general, his efforts may have helped the North prevail more effectively than many Union officers could claim.  He lots a ton of battles, opening the door to Sherman’s march, and in the process was so hated by his own soldiers that they plotted and attempted to assassinate him multiple times.

But at any rate, the coastline remains a strip of beautiful hills, trees, fog and dark skies, which are all things I relish, and are hard to find. There are far too few places that humans live where the night is allowed to be dark, and the air is filled with only the sounds of oceans and frogs.

March will find me in Wilmington in the early part of the month, and LA in the middle, and probably hiking in New Hampshire or Maine for the spring melt waterfall season, when a muddy stomp is rewarded by torrents across New England.

THE OTHER STUFF

I’m re-reading A Prayer for Owen Meany, an old favorite I haven’t read in far too long.  Not much writing otherwise these days, because I’ve been trying to get my photographs in order, and helping the neighbor clear out some Old Crappe from our shared basement. There’s an understandable impulse when trying to empty an overly full basement to find new homes for all the items therein; surely some of these 138 VHS tapes would have value to someone who likes the almost-retro!  But at some point you have to realize the energy expended in finding them new homes surpasses their worth.  I am in the self appointed role of “I’ll find a place that wants them or toss them for you” which makes it psychologically easier, I think.

 

Jan 2025 Supplemental

Sigh.

So look, I could just make up a pile of believable nonsense sprinkled with technical terms here and pawn the blame off on a crevulating internexus or something.

Nah. I simply screwed up. I built the thing that scales the Tabroom web server base up and down to be a guided process, because I didn’t have enough data to make it automatic; since I didn’t know the proper ratio of tournament/users to servers, I couldn’t tell the computer. It was a judgment call. So the process was manual.

And well, today I simply forgot to do it. I didn’t go to a tournament myself, so I slept in a little bit and didn’t remember until my phone flipped out. And then, two hours later than I should have, I hit the Big Red Button and the gears spun up and it was all fine 15 minutes later. Mea culpa.

There’s a silver lining, however. At first I spun us up from our weekday standard pair of servers to a full 10. Tabroom, for you, came back up and was fine from that point forward. But the performance numbers were cheerfully and consistently in the orange range. Nothing was overloaded, but we had little spare capacity.

That’s interesting, because now I have a sense of where the line is.

It’s especially indicative because Tabroom was maximally busy right then.  One of the challenges I have with Tabroom blips is that for 10-30 minutes after, Tabroom experiences much heavier load than usual. That’s because you all build up a backlog of things for it to do; some rounds are delayed, some judges wait to enter their ballots, and so on.  When the site comes back, everyone rips through their backlog at the same time. So I watched my newly adequate servers balance on the edge of what they could do, I knew this moment was also likely the limit of how many operations they’d ever be asked to run on a weekend with this many tournaments and users.

This weekend Tabroom is hosting 92 tournaments with 14,688 individual competitors and 6,141 individual judges.  That implies 10 servers can just about handle 20,000 prospective users, which makes 2,000 users per server.  That, my friends, is what we call actual data, not “Palmer’s gut.”

Computer folks sometimes call themselves, or are called, ‘engineers.’  I don’t use the term — I refer to myself as a software developer instead. Real Engineers™ have the duty, but also the luxury, of checking things exhaustively before they actually build anything. We all want them to, since they are building bridges, schools and hospitals. Today fantastic structures are created through a deep understanding of physics, material tolerances, weather, and so on.  The process is guided by obsessive checking and regulation. It takes a lot for an engineer to stamp and sign a set of plans, before any dirt is dug or hammers are swung.

In the Roman era, an engineer who built a bridge or aqueduct had to live under it for a year after with his family.  They understood physics and materials less well than we do, so instead they overbuilt the hell out of things. It’s small wonder so many of their creations still stand.

The pace of change and resources in computing doesn’t permit us software developers that standard of care. We’re expected to produce a swifter pace of change and new features that engineers aren’t asked for. So, we often end up out over our skis and the whole thing comes down. That’s not great for Tabroom, but that’s nowhere near the tragedy of a bridge collapse. More resources don’t help, because with them come more demands: did you know that Facebook and Google each had more total downtime than Tabroom did in 2024?  That fact makes me feel… a bit better.

Therefore, in computing we end up with systems like “Palmer has to press a button every Friday night or the whole thing explodes. Let’s hope he remembers!”  Imagine how quickly everyone involved would be fired if a railroad was built that way.

However, we do have some commonalities with Real Engineering. We share a sacred dedication to safety margins. Ten machines this weekend was just barely enough, so instead of watching like an dunce to see if it would tip over again, I immediately spun up six more until all the numbers were vividly green. In the safety of the hindsight, I can say that two more servers would have been fine, and even four more was a touch excessive. Six was blatant overkill, birthed of a morning’s panic. I’m not sorry.

But now, I am armed with actual data. I can set up Tabroom to automatically spin up and run a new server for every 1500 or so anticipated users, that extra 25% being a generous but not overly expensive safety margin.  And that automated process will not be forgotten. What makes computers useful is that they have different strengths & weaknesses than humans. Computers cannot be told to “eyeball the number of tournaments and think about what we’ve done in the past and spin up a bit more than you think we need.”  Even modern AI is likely to take that instruction and try to run 3,400 servers and bankrupt us, or -12 servers and break the laws of reality.  They require a real formula: “Run one per 1500 users”, they can do.

But If I tell the machines to spin up that many machines every Friday at 4PM Eastern, then they absolutely will do that every weekend within seconds of 4PM Eastern. My imperfect human memory is replaced with a guarantee.  But there’s still a catch: it’s a guarantee the job will be attempted.  That automating code will still be the product of my imperfect hands, and therefore might fail even though it was tried.  If I run the job and it fails, I see and fix it right away.  An automatic job cannot self-correct.

So I’ll still check it.  But let’s say that I was 99% certain to remember to spin up the Tabroom servers manually.  That sounds good, except when you consider that we have 365 days in a year, so that’d be 3 1/2 days of downtime on average per year from this cause alone. Today was that 1%.

That’s not nearly good enough. So we multiply it against another 99% certainty: that I can build an automatic scaling system that runs correctly. So now there’s a 1% chance that fails, and a 1% chance I forget to check it.  We land on a 99.99% certainty one or the other of those things will happen any given day.  That would take a decade to explode again. That’s likely good enough, but we’ll still add another layer. We’ll make sure another NSDA staffer also checks every Friday, so they can scream at me if it hasn’t happened and I did not notice.

Now we’re at 99.9999% certainty. At that rate of risk, downtime from this type of screw-up would be on average half a second per year.  If only we could handle all risks so easily.  Getting to that “six-nines” of coverage — which is how the computing industry refers to it — costs me building a script, and another employee 5 minutes weekly each Friday.  Doing it in some other areas of our installation would cost us several million dollars.

So we do what we can. Maybe I should have chosen a lower stress career, like disarming landmines or cleaning up nuclear waste or something.

January 2025

TABROOM

Turned a corner over the break; some elements of the new framework fell into place mentally, and I’ve hit the glorious threshold of SEEING a THING.   Doing invisible work is seriously tough; you can’t feel like you’ve changed the world, even if it’s just the world inside a small browser window.

I’ve also learned that Tabroom has been requiring people to write ballot comments for entries marked no-show, which is obviously dumb, but the type of thing that doesn’t get reported often.  Folks tend to report what they use, and my interface to the user base is more often tabbers than judges.  It doesn’t always strike tab staff to report issues that they are helping others around, instead of the ones they are confronting themselves.

I’ll be at ASU this weekend; as I’ve said, it’s not a hard sell getting me to Arizona in January.   Emory awaits me after that.  A quieter January otherwise, punctuated by a short trip to park lands and hiking to try to compensate for the late fall, which I’ve spent as a slug.  And hopefully more SEEING the THINGS as above.

NOT TABROOM

I went to Montréal over the holidays; it was more holiday-ish than I prefer to spend my birthday that everyone else celebrates, but it worked.  The food was glorious, the city is lovely, I got away with speaking French far more often than my sense of my own fluency should allow, and in Montréal is totally fine with you shoving in an English word when the French one fails you. And the pouding chomeur at Au Pied de Cochon made the whole trip worthwhile on its own, though the venue belied the name of “Poor-Man’s Pudding”.  It’s very similar to a dish my own mémère would make for a special treat. So that was nice.

The temperatures hovered around 10F, but there was no wind, and some lovely snow.

I am beginning to think about the summer trip.  Unlike the last few years, I do not have a short list of I’ve-always-wanted-tos to choose from, so I’m rather at loose ends.  But I’ll come up with something. I bailed on thinking too about the winter and will instead go to my favorite haunt.  Perhaps I should do the same; I go full nomad typically for the summer and range across a lot of territory.  Perhaps the play this year is to stick a pin someplace and drink the stillness.

 

December 2024

TABROOM

Had a short hiccup on Friday night; it was largely because I’d spun up extra capacity but the process of spinning it up didn’t quite finish, alas, for a really stupid reason.  But fortunately it was also quick to fix; the process to get around the stupid reason was fast, and we were back after like 6-7 minutes.  I can’t promise never to have issues, after all; nobody can in tech.  But it’s nicely affirming when the problem is just a little turbulence instead of a full plane crash, especially when I recovered so fast as a direct consequence of some blood sweat & tears I’ve recently put in.

That brings me to a wider point about the rewriting process and the concept of a feature freeze.  I’m trying to not code up new material in the old programming environment as much as possible, except for direct bug fixes and flaws, while I get the infrastructure rolling behind the new framework. However, to some degree that is impossible.  Tabroom’s reality is constantly changing, because you all keep using it.

Even if I never add another feature and only fix bugs and errors, Tabroom must change, simply because the scale increases. We get more traffic year over year, more tournaments, more students, and every one of our end participants uses the tech more heavily too; we bring three devices per person to tournaments now. That growing load represents unavoidable change permanently baked into Tabroom, that will always demands a measure of attention.  Software in active use can never be paused.

So after our fun times in November I combed through our records of moments the database locked up with heavy write traffic, and rewrote every page and query that featured there to avoid them. The big one was the pref entry screen.  Did that cause our corrupted index?  I’ll never know. But it will perhaps make them less likely in the future, and it will definitely make parts of the site run faster and better.

The expanded load also means the software is more unforgiving.  Smaller mistakes become big problems. To a degree the expansion of Tabroom represents an expansion of the world of forensics.  This is good!  But it does demand I keep up with it, so I’ll never be able to entirely focus on the rewrite.

But all the same, I’ve made some good progress there. One big advantage of the new framework is it runs a lot faster on less powerful hardware.  The other big win is that I’m a far better coder than I was twenty years ago; the code I put out after rewriting will be more robust and capable.  I can already feel the system reaping the benefits of both of those things.  I’m not traveling at all in December, either for myself or tournaments, after this weekend.  I’m hoping I can use the stillness to hunker down and turn the corner; I’ve seen its edge, so we are perhaps near to seeing some reality there.

THE OTHERS AROUND

My sister got a new gig already after the old one had a round of layoffs; the new one seems much more comfortable and promising, though, so props to her for landing so quickly.  I continue to have some pretty phenomenal nephews and nieces, as finding things they’ll like during my travels has confirmed for me.  But then, I am somewhat biased.

JUST LITTLE OL’ ME

Welcome to the Holiday Season, such as it is.  I confess a dearth of conventional Christmas spirit, and generally I try to avoid traditional observance of the holidays. For one, Christmas really my dad’s holiday; he always made it a big deal, and since we lost him some 13 or so years ago, his passion for the day adds a tang to the holiday that I find it better to avoid. Don’t take up smoking, kids.

And as it happens, December 25th is the least common birthday of the year, but it was still the birthday of Humphrey Bogart, Jimmy Buffet, Sissy Spacek, Rickey Henderson, and your humble Tabroom programmer.  I therefore prefer to spend the day away from indoor trees and too much rib roast. Instead I go off and find someplace quiet with more outdoor trees.  It works for me.

My European Gallivant was lovely for the most part. I found Munich warm, comfortable and welcoming. Venice was fun as always especially for me having company there — I’ve never actually traveled with people in Europe before, and they spurred me into seeing and doing things I’d not usually find on my own, such as a performance of Verdi’s Otello at the iconic La Fenice opera house. I confess I’m not much of an opera person, despite loving classical music generally.   But it was still great to go if only the once.  And then I swung through Amsterdam for some time in coffee shops reading tech docs, rijsttafel, and cloudy skies.

I confess however that travel to Random European Cities has grown both easy for me, but also less interesting and exotic. I found myself walking around places feeling more at home, but less engaged by them as interesting in their own right as a result.  I intend to focus my wanderings to more rural places and probably further afield in the days to come.

November Supplement

At 7:16 AM Central, on Saturday November 16th, Tabroom’s database server had this to say:

2024-11-16 13:16:00 0 [ERROR] InnoDB: tried to purge non-delete-marked record in index uk_ballots of table tabroom.ballot: tuple: TUPLE (info_bits=0, 4 fields): {NULL,[4] \ (0x805CEFD5),[4] u (0x807514D1),[4] ZO(0x82B95A4F)}, record: COMPACT RECORD(info_bits=0, 4 fields): {NULL,[4] \ (0x805CEFD5),[4] u (0x807514D1),[4] ZO(0x82B95A4F)}

Poof goes the ballots table.

A ballot in this context is a data record.  One is created for every judge in a section for every entry in that section.  So someone judging a single flight of debate would have two ‘ballots’; a three-judge panel in a room with six speech contestants would have 18 of them.  Each one tracks what side the entry is on, what order they speak in, when that judge hit start, and whether the round is finished. All the points, wins and losses the judges hand out are stored below it; all the information about room assignments, scheduled times, the flip, and event type are above it. So, it’s a rather critical table.  And it’s large: there are 16.1 million such records in Tabroom, making it the second largest.

At 7:16 CST this morning, Tabroom had to delete just one of those records. Maybe a judge needed to be replaced. Maybe a round was being re-paired and all the entries were dumped. Whatever the reason, a ballot was queued for deletion.  That happens thousands of times on a Saturday. But in deleting that particular ballot, the database server software wobbled just a little bit. Perhaps it hit a very obscure bug. Perhaps it wrote the information on a part of the disk that has a small chemical flaw buried in its atoms, and so it failed. Or perhaps a cosmic ray hit the memory and flipped a zero to 1, and changed our world. However it happened, the table’s index was transformed to nonsense.

An index is a data structure used to speed up reading the database. If you ask for all the ballots in Section 2424101, the database server would have to scan all 16.1 million ballots in Tabroom to deliver the 12 you are looking for. That’s very slow on a large table. So for commonly accessed data, you create an index, which is a record in order of all the Section IDs in the Ballots table. The database finds the range you’re looking for quickly, and all 12 ballots IDs are listed there together.

But indexes aren’t free; you can’t just create them for every data element. Each one takes up space, increasing the disk size of the database. They also slow down writes: you have to update the index every time you create new data. So you only create them for data elements that you search by; the Section ID of a ballot yes, but the time of your speech, no.

That little glitch at 7:16 AM deleted the index records for that one doomed ballot, but not the ballot itself.  Suddenly the number of rows in the index did not match the table. Therefore, the database stopped using it — it knew the index was no longer reliable. The slowdowns, lockups and downtime on Saturday morning is what it feels like to use ballots table without indexes: it starts out slow, and goes downhill from there.

First, I tried the gentle fix: a utility that tries to verify the data and rebuild just the indexes, which it does without any invasive changes to the data itself. If it succeeds, the database just starts working afterwards. It takes about 12 minutes to run on that large ballots table; a fact I learned this morning. And then, it failed.  It can fail for a lot of reasons, but mostly it has a very hard time verifying data that is changing as it operates, which is what a live database must do.

So I had to turn to invasive procedures. What you do is cut off the ability of anyone to access the database, so nothing changes in the data in the middle of your surgery. Then you dump a backup copy of the table. Then you run the most scary command I’ve yet typed into a database:

DROP TABLE ballots;

That’s right, that deletes them all. And then I hope beyond hope that your backup data file is accurate and not itself corrupt. In reality, in my paranoia, I took four backups. Two of the primary database, and two of the replica. That took eight minutes, which included me comparing them against each other to make sure they were all identical. If they disagreed as to how many ballots exist, you then have to try to figure out, or then guess, which one was right.  Today I was spared that.

Then I had to make a choice between the Right Way and the Fast Way to reload the data. Loading up the ballots data takes about 15 minutes. Deleting all the ballots takes about 0.15 seconds, and can’t be undone. So if I do a test run, your downtime is longer. If I don’t test it but the file is bad, then I’d have nothing to recover it from. In trying to shorten the downtime by 20 minutes, I would lengthen it by several hours.

So today, caution won. I took one of the backup files, and loaded it into my test machine. Simply copying the file took a few minutes, and then I got to sit there and watch as it re-created ballots in batches of about 8,500. All 2,000 of them. Each batch takes about 0.45 seconds to run on average, thus it was about 15 minutes of time total of just sitting and waiting as line after line of data was reloaded in, like this:

Query OK, 8770 rows affected (0.260 sec)
Records: 8770 Duplicates: 0 Warnings: 0


Query OK, 8594 rows affected (0.272 sec)
Records: 8594 Duplicates: 0 Warnings: 0


Query OK, 8927 rows affected (0.270 sec)
Records: 8927 Duplicates: 0 Warnings: 0

It’s a real fun thing when you are sitting and waiting and can do nothing while you know everyone else is doing the same. But eventually, it worked. And so I braced, dumped the real database’s  ballots, and ran it again.

And phew, it was fine.

The site came back immediately, though naturally was a bit slow at first because then EVERYONE was pairing their first round at once, which is far from typical. But that worked itself out fast.

And then I got to clean up the resulting mess.  I have 31 terminals windows open with full access to the entire database — better not typo in any of those!  I spun up a bunch of servers to get spare capacity going, and found a different, less grievous bug in the process with that — but that thankfully was Hardy’s fault, and so I shoved if off on him.  And then of course, you know how I get email every time that error message screen happens? I got to clear out the 92,822 error reports that were queued up in the email server before any other messages would send.

And then I wrote this post.

After a downtime, you want to take apart the causes and figure out how to make it not happen again that way. The last year’s downtimes were all capacity related; we had too few resources for too many users. It was mostly me figuring out how powerful our new cloud system was, and sometimes wasn’t. We also lacked a system that could quickly bring new resources online when I guessed short. I spent a fair chunk of August building a system to help; now it takes me about 5 minutes to spin up new servers, instead of an hour.  So, neither of our episodes this fall were caused by that.

The one in October was in the category of “my fault, preventable, but super unlucky.” It’s the type of thing where there does exist a level of care that might have prevented it. But practically speaking that level of care would also paralyze me if I adopted it; I would do nothing else if I were that fanatic about validating code and queries. So instead, I created some automated systems to check for slow queries during the week and notify me, to try to find these issues before they explode. That system has already ferreted out a number of annoyingly — but not tragically — slow functions. These things only blow up when there’s hundreds of them running at once, but if they don’t exist at all, then that will never happen instead of rarely happening.

Today’s episode was worse: there’s no way for me to prevent errors deep in the underlying database code. I have neither control nor capacity to address it. I will probably schedule a very early morning downtime in the next week or so — or maybe over the Thanksgiving break — to do a full rebuild of all the database tables, and to deep scan the disk they live on. That’s worth doing anyway; rebuilding the tables gets rid of empty spaces that once held deleted records, and makes the whole thing run a few percent faster.

And I might just move all the data onto a new disk altogether. That’s proactive and reduces some risk, but the truth is I might be chasing chimeras. And that’s life with computing. Technical complexity can cause a lot of grief. Human error causes even more. And sometimes, it’s neither; it’s just the stars decided today is not your day, and you’re going to know it.

And such a day was Saturday the 16th of November. It’s been many years since we’ve had a problem of this particular flavor; may it be many years before we have another.