Logic Fault

Have you considered running your software on a computer?


On Reversibility

Reversibility is the most important quality of good software decisionmaking.

This is not a new idea.

Reversibility is more important than knowing “will this work at all?”. If you make a wrong decision but can reverse it, you’re fine. If you make decisions you can’t reverse and your strategy is “hope it’s the right call”, you’re building systems on the power of prayer.

What we talk about when we talk about reversibility #

Reversibility is not the same as the ability to iterate/improve on incomplete solutions. Iteration is well and good, but if you equate the ability to improve a system with the ability to go back to the way things were before you had that system, your outcomes will land somewhere between “endless struggles while continually wondering if this could have been done a different way” and “existentially annihilating hamster-wheel of layering forward-(not-quite-)fixes onto a cautionary monument to the sunk cost fallacy.”

Reversibility means undoing the thing you changed, un-building the system, going back to how things were before you deployed the bad solution. Reversibility does not mean trying another solution to the problem while the wrong solution is deployed; it means putting the car in reverse.

Reversibility does not mean committing less-than-fully to an approach. It means executing on that approach in such a way that it is possible (maybe not easy, but possible) to go back to the state of the system/business/product/whatever before that approach was taken.

Reversing should not be understood as failure–or, if it is, that word should be destigmatized from all “this should never have been done/how dare you” baggage–but rather as a property systems should have. In the same way that a healthy organization wouldn’t deploy a database on the assumption that it will never be down, and in the same way that a healthy organization would practice blameless troubleshooting of downtime when it does occur, reversing a decision should be understood as an inevitable system property to be minimized but not eliminated–think “hardware failure”, not abject dereliction of change control.

Once you have backed out of the swamp, you can decide what to do instead. The right answer to “what to do instead” might be to do nothing and deal with leaving the problem unsolved.

These are all valid outcomes of a healthy architectural decisionmaking environment:

First order effects of prioritizing reversibility #

Second-order effects of prioritizing reversibility #

Case study: Testing #

I’ve been part of a few groups working on complex software systems which didn’t have any automated tests. Manual tests existed. So did many hand-runnable programmatic unit/integration/system tests, but which tests were run when/how was up to the people releasing changes. This is a surprisingly common situation, and not just in small projects.

Each time, the following scenario played out:

  1. Problem identified: no automated tests. Identified as a problem either due to released defect rates or it being a bad practice outright.
  2. Solution chosen: use automated CI testing for changes being released; people should write tests along with features.
  3. Solution implemented: CI runs automated tests and people write them.
  4. New problem identified: product owners are confused as to why releases take a long time. Engineers answer “we had to spend a long time writing/fixing tests, even though the thing being released was ready before then.”
  5. New problems identified: flaky tests, poor quality tests, slow tests. The usual growing pains of automated-testing adoption.
  6. Solution chosen: switch automated testing system used (framework, CI system, autogenerate test code, etc.)
  7. Solution implemented: new automated testing system used.
  8. New problem: much rejoicing in engineering because the new system was better than the old system, which swiftly ends because, improvement or not, there is still plenty of slowness/flakiness/quality issues in the new testing system. Product still unhappy.
  9. Solution chosen: change testing system again/use TDD/add coverage metrics as a KPI/etc.
  10. End state: things are not great.

So why did this end in a not-great state? There are plenty of well-known lessons to take away: track defect/rollback rates; communicate expectations to product/stakeholders before implementing new engineering processes; make data-driven decisions based on CFR/false-failure-rate/whatever; presence-of-tests != culture-of-testing, and so on.

Those are all good and important things.

Decision reversibility is more important than all of them.

It’s more important because plenty of people will not or cannot do all of the “correct” solutions (2)Raise your hand if you think that that shipping slower is worth it for quality through automated testing. Cool. Now keep your hand up if you think that all the different parts of your business will agree on how much slower is an acceptable tradeoff for a testing culture. Yeah, that’s what I thought. . If you can only operationalize one good-decisionmaking practice, (3)“Only able to operationalize at most one good decisionmaking practice” is common situation which organizations are rarely self-aware enough to notice that they’re in. pick reversibility.

In those testing-related anecdotes, the long-term fix was surprisingly similar:

  1. Reverse: go back to the bad old days of not running automated tests. Yes, really. With all the tradeoffs that entails: higher predictability in release cadence, higher ownership of feature delivery by engineers, lower reliability, more fear associated with releases, etc.
  2. Honestly assess what went wrong with the attempted solutions. Honesty is impossible to achieve if the attempted solution is still in place. Reversing first makes discussions of what went wrong/what to do instead less fraught, urgent, and fragile.
  3. Decide what alternative solutions–if any–should be tried. This is also extremely hard to do when partial/defective solutions (and people invested in forward-fixing them) are already in place.

In most of those cases, better automated testing systems were built after reversing. They worked out. In a couple of cases, teams chose to remain without automated testing. That worked out, too. Yes, really.

Lessons #

Reversibility means reversibility. Not agility/iteration ability. Not forward fixing, going back to the way things were before.

“Reverse and try again” is not always the right call, but having the ability to do so is. Put another way: being able to go back to a previous state is hugely beneficial to your organization/software, even if you rarely actually do it, for the same reason that the test that always passes is valuable, too.

Reversibility is a value, not a process. As such, it can be adopted informally (by creating a culture that values people who value reversibility) or formally (ADRs, rollback plans, stability levels, back-compat guarantees, etc.).

Reversibility applies at all scales:

Nobody makes good decisions while committed to a bad-but-already-shipped solution. No, you’re not special–nobody. Between split focus, sunk-costing, ego, fear, and incentives, this is always a bad move. Yes, plenty of successful groups do this and still succeed. Lots of people smoke and live a long time, too.

Prioritize reversibility even if you have limited resources/maturity/ability to commit to changes. Especially then.

Going back to the old state is never, ever as bad as you think. It’ll be fine. I prommy. Things worked in the old state, however badly, after all.