The Art of debugging

This isn’t a Go post really. In fact it’s not really a programming post either. What this post covers, though, is highly applicable to programming, in Go, or any other language.

Debugging a program, in fact, debugging a set of programs, or even finding a fault in your car is easiest done by following the following algorithm.

Step 1.

Triage

The very first thing that you need to know is, what is the problem, that is, why is a bug being reported at all.

When someone reports a bug, whoever is looking at the problem needs to know what the problem is, a good bug report contains three important pieces of information. * 1 The state - something like “When the application is doing X”, or “When the DB values are in combination Y” * 2 The input - something like “When I press button A”, or “If I input yesterday’s accounts” * 3 The output (this is the bit people notice first) - “A panic occurs”, or “An impossible value is calculated”

The hard thing here is getting whoever reports the issue to provide all of the information. Sometimes it takes a few discussions, but without it the only way to provide fix is to guess, and that could be a very long process indeed.

Step 2.

Repeat the bug.

Create a local environment that matches the state described in the bug, provide the input, and observe the output.

Being able to do this means that a fix (if created) can be confidently asserted to work.

Docker really comes to the fore here, being able to spin up an environment that matches production on your local machine is invaluable.

Having a copy of the production DB, in one container and a container running the same code as production makes life SO easy.

This is a great time to write a bit of code that provides input to the system/application, and measures the output. This is a test, and being able to automatically test if the bug has been solved is what computers were invented for :-)

Step 3.

Isolate the bug.

Knowing where the bug is is how its fixed. There’s no point looking at code if it’s because of a Database misconfiguration.

How do I isolate the bug? I break the system down into steps. For example: Input, Processing, Upstream system integration, Database, and Output. I do a binary search, that is, I break the system into two and workout if the input, state, or output is right or wrong at that point. If it’s good I know that the bug lies somewhere between my check and the final output. If not, I know it’s somewhere between the input and the checkpoint.

I cannot write exactly how to check, because the flow of the data, and how to check it changes with every application. But I lean heavily on logging output, adding more as I narrow down where the bug lives, I look at the state of datastores before and after the operation, ensuring they are as expected.

Once the bug’s location is narrowed down, if it’s in a function, then writing a unit test for this case is a good thing to do. Because automatic testing ensures the bug will not be recreated, and it’s known that the case wasn’t handled before, or the code was wrong before.

Step 4.

Issue the fix

The bug’s been found, the fix for it might be a “workaround”, or a “change”, or “won’t fix”. This depends on the nature of the bug, and what is required for the bug to no longer exist (it could be a simple code change, it could be that the laws of physics actively prevent any change)

In any case, the change is documents, unit tests are created, and the fix moves from the local development environment through to a quality control environment (sometimes known as staging, or User Acceptance testing area), before being rolled out to production.

Summary.

Following this algorithm is the most efficient way of fixing a bug. There are no shortcuts. Missing information from one step means that the bug cannot be confidently solved. People are just guessing.

That’s not to say some people won’t get lucky, but that sort of luck cannot be relied on, no more than a crap shoot, or a lottery win.

Shane Howearth

Home

About