Learning debugging

In my learning programming post I wrote about some approaches for learning to program and linked to a bunch of resources. Some of my insightful friends pointed out that most programming teaching doesn’t cover debugging and that lack leads to a lot of frustration. This is particularly hard on those working alone or with other people that don’t know how to debug.

Debugging is the art of figuring out why a system doesn’t work. The skills of debugging apply not just to errant programs but to any system that needs diagnosis.

A brief story to illustrate some debugging independent of programming:

I return to my car in a parking lot and hit the alarm disarm button on the key fob and nothing happens. My car doesn’t work. Why not? It’s late and I want to go home.

What might cause the car not to respond when the key fob button is pressed?

Some theories for why the key fob doesn’t work come to mind:

  1. Key fob is broken
  2. Key fob battery is dead
  3. Car alarm isn’t working

A simple test will disprove #2 – use the key to open the car door or trunk. If the alarm works then it will trigger. This would suggest that the problem is with the fob. If the alarm doesn’t go off then the car alarm isn’t working for some reason.

I use the key to open the car door. The alarm doesn’t go off and the dome light remains dark.

The next step is to select between the two theories for why the car alarm isn’t working. The most likely reason for both the car alarm and the dome light not working is the car battery being dead. Trying to start the car confirms that hypothesis – nothing happens.

If the car battery is dead then the remediation is to get the car jumped so I can start it and drive home. If the battery is dead then getting a jump will work around that problem but it does not address the new “bug” – the car battery shouldn’t be dead. I formed some theories about why the battery might be dead. The battery is dead enough to not power the alarm or any lights rather than just not providing enough power to start the car.

  1. Headlights left on – not likely since the car gets audibly distraught if you remove the key when the headlights are on.
  2. Dome light left on
  3. Other power drain

I checked the dome light switch – on. Whoops. This is when it is useful to know the car has been sitting in this airport long-term parking lot for two weeks. Confident I had probably isolated the “input” that caused the problem I called the helpful airport parking service to get a jump and drove home. The next morning I did some research to learn if the car alarm might drain the battery and if so how long it might take. I knew I had left the car previously for 10-12 days without driving it with no problem. For my particular car and alarm, it turned out the answer was three to four weeks – or, it turns out, two weeks or less if the dome light is also on. This probably means that the battery is normally kept charged but the combination of accessories left on while the car was parked drained the battery.

A non-trivial debugging session has a few phases:

  1. Reproduce – Figure out how to make the system have the problem reliably.
  2. Isolate – Isolate the quality of the input or environment that causes the problem. This is trimming down the reproduction to the essential parts. Sometimes this is achieved during the first phase. These first two phases are important for filing good bug reports.
  3. Understand – Figure out why the problem occurs and test that understanding with more tests. For where the code is currently working in some cases this should also involve exploring the edges of the failure enough to be sure the problem is really understood.
  4. Fix and Test – This isn’t really part of debugging, but having good isolated reproducible cases and and understanding of the bug is critical to knowing when the bug is really fixed. That knowledge is also what goes into building new tests to help catch other bugs of the same nature and prevent regressions.

There’s no step-by-step guide to debugging. Sometimes the act of reproducing and isolating the bug yields enough clues for the understanding step to be easy. Sometimes reproducing and isolating is just the beginning.

Experimentation, deduction, and experience (including “external experience” such as searching or asking colleagues) are all important. It is important not to immediately fall back to asking someone for help with a vague description of the problem like “it doesn’t work”. In the worst case, the person you ask will solve your problem and won’t learn anything. Think of each debugging task as a way to build experience so the future debugging tasks will be easier. After tens, hundreds, or thousands of bugs, you’ll have experience with many different kinds of bugs and a sense for how to form better hypothesis to isolate the bug given only the initial repro case. If you’re mentoring someone, remember that learning to debug is critical a part of learning to program and avoid short-circuiting that learning by wielding your greater experience debugging (and reading code) by just giving answers!

Debugging a system is harder than debugging a single program. Frequently when a problem occurs in a system there’s a lot of pressure to restore operation (e.g. bring it back up if it has hung or crashed) and that act may destroy important information you need to reproduce the bug. It’s useful to know how to capture enough state from the hung or crashed system before restarting or rebooting it to have a shot at debugging the problem. It’s also useful to have the system emitting useful logs or crash dumps – thinking ahead of time about what state may be important for debugging problems is an important part of building systems that are easier to debug.

Articles about debugging are much harder to find than articles about running a debugger for a particular environment. A debugger is a useful way to get information about a program while it runs rather than a magic tool for isolating bugs.

Some resources: