The art of debugging


Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it? – “The Elements of Programming Style”, 2nd edition, chapter 2. Brian Kernighan


Debug hacks (a Japanese book)


Bug or error analysis.
This world is built on causality, cause and effect.
Root cause -> series of causes and effects -> bug.
What is debugging? It’s Abductive reasoning.
We infer the root cause with observations.
Observations -> Infer -> Causes -> Increase obsevability -> more observations -> … -> Root causes

Performance issues vs Error issues

Performance issues: bottleneck
Error issues: crash, lockup

Error issues analysis - methodology

Execution ID

Stack backtrace
PID, TID. Interrupt/execption number or hanlder,

Structuralism: concurrency, life cycle, lock.

Expected/intended behavior vs actual behavior.

Timeline - put observations on a timeline

IO hang issue at OS layer

Observations #TODO


Scientific method: Hypothses, narrow down


Data-race from kernel
Race conditions vs. data races
Instruction fetch fault: exception RIP: unknown or invalid address

Increase observability based on causes - more tools need to be created.

Data-race detector: Concurrency bugs should fear the big bad data-race detector (part 1)
Race condition detector: e.g. the rb-tree race detector used in bsc#1167133
Instruction fetch fault: Prevent page fault handler from overwriting unwound stack frames. See attachment System crashing daily around midnight with unable to handle kernel paging request

Bug classifications

Software bug types
CWE VIEW: Research Concepts
CWE VIEW: Simplified Mapping
CWE VIEW: Development Concepts

Hardware Bugs

If this has only happened on a single physical machine, I suggest that machine be considered to be faulty.

Memory corruption

The generic term “memory corruption”is often used to describe the consequences of writing to memory outside the bounds of a buffer, when the root cause is something other than a sequential copies of excessive data from a fixed starting location(i.e., classic buffer overflows or CWE-120). This may include issues such as incorrect pointer arithmetic, accessing invalid pointers due to incomplete initialization or memory release, etc.]
An example by Neil Brown: The corrupted list of inodes could be due to one inode being freed and re-used while still on the list - or it could be due to memory corruption of a forward pointer.
Memory corruption is one of the most intractable class of programming errors, for two reasons:
The source of the memory corruption and its manifestation may be far apart, making it hard to correlate the cause and the effect.
Symptoms appear under unusual conditions, making it hard to consistently reproduce the error.
Memory corruption
Memory safety
uninitialized memory: wild pointer
use after free: dangling pointer
buffer overflow:
unknown source memory corruption: The generic “memory corruption”.
memory leak:


Syntax checking: gcc -Wall, bash -n
static code analysis: smatch