Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it? – “The Elements of Programming Style”, 2nd edition, chapter 2. Brian Kernighan
Debug hacks (a Japanese book)
Bug or error analysis.
This world is built on causality, cause and effect.
Root cause -> series of causes and effects -> bug.
What is debugging? It’s Abductive reasoning.
We infer the root cause with observations.
Observations -> Infer -> Causes -> Increase obsevability -> more observations -> … -> Root causes
Performance issues vs Error issues
Performance issues: bottleneck
Error issues: crash, lockup
Error issues analysis - methodology
PID, TID. Interrupt/execption number or hanlder,
Structuralism: concurrency, life cycle, lock.
Expected/intended behavior vs actual behavior.
Timeline - put observations on a timeline
Scientific method: Hypothses, narrow down
Increase observability based on causes - more tools need to be created.
Data-race detector: Concurrency bugs should fear the big bad data-race detector (part 1)
Race condition detector: e.g. the rb-tree race detector used in bsc#1167133
Instruction fetch fault: Prevent page fault handler from overwriting unwound stack frames. See attachment System crashing daily around midnight with unable to handle kernel paging request
If this has only happened on a single physical machine, I suggest that machine be considered to be faulty.
The generic term “memory corruption”is often used to describe the consequences of writing to memory outside the bounds of a buffer, when the root cause is something other than a sequential copies of excessive data from a fixed starting location(i.e., classic buffer overflows or CWE-120). This may include issues such as incorrect pointer arithmetic, accessing invalid pointers due to incomplete initialization or memory release, etc.]
An example by Neil Brown: The corrupted list of inodes could be due to one inode being freed and re-used while still on the list - or it could be due to memory corruption of a forward pointer.
Memory corruption is one of the most intractable class of programming errors, for two reasons:
The source of the memory corruption and its manifestation may be far apart, making it hard to correlate the cause and the effect.
Symptoms appear under unusual conditions, making it hard to consistently reproduce the error.
uninitialized memory: wild pointer
use after free: dangling pointer
unknown source memory corruption: The generic “memory corruption”.
Syntax checking: gcc -Wall, bash -n
static code analysis: smatch