You may invest time in this most heavily after deferral of a bug that you know in your heart is going to cause customer grief.
TACTICS FOR ANALYZING A REPRODUCIBLE BUG
Here are a few tips for achieving the objectives laid out in the previous section:
LOOK FOR THE CRITICAL STEP
When you find a bug, you're looking at a symptom, not a cause. Program misbehavior is the result of an error in the code. You don't see the error because you don't read the code; you just see misbehavior. The underlying error (the mistake in the code) may have happened many steps ago: any of the steps involved in a bug could be the one that triggers the error. If you can isolate the triggering step, you can reproduce the bug more easily and the programmer can fix it more easily.
Look carefully for any hint of an error as you take each step. Often minor indicators are easily missed or ignored. Minor bugs might be the first symptoms of an error that will eventually manifest itself as the problem you're interested in. If they occur on the path to the problem you're analyzing, the odds are reasonable that they're related to it. Look for
• Error messages: Check error messages against a list of the program's error messages and the events the programmer claims trigger them. Read the message, try to understand why it appears and when (what step or substep).
• Processing delays: If the program takes an unusually long time to display the next bit of text or to finish a calculation, it may be wildly executing totally unrelated routines. The program may break out of this with inappropriately changed data or it may never return to its old state. When you type the next character, the program may think you're answering a different question (asked in an entirely different
section of code) from the one showing onscreen. An unusual delay may be the only indicator that a program has just started to run amok.
• Blinking screen: You may be looking at error recovery when the screen is repainted or part of it flashes then reverts to normal. As part of its response to an error, the program makes sure that what shows on the screen accurately reflects its state and data. The repainting might work, but the rest of the error recovery code may foul up later.
• Jumping cursor: The cursor jumps to an unexpected place. Maybe it comes back (error recovery?) or maybe it stays there. If it stays, the program may have lost track of the cursor's location. Even if the cursor returns, if the program maintains internally distinct input and output cursors (many do), it may have lost one of them.
• Multiple cursors: There are two cursors on the screen when there only should be one. The program may be in a weird state or in a transition between states. (However, this may not be state-dependent. The program may just be misdriving the video hardware, perhaps because it's not updating redundant variables it uses to track the register status of the video card.)
82
• Misaligned text. Lines of text that are normally printed or displayed in a consistent pattern (e.g., allof them start in the leftmost column) are slightly misprinted. Maybe only one line is indented by one character. Maybe all the text is shifted, evenly or unevenly.
• Characters doubled or omitted: The computer prints out the word error as errrro. Maybe you've found a spelling mistake or maybe the program is having problems reading the data (the string "error") or communicating with the printer. Some race conditions cause character skipping along with other less immediately visible problems.
• In-use light on when the device is not in use: Many disk drives and other peripherals have in-use lights. These show when the computer is reading or writing data to them. When a peripheral's light goes on unexpectedly, the program might be incorrectly reading or writing to memory locations allocated to these peripherals instead of the correct area in memory. Some languages (C, for example) make it especially easy to inadvertently address the wrong area of memory. The program may "save" data to locations reserved for disk control or have previously overwritten control code with data it thought it was saving elsewhere. When this happens you don't see the internal program being overwritten (which will result in horrible bugs when you try to use that part of the program), but you can see the I/O lights blink. This is a classic "wild pointer" bug.
M
AXIMIZE THE VISIBILITY OF THE BEHAVIOR OF THE PROGRAMThe more aspects of program behavior you can make visible, the more things you can see going wrong and the more likely you'll be able to nail down the critical step.
If you know how to use a source code debugger, and have access to one, consider using it. Along with tracing the code path, some debuggers will report which process is active, how much memory or other resources it's using, how much of the stack is in use, and other internal information. The debugger can tell you that:
• A routine always exits leaving more data on the stack (a temporary, size-limited data storage area) than was there when it began. If this routine is called enough times, the stack will fill up and terrible things will happen.
• When one process receives a message from another, an operating system utility that controls message transfer gives me receiving process access to a new area of memory. The message is the data stored in this memory area. When the process finishes with the message, it tells the operating system to take the memory area back. If the process never releases message memory, then as it receives more messages, eventually it gains control of all available memory. No more messages can be sent. The system grinds to a halt. The debugger can show you which process is acc umulating memory, before the system crashes.