# Testing and Debugging
* * *
Topics:
* validation
* coverage
* black-box testing
* glass-box testing
* testing data abstractions
* randomized testing
* debugging
* defensive programming
* * *
## Validation
Many programmers might think of programming as a task largely involving
debugging. So it's worthwhile to take a step back and think about
everything that comes before debugging.
The goal we're after is that programs behave as we intend them to behave.
*Validation* is the process of building our confidence in correct program
behavior. There are many ways to increase that confidence. Social methods,
formal methods, and testing are three, and we discuss them, next.
**Social methods** involve developing programs with other people, relying
on their assistance to improve correctness. Some good techniques include
the following:
- *Code walkthrough.* In the walkthrough approach, the programmer
presents the documentation and code to a reviewing team, and
the team gives comments. This is an informal process. The focus
is on the code rather than the coder, so hurt feelings are easier
to avoid. However, the team may not get as much assurance that the
code is correct.
- *Code inspection.* Here, the review team drives the code review
process. Some, though not necessarily very much, team preparation
beforehand is useful. They define goals for the review process
and interact with the coder(s) to understand where there may be
quality problems. Again, making the process as blameless as
possible is important.
- *Pair programming.* The most informal approach to code review
is through pair programming, in which code is developed by a
pair of engineers: the driver who writes the code, and the
observer who watches. The role of the observer is be a critic,
to think about potential errors, and to help navigate larger
design issues. It's usually better to have the observer be
the engineer with the greater experience with the coding task
at hand. The observer reviews the code, serving as the devil's
advocate that the driver must convince. When the pair is
developing specifications, the observer thinks about how to
make specs clearer or shorter. Pair programming has other
benefits. It is often more fun and educational to work with
a partner, and it helps focus both partners on the task.
If you are just starting to work with another programmer,
pair programming is a good way to understand how your partner
thinks and to establish common vocabulary. It is a good idea
for partners to trade off roles, too.
These social techniques for *code review* can be remarkably effective.
In one study conducted at IBM (Jones, 1991), code inspection
found 65% of the known coding errors and 25% of the known
documentation errors, whereas testing found only 20% of
the coding errors and none of the documentation errors.
The code inspection process may be more effective
than walkthroughs. One study (Fagan, 1976) found that
code inspections resulted in code with 38% fewer failures,
compared to code walkthroughs.
Thorough code review can be expensive, however. Jones found
that preparing for code inspection took one hour per 150 lines of
code, and the actual inspection covered 75 lines of code per hour.
Having up to three people on the inspection team improves the
quality of inspection; beyond that, more inspectors doesn't seem
to help. Spending a lot of time preparing for inspection did not
seem to be useful, either. Perhaps this is because much of the
value of inspection lies in the interaction with the coders.
**Formal methods** use the power of mathematics and logic to validate
program behavior. *Verification* uses the program code and its
specifications to construct a proof that the program behaves correctly
on all possible inputs. There are research tools available to help with
program verification, often based on automated theorem provers, as well
as research languages that are designed for program verification.
Verification tends to be expensive and to require thinking carefully
about and deeply understanding the code to be verified. So in practice,
it tends to be applied to code that is important and relatively short.
Verification is particularly valuable for critical systems where testing
is less effective. Because their execution is not deterministic,
concurrent programs are hard to test, and sometimes subtle bugs can only
be found by attempting to verify the code formally. In fact, tools to
help prove programs correct have been getting increasingly effective and
some large systems have been fully verified, including compilers,
processors and processor emulators, and key pieces of operating systems.
**Testing** involves actually executing the program on sample inputs to
see whether the behavior is as expected. By comparing the actual results
of the program with the expected results, we find out whether the
program really works on the particular inputs we try it on. Testing can
never provide the absolute guarantees that formal methods do, but it is
significantly easier and cheaper to do. It is also the validation
methodology with which you are probably most familiar. Testing is a
good, cost-effective way of building confidence in correct program
behavior.
## Test coverage
We would like to know that a program works on all possible inputs. The
problem with testing is that it is usually infeasible to try all the
possible inputs. For example, suppose that we are implementing a module
that provides an abstract data type for rational numbers. One of its
operations might be an addition function `plus`, e.g.:
```
(* AF: [(p,q)] represents the rational number p/q
* RI: [q] is not 0 *)
type rational = int*int
(* [create p q] is the rational number p/q.
* raises: [Invalid_argument "0"] if [q] is 0 *)
val create : int -> int -> rational
(* [plus r1 r2] is r1 + r2 *)
val plus : rational -> rational -> rational
```
What would it take to exhaustively test just this one function? We'd
want to try all possible rationals as both the `r1` and `r2` arguments.
A rational is formed from two ints, and there are \\(2^{63}\\) ints on a
modern OCaml implementation. Therefore there are approximately
\\((2^{63})^4 = 2^{252}\\) possible inputs to the `plus` function. Even
if we test one addition every nanosecond, it will take about 10^59 years
to finish testing this one function.
Clearly we can't test software exhaustively. But that doesn't mean we
should give up on testing. It just means that we need to think carefully
about what our test cases should be so that they are as effective as
possible at convincing us that the code works.
Consider our `create` function, above. It takes in two integers `p` and
`q` as arguments. How should we go about selecting a relatively small
number of test cases that will convince us that the function works
correctly on all possible inputs? We can visualize the space of all
possible inputs as a large square:
![](create_inputs.gif)
There are about \\(2^{126}\\) points in this square, so we can't afford
to test them all. And testing them all is going to mostly be a waste of
time—most of the possible inputs provide nothing new. We need a
way to find a set of points in this space to test that are interesting
and will give a good sense of the behavior of the program across the
whole space.
Input spaces generally comprise a number of subsets in which the
behavior of the code is similar in some essential fashion across the
entire subset. We don't get any additional information by testing more
than one input from each such subset.
If we test all the interesting regions of the input space, we have
achieved good *coverage*. We want tests that in some useful sense
cover the space of possible program inputs.
Two good ways of achieving coverage are *black-box testing*
and *glass-box testing*.
## Black-box testing
In selecting our test cases for good coverage, we might want to consider
both the specification and the implementation of the program or
module being tested. It turns out that we can often do a pretty good job
of picking test cases by just looking at the specification and ignoring
the implementation. This is known as **black-box testing**. The idea is
that we think of the code as a black box about which all we can see is
its surface: its specification. We pick test cases by looking at how the
specification implicitly introduces boundaries that divide the space of
possible inputs into different regions.
When writing black-box test cases, we ask ourselves what set of test
cases that will produce distinctive behavior as predicted by the
specification. It is important to try out both *typical inputs* and
inputs that are *boundary cases* aka *corner cases* or *edge cases*. A
common error is to only test typical inputs, with the result that the
program usually works but fails in less frequent situations. It's
also important to identify ways in which the specification creates
classes of inputs that should elicit similar behavior from the
function, and to test on those *paths through the specification*.
Here are some examples.
**Example 1.**
Here are some ideas for how to test the `create` function:
- Looking at the square above, we see that it has boundaries at
`min_int` and `max_int`. We want to
try to construct rationals at the corners and along the sides of the
square, e.g. `create min_int min_int`, `create max_int 2`, etc.
- The line p=0 is important because p/q is zero all along it. We
should try (0,q) for various values of q.
- We should try some typical (p,q) pairs in all four quadrants of the
space.
- We should try both (p,q) pairs in which q divides evenly into p, and
pairs in which q does not divide into p.
- Pairs of the form (1,q),(-1,q),(p,1),(p,-1) for various p and q also
may be interesting given the properties of rational numbers.
The specification also says that the code will check that q is not zero.
We should construct some test cases to ensure this checking is done as
advertised. Trying (1,0), (maxint,0), (minint,0), (-1,0), (0,0) to
see that they all raise the specified exception would
probably be an adequate set of black-box tests.
**Example 2.**
Consider the function `list_max`:
```
(* Return the maximum element in the list. *)
val list_max: int list -> int
```
What is a good set of black-box test cases? Here the input space is the
set of all possible lists of ints. We need to try some typical inputs
and also consider boundary cases. Based on this spec, boundary cases
include the following:
- A list containing one element. In fact, an empty list is probably
the first boundary case we think of. Looking at the spec above, we
realize that it doesn't specify what happens in the case of an empty
list. Thus, thinking about boundary cases is also useful in
identifying errors in the specification.
- A list containing two elements.
- A list in which the maximum is the first element. Or the last
element. Or somewhere in the middle of the list.
- A list in which every element is equal.
- A list in which the elements are arranged in ascending sorted order,
and one in which they are arranged in descending sorted order.
- A list in which the maximum element is `max_int`, and a list in which
the maximum element is `min_int`.
**Example 3.**
Consider the function `sqrt`:
```
(* [sqrt x n] is the square root of [x] computed to an accuracy of [n]
* significant digits.
* requires: [x >= 0] and [n >= 1] *)
val sqrt : float -> int -> float
```
The precondition identifies two possibilities for `x` (either it is zero
or greater) and two possibilities for `n` (either it is one or greater).
That leads to four "paths through the specification", i.e., representative
and boundary cases for satisfying the precondition, which we should test:
- `x` is zero and `n` is 1
- `x` is greater than zero and `n` is 1
- `x` is zero and `n` is greater than 1
- `x` is greater than zero and `n` is greater than 1.
**Summary.**
Black-box testing has some important advantages:
- It doesn't require that we see the code we are testing. Sometimes
code will not be available in source code form, yet we can still
construct useful test cases without it. The person writing the test
cases does not need to understand the implementation.
- The test cases do not depend on the implementation. They can be
written in parallel with or before the implementation. Further, good
black-box test cases do not need to be changed, even if the
implementation is completely rewritten.
- Constructing black-box test cases causes the programmer to think
carefully about the specification and its implications. Many
specification errors are caught this way.
The disadvantage of black box testing is that its coverage may not be as
high as we'd like, because it has to work without the implementation.
## Glass-box testing
Black-box testing is a good place to start when writing test cases, but
ultimately it is not enough. In particular, it's not possible to
determine how much coverage of the implementation a black-box test suite
actually achieves—we actually need to know the implementation
source code. Testing based on that code is known as *glass box* or
*white box* testing. Glass-box testing can improve on black-box by
testing *execution paths* through the implementation code: the series
of expressions that is conditionally evaluated based on if-expressions,
match-expressions, and function applications. Test cases that
collectively exercise all paths are said to be *path-complete*. At a
minimum, path-completeness requires that for every line of code, and
even for every expression in the program, there should be a test case
that causes it to be executed. Any unexecuted code could contain a bug
if has never been tested.
For true path completeness we must consider all possible execution paths
from start to finish of each function, and try to exercise every
distinct path. In general this is infeasible, because there are too many
paths. A good approach is to think of the set of paths as
the space that we are trying to explore, and to identify boundary cases
within this space that are worth testing.
For example, consider the following implementation of a function that
finds the maximum of its three arguments:
```
let max3 x y z =
if x>y then
if x>z then x else z
else
if y>z then y else z
```
Black-box testing might lead us to invent many tests, but looking
at the implementation reveals that there are only four paths through
the code—the paths that return `x`, `z`, `y`, or `z` (again).
We could test each of those paths with representative inputs such as:
3,2,1; 3,2,4; 1,2,1; 1,2,3.
When doing glass box testing, we should include test cases for each
branch of each (nested) if expression, and each branch of each
(nested) pattern match. If there are recursive functions,
we should include test cases for the base cases as well as each
recursive call. Also, we should include test cases to trigger
each place where an exception might be raised.
Of course, path complete testing does not guarantee an absence of
errors. We still need to test against the specification, i.e.,
do black-box testing. For example, here is a broken implementation
of `max3`:
```
let max3 x y z =
x
```
The test `max 2 1 1` is path complete, but doesn't reveal the error.
Glass-box testing can be aided by *code-coverage tools* that assess
how much of the code has been exercised by a test suite. The
[bisect][] tool for OCaml can tell you which expressions in your
program have been tested, and which have not.
[bisect]: http://bisect.x9c.fr/
## Testing data abstractions
When testing a data abstraction, a simple first step is to look at the
abstraction function and representation invariant for hints about what
boundaries may exist in the space of values manipulated by a data
abstraction. The rep invariant is a particularly effective tool for
constructing useful test cases. Looking at the rep invariant of the
rational data abstraction above, we see that it requires that q is
non-zero. Therefore we should construct test cases that make q as close
to 0 as possible, i.e. 1 or -1.
We should also test how each *consumer* of the data abstraction handles
every path through each *producer* of it. A consumer is an operation
that takes a value of the data abstraction as input, and a producer is
an operation that returns such a value.
For example, consider this set abstraction:
```
module type Set = sig
(* ['a t] is the type of a set whose elements have type ['a]. *)
type 'a t
(* [empty] is the empty set. *)
val empty : 'a t
(* [size s] is the number of elements in [s]. *
* [size empty] is [0]. *)
val size : 'a t -> int
(* [add x s] is a set containing all the elements of
* [s] as well as element [x]. *)
val add : 'a -> 'a t -> 'a t
(* [mem x s] is [true] iff [x] is an element of [s]. *)
val mem : 'a -> 'a t -> bool
end
```
The `empty` and `add` functions are producers; and the `size`, `add`
and `mem` functions are consumers. So we should test how
* `size` handles the `empty` set;
* `size` handles a set produced by `add`, both when `add` leaves the
set unchanged as well as when it increases the set;
* `add` handles sets produced by `empty` as well as `add` itself;
* `mem` handles sets produced by `empty` as well as `add`, including
paths where `mem` is invoked on elements that have been added as
well as elements that have not.
## Randomized testing
*Randomized testing* aka *fuzz testing* is the process of generating
random inputs and feeding them to a program or a function to see whether
the program behaves correctly. The immediate issue is how to determine
what the correct output is for a given input. If a *reference implementation*
is available—that is, an implementation that is believed to be correct
but in some other way does not suffice (e.g., its performance is too slow,
or it is in a different language)—then the outputs of the two
implementations can be compared. Otherwise, perhaps some *property*
of the output could be checked. For example,
* "not crashing" is a property of interest in user interfaces;
* adding \\(n\\) elements to a data collection
then removing those elements, and ending up with an empty collection,
is a property of interest in data structures; and
* encrypting a string under a key then decrypting it under that key
and getting back the original string is a property of interest
in an encryption scheme like Enigma.
Randomized testing is an incredibly powerful technique. It is often
used in testing programs for security vulnerabilities. The
[`qcheck` package][qcheck] for OCaml supports randomized testing.
[qcheck]: https://github.com/c-cube/qcheck
## Debugging
The word "bug" suggests something that wandered into a program.
Better terminology would be that there are
* *faults*, which are the result of human errors in software systems, and
* *failures*, which are violations of requirements.
Some faults might never appear to an end user of a system, but failures
are those faults that do. A fault might result because an implementation
doesn't match design, or a design doesn't match the requirements.
*Debugging* is the process of discovering and fixing faults. Testing
clearly is the "discovery" part, but fixing can be more complicated.
Debugging can be a task that takes even more time than an original
implementation itself! So you would do well to make it easy to debug
your programs from the start. Write good specifications for each function.
Document the AF and RI for each data abstraction. Keep modules small,
and test them independently. Utilize both black box and glass box
testing.
Inevitably, though, you will discover faults in your programs. When
you do, approach them as a scientist by employing the *scientific method:*
* evaluate the data that are available;
* formula a hypothesis that might explain the data;
* design a repeatable experiment to test that hypothesis; and
* use the result of that experiment to refine or refute your hypothesis.
Often the crux of this process is finding the simplest, smallest input
that triggers a fault. That's not usually the original input for
which we discover a fault. So some initial experimentation might be needed
to find a *minimal test case*.
Never be afraid to write additional code, even a lot of additional code,
to help you find faults. Functions like `to_string` or `format` can
be invaluable in understanding computations, so writing them up front
before any faults are detected is completely worthwhile.
When you do discover the source of a fault, be extra careful in fixing
it. It is tempting to slap a quick fix into the code and move on. This
is quite dangerous. Far too often, fixing a fault just introduces a new
(and unknown) fault! If a bug is difficult to find, it is often because
the program logic is complex and hard to reason about. You should think
carefully about why the bug happened in the first place and what the
right solution to the problem is. *Regression testing* (i.e., recording
only test cases that originally failed but now pass) is important
whenever a bug fix is introduced, but nothing can replace careful
thinking about the code.
## Defensive programming
*Defensive programming* is a form of *proactive debugging*: implementing
code that will later be easy to debug. Some excellent techniques include
the following:
* *Assert preconditions:* If the precondition fails your function is permitted
to do anything, so raising an exception because an assertion fails is fair game.
It also guarantees to detect faults sooner rather than later—and later
it might be difficult to attribute blame in the right place.
* *Assert invariants, including rep invariants:* for the same reasons
as preconditions.
* *Exhaustively check any conditionals:* A *conditional* is a language
construct that executes one piece of code or another based on the value
of some other piece of code. In OCaml, `if`, `match`, and `function`
are all conditionals. This defensive programming technique says to
never fail to check all the possible values of a conditional. For
example, you might be expecting a string value to be one of two possibilities,
`"a"` or `"b"`. Your conditional should explicitly check for both of those,
and if the value is neither, do something to signal that a fault has
occurred, such as raise an exception.
Sometimes programmers worry unnecessarily that defensive programming
will be too expensive—either in terms of the time it costs
them to implement the checks initially, or in the run-time costs that
will be paid in checking assertions. These concerns are far too often
misplaced. The time and money it costs society to repair faults in
software suggests that we could all afford to have programs that
run a little more slowly.
## Terms and concepts
* asserting
* black box
* boundary case
* bug
* code inspection
* code review
* code walkthrough
* consumer
* debugging by scientific method
* defensive programming
* failure
* fault
* formal methods
* glass box
* inputs for classes of output
* inputs that satisfy precondition
* inputs that trigger exceptions
* minimal test case
* pair programming
* path coverage
* paths through implementation
* paths through specification
* producer
* randomized testing
* regression testing
* representative inputs
* social methods
* testing
* typical input
* validation
## Further reading
* [*Program Development in Java: Abstraction, Specification, and
Object-Oriented Design*][liskov-guttag], chapter 10, by Barbara
Liskov with John Guttag.
[liskov-guttag]: https://newcatalog.library.cornell.edu/catalog/9494027