Our Test Gap analysis and Test Impact analysis automatically monitor the execution of your application to determine which code is covered in tests. As a result, they are objective with respect to what has been tested and, more importantly, what hasn’t. No personal judgement or gut feeling involved.
However, when we first setup the analyses with our customers, we often find that the measurements differ (significantly!) from their expectations. Often, this is because other coverage tools report different coverage numbers. This post explores causes for such differences.
We usually refer to code coverage as a percentage, as in “we have 83.5% coverage on our software system”. To compute such a percentage, we divide the amount of code that has been executed in tests by the total amount of code in the system. When two coverage tools report different coverage, this means that they disagree with regard to either one or both of these amounts.
As we will see, this might be a disagreement in the absolute values as well as in the unit by which they measure the amount, e.g., number of statements vs. number of code branches.
Some coverage tools measure statement coverage, others line coverage, method coverage, file coverage, basic-block coverage, branch coverage, path coverage, or yet another type of coverage. These different types of coverage are not directly comparable, because they count different things.
For Test-Gap Analysis and Test-Impact Analysis, Teamscale considers method coverage. In the Tests perspective, it reports line coverage. So if Teamscale reports different coverage than your other coverage tool, first check whether you are comparing apples and oranges.
On a related note: It is a common misunderstanding that line coverage and statement coverage are the same thing. You can easily see the difference on a simple example program with just two method-call statements in the same line:
a.foo(); a.bar();
Assume that a test executes this program and foo()
throws an exception, which means that bar()
is not invoked. Line coverage would report that one line out of the one-line program was executed, i.e., we have 1/1=100% coverage. Statement coverage, on the other hand, would report that from the two call statements the first was executed and the second wasn’t, i.e., 1/2=50% coverage. A huge difference!
We find that there are huge differences in how different coverage tools determine the number of all statements in the code. Most notably, some coverage tools count only those statements that were loaded during test execution. For example, dotCover (.NET) counts only statements in loaded assemblies. And coverage.py (Python) considers only files that are loaded, unless you explicitly specify the location of all your code via the --source
parameter. Both tools, therefore, ignore any code that has not been loaded, although this code was obviously not covered in tests.
As a result, if you write an additional test covering a part of some code that was previously not loaded, the reported coverage may actually go down!
Also, many coverage tools simply count statements from all executed code. In practice, however, we usually want to distinguish certain types of code:
Teamscale counts all statements in the source code as they appear in your version-control system, except for what you configure as an explicit exclude (test code, internal tools, generated code, etc.) in your project configuration. This is literally all the code that you declare relevant for testing, regardless of what gets loaded during test execution. Code that is not loaded during testing is consequently counted as not covered. Therefore, Teamscale may report much lower test coverage than other tools. However, with Teamscale, additional testing efforts will only ever increase your test coverage, which gives you a reliable measure to base your decisions on.
Usually, coverage tools don’t simply include all code when computing coverage, but rather only executable code. The rational is that code that isn’t executable cannot ever be covered in test and should, therefore, be excluded when measuring coverage. Which code should be counted as “executable” is a wildly disputed topic, however. For example, some coverage tools count lines containing only curly braces as executable, while others exclude them. And some coverage tools for dynamic languages, such as Python, count lines containing class or method declarations as executable, while tools for other language, such as Java or C#, typically exclude them.
This is not to say that one way to count is necessarily better or worse than another or even wrong. It just means that there are some degrees of freedom, which lead to (sometimes even huge) differences between almost any two coverage tools out there, as one tool will count a certain line that another tool excludes, and vice versa.
To see which lines Teamscale consideres executable, you can enable the Annotate test coverage
option in the right sidebar of the code perspective. Lines shown as red, green, or yellow are considered executable, while white lines are not.
Code coverage may also be approximated statically, for example, with nDepend
(.NET). The idea is to use your tests as entry points and compute which code is reachable from any of them. This approach is tempting, because it does not require any code execution and is, therefore, quite fast. Note, however, that there is a huge difference between which code is statically reachable and which code is actually executed in a test:
For theses reasons, static reachability and coverage are simply two different measures that are non-comparable.
As we have seen, the seemingly simple term “coverage” stands for quite a variety of different measures in different flavors. Whether you use Teamscale or not, you should always be aware of what you measure and for what purpose you use these measurements.
For example, if you want to know exactly how well you’ve tested some piece of code, you might consider branch or even path coverage. Achieving high branch or path coverage is usually quite expensive in terms of test runtime though, because you need lots and lots of test cases that add up to a long-running test suite. Additionally, with many technologies, measuring these kinds of coverage itself causes a significant runtime overhead.
Often, a much cheaper approach to increase the overall effectiveness of your testing is to identify changes that were not test at all. Teamscale’s Test Gap analysis can do this for you, based on relatively lightweight method-coverage measurements, and we’ve seen it reduce field defects by almost a quarter.
Alternatively, or in addition, you may counter increasing test runtimes by selectively running only relevant tests. Using Teamscale’s Test Impact analysis, which also requires only method coverage, may uncover 90% of the failing tests in only 2% of the test runtime, for example.
Choose your weapons wisely.