Sunday, February 14, 2016

A different kind of test metric - test depth

A better metric on test quality

How do you measure the quality of your tests? Test coverage is one of the more widely used metrics. This shows the percentage of code lines that are executed by unit tests.

This metric does not tell anything about the quality of the tests, though. It has a purpose as an alarm bell if coverage is low, but a high coverage does not guarantee that the tests are good.

One problem is; consider a scenario where we only have high-level tests. Let's say that we have an application which is calculating the weight of a 3D printed model. The application has three layers:



Say that we have one single test which is testing the "calculate" button in the User interface layer:

[Test]
public void CalculateButton_Clicked_ShowsCalculationProgress()
{
  ... verify some GUI state here
}

Even if we intend to test only a GUI feature, the rest of the layers may be covered by this test. We are not even verifying the calculation results, yet the Calculation layer may have 100% code coverage. Obviously, test coverage is not a good metric here.

I have some thoughts on a better metric.

Test depth

In the example above, the Calculation layer is tested only through the User interface layer. In other words, the Calculation layer is multiple levels away from the test code in the call stack. We will name this distance between test and tested code as "test depth". A high test depth (for this particular test) means that the test is not actually testing the tested code very well.

This gives us an improved metric; by weighting the test coverage with the test depth, we will get a new depth-weighted coverage. In traditional test coverage, each line is either covered or not covered:

C(line) = 0% or 100%, depending on whether line is executed by a test or not

Whereas the coverage for an entire class is the average

C(class) = Coverage_sum / line_count.

A depth-weighted coverage would then define the coverage per line (if it is covered)

C(line) = 100% / (lowest_depth_of_test - 1), where depth_of_test is the distance in the call stack.

If a line is covered directly by a test, C(line) will be 100%. If it is one level deeper, it will be 50%. Two levels deeper will yield a C(line) of 33%.

Is this the perfect metric?

Is depth-weighted metric a perfect test metric? No, it still doesn't actually verify that you are asserting on the right things. It is not a silver bullet, but it does however prevent the deep-test scenario above.

How do I calculate this?

I haven't found any tools that actually makes this kind of metric. Perhaps I need to make one myself... or are there any volunteers out there? ;-)