How do we know our software works? Well, mostly we don’t. If you’re like me, you’ve seen enough bugs to be humbled to our species’s inherent inability to write perfect code.

To mitigate the risk of these errors, we write tests—unit tests, property tests, integration tests, acceptance tests, load tests, accessibility tests, QA tests—you get the point. They give us empirical certainty that our programs work the way we expect them to.

Last week, I learned that we can do one better than empirical verification. We can mathematically prove our code works. In this post, I hope to demystify the role of formal proofs in software by placing them in context of more traditional approaches to testing. Specifically, we’ll compare verification strategies in Python, a normal language, and Coq, a language that allows you to prove properties about code you write in it.

Testing Sort in Python

Let’s say I’ve written an implementation of quicksort in Python, and I want to make sure it works. My first approach might be to write some unit tests.

import unittest

class TestSort(unittest.TestCase):
    def test_empty(self):
        self.assertEqual(quicksort([]), [])

    def test_sorted(self):
        self.assertEqual(quicksort([1,2,3]), [1,2,3])

    def test_unsorted(self):
        self.assertEqual(quicksort([1,4,3,2]), [1,2,3,4])

After writing these tests, I now know that quicksort more or less does what I expect it to. It sorts a list. For additional certainty, I can go a step further and use Python’s Hypothesis library to write tests against properties of my sorting algorithm. For example, I could test that it doesn’t change the length of the list it sorts.

from hypothesis import given
from hypothesis import strategies as st

class TestSort(unittest.TestCase):
    def test_same_length(self, numbers):
        self.assertEqual(len(quicksort(numbers)), len(numbers))

This property might seem trivial to test, yet even these small checks can catch subtle bugs. Here’s an incorrect version of quicksort in Python.

def quicksort(numbers):
    if len(numbers) == 0:
        return numbers
        first = numbers[0]
        remaining = numbers[1:]
        sorted_remaining = quicksort(remaining)

        less_than = [number for number in sorted_remaining if number < first]
        greater_than = [number for number in sorted_remaining if number > first]
        return less_than + [first] + greater_than

The bug here is both less_than and greater_than filter by a strict inequality. That means if there is another value in remaining equal to first, then it will be filtered out. Then quicksort will return a list shorter than the one we passed to it. The bug is subtle enough to pass our original unit tests, but not our property test. Hypothesis is able to find a failing edge case.

-> % python
.Falsifying example: test_same_length(self=<__main__.TestSort testMethod=test_same_length>, numbers=[0, 0])

Sure enough, quicksort([0, 0]) returns [0], not [0, 0]. The bug can easily be fixed by changing one of the strict inequalities to a non-strict one. It’s a one-character fix.

        less_than = [number for number in sorted_remaining if number <= first]
        greater_than = [number for number in sorted_remaining if number > first]

Now tests pass! At this point, I’m pretty confident that my quicksort doesn’t change the length of a list. But can I be more certain?

Proving Sort in Coq

That was a rhetorical question—of course I can!

We use both Hypothesis and Coq to affirm properties about programs. The difference is that with Hypothesis you test properties empirically, where with Coq you formally prove them. Let’s do a proof to see what that looks like.

Below, I’ve written an implementation of quicksort in Coq. While the particulars of the syntax might seem unfamiliar, you can generally see that it’s doing the same thing as the Python version.

Definition filter_lte (a: nat) list := filter (fun n=> n <=? a) list.
Definition filter_gt (a: nat) list := filter (fun n=> negb(n <=? a)) list.

Fixpoint quicksort (l: list nat) :=
    match l with
      nil => nil
    | h :: t => let sorted_t := quicksort t in
                    filter_lte h sorted_t
                    ++ h :: filter_gt h sorted_t

Sure enough, it behaves as we’d expect it to.

> Compute quicksort (nil).
= nil : list nat
> Compute quicksort (2::6::1::2::1::0::nil).
= 0 :: 1 :: 1 :: 2 :: 2 :: 6 :: nil : list nat

Now, we want to check that the length of a list remains unchanged once we run it through quicksort. Rather than write a test, we’re going to write this property as a theorem.

Theorem sortLengthUnchanged :
    forall l:(list nat),
    length l = length (quicksort l).

Coq is meant to be written in an IDE. Once you start writing a proof, the IDE enters an interactive proof-proving session. We proceed line by line, unpacking the definitions of quicksort and length, applying axioms and lemmas until we we prove our theorem.

Since Coq is optimized for this interactive session, its proofs verge on being unreadable. You can see how impenetrable our finished quicksort theorem is.

Theorem sortLengthUnchanged :
    forall l:(list nat),
    length l = length (quicksort l).
    induction l.
    rewrite IHl.
    rewrite lengthIsDistributive.
    rewrite Nat.add_succ_r with
        (n:=length(filter_lte a (quicksort l)))
        (m:=(length (filter_gt a (quicksort l)))).
    rewrite twoFilterLengthEq.

Without going into the proof’s details, which are tedious, you can get a glimpse on how the code makes more sense when run in the context of the interactive session. Watch the right panel, where you can see the proof evolve. Again, don’t worry about the details. I just want to illustrate how Coq proofs are experienced.

gif showing what running the above program line by line looks like in a Coq interpreter.
An interactive proof session in Coq.

In most programming languages, readability matters. It’s not enough to believe your code works. In Coq, because you actually have true certainty about properties of your code, you almost never need to revisit how you arrived at it. It’s enough to know it’s true. Most software is written with far less than mathematical certainty.

Proofs versus Property Testing

There are tradeoffs to the certainty you get from using Coq. For one, proofs are a poor tool for debugging. Property testing, on the other hand, can help you identify edge cases you would not have otherwise anticipated. With Coq, it’s hard to prove something you don’t already have a good sense is true.

Property testing also provides a framework that makes refactoring easy. If I wanted to change my implementation of quicksort, or if I wanted to use a different sorting algorithm altogether, I could keep my property tests unchanged. I would still expect them to pass if my algorithm was correct, and I’d expect them to catch the error if it was not. With Coq, I would have to prove the properties again from scratch.

However, if you do need to show that you absolutely know a fact about your program, then Coq is an awesome tool. If your code needs to satisfy legal regulations, or if the result of those properties not holding can have catastrophic consequences, then it could be worth the extra effort to show that your programs truly satisfy the conditions you say they do.

In any event, learning Coq was really fun, and it made me think differently about the levels of certainty we can have about software. I encourage you to try it out.

Here are links for the Coq code and Python code referenced in this post. If you’re looking for a Coq tutorial, I really liked this one.