Functional Tests as a Tree of Continuations (2010)

2025-03-1320:466932www.evanmiller.org

By Evan Miller June 15, 2010 One of the most essential practices for maintaining the long-term quality of computer code is to write automated tests that ensure the program continues to act as…

By Evan Miller

June 15, 2010

One of the most essential practices for maintaining the long-term quality of computer code is to write automated tests that ensure the program continues to act as expected, even when other people (including your future self) muck with it.

Test code is often longer than the code that is being tested. A former colleague estimates that the right ratio is around 3 lines of functional test code for every line of “real” code. Writing test code is often mindless and repetitive, because all you do is trace out all the steps a user could take and write down all the things that should be true about each one.

A great deal of test code simply exists to set up the tests. Sometimes you can abstract out the logic into setup() and teardown() functions, and put the fake data needed for tests into things called fixtures. Even then, you’ll often have a subset of tests that requires additional setup, and test code often becomes littered with miniature, unofficial setup functions.

The problem

It’s a wonder to me that most testing environments are structured as lists of tests. Lists of tests are sensible for unit tests, but they’re a disaster for functional tests. The list structure is one of the main reasons that functional test suites are so repetitive.

Let’s say you want to test a 5-step process, and test the “Continue” and “Cancel” buttons at each step. You can only get to a step by pressing “Continue” in the previous step. With the traditional “list of tests” paradigm, you have to write something like

test_cancel_at_step_one() {
    Result = do_step_one("Cancel");
    assert_something(Result);
}

test_cancel_at_step_two() {
    Result = do_step_one("Continue");
    assert_something(Result);
    Result = do_step_two("Cancel");
    assert_something(Result);
}

test_cancel_at_step_three() {
    do_step_one("Continue"); # tested above
    Result = do_step_two("Continue");
    assert_something(Result);
    Result = do_step_three("Cancel");
    assert_something(Result);
}

test_cancel_at_step_four() {
    do_step_one("Continue"); # tested above
    do_step_two("Continue"); # tested above
    Result = do_step_three("Continue");
    assert_something(Result);
    Result = do_step_four("Cancel");
    assert_something(Result);
}

test_cancel_at_step_five() {
    do_step_one("Continue"); # tested above
    do_step_two("Continue"); # tested above
    do_step_three("Continue"); # tested above
    Result = do_step_four("Continue");
    assert_something(Result);
    Result = do_step_five("Cancel");
    assert_something(Result);
}

As you can start to see, the length of each test is growing linearly in the number of steps we’re testing, so the length of the total test suite ends up being O(N2) in the number of steps.

The solution

A more appropriate data structure for functional tests is a testing tree. The tree essentially maps out the possible actions at each step. At each node, there is a set of assertions, and parent nodes pass a copy of state down to each child (representing a possible user action). Child nodes are free to modify and make assertions on the state received from the parent node, and pass a copy of the modified state down to its children. Nodes should not affect the state of parents or siblings.

Let’s take a concrete example. In a 5-step process, the tree would look like:

  • Step 1
    • Cancel
    • Continue to Step 2
      • Cancel
      • Continue to Step 3
        • Cancel
        • Continue to Step 4
          • Cancel
          • Continue to Step 5

Here, the first “Cancel” and “Continue to Step 2” are like parallel universes. Rather than repeating Step 1 to test each of these, we want to automatically make a copy of the universe at the end of Step 1, then run child tests on each parallel universe. If we can write our tests as a tree in this way, the length of the total test suite will be O(N) in the number of steps, rather than O(N2).

For modern web applications, all the state is stored in a database. Therefore to make a “copy of the universe”, we just need a way to make a copy of the database to pass down to the child tests, while preserving older copies that tests further up the tree can copy and use.

The solution is to implement a stack of databases. As we walk down the testing tree, we push a copy of the current database onto the stack, and the child can play with the database at the top of the stack. After we’ve finished with a set of child nodes and ascend back up the testing tree, we pop the modified databases off the stack, returning to the previous database revisions.

An example

I won’t go through the details of writing a testing framework or the database stack, but here’s how you’d test a multi-step process with Chicago Boss’s test framework. This is a tree implemented as nested callbacks in Erlang. Each “node” is an HTTP request with a list of callbacks that make assertions on the response, and a list of labeled continuation callbacks — these are the child nodes. Each child node receives a fresh database copy that it can thrash to its heart’s content. Managing the stack of databases is all done under the hood.

The resulting test code is surprisingly elegant:

start() ->
  boss_test:get_request("/step1", [],
    [ % Three assertions on the response
      fun boss_assert:http_ok/1, 
      fun(Res) -> boss_assert:link_with_text("Continue", Res) end
      fun(Res) -> boss_assert:link_with_text("Cancel", Res) end
    ],
    [ % A list of two labeled continuations; each takes the previous 
      % response as the argument

      "Cancel at Step 1", % First continuation
      fun(Response1) -> 
          boss_test:follow_link("Cancel", Response1,
            [ fun boss_assert:http_ok/1 ], []) % One assertion, no continuations
      end,

      "Continue at Step 1", % Second continuation
      fun(Response1) ->
          boss_test:follow_link("Continue", Response1,
            [ fun boss_assert:http_ok/1 ], [ 

              "Cancel at Step 2", % Two more continuations
              fun(Response2) ->
                  boss_test:follow_link("Cancel", Response2,
                    [ fun boss_assert:http_ok/1 ], [])
              end,

              "Continue at Step 2",
              fun(Response2) ->
                  boss_test:follow_link("Continue", Response2,
                    [ fun boss_assert:http_ok/1 ], [ 

                      "Cancel at Step 3",
                      fun(Response3) ->
                          boss_test:follow_link("Cancel", Response3,
                            [ fun boss_assert:http_ok/1 ], [])
                      end,

                      "Continue at Step 3",
                      fun(Response3) ->
                          boss_test:follow_link("Continue", Response3,
                            [ fun boss_assert:http_ok/1 ], [ 

                              "Cancel at Step 4",
                              fun(Response4) ->
                                  boss_test:follow_link("Cancel", Response4,
                                    [ fun boss_assert:http_ok/1 ], [])
                              end,

                              "Continue at Step 4",
                              fun(Response4) ->
                                  boss_test:follow_link("Continue", Response4,
                                    [ fun boss_assert:http_ok/1 ], [])
                              end ]) end ]) end ]) end ]).

If the indentation ever gets out of hand, we can simply put the list of continuations into a new function.

Conclusion

There are several benefits to structuring functional tests as a tree of continuations:

  • Eliminates code duplication. We don’t need to repeat steps 1-3 for each action that can be taken at step 4. This reduces the code base, as well as time needed for execution.

  • Eliminates most setup code. As long as the setup actions are associated with an HTTP interface, all setup can be done as part of the testing tree with no performance penalty.

  • Pinpoints the source of failing tests. If assertions fail in a parent node, we can immediately stop before running the child nodes. By contrast, lists of tests usually produce long lists of failures for one bug, making it harder to find the source.

  • Tests are well-structured. There is a 1-1 mapping between nodes and something the user sees, and a 1-1 mapping between child nodes and something the user can do next. The tests appear in a well-structured hierarchy, instead of a haphazard “list of everything I could think of”.

  • The trail of previous responses are all in scope. This benefit is unique to a callback implementation in a language that supports closures — it’s handy if you need to compare output from two requests that are separated by several intermediate requests. With a list of tests, you would have to pass around data in return values and function arguments, but here all previous responses are at our fingertips in Response1, Response2, etc.

Why haven’t I been able to find anyone else using this approach? My guess is that all of the side effects of OO languages encourage a wrecking-ball mentality when it comes to unit tests — destroy all possible state after each test. But for functional tests with many steps, this approach is grossly inefficient — if you want to test every rung on a ladder, it’s pointless to climb all the way down to the ground and trudge back up for each test.

To write your own testing framework based on continuation trees, all you need is a stack of databases (or rather, a database that supports rolling back to an arbitrary revision). I don’t know what databases support this kind of revisioning functionality, but adding the feature to Chicago Boss’s in-memory database took about 25 lines of Erlang code.

Once you start writing functional tests as a tree of assertions and continuations, you really will wonder how you did it any other way. It’s just one of those ideas that seems too obvious in hindsight.

You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:

Get new articles as they’re published, via LinkedIn, Twitter, or RSS.

Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!


Wizard
Statistics the Mac way

Back to Evan Miller’s home pageSubscribe to RSSLinkedInTwitter


Read the original article

Comments

  • By lihaoyi 2025-03-140:37

    This is the approach my uTest testing library (https://github.com/com-lihaoyi/utest) takes. I don't think it's unique to functional tests, even unit tests tend towards this pattern. Tests naturally form a tree structure, for multiple reasons:

    - You usually have shared initialization nearer the root and the various cases you want to assert at the leaves.

    - You want to group related tests logically together, so it's not one huge flat namespace which gets messy

    - You want to run groups of tests at the same time, e.g. when testing a related feature

    Typically, these different ways of grouping tests all end up with the same grouping, so it makes a lot of sense to have your tests form a tree rather than a flat list of @Test methods or whatever

    Naturally you can always emulate this yourself. e.g. Having helper setup methods that call each other and form a hierarchy, or having a tagging discipline that forms a hierarchy to let you call tests that are related, or simply using files as the leaf-level of the larger filesystem tree to organize your tests. All that works, but it is nice to be able to simplify define a tree of tests in a single file and have all that taken care of for you

  • By simonw 2025-03-1322:482 reply

    "One of the most essential practices for maintaining the long-term quality of computer code is to write automated tests that ensure the program continues to act as expected, even when other people (including your future self) muck with it."

    That's such a great condensation of why automated tests are worthwhile.

    "To write your own testing framework based on continuation trees, all you need is a stack of databases (or rather, a database that supports rolling back to an arbitrary revision)."

    PostgreSQL and SQLite and MySQL all support SAVEPOINT these days, which is a way to have a transaction nested inside a transaction. I could imagine building a testing system on top of this which could support the tree pattern described by Evan here (as long as your tests don't themselves need to test transaction-related behavior).

    Since ChatGPT Code Interpreter works with o3-mini now I had that knock up a very quick proof of concept using Python and SQLite SAVEPOINT, which appears to work: https://chatgpt.com/share/67d36883-4294-8006-b464-4d6f937d99...

    • By addled 2025-03-144:58

      An approach we've done is to use Postgres' "create database foo2 template foo1" syntax to essentially snapshot the db under test at various points and use those to rollback as needed.

    • By __MatrixMan__ 2025-03-143:291 reply

      I feel uneasy about using transactions like that. Eventually somebody will be puzzled enough to enable statement logging, which will not contain the data they need. Then they'll set a breakpoint in the test and get a shell on the db to see what's actually there, and they'll find it empty.

      And by eventually somebody, I mean two days ago me.

      I'd much rather just have a utility that copies the database actually and hands my test the one it's allowed to mess with.

      • By simonw 2025-03-144:48

        One trick I've experimented with is setting things up so that if an assertion fails a copy of the full current state of the database is automatically captured and written out to a file on disk.

  • By turtleyacht 2025-03-1321:141 reply

    End-to-end (e2e) tests are slow and flaky. They don't have to be, but effort to fix breakage starts consuming most available time.

    One idea is to separate scraping from verification. The latter would run very fast and be reliable: it only tests against stored state.

    Then scraping is just procedural, clicking things, waiting for page loads, and reading page elements into a database.

    Some consequences are needing integrity checks to ensure data has been read (first name field selector was updated but not populated), self-healing selectors (AI, et al), and certifying test results against known versions (fixing the scraper amid UI redesign).

    A lot of effort is saved by using screenshot diffing of, say, React components, especially edge cases. It also (hopefully) shifts-left test responsibility to the devs.

    Ideally, we only have some e2e tests, mostly happy paths, that also act as integration tests.

    We could combine these ideas with "stacked databases" from the article and save on duplication.

    Finally, the real trick is knowing, in the face of changes, which tests don't have to run, making the whole run take less time.

    • By arkh 2025-03-147:471 reply

      > End-to-end (e2e) tests are slow and flaky.

      If they are slow, it means your application is slow. Good thing your tests make you realize it so you can work on it.

      If they are flaky either your application is flaky or your UI is hard to use. Anyway that's something your tests tell you you have to fix.

      And last: if your tests are all independents, why not run them all in parallel? With IaC you should be able to provision one instance of your architecture per test (or maybe dozen tests) easily.

      • By turtleyacht 2025-03-1411:511 reply

        Even in parallel, there are tradeoffs: run them all in one container, or chunk them out to available workers. The first runs into resource constraints; the latter takes up everything in the shared pool.

        With IaC, emulating a constellation of all dependent services along with the site is technically feasible. (There are other possible constraints.)

        What's your ideal scenario? For example, k8s + cloud, ephemeral db, auto-merged IaC file of a thousand services, push-button perf testing, regression suite with a hundred bots, etc.

        • By arkh 2025-03-1413:15

          > What's your ideal scenario?

          On premise k8s cloud where you can deploy many instances of the exact same services you have in prod. Let's say a E2E test takes 5s to run, your deployment takes 2mn and you want to stay under the 5mn line for running your test suite: deploy an instance of all your services and their databases per batch of 30 tests.

          I can understand this not really being possible at an Amazon scale. But for most businesses? A good beefy server should be enough.

HackerNews