A team I’ve worked with recently were struggling with a poor delivery effectiveness, which was the result of a few particularly bad decisions compounding. One of the biggest issues was that the team had a lot of automation tests, but no other feedback mechanism. Every simple test was implemented as an automation journey through the entire system, running against a real disc-backed database, starting with creating users, and creating every single part of the data setup through automation of user interactions. The test suite had of course grown as new features were added – it actually covered quite a large portion of the codebase – but the tests themselves were at best useless because the feedback they gave was too slow and indirect, and at worst outright harmful because they still had a mix of real and fake components and didn’t actually catch that many problems. As is often the case with a team whose delivery is ineffective, the stakeholders were starting to get restless, and had a wish-list of features that had been assigned to a timeline that they expected to get delivered. This effectively ruled out significant investment in tackling the root causes of the problem, and called for something more pragmatic to relieve the worst of the immediate problems.
At the time I joined, the test suite took about 2 ½ hours to run. This was almost all down to IO constraints, with the pseudo-real database constantly writing and reading the disc. The challenge I set, and invited the whole team to come up with ideas to solve, was how to get this time down without compromising the visibility of issues in the codebase. Honestly, my approach with no constraints would be to look at what every test was intending to cover, and replace almost the whole thing. There are other far more suitable types of testing for every single thing covered here. However, investing so much time in a non-feature-delivery activity wouldn’t be acceptable, and understandably so.
My first priority was to change the team’s trajectory in developing new features, and get some more sustainable methodologies in place. I ran workshops on TDD and unit testing, and presented numerous times on the different options available to ensure that the whole team had the basic knowledge of what was available. Of course many of the team already knew there are better ways, and how to implement them, so my goal was to empower these members of the team to break the trend and not feel obliged to “be consistent”. I had some of the members of the team who weren’t so familiar work on proof of concept exercises, taking an existing test and establishing equivalent tests at the right levels to give the same coverage. Having as many of the team as possible on board with writing better tests and establishing some patterns, as well as putting a foundation for knowledge sharing in place made it possible for the whole team to at least stop contributing to the problem, and the total number of slow running tests did stabilize and start to decrease, as fast component tests took their place.
An easy way to mitigate some of the problem was to investigate hardware and software configurations; the team were at the time using Windows Server virtual machines, running on top of Windows 7, with mostly fairly powerful hardware such as SSD hard drives. Some of the team had machines of a much lower spec, so an obvious thing to do was to bring everyone up to the same spec. Meanwhile, I dug into why the software configuration was as it was, and campaigned for a switch to Windows 10. The Windows 7 disc drivers, especially combined with the virtual machine configuration, meant that the powerful SSD drives weren’t being used anywhere near optimally. The simple switch of operating system brought the run time down to about half, at just over 1 hour. Obviously this still wasn’t good enough for a fast feedback workflow such as TDD, but at least developers could reasonably run the test suite locally while in meetings or similar, rather than being reduced to overnight runs. The total cost of hardware and software upgrades for the whole team came in at about the monetary equivalent of 6 person-days of work, but crucially this was without diverting time away from other things.
Meanwhile the same test suite would run on a nightly trigger in TeamCity. This was far from ideal, as multiple changes would be bundled together, and people would have to waste time searching for the root causes when tests broke. The tests would be broken about 2/3 of the time, there was actually a sprint where the tests were broken for 13 consecutive days(!), where people had pushed attempted fixes but not kicked off another test run to prove the fix had worked, and others in the team had thought nothing of pushing further changes on top of a failing build and introduced fresh problems. The approach I hoped to be able to take was to run the tests in parallel, but unfortunately there were infrastructure dependencies in the code, as well as static state, that only made it possible to run tests in series on any given machine. With a little lateral thinking however we were able to use multiple build agents to achieve the same effect:
- I added a counter to the test runner code; these particular tests were implemented in SpecFlow, where there are hooks that can be configured to run before each feature, scenario or steps, so the counter simply increments before each test
- I added two configuration settings, one for an “instance id” and one for the number of instances
- If the remainder when dividing the counter by the number of instances matches the instance id, the test runs as normal, otherwise the test is skipped
- At this point, by changing 2 config values we can run a non-overlapping, non-underlapping “stripe” of the tests, with a significantly shorter runtime – this works both on a developer machine, as a smoke test, and on a build agent
- We set up a bunch of builds – we chose 10 as an arbitrary number – that would each run a stripe of tests independently; these all used the same artefacts, but simply swapped out the instance id as a configuration transform
- We added a “control build” to orchestrate the flow. This is very simple, it is a single build that contains no steps, but has a snapshot dependency on all of the individual builds. The control build collects together a set of its dependent builds, and reports a pass if all pass and a fail otherwise.
The outcome of using the striped builds was that we could divide the overall runtime by the number of agents that could run a stripe in parallel – in our case 3, but we could have added more agents to improve this further. We could also have experimented with different numbers of stripes, trading off the spin-up and tear-down costs against the benefit of running more instances. The result of these changes combined with the hardware/software changes was dramatic: we could run the entire test suite in just under half an hour, and were able to enable build triggers to run the whole test suite on every single commit to the main branch! With more visibility of the test results, and a lower barrier to running the tests both locally on build agents (along with stricter continuous integration disciplines) , the test suite went from passing about 35% of the time to passing over 90% of the time. When the tests did fail, if the failure was perceived to be transient (timeouts were not uncommon), we could simply re-run the stripe that had failed, rather than have to run the whole test suite again.
The key message here though is that while none of these solutions were ideal, they were low cost, and provided quick wins. This bought the team a little headroom to look at better options for the longer term. We could have stagnated while trying to find better solutions, but by thinking outside the box and challenging some of the constraints around us, we were able to get something off the ground and enjoy the benefits.