One approach to performance testing collaborating services

The first service oriented system I worked on was for a retailer, processing the lifecycle of customers’ orders from purchase through dispatch and a customer service cycle. At launch, the system had performance issues which my team was tasked with identifying and resolving. I’ll not go too much into the cause of the issues or the solution – that’s a whole other topic – but the performance testing approach we took was interesting in its own right. I’ve found with many teams that there is an aspiration to include performance testing in the release lifecycle, but it was one of these things that tended not to actually happen because it can be hard to get started.


As should often be the case, the need to investigate performance arose as a business concern (rather than a technical whim!). The business had captured data that showed the traffic on the system at different times of day, and identified that there was a consistent profile – specifically that customers tended to place the most orders at 3 specific times – 8:00 to 9:00 (before work), 12:00 to 14:00 (lunch) and 19:00 to 20:00 (evening), of which by far the largest was the morning spike. Retail businesses can also predict to some extent when spikes in load will come – the obvious ones being Christmas or other holidays or events, and response to advertising campaigns. In our case, previous data capture had established that the 2nd week in December was the busiest week of the year, and we knew the historical ratio between this peak and the average, and the ratio between the current year’s sales and the previous, which meant we were able to estimate with reasonable confidence what load the system would be placed under. We also estimated a trendline, so that we could ensure there was headroom for a couple of years of future growth; on top of this a certain degree of redundancy was required. All said and done, the target we set was that we should be able to cope with 10x the previous load profile. Our definition of “coping” was that the backlog of messages waiting to be processed stayed under an agreed threshold, and reduced to zero at the end of the test within an agreed time (as the messages were picked up in the background with no more coming in).


The first thing that is necessary in testing the performance of any system is reaching a decision on what to measure. There are plenty of choices, and in some teams the engineers may have a hunch of what the weak spot would be. I think of two distinct classifications – performance and scale. Performance is the speed at which things happen, scale is the volume of things happening at a given time. The two are not mutually exclusive of course; it is very common for performance to degrade at scale. In this particular system, the architecture was built on asynchronous services who managed their workload via queues – which meant that the scale of the system was essentially determined by how many processes would handle a message at the same time – meanwhile an interesting characteristic of this particular implementation was that the services were quite “chatty”, with one message being sent into a system, that would handle the message by sending multiple messages, in a fan-out – I’ll write about this separately another time. From this insider insight, our hunch was that latency and capped capacity in the message transit would both be the major issues, while things like resource saturation (memory usage, CPU usage) would probably not be a problem.


One of the nice characteristics of a message-driven system is that it is relatively easy to interact with – by simply sending messages. In our case, the single entry point to the system was that a customer placed an order. The message was relatively easy to reconstruct from the data that ended up in the database, and by changing identifiers to make sure no duplicates were created. We wrote a simple tool to query out the orders in the database for the relevant period, transform these into messages that were ready to send, and send them in sequence with the same timing relationships as the original – essentially replaying this data block. By multiplying the timing windows, we were able to simulate higher load. This gave us a nice deterministic, repeatable set of inputs to test against different system configurations. An important thing here is that the test harness itself was able to outperform the system under test! In this case, all the test harness was doing was injecting messages onto a queue, which is quite quick and was happening in parallel threads, having pre-calculated everything it could up front to avoid that cost while the test was running. Before going too much further, we tested our test harness, and established it was able to simulate over 2000x real-time speed – more than enough. We built a simple user interface on top of the tool, such that anyone with a similar need in future would be able to leverage our efforts.

On running the test against the production codebase, we found that the system was able to cope with just 2.5x the standard load rate – a long way short of the estimated 10x. It was very clear by monitoring the queues for the various services that there was 1 service in particular that had serious performance issues. We’d had a hunch it could be this service that needed attention, but now with a repeatable test we were able to demonstrate the problem, prove that there was a problem that merited investment of time and money to solve. We were then able to follow a “test driven” cycle of tweaking the problem area and confirming by measuring that the results were actually an improvement.

As it happened, we identified a much better way of implementing the service in question – we went from easily being able to put the system under load it couldn’t cope with to not being able to even see the performance metrics on the original chart. We weren’t able in the end to measure the actual capacity of the new implementation, because the load generation tool couldn’t keep up, with a multiple-thousand-to-one improvement – which we could demonstrate in a repeatable way.