The cost of messaging between systems

When building a distributed system, you need a way of communicating between the components or services, and this is often a message bus. Message buses come in different forms, but ultimately use the same paradigms – commands and events. One of the earlier distributed systems I worked on suffered from performance problems because of an over-use of messages; in many places messages were being used instead of method calls, with a Command initiating an action, and an Event being published when it completed, and the workflow continuing on receiving the event. What had happened was that in small tests, on developer machines, with no network latency, with no ambient traffic, this had seemed to do no harm; when released into the wild, with real customers and real load, on a network with latency this of course proved not to be the case!

I often see conversations when teams are thinking of using a message bus, where there can be a concern over latency – other suggestions are often put forward, like ZeroMQ, which trade latency for durability. It comes down to this: if you care about latency, message buses, queues, and this entire paradigm are not what you’re looking for! Message buses are great when you want something to happen, and you’re not too fussy about when. Some things to consider are:

Overheads: To send a message, first you need to serialize it; this has a cost. You then need to send the message, normally over a network or at the very least a write to disc, this has a cost. The consumer needs to receive it, normally over a network or at the very least from a disc, this has a cost. The sends and receives may be wrapped in a distributed transaction, this again adds cost. The consumer deserializes the message, at a cost, dispatches it, at a cost, and then will start processing. If you’re waiting for a response, repeat the whole cycle!

What’s the receiver doing: You’re building a distributed system, so you really want to make sure you understand fault tolerance, partitioning, CAP theorem and the like! Just because you’re online and working, doesn’t mean that anything else is; you may be sending a message that will sit in a queue until the receiving service is rebooted, patched, switched back on, or whatever. This is sometimes known as “temporal coupling”, because you are dependent on a thing being online at a specific time (now). A big part of building a distributed system successfully is to break down the dependencies as much as possible, and to use the topology to introduce fault domains rather than suffer from cascading failure. You need to not know, or care, whether the receiver is online – and certainly not let it take you down with it!

Traffic: You’re sending a message to another service… but you’ll not be the only one! If you’re using a queue, then queueing theory applies; the time taken to process your message is a combination of the actual handling time, the number of messages in the queue, and the number of workers processing messages. You don’t control any of these things, and especially after an outage there might be a considerable backlog. Who knows how long your message will sit in the queue before even being looked at?

In summary, messaging is a great way to build distributed systems, so long as you use it for what it’s good at. Many messaging systems are really reliable, with things like guaranteed at-least-once delivery, but there are costs, such as latency. Embrace this, and use messaging when time isn’t a factor, such as for handling events (which by definition are things that happened in the past, and can be eventually consistent) or triggering long-running work in the background. Reject messaging anti-patterns that need request-response – use other things, like HTTP for this. Use the right tool for the job!