Debugging Async timing

In an application I’ve been looking at recently, I came across a problem with diagnosing and improving performance problems in a .NET process. The application was fairly simple, loading remote data in over a network and transforming it – a perfect use case for asynchronous processing, as there is no point a CPU thread waiting around for data over the network. The remote calls were being managed as async Tasks, with several being dispatched together in parallel, rather than running one by one in series, in a Task.WaitAll block, part of the .NET core framework to allow multiple tasks to run in parallel, blocking until all Tasks have completed (or errored).

One of the issues in this particular instance seemed to be that some of the tasks in the WaitAll block were taking significantly longer than others – the question was how to prove it, and which ones. Some of the tasks too a long enough time that it looked like the application had frozen. To help in this diagnosis, I wrote a small wrapper function, to periodically log out each task that was still running, and how long it had been running for.

I’m passing in a function that will generate a task, so that I can wrap its execution. The other two parameters are a name – so we can make sense of the log output – and a timespan that we can set to a reasonable interval relative to the tasks being timed. A CancellationTokenSource gives us a means of cancelling in-flight Tasks, by managing the lifecycle of a CancellationToken. Most of the .NET framework methods that work with Tasks have an overload that can take a cancellation token, so that they can subscribe to this mechanism. The token source can be disposed, so it’s wrapped in a using block, to ensure the resources it holds are released as quickly as possible.

The inner function is simply logging when it starts and ends, and while it is running. The while loop is driven by the cancellation token, so that we can return from the logging at as close as possible to the same time as the function being logged. Task.Delay is a CPU-sympathetic way to wait for a specified time, with the async framework capabilities freeing up the thread while waiting – in contrast to Thread.Sleep. The Task.Delay also takes in the cancellation token, so that the delay can be cut short when the function being logged finishes.

This is a relatively simple little utility, but by using it I was able to gather more information about the performance of the system without polluting each individual method with logging concerns. It was possible to wrap the calls to the Tasks from the coordinating code, leaving their innards unchanged and get metrics about what was going on. As it happens, there was 1 in particular that took far longer than the others in its parallel block. Moving the tasks around into groups that took similar time meant that the number of in-flight tasks was kept higher, and the overall runtime was much lower.

A way to improve this further would to move away from imperatively coordinating the Task execution order with control flow statements, and to use something like the Reactive Extensions framework. It would be possible with this approach to set up a sequence of actions, and declaratively ask for a certain number to be run in parallel. Whereas the WaitAll will always wait for everything to finish before starting the next block, the Reactive “Merge” will manage how many tasks (Observables in the Reactive context) are running at any given time by starting a new one as soon as the previous finishes.