Capturing the bug
There wasn’t a lot of information to go on, but the fact that the error is intermittent suggests that perhaps things like timing can be an issue. Modern applications usually do things asynchronously, especially things like loading data from a server, and without anything else my initial hunch was that this would turn out to be a timing problem. The ideal thing to do would be to have a reliable way of triggering the issue, that can be easily run again and again; this is a job for automation tests. Unfortunately the application had close to zero tests, and so we started to build a simplistic set of automation scripts that simulated a journey through the application.
The way automation tests work is fairly simple; an external controlling process communicates with the browser, sending commands and querying the state of the window. A typical flow would be that you check to see whether a particular element is visible on the page, interact with it in some way, and check to see whether the state of the page afterwards is what you expect it to be. By mapping out the user’s intent, and translating this to elements on the page and interactions, we can form an automatable journey through the system. We can let this journey run over and over, and capture any errors that are triggered with logging. Ideally this would be with a logging setup that we can release into production later, so that we can see not only errors that happen in development environments, but also any errors that a customer sees.
Capturing errors that happen in the browser is traditionally done with bug reports, but this isn’t ideal:
- Not all users will send a bug report – many will suffer in silence or simply move on to a competitor
- End users’ bug reports probably won’t have a deep level of information
- By the time an end user is reporting a bug the damage to your reputation is already done
With error logging set up, the next step was to point both the automation scripts and manual testing at the application, and watch for the error. Sure enough, before long we started seeing errors being triggered and captured. As it happened this was easier than I thought it would be; it was indeed a timing issue, and the automated tests were interacting with the page more quickly than users do, so were able to trigger the problem more reliably. As well as logging, we had also configured the automation tests to take screenshots of the application. We were able to use the error log to determine the exact time the error had been triggered; knowing the time we could go straight to the right screenshot. We could then show this screenshot to the user who raised the bug, who confirmed this was the behavior he had seen.
One of the really nice features of the Track.js error log is that it contains very rich information about the error, including timelines, source maps and more. We were able to go straight to the error, and step back to the cause. In this particular case it was a classic timing problem; a method being called on an object that hadn’t been set up because a network call was still in-flight, resulting in the familiar “undefined” runtime error. This is pretty simple to resolve by a little defensive coding; the easiest way is to just check whether an object exists before interacting with it. The actual fix was to just add these guard clauses to all the places where these objects were being accessed.
Making sure it is fixed and stays fixed!
Once we think the bug is fixed, we need to measure the impact of the fix. We want to make sure we’ve fixed the problem, and not introduced any new ones in its place! This is where the automated tests once again prove their worth. We can simply deploy the fix to a test environment, and set up the automation tests to run again and again until we’re happy. We can keep an eye on the error log, and if we still see the error we know we have more work to do. If we’re happy that the error is fixed, we can release the fix into production.