Why I check packages into version control

For a long time now, Git has been my de-facto version control tool. It is fast and efficient, does everything I need and much more, and gets the things that are important to me right. If you’re not familiar with Git, one of its big strengths is that it is driven by diffs, which means that things like branching are very quick and cheap operations. Before that I had used SubVersion, where every branch is essentially a separate folder, and the volumes of data being transferred for almost every operation are much larger.

There’s a very common practice when it comes to packaged dependencies which is to exclude the downloaded dependencies from source control. In version control like SVN (or, god forbid, TFS), this makes a fair amount of sense – including the packages will result in a large and slow repository, because all the bytes will get transferred around with multiple operations. Especially on self-hosted solutions, this means more demands are placed on hardware – in particular disc space, and download bandwidth. With git, you only store changes (and some metadata), and only download changes (and metadata) – so these problems are far smaller, to the point of being non-existent. I’m not aware of any other compelling reason not to check packages into source control, as in I don’t know of any compelling benefit of excluding them or any major cost of including them.

Ideally, version control is your “time machine” for the project. It should be possible to switch to a particular commit, and be able to work off that point. If something is needed, it would be nice if it were included. I draw the line at things like operating system images or installers for things like IDEs, because the software I write is not dependent on a fixed version of these and they can safely be considered stable and available from other sources. If there are build scripts that relate to a particular version of the codebase, they should be included. Tooling that is rare or might become unavailable that relates to a particular version should be included. The reason for this is very simple – I want to have confidence that I can recreate any previous state if I need to, and not find myself in a situation where I have an undocumented external dependency. Packages should in theory be available from a known place, but history has repeatedly shown this isn’t reliable. Package repositories have themselves versioned – needing tooling changes to keep up; URLs have changed or moved; in some cases individual packages have been taken down completely. Last year there was the infamous “left-pad” incident, where an individual developer who had written a tiny module that simply padded a string with spaces at the start decided to remove that package, and it caused widespread disruption – everyone who had included it as a dependency, or who had included something that depended on it, however indirectly, was unable to build their project! In this particular case, there were a number of widely used low-level packages that depended on “left-pad”, which magnified the impact even more.

Package tools normally offer a command line interface to restore packages; this allows build systems to restore them as needed. Let’s think about that though, we want our build system to be as deterministic and predictable as possible. Why try to repeat a development step of bringing in packages, which means the build server is allowed and encouraged to download content from the public internet? Sometimes teams will do things like fork a package, and publish their own version to a private package repository. This then means there are multiple package repository dependencies, and they need to be checked in the right order. If we just check package artefacts in, the build of any given commit is entirely deterministic, with no external dependencies or moving parts.

One argument that is made for excluding packages from version control is that it increases the repository size and makes clone slower. This is true, but irrelevant! Most people will clone a repository very infrequently, and cloning a repository implicitly means a need to be online. However, everyday workflows are more about pull/push and reset/clean. By excluding packages, clone is faster, but simply cloning is no longer enough to get going – a network connection is needed to clone, but also needed to install packages. One of the great things about git is that it enables offline workflows – being fully distributed, every command except pull and push works without network access, including all branching operations. If the packages are included in version control, and you clone while online, you then can work freely offline until it’s time to push a branch. If you have private repositories, that should for security be hidden behind VPN, this is even worse, because there’s a dependency not just on being online, but on being online on the corporate network. If the corporate network has problems, or the private repository goes offline, or the VPN connections aren’t stable, you will be blocked; check packages in and you can get on with things.

When I’ve seen teams creating their own packages, and serving them from a network drive, there’s a whole new incredibly nasty category of problem that can be created. While public server repositories will normally have an application layer that stops the same version of a package being republished, if your sharing mechanism is a lightweight network share – which works perfectly well, and is the cheapest and fastest to get up and running – there is no such restraint. I’ve seen nasty bugs happen when someone accidentally overwrites a package; there is no longer a single version of truth for what this package contains. I’ve had to resort to decompiling the deployed binaries on production servers to try and track down an issue, only to discover that the problem is a mis-labelled package version that got checked in between the build pipeline starting and deployment finishing. If you have your build server set up to automatically publish internal packages and the build scripts have errors in, this can happen without a person even being involved and having a chance to recognize what is going on. If people are manually incrementing version numbers, a new category of concurrency problem is introduced – the team need to coordinate the version numbers that are being applied to make sure they are in sequence with no duplicates. At least if the packages are checked in, the git diff between commits will show that the version number is the same but the binary has changed, which could save a lot of fiddly debugging.

Whenever something is considered “best practice”, it’s a good idea to ask “why”. Often “best practice” simply means “someone said so”, or “it used to be the case”. If there are valid reasons why something is considered good practice (if we believe in continuous improvement, there is no such thing as best), then we should understand what those reasons are, as well as what might make those reasons become invalid. We should always look for ways to do something better instead of blindly following, and try to have “strong beliefs held loosely”, meaning that while it’s good to have opinions, it’s more important to be able to let them go in the face of fresh information.