How Not to Write Your Git Update Hooks

Written by Fabian Streitel | Jan 9, 2017 11:00:00 pm

This piece contains some lessons learned about my experience optimizing our git hook performance. The information here is certainly not new but I haven’t found it aggregated and explained in one single place yet.

We recently switched our main code repository from SVN to Git and with that came many challenges and improvements to our software development process. One option that Git offers are so-called hooks. These are small programs or scripts that are run before or after a commit, when pushing to a repository and at other times. They may write to the console during these Git processes (making the output look like it came from Git) and even abort them, e.g. if the user is trying to use Git in a way you don’t want them to.

Directly after the migration to Git, we had several recurring problems that we addressed with these commit hooks. In this post, I’ll summarize a few of the interesting things we learned along the way and I’ll show you how to

write a commit hook that prevents big files from being pushed to a repository
ensure that commit messages follow a certain format

But first, let’s dive a bit into one specific Git hook: the update hook.

The update hook

Hooks in general live in your .git/hooks directory (or your_repo.git/hooks for a bare repository). A hook is simply an executable file with a certain name. The update hook is particularly interesting if you follow a development model in which you have one central repository to which all changes should eventually be pushed. This hook is run everytime someone pushes something to this repository and it lets you accept or reject that push. This makes it ideal to enforce central rules that all developers should follow while also giving developers the freedom of temporarily violating these rules in their local repositories. Once they try to push their changes, though, they’ll have to follow these rules or the push will be rejected.

This hook is called with three parameters

the name of the ref that is being updated (in the most common scenario this is the name of the branch the developer is pushing, e.g. refs/heads/my_fancy_branch)
the old object name this ref used to point to (i.e. the commit at the tip of the branch in the remote repository, the format is the usual SHA-1)
the new object name this ref should point to (i.e. the commit at the tip of the branch in the local repository from which the developer is pushing, the format is the usual SHA-1)

The naive approach to commit verification

With this info, we have everything we need to check every commit the developer is pushing. The easiest way to do this is to get all commits between the old and the new SHA-1 and perform our check on that. In pseudo code:

$commits = git rev-list $old_ref..$new_ref
for ($commit in $commits) {
    if (!check_commit($commit)) {
        print some helpful error message
        exit 1
    }
}

This does what we want. But. If your check_commit function isn’t blazingly fast, it’ll take too long. The problem here is threefold:

For every commit we run the check function, which usually involves spawning some subprocesses per commit (e.g. grep to check the commit message etc.) which is rather slow.
The first violating commit will make our script exit with an error code. If there are other violating commits, the developer doing the push won’t know until they fix the first commit and try pushing again.
We also check a lot of commits that we don’t have to check as some commits in the range may be from merges and not actually new in the repository.

The last point is very interesting. Assume the following scenario:

The developer has their own feature_branch that they work on
The developer merges master into their feature_branch
The developer makes some commits on the feature_branch
The developer now tries to push those changes

If we follow the naive approach above, what will happen is that not only will we verify all commits the developer made on the feature branch, we’ll also re-verify all the commits they merged into the feature branch from master. Take a feature branch that’s been delayed for a while on a repository with several developers and you’ll get pushes that take up to several minutes. That’s unacceptable. Reminds me of the old times where compiling used to be the go-to excuse for slacking off.

The smarter approach

What we need instead is a solution that satisfies both of these criteria:

Don’t verify commits that we have already verified
Don’t verify each commit separately, verify them all in one go

To achieve the first goal, we need a smarter rev-list call. git rev-list $old_ref..$new_ref will give us all commits between the two refs, even if they are already known to the repository. So we have to filter out all commits that are on existing branches, e.g. master, at the time of the push. Luckily, the update hook is run after all our new commits have been received, but before any of the branch labels are adjusted. So if we push some new refs to master, at the time the hook is run, the master label will still point to where it was before. We can use this to our advantage and simply specify that we don’t want any commits that are reachable from any of the known branch labels:

git rev-list $old_ref..$new_ref --not --branches='*'

This drastically reduces the number of commits we need to look at and thus the performance of our hook.

Next, we’ll need to make sure we verify all those commits in one go. I’m going to show you how we did this with two specific examples.

How to prevent big files from being checked into your repository

To check all file modifications and addition for a maximum file size (big files like binaries in Git repos are problematic in the long run), we’ll first need to list all those touched files, then get their size and finally filter for those that are too large.

This is how we do it:

git rev-list --objects $old_ref..$new_ref --not --branches='*' which returns us the SHA1 hashes and paths of all touched objects (including but not limited to touched files)
git cat-file --batch-check='%(objectname) %(objectsize) %(rest)' List the SHA1, object size in bytes and path for each of those objects
awk '$2 > '$filesize_limit' { print }' | sort | uniq --skip-fields=2 Filter on those objects that are too big, sort them and remove duplicates

Combine these three steps in one long pipe and you will have a list of SHA1 hashes and paths for files that are too big. Running this on a commit range with hundreds of commits in our repository takes about 3 seconds. For any normal push this should therefore be sufficient in terms of performance.

How to check commit messages for a certain naming scheme

If you want to make sure that, e.g. your team mates don’t forget to reference an issue ID when committing, all you have to do is

git log --no-merges --pretty='%H %s' $old_ref..$new_ref --not --branches='*' List all commits with their SHA1 and first message line
egrep -v "$message_regex" Filter for those that do not match your message regex (substitute egrep for your favorite grepping tool)

Wrapup

As you can see, hook performance needs to be carefully monitored. Bad hooks can easily slow down the development process and be a hindrance to the developers rather than a help. But with some simple tricks you can easily create hooks that run in a matter of seconds or less. There are, of course, some edge cases to consider that I skipped over here (e.g. creation of a new branch, deletion of a branch, …) but I’ll leave those up to you.

View full post