This piece contains some lessons learned about my experience optimizing our git hook performance. The information here is certainly not new but I haven’t found it aggregated and explained in one single place yet.
We recently switched our main code repository from SVN to Git and with that came many challenges and improvements to our software development process. One option that Git offers are so-called hooks. These are small programs or scripts that are run before or after a commit, when pushing to a repository and at other times. They may write to the console during these Git processes (making the output look like it came from Git) and even abort them, e.g. if the user is trying to use Git in a way you don’t want them to.
Directly after the migration to Git, we had several recurring problems that we addressed with these commit hooks. In this post, I’ll summarize a few of the interesting things we learned along the way and I’ll show you how to
But first, let’s dive a bit into one specific Git hook: the update hook.
Hooks in general live in your .git/hooks
directory (or your_repo.git/hooks
for a bare repository). A hook is simply an executable file with a certain name. The update hook is particularly interesting if you follow a development model in which you have one central repository to which all changes should eventually be pushed. This hook is run everytime someone pushes something to this repository and it lets you accept or reject that push. This makes it ideal to enforce central rules that all developers should follow while also giving developers the freedom of temporarily violating these rules in their local repositories. Once they try to push their changes, though, they’ll have to follow these rules or the push will be rejected.
This hook is called with three parameters
refs/heads/my_fancy_branch
)With this info, we have everything we need to check every commit the developer is pushing. The easiest way to do this is to get all commits between the old and the new SHA-1 and perform our check on that. In pseudo code:
$commits = git rev-list $old_ref..$new_ref
for ($commit in $commits) {
if (!check_commit($commit)) {
print some helpful error message
exit 1
}
}
This does what we want. But. If your check_commit
function isn’t blazingly fast, it’ll take too long. The problem here is threefold:
grep
to check the commit message etc.) which is rather slow.The last point is very interesting. Assume the following scenario:
feature_branch
that they work onmaster
into their feature_branch
feature_branch
If we follow the naive approach above, what will happen is that not only will we verify all commits the developer made on the feature branch, we’ll also re-verify all the commits they merged into the feature branch from master
. Take a feature branch that’s been delayed for a while on a repository with several developers and you’ll get pushes that take up to several minutes. That’s unacceptable. Reminds me of the old times where compiling used to be the go-to excuse for slacking off.
What we need instead is a solution that satisfies both of these criteria:
To achieve the first goal, we need a smarter rev-list call. git rev-list $old_ref..$new_ref
will give us all commits between the two refs, even if they are already known to the repository. So we have to filter out all commits that are on existing branches, e.g. master
, at the time of the push. Luckily, the update hook is run after all our new commits have been received, but before any of the branch labels are adjusted. So if we push some new refs to master
, at the time the hook is run, the master
label will still point to where it was before. We can use this to our advantage and simply specify that we don’t want any commits that are reachable from any of the known branch labels:
git rev-list $old_ref..$new_ref --not --branches='*'
This drastically reduces the number of commits we need to look at and thus the performance of our hook.
Next, we’ll need to make sure we verify all those commits in one go. I’m going to show you how we did this with two specific examples.
To check all file modifications and addition for a maximum file size (big files like binaries in Git repos are problematic in the long run), we’ll first need to list all those touched files, then get their size and finally filter for those that are too large.
This is how we do it:
git rev-list --objects $old_ref..$new_ref --not --branches='*'
which returns us the SHA1 hashes and paths of all touched objects (including but not limited to touched files)git
cat-file
--batch-check='%(objectname) %(objectsize) %(rest)'
List the SHA1, object size in bytes and path for each of those objectsawk '$2 > '$filesize_limit' { print }' | sort | uniq --skip-fields=2
Filter on those objects that are too big, sort them and remove duplicatesCombine these three steps in one long pipe and you will have a list of SHA1 hashes and paths for files that are too big. Running this on a commit range with hundreds of commits in our repository takes about 3 seconds. For any normal push this should therefore be sufficient in terms of performance.
If you want to make sure that, e.g. your team mates don’t forget to reference an issue ID when committing, all you have to do is
git log --no-merges --pretty='%H %s' $old_ref..$new_ref --not --branches='*'
List all commits with their SHA1 and first message lineegrep -v "$message_regex"
Filter for those that do not match your message regex (substitute egrep for your favorite grepping tool)As you can see, hook performance needs to be carefully monitored. Bad hooks can easily slow down the development process and be a hindrance to the developers rather than a help. But with some simple tricks you can easily create hooks that run in a matter of seconds or less. There are, of course, some edge cases to consider that I skipped over here (e.g. creation of a new branch, deletion of a branch, …) but I’ll leave those up to you.