Solving buildhistory slowness

The buildhistory class in oe-core is incredibly useful for analysing the changes in packages and images over time, but when doing frequently builds all of this metadata builds up and the resulting git repository can be quite unwieldy. I recently noticed that updating my buildhistory repository was often taking several minutes, with git frantically doing huge amounts of I/O. This wasn’t surprising after realising that my buildhistory repository was now 2.9GB, covering every build I’ve done since April. Historical metrics are useful but I only ever go back a few days, so this is slightly over the top. Deleting the entire repository is one idea, but a better solution would be to drop everything but the last week or so.

Luckily Paul Eggleton had already been looking into this so pointed me at a StackOverflow page which used “git graft points” to erase history. The basic theory is that it’s possible to tell git that a certain commit has specific parents, or in this case no parent, so it becomes the end of history. A quick git filter-branch and a re-clone later to clean out the stale history and the repository is far smaller.

$ git rev-parse "HEAD@{1 month ago}" > .git/info/grafts

This tells git that the commit a month before HEAD has no parents. The documentation for graft points explains the syntax, but for this purpose that’s all you need to know.

$ git filter-branch

This rewrites the repository from the new start of history. This isn’t a quick operation: the manpage for filter-branch suggests using a tmpfs as a working directory and I have to agree it would have been a good idea.

$ git clone file:///your/path/here/buildhistory buildhistory.new
$ rm -rf buildhistory
$ mv buildhistory.new buildhistory

After filter-branch all of the previous objects still exist in reflogs and so on, so this is the easiest way of reducing the repository to just the objects needed for the revised history. My newly shrunk repository is a fraction of the original size, and more importantly doesn’t take several minutes to run git status in.

3 thoughts on “Solving buildhistory slowness

  1. The amount of time a ‘git status’ takes to complete really *should* mostly be independent of the number of revisions of history that you have. I’ve seen this violated a number of times, similar to what you report in your post, and in the cases I’ve seen it almost always boils down to a repository where no ‘git gc’ has been run in a really long time. Your filtering of the history might be mostly a distraction, with the recloning of the repository and replacing the original with the new clone having the same effect as a gc. While git gc –auto is triggered by lots of commands to try to avoid this kind of nasty slow buildup, I have seen a number of specialized tools or workflows that never trigger one. This buildhistory thing could easily be one.

    If you see this again, it would be interesting to see the output of ‘git count-objects -v’ in that repository. If there are a large number of loose objects (greater than say 5000, check the ‘count’ field of the count-objects output) or a large number of packs (larger than, say 50; check the ‘packs’ field of the count-objects output), then you’re in need of a gc. I’ve seen the numbers go an order of magnitude or two higher than those amounts, and when it does, all git operations slow to a complete crawl. (If this is the problem, the buildhistory tool can work around it by periodically calling ‘git gc –auto’ or calling some git operation that will call that as a side-effect.)

  2. Brace yourself:

    $ git count-objects -v
    count: 588119
    size: 2399988
    in-pack: 0
    packs: 0
    size-pack: 0
    prune-packable: 0
    garbage: 0

    The operations this repository goes through are lots of files added and committed, then some generally minor changes and committed again. No pushes, no branching. I suspect we’re never triggering a garbage collect.

    1. Post git-gc:

      $ git count-objects -v
      count: 97
      size: 408
      in-pack: 587666
      packs: 1
      size-pack: 57446
      prune-packable: 0
      garbage: 0

      Before the garbage collection with cold caches “git status” took two minutes ten seconds. After running “git gc –auto” and clearing the cache, “git status” takes 27 seconds.

      Paul just put a call to “git gc –auto” in the script so that should help quite dramatically. I didn’t realise that garbage collection didn’t happen if all you did were commits, thanks for pointing that out!

Comments are closed.