Debugging parallelism problems in Make

Posted by Ross Burton on August 2, 2012

As I'm now working on the Yocto Project, I've a new i7 build machine which builds all of the distro with -j8 for speed (and builds up to 8 packages at once, just to make sure that all the cores are busy). I don't actually know what -j level the autobuilders are using but they've 24 cores each... Anyway, lots of code is being built daily with high Make parallelism, so we're good at finding subtle races in makefiles. Debugging these isn't trivial or obvious at first, so I thought I'd blog about a few that I've encountered recently.

telepathy-glib

| Making all in telepathy-glib
| make[2]: Entering directory `/buildarea1/yocto-autobuilder/yocto-slave/nightly-x86/build/build/tmp/work/i586-poky-linux/telepathy-glib-0.19.2-r0/telepathy-glib-0.19.2/telepathy-glib'
| /bin/mkdir -p _gen
| ( cd . && cat versions/0.7.0.abi [...] versions/0.19.2.abi  ) |

| /bin/grep '^tp_cli_.run.' > _gen/reentrant-methods.list.tmp | /bin/sh: line 1: _gen/reentrant-methods.list.tmp: No such file or directory | make[2]: *** [_gen/reentrant-methods.list] Error 1

So it creates a directory, and then fails to create a file? The hint is that the error is "no such file or directory" which tells you that _gen/ isn't present. What isn't obvious from the output is that make is running the mkdir and the subshell containing reentrant-methods.list in parallel, which you can confirm by looking at the makefile. It's rather large, but the gist of it is that the rule that does the mkdir isn't a dependency of the code that generates reentrant-methods.list, so they must be dependencies of some higher target and are therefore being run in parallel.

Most of the time the mkdir happens first but occasionally the subshell wins the race and _gen/ doesn't exist yet. Once this was understood it's a simple matter to add some missing dependencies to the makefile.

gThumb

This was more fun. When building with any level of parallelism, make would busy-loop forever. Annoying on your desktop, not so funny on a build server.

When make is running tasks sequentially, it knows when the task has been completed it can check to see if files have appeared and so on. This logic changes with any level of parallelism because multiple things are happening at once. Strangely make solves this by busy-looping, watching for file changes (you can see this with --debug). Generally the expected files either appear or there is an error, but in this case make was spinning for ever.

Digging into the rules for the enumeration generator shows some dependenceis that are not required, and rather complex logic when putting the generated files in the right place. Complicated, and Doing It Wrong.

Writing to a temporary file and then atomically moving that to the right file is a good thing, and essential in parallel builds, as otherwise dependent rules could read a partially-written file. But this makefile is comparing the temporary file with the target and copying the file only if it's different. This looks like an attempted optimisation to reduce rebuilds caused by the enum timestamp changing (won't work: the enum re-generation is happening for a reason, so the rest of the source will rebuild too) and this is what is causing the problem: make is waiting for a file to change when it won't ever change. Once this is understood the fix is simple and results in a cleaner makefile.

WebKitGTK

Oh, WebKit... The one package that you need to build with -j to get a build time less than two days, and it exposes a bug in Make 3.82 causing it to fail with -j. Thanks for that, Make. For reference this is the WebKitGTK+ bug and this is the two-year old Make bug.

tags: tech, yocto