This can only end badly
I spent a good 5 hours this week, tracking down a problem with distcc hanging up in our Debian Testing build farm. We use distcc to speed up compilation of the c++ sources in the development of the trading shim. Interestingly, our end user community forced us to this decision of developing on Debian testing, as they are using later gcc versions than we were on CentOS, and it was useful to be able to see their errors, BEFORE they reported them to us
On the new compile farm, sometimes we would get a compile in, say, 44 seconds; other times it would drag out for several minutes. This is a problem as we had just slotted a new unit into harness, and expected better results
In checking the logs in the client doing the distribution of compilation tasks, we were seeing a symptom of 'segfaults' in that client's process; other times, the client would stall, seemingly blocked waiting for a compilation result to come back from a remote buildfarm peer, that never came back. Checking on the remote build unit, one of the distccd children would die for mysterious reasons, leaving a message in the dmesg record. Once that failed build timed out, the needed file would be built locally, and the build proceed. Checking the log files nothing obvious jumped out
The obvious debugging technique is to get a minimal reproducer, and then to partition the problem into smaller and smaller possible causes using that reproducer tool. the issue will manifest on one setup, but not the other, ans so one can rule out more and more issues, until the answer is left, staring you in the face
Looking at my Debian helper tool, it had rotted, and was in sorry need of removal of some constraints: It did not use distcc when available; it did not use proper -J parallel compiles; it did not use -O3 optimization in the compiles. My test tool was not set up to see what I needed to see
Time to pay down some 'technical debt' (If you've not read martinfowler piece, and viewed Ward Cunningham's video, stop now, and do so). And so I made some payment there. After testing, I got these results:
Master | Clients | Elapsed time (real) |
pippin | nfs2, 10.16.1.231 | 0m23.281s |
nfs2 | 10.16.1.231, pippin, localhost | 0m23.702s |
10.16.1.231 | pippin, nfs2, localhost | 0m22.551s |
My first thought looking at this: Well, that pretty conclusively rules out machine specific errors, or network path issues. It must be something different in the setup of the user provoking the issue that my tool does not duplicate. NOTE: This is wrong-headed, of course, as: 'An absence of evidence is not evidence of absence of a problem' but was an easy trap to fall into
For every complex problem, there is a solution that is simple, neat, and wrong.
— H. L. Mencken
For every problem there is a solution which is simple, obvious, and wrong."
— Albert Einstein
I tossed my results at that user for their thoughts on the results, and went back to work on another issue
Later in the day, doing some thought experiments with the user, we could not pin down where to look yet. But as a team, I had him provoke the issue with his setup, while I watched the logs on the various machines through several consoles. And the error appeared, and then jumped out and tickled my eyeballs. I was watching nothing in particular, until I saw the failure on process 29673, and then traced that back up. A successful and a failed session looked like this, respectively:
distccd[29673] | (dcc_check_client) | connection from :ffff:10.16.1.249:41771 |
distccd[29673] | (dcc_r_file_timed) | 909179 bytes received in 0.078651s, rate 11289 kB/s |
distccd[29627] | (dcc_collect_child) | cc times: user 1.132070s, system 0.144009s, 23039 minflt, 0 majflt |
distccd[29673] | (dcc_collect_child) | cc times: user 1.092068s, system 0.104006s, 22481 minflt, 0 majflt |
distccd[29673] | (dcc_check_client) | connection from ::ffff:10.16.1.249:41775 |
distccd[29673] | (dcc_r_file_timed) | 818437 bytes received in 0.071648s, rate 11155 kB/s |
distccd[31248] | (dcc_check_client) | connection from ::ffff:10.16.1.249:41779 |
distccd[31248] | (dcc_r_file_timed) | 886761 bytes received in 0.076688s, rate 11292 kB/s |
distccd[29627] | (dcc_collect_child) | cc times: user 1.068066s, system 0.112007s, 23890 minflt, 0 majflt |
distccd[29673] | (dcc_collect_child) | cc times: user 1.108069s, system 0.112007s, 22012 minflt, 0 majflt |
distccd[29673] | (dcc_pump_sendfile) | Notice: sendfile: partial transmission of 15868 bytes; retrying 344332 @15868 |
distccd[1995] | (dcc_log_child_exited) | ERROR: child 29673: signal 11 (no core) |
A-ha! Now we know what to look for:
dhcp-231:/var/log# grep dcc_pump_sendfile distccd-transition-log
distccd[29673] (dcc_pump_sendfile) Notice: sendfile: partial transmission of 15868 bytes; retrying 344332 @15868
distccd[31248] (dcc_pump_sendfile) Notice: sendfile: partial transmission of 15868 bytes; retrying 586732 @15868
distccd[30262] (dcc_pump_sendfile) Notice: sendfile: partial transmission of 15868 bytes; retrying 4655916 @15868
distccd[2005] (dcc_pump_sendfile) Notice: sendfile: partial transmission of 16384 bytes; retrying 74824 @16384
distccd[2128] (dcc_pump_sendfile) Notice: sendfile: partial transmission of 16384 bytes; retrying 286560 @16384
distccd[2170] (dcc_pump_sendfile) Notice: sendfile: partial transmission of 16384 bytes; retrying 97440 @16384
distccd[2129] (dcc_pump_sendfile) Notice: sendfile: partial transmission of 16384 bytes; retrying 301000 @16384
dhcp-231:/var/log#
The TCP process of shuttling code to compile, and the binary results of such compiles are failing the same way, over and over again: partial transmission of 15868 bytes is present every time. Looking at the log entry again, the form of the connecting hosts is unusual: ::ffff:127.0.0.1 and ::ffff:10.16.1.249. Why that is IPv6 notation? And I reach back to my logs as I remember I had an issue like this a year or so on a Debian box
And so, Google with the search argument: debian ipv6 distcc confirms as its first result: 1. #481951 - distcc: zeroconf support broken wrt IPv6 - Debian Bug ... ... and the bug is still open. Killing off IPv6 is the obvious next step, and so, back to Google with: debian disable IPv6 to find: Disabling IPv6 under a 2.6 kernel. Reading the post, there is some back and forth, and the answer seems to be, there is not an 'official Debian answer', but this is what people are doing. Back to Google with: site:debian.org debian disable IPv6 seems to confirm that there is not a single well documented answer which has floated up in Google's searching
Compare: CentOS addresses the matter directly, and as the first Google hit with: site:centos.org disable IPv6
7. How do I disable IPv6?
* Edit /etc/sysconfig/network and set "NETWORKING_IPV6" to "no"
* Add the following to /etc/modprobe.conf :
alias ipv6 off
alias net-pf-10 off
* Run chkconfig ip6tables off to disable the IPv6 firewall
* Reboot the system
Alternative (which might be easier and works on any release with /etc/modprobe.d):
echo "install ipv6 /bin/true" > /etc/modprobe.d/disable-ipv6
Sadly, there is something else on Debian testing in play as well, and it is not just an IPv6 issue (although turning off IPv6 has drastically reduced the frequency of the issue). When I look in today to make sure the 'fix' is working
[74988.951989] distccd[8671]: segfault at 1 ip 7fdd2250e030 sp 7fff2b025da8 error 4 in libc-2.7.so[7fdd22493000+14a000]
[74989.017836] distccd[8651]: segfault at 1 ip 7fdd2250e030 sp 7fff2b025da8 error 4 in libc-2.7.so[7fdd22493000+14a000]
[74989.518050] distccd[8664]: segfault at 1 ip 7fdd2250e030 sp 7fff2b025da8 error 4 in libc-2.7.so[7fdd22493000+14a000]
[74994.152461] distccd[8659]: segfault at 1 ip 7fdd2250e030 sp 7fff2b025da8 error 4 in libc-2.7.so[7fdd22493000+14a000]
Where is that coffee cup? I knew this would not end well