This post will deviate from the type of content I generally post here. It isn’t really related to QA and instead deals with a recent problem we encountered when upgrading our build servers. I’m posting about it here in hopes that others may save themselves time and trouble from the lesson that we learned. If you’re interested, read on, if not, hopefully I get some more QA posts up over the holidays. If you just want to know what the problem was and how we fixed it, you can scroll right to the bottom.
We recently decided to give our build servers an upgrade. Our existing setup was a Mac mini running Jenkins and 4 Xserves that served as slaves. While the Xserves were great as servers, their product line has been discontinued, they’re noisey, they’re large, and they require more power than Mac minis. And even though redundant networking and power supplies are nice, they’re overkill for our needs. So when we went to update the servers we decided go to with Mac mini servers.
Setting up the first slave took a little bit of time and patience. We decided to start clean and set these up from scratch rather than migrate the existing servers. We had a general idea of what was needed and figured out what pieces we were missing through trial-and-error. Taking notes on steps as we went through the first one allowed for an expedited process on the subsequent servers. When we finally got the first one setup we launched it as a slave on Jenkins, assigned a project to build on it, kicked it off and, with some tweaking, got the build to succeed.
With my set of steps to follow, I quickly ran through the second server and had it up and going in an hour or so. Feeling fairly confident in the process at this point, I decided to try a third server near the end of the day, but this one was a special case. This one was going to replace the server that also ran our VPN. Initially to try and avoid having to re-create all the VPN users and have everybody set up a new password, I set up this Mac mini as a clone of the existing server it would be replacing. During the cloning process though I began to give things more thought. After talking with my boss we decided that VPN would actually be moved off of the build servers to a more dedicated box. Once the cloning finished, I booted the mini into the recovery menu and did a re-install.
An hour or so later the re-install was complete, but it hadn’t gone quite as I had hoped. There still seemed to be certain user data and settings that had persisted through the re-install, but it was good enough for now. I ran through my list of steps, skipping a few for software that had persisted through the re-install, and got the server added to Jenkins. Feeling satisfied with my work for the day, I decided to hold off until the next day to do the remaining servers.
The next morning we noticed something strange; Jenkins was showing that two of the new servers had response times of a few seconds while the old servers were showing response times of a few milliseconds. Jenkins was also reporting a clock difference on the two new servers of being a few seconds ahead while the old ones all showed in sync. We also noticed a few builds had failed overnight, but all at different points in the build process. Some of them were even failing in the middle of uploading builds to an external server.
Our first guess was a networking problem, but why? When we created the new build servers we gave them the same names as their predecessors that were to be retired. We also assigned them the same IP addresses, but due to some technical hurdles, the static IPs weren’t being handled by the router. Instead we were telling the machines to use DHCP, but were manually assigning them the IPs we wanted them to have. We wondered if these factors had caused some weird networking problems when trying to reach these new machines that had taken over old hostnames and IPs. While trying to diagnose the issue I noticed that if I pinged one of the new servers continuously, intermittently several packets would timeout.
Oddly, the third server wasn’t exhibiting any of the symptoms that the other two servers were. I had more or less followed the same steps, but one notable difference is this server had gotten a new hostname and IP since the existing Xserve needed to stay up for now to run our VPN. We were already thinking it seemed like some kind of network issue, and now we had one server that didn’t take over an existing IP and hostname and it wasn’t having any of the same problems.
Now that we had figured out the issue, it was just a matter of moving the other two servers we had already setup to new hostnames and IPs. The change only took a couple minutes and we were back in business. Jenkins was showing the response times as low and the clocks in sync. Everything looked great… until a little while later when I got an email notifying me that a build had failed. I looked and sure enough, it was one of our new Mac minis and like the other failures, it happened in the middle of the build. Looking at the Jenkins interface, the slave had now been marked as down as its response times had gotten so bad. I looked to the third server that had shown the initial success and it was still fine. Once again, I was perplexed.
I tried searching online for any clues on what to do about a drifting clock in Jenkins with no luck. I went to an OS X Server room on IRC and asked for help, but was directed to try the Jenkins room. I went to the Jenkins room for help and nobody had any suggestions. I spent the night mulling things over and was running out of ideas.
The next morning I talked with Casey, one of our developers who had been helping me try to troubleshoot the issue the last couple of days. Casey had an idea; what about the energy saving preferences? I quickly took a look at the two problematic servers and noticed they were set to put the computer to sleep after 30 minutes of inactivity. I checked the third server and this option was disabled. I quickly disabled the option on the other two servers and wished for the best. Initially things looked better, but I had already been fooled several times with other solutions that initially had good results.
We let the servers go for a few hours and this time the good results actually stuck. That was it; Casey figured it out. In retrospect it all makes sense. Well… almost all of it. The build processes didn’t count as activity because they were running in the background; the OS was relying on mouse/keyboard input to tell it the computer was in use. The reason the third server didn’t have any issues is because one of the settings that persisted across the reinstallation was the Xserve’s Energy Saver preferences. The Xserve’s seemed to have shipped with a much saner default of “Never” for the computer sleep option. I still don’t quite understand the symptoms of showing the clock as drifting ahead or why the response time appeared slow rather than just non-existant. Regardless, this was a much harder lesson to learn than it probably should have been. It should have occurred to us much sooner to check the Energy Saver preferences, but I do hope that Apple will change the default on the Mac mini servers to have these preferences disabled and save others the trouble and confusion. I don’t imagine that many people out there running servers want them going to sleep every 30 minutes.
tl;dr: Mac mini servers running as build servers had symptoms of slow response times with Jenkins, Jenkins reporting their system clocks as ahead, and builds failing at random points, but not all the time. This was fixed by changing the Mac mini server Energy Saver preferences from the default of putting it to sleep after 30 minutes of inactivity, to never putting it to sleep.