Thursday, March 29, 2012

Amazon Linux 2012.03, Ruby, and Chef

Yesterday Chef suddenly stopped working on my AWS boxes. It would proceed normally until it attempted to install a package, and then it would segfault, like so:

Wed, 28 Mar 2012 23:26:07 +0000 INFO: Processing package telnet action install ((irb#1) line 1) /usr/lib/ruby/gems/1.8/gems/chef-0.10.8/lib/chef/provider/package/yum.rb:420: BUG Segmentation fault ruby 1.8.7 (2011-12-28 patchlevel 357) x86_64-linux

It wouldn't always segfault on line 420, nor always on yum.rb, but it would always segfault.

Not good!

I discovered that this problem was only happening on newly deployed machines. Machines launched a day before could complete their Chef runs, and they could install packages from recipes constructed with shef, the Chef command-line client.

How might a newly deployed machine differ from one that's been running? Using cloud-init and cfn-init I've configured my machines to install a few things when they spin up. One of these is Chef, but I've locked the version down to 0.10.8, and was able to verify that both new and old boxes were running the same versions of Chef.

I also use cloud-init and cfn-init to install a few packages through yum – just the bare necessities required to get Chef installed. One of these is ruby-devel, and I had not locked down the version on it. If ruby-devel had a later version available, it would have been installed, and ruby would have been updated as a dependency. I checked the old box:

$ ruby -v ruby 1.8.7 (2011-12-28 patchlevel 357) [x86_64-linux]

And the new:

$ ruby -v ruby 1.8.7 (2011-12-28 patchlevel 357) [x86_64-linux]

The same version! I wondered what yum might have to say about these packages. On the old box:

$ yum list installed | grep ruby ruby.x86_64 1.8.7.357-1.10.amzn1 @amzn-updates ruby-devel.x86_64 1.8.7.357-1.10.amzn1 @amzn-updates ruby-irb.noarch 1.8.7.357-1.10.amzn1 @amzn-updates ruby-libs.x86_64 1.8.7.357-1.10.amzn1 @amzn-updates ruby-rdoc.noarch 1.8.7.357-1.10.amzn1 @amzn-updates rubygems.noarch 1.3.7-1.7.amzn1 @amzn-main

And the new:

$ yum list installed | grep ruby ruby.x86_64 1.8.7.357-1.16.amzn1 @amzn-main ruby-devel.x86_64 1.8.7.357-1.16.amzn1 @amzn-main ruby-irb.noarch 1.8.7.357-1.16.amzn1 @amzn-main ruby-libs.x86_64 1.8.7.357-1.16.amzn1 @amzn-main ruby-rdoc.noarch 1.8.7.357-1.16.amzn1 @amzn-main rubygems.noarch 1.3.7-1.7.amzn1 @amzn-main

Aha! A tiny change in that right-most portion of the version string, a portion that I was not aware even existed.

Digging through the AWS website revealed that today was launch day for the newest version of Amazon Linux, 2012.03. The last release was 2011.09. What's more, it turns out that Amazon Linux employs a system that allows 2011.09 instances to run a yum update to upgrade all packages and effectively become 2012.03. That means for anything installed via yum on a 2011.09 instance, packages for the new 2012.03 release will be installed instead of those created for 2011.09. The root of the problem – I was running 2012.03's Ruby on my 2011.09 instances.

I ran some more tests, deployed a couple of 2012.03 instances, and confirmed that it was the Ruby shipping with 2012.03 that was at fault. Even on a fresh 2012.03 instance, Chef would throw the same segfault errors when attempting to install packages.

What to do? I found here that Amazon provides a method to tell 2011.09 instances not to pull packages for later versions. Just throw this in to the cloud-config:

#cloud-config repo_releasever: 2011.09

I made the changes, which included converting my instance's userdata to use cloud-config syntax instead of being ingested as a Bash script, and the segfaults disappeared. I was able to confirm that the instance was once again sticking with the Ruby from 2011.09:

$ yum list installed | grep ruby ruby.x86_64 1.8.7.357-1.10.amzn1 @amzn-updates ruby-devel.x86_64 1.8.7.357-1.10.amzn1 @amzn-updates ruby-irb.noarch 1.8.7.357-1.10.amzn1 @amzn-updates ruby-libs.x86_64 1.8.7.357-1.10.amzn1 @amzn-updates ruby-rdoc.noarch 1.8.7.357-1.10.amzn1 @amzn-updates rubygems.noarch 1.3.7-1.7.amzn1 @amzn-main

I reported the issue to Amazon here, and commented on the Chef bug related to this problem here. Hopefully they'll have a fix soon, because I prefer to run the latest releases, when possible.