Tuesday, July 22, 2008

The bleeding edge Linux desktop

The obvious advantage of closely following the development of the various components of a Linux distribution is that you get the enhancements as soon as they are deemed "stable enough" by the developer. The obvious downside of this is that "stable enough" for them is possibly not "stable enough" for you, and sometimes the bleeding edge is not stable by design.

Consider for example the Linux kernel. Current Ubuntu distributions are using a stable 2.6.24.something kernel. The last stable kernel is 2.6.26. You might say "But that's not much of a difference". It depends. Under the new kernel development model, all kernel development happens between those (not so) "minor" releases. The first two weeks after a release consist on the "merge window" in which new features and updates are merged into Linus' tree. After the merge window, 2.6.(x+1)-rc1 is published, and the rest of the release cycle consists in bugfixing the merge.

One of the things that got merged for the last 2.6.26 release was LED support for the iwl3945 driver (the driver for the wireless NIC on my laptop). I had been waiting for this since the release of this driver (there was an older driver for this card with LED support, but the driver used a closed-source userspace component, which was not cool). I wanted to have the LEDs as soon as possible. How could I do that?

The first option is to regularly get tarballs from kernel.org. That's not a bad option, but having to load kernel.org each time there's a new release or release candidate is a bit bothering. The best option is directly compiling from Linus' git tree.

To do this we first need to clone Linus' tree. Of course, if you don't have git installed you need to install it first. On Ubuntu:

$ sudo apt-get install git-core

Now we are ready to clone Linus' tree. This is the equivalent of a check-out in older (svn, cvs) terminology. Position yourself in the directory you will use as the parent of your Linux source tree and do:

$ git-clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

This will take a while. After it's done, change into the new linux-2.6 directory. To compile the kernel, we first need to configure it. Use make defconfig to automatically select the default options, and then tailor your config using the curses-based make menuconfig. Most of the defaults are sane, you should just check that the drivers for your hardware are selected. After you are happy with your config, just do make to build the kernel, and sudo make install && sudo make modules_install after the build is done to install the kernel.

There's still a few things you might need to do in order to boot your recently compiled kernel. If you use drivers that require firmware (such as wireless NIC drivers) you need to add a symlink to the old firmware directory in /lib/firmware, naming the link as your newly built kernel. After that, just do:

$ sudo update-initramfs -c -k

This will create the initramfs image for your new kernel. The initramfs holds important kernel modules (filesystem, PATA/SATA controller) that are needed to bring up the system, to avoid having to include them directly in the kernel. Next, you need to link /boot/initrd.img to your new initramfs. Finally, use update-grub to install the new kernel into the GRUB bootloader. After a reboot, you should be running your new kernel.

Or not.

Depending on when you did the clone, you could have landed in different places of the release cycle. Maybe you cloned an rc1, or worse, just after a release. This means you got a snapshot of the kernel tree in its most unstable state. The way I handle that is not pulling (git-pull, equivalent of svn/cvs update) before rc1. Even doing that, there are times where a particular bug hits your setup, making your new kernel unbootable. At that point you can do two things: you can wait for the release, or for the next release candidate, or you can look for the bug.

git gives you a very powerful (albeit time-consuming) way of finding the bug, or more exactly, the commit that broke your build. The assumption is that the tree has booted before (since you were able to compile and install the new kernel), but one commit since that moment broke your build. A simple binary seach procedure (which is time consuming since each test means compiling and installing a new kernel) gives you the commit that broke your build: the tree built and booted before that particular commit. This is called a bisect.

Let's say that you were running kernel 2.6.25 and you wanted to try 2.6.26-rc1. You clone Linus' tree and you compile, install and try to boot the new kernel, only to be greeted by some obscure error message or just an incomplete boot sequence. You decide you want to hunt down that bug. What do you do? You go to your kernel tree and do (taken from the git-bisect manpage):

$ git-bisect start

$ git-bisect bad

This means the current version (2.6.26-rc1 in our example) is bad, i.e. does not boot, eventually it could mean another problem with the kernel.

$ git-bisect good v2.6.25

2.6.25 was the last version tested that was good, i.e. booted correctly (in our example). git will calculate the commit halfway through both versions, and arrange the tree to reflect that version. Now you need to compile and install, reboot and test the new kernel. If it works, you can use the working kernel to do git-bisect good. If it doesn't, you need to boot another kernel and do git-bisect bad. Either way, the binary search will continue (obviously in different directions in each case). In a logarithmic number of steps you will find the first commit that makes your kernel unbootable.

Usually, at least for most people this commit serves more as a hint of where the problem is rather than as a pointer to a bug. A kernel developer might start debugging that commit, I usually try to see the configuration options associated with that commit to see if I can work around the bug (in the best case, disabling the feature). In any case, you have a clearer idea of where the problem is coming from, and possibly this can lead to more information if you search the web.

Some personal experience: the 2.6.26-rc series failed to boot on my laptop due to a SATA-related change that was introduced in the 2.6.26 merge window. Using bisecting I isolated the problem to a specific SATA feature that was changed in that merge window. Disabling the feature made my kernel boot as before =).

No comments: