Yesterday, I decided to upgrade my Linode hosted, Xen virtualized server from Debian 6.0 “Squeeze” to Debian 7.0 “Wheezy“. Usually this process is quite easy, just adding some software channels and upgrading the software. This time, however, I ran into some nasty bug with different Xen host/guest Linux kernels and how they handle disk-io write-barriers.
After upgrading the software, and rebooting, I was greeted with a system that couldn’t change the root filesystem to read-write mode. Digging around, I saw errors like “ext3_journal_start_sb: Detected aborted journal”. I assumed that maybe the filesystem had become slightly corrupted, as it had an uptime of 176 days. However, ‘fsck -vf’ didn’t solve the problem; the journal kept reporting problems. So, I used ‘tune2fs -O’ to remove and re-create the journal, with an ‘fsck’ in-between; still the problem persisted.
Googling around, I found this, which hinted at the problem: disable disk-io write barriers, but it wasn’t clear why. As I read more, articles mentioned other errors earlier in the kernel initialization, specifically, errors like: “blkfront: barrier: empty write xvda op failed” and “blkfront: xvda: barrier or flush: disabled”; I had those entries too. Eventually, I came to articles like this, which explained that there can be an inconsistency in how disk-io write barriers are handled between the host and guest OS. The quick solution was to disable the feature in my host OS. The real solution was to upgrade Linode’s guest OS…
I wanted to ensure my data was accessible, so I found the flags, ( ‘barrier=0’ for ext3/ext4, ) and added them to my ‘/etc/fstab’ entries for my filesystems. And lo, the first error about the “aborted journal” as well as some of the “empty write xvda op failed” entries. My data was accessible, and I could bootup into a functioning system. Success!
However, there was still an entry about my swap-partition having barrier problems; there was no obvious workaround to this. Additionally, this felt like a workaround for a known problem, and it’s something that could impact efficiency.
Now, I had been conversing with Linode the whole day. Once I got to this point, I forwarded them an article describing how it is a Xen host/guest mismatch problem. Their reply was quite welcome: “Would you like to be migrated to some newer host OSes?” Yes, yes I would!
The new host OS worked, and it didn’t have the errors regarding my swap-partition. Additionally, I was able to remove the ‘/etc/fstab’ options ‘barrier=0’ and my system is working smoothly. Well, this was an unexpectedly happy ending!