Daemonic Dispatches - November 2017

FreeBSD/EC2 on C5 instances

Last week, Amazon released the "C5" family of EC2 instances, continuing their trend of improving performance by both providing better hardware and reducing the overhead associated with virtualization. Due to the significant changes in this new instance family, Amazon gave me advance notice of their impending arrival several months ago, and starting in August I had access to (early versions of) these instances so that I could test FreeBSD on them. Unfortunately the final launch date took me slightly by surprise — I was expecting it to be later in the month — so there are still a few kinks which need to be worked out for FreeBSD to run smoothly on C5 instances. I strongly recommend that you read the rest of this blog post before you use FreeBSD on EC2 C5 instances. (Or possibly skip to the end if you're not interested in learning about any of the underlying details.)

Ever since the first EC2 instances launched — the ones which were retrospectively named "m1.small" — Amazon has relied on the Xen hypervisor. No longer: C5 instances use KVM. This sounds like it would be a problem, but in fact that change didn't bother FreeBSD at all: Now that everything uses hardware-based paging virtualization, the core of the FreeBSD kernel barely noticed the change. (This would have been a much larger problem if FreeBSD/EC2 images were using Xen paravirtualized paging, but EC2 has provided hardware virtualization in all of their new instance types since October 2012.) As usual, it's the drivers which have caused problems for FreeBSD.

Under Xen, EC2 was able to provide FreeBSD with Xen "paravirtualized" devices: A privileged virtual machine within each physical EC2 host had access to the physical disks and network, and FreeBSD would interact with it via the Xen "netfront/netback" and "blkfront/blkback" drivers. There were a lot of tricks used to eke out every last scrap of performance, but this had an inevitable performance cost: Every network packet or disk I/O would need to be handled not just by the FreeBSD kernel but also by the Linux kernel running in the "Dom0" domain. Starting a few years ago, Amazon offered "Enhanced Networking" where an EC2 instance could talk directly to network adapter hardware — first with Intel 10GbE network interfaces, but later with Amazon's custom-designed "Elastic Network Adapter" hardware; FreeBSD gained support for the ENA network interface in FreeBSD 11.1, thanks to Amazon taking the step of proactively looking for (and paying) someone to port their Linux driver. Until very recently, there was no similar "pass-through" for disks; from the original m1.small in August 2006 until the I3 family arrived in February 2017, disks always showed up as Xen block devices. With the I3 "high I/O" family, ephemeral disks were exposed as directly accessible NVMe devices for the first time — but EBS volumes were still exposed as Xen devices block devices.

I had the first hint that Amazon was going to be doing something interesting when I was asked if FreeBSD would boot if its root disk was NVMe instead of being a Xen block device. As I recall it, my answer was as follows:

"Yeah, it should work just fine; FreeBSD supports NVMe disks, so it will taste the disk, read the GPT labels, and boot from the one marked as rootfs. I never hard-coded the device name of the boot disk anywhere.
But wait, how is this going to work? AMI boot disks are EBS volumes. You can't be copying the disk image to a local NVMe disk before booting; that would take too long, and changes would be orphaned if the node failed. YOU'RE BUILDING A HARDWARE FRONT-END TO EBS? You guys are insane! Even rolling out a software update to EBS must be a nightmare at your scale, and now you want to add the headache of dealing with hardware on top of that?

Well, apparently Amazon has some engineers who are both very brave and extremely talented: EBS volumes show up on EC2 instances as "NVMe" hardware.

Of course, EBS volumes aren't NVMe devices — and herein lies the problem. You can attach and detach EBS volumes from a running EC2 instance with a single API call (or a few mouse clicks if you prefer a GUI), and I doubt anyone has ever tested hotplug and hotunplug of physical NVMe disks on a FreeBSD system. Moreover, I'm absolutely certain that nobody has ever tested hotplugging and hotunplugging physical NVMe disks from a Legacy PCI bus which is hanging off an Intel 440FX chipset — which is what a C5 instance looks like! Unsurprisingly, with untested code paths came new and interesting bugs.

The first problem I ran into is that when I attached or detached EBS volumes, nothing happened. FreeBSD isn't expecting new devices to appear and disappear on the PCI bus! It turns out that EC2 is sending an ACPI notification about hotplug events, and when I compiled a FreeBSD kernel with ACPI debugging enabled, it was clear what was happening:

kernel: evmisc-0267 EvQueueNotifyRequest : No notify handler for Notify, ignoring (S1F_, 1) node 0xfffff80007832400

FreeBSD wasn't listening for the ACPI hotplug notifications, so they were simply getting dropped on the floor. Fortunately the FreeBSD project includes smarter people than me, and John Baldwin pointed out that we have a tool for this: devctl rescan pci0 prompts FreeBSD to rescan the PCI bus and detect any changes in the hardware. Attaching an EBS volume and running this command makes the new disk promptly appear, exactly as expected.

Unfortunately the detach case doesn't work quite so well. When I removed an EBS volume and ran devctl rescan pci0, rather than FreeBSD removing the disk, I got a kernel panic. It turned out that the FreeBSD NVMe driver was marking its device nodes as "eternal" (which allows for some locking optimizations) and you're not allowed to remove such device nodes. OK, get rid of that flag and recompile and try again and... another kernel panic. Turns out that some (completely untested) teardown code was freeing a structure and then calling a function pointer stored within it; without kernel debugging enabled this might have worked, but as it was, it turned out that calling 0xdeadc0dedeadc0de is not a very good idea. OK, fix the order of the teardown code, recompile, try again... and FreeBSD didn't panic when I instructed it to rescan the PCI bus and detect that the NVMe disk went away. But it didn't remove all of its device nodes either, and as soon as anything touched the orphan device node, I got another kernel panic. Apparently nobody ever got around to finishing the NVMe device removal code.

So the situation is as follows:

FreeBSD versions prior to FreeBSD 11.1 will not run on C5, because they lack support for the ENA networking hardware — on Xen-based EC2 instances, earlier FreeBSD versions can get virtualized Xen networking, but of course that's not available on C5.
FreeBSD 11.1 and HEAD (as of mid-November 2017) will boot and run just fine on C5 as long as you never attach or detach EBS volumes.
If you attach or detach an EBS volume and then reboot, you'll see the devices you expect.
If you attach an EBS volume and run devctl rescan pci0, you'll see the new volume.
If you detach an EBS volume and run devctl rescan pci0, you will either get an immediate kernel panic or be left with a device node which causes a kernel panic as soon as it is touched.
In FreeBSD 11.2 and later, everything should Just Work.

That last bit, which depends on fixing the NVMe driver, is currently being worked on by Warner Losh (not to be confused with Warren Lash, who is a completely unrelated character in Michael Lucas' git commit murder). It also depends on someone figuring out how to catch the ACPI events in question, but that's more of a question of finding the best way rather than finding a way: In the worst case, I could ignore the ACPI events completely and ship 11.2 AMIs with a daemon which runs a bus rescan every few seconds.

Thanks to Matthew Wilson, David Duncan, John Baldwin, and Warner Losh for their help with figuring things out here.

Posted at 2017-11-17 01:45 | Permanent link | Comments