How to port your OS to EC2

I've been the maintainer of the FreeBSD/EC2 platform for about 7.5 years now, and as far as "running things in virtual machines" goes, that remains the only operating system and the only cloud which I work on. That said, from time to time I get questions from people who want to port other operating systems into EC2, and being a member of the open source community, I do my best to help them. I realized a few days ago that rather than replying to emails one by one it would be more efficient to post something publicly; so — for the benefit of the dozen or so people who want to port operating systems to run in EC2, and the curiosity of maybe a thousand more people who use EC2 but will never build AMIs themselves — here's a rough guide to building EC2 images.

Prerequisites

Before we can talk about building images, there are some things you need:

Building a disk image

The first step to building an EC2 AMI is to build a disk image. This needs to be a "live" disk image, not an installer image; but if you have a "live USB disk" image, that's almost certainly going to be the place to start. EC2 instances boot with a virtual BIOS so a disk image which can boot from a USB stick is almost certainly going to boot — at least as far as the boot loader — in EC2.

You're going to want to make some changes to what goes into that disk image later, but for now just build a disk image.

Building an AMI

I wrote a simple tool for converting disk images into EC2 instances: bsdec2-image-upload. It uploads a disk image to Amazon S3; makes an API call to import that disk image into an EBS volume; creates a snapshot of that volume; then registers an EC2 AMI using that snapshot.

To use bsdec2-image-upload, you'll first need to create an S3 bucket for it to use as a staging area. You can call it anything you like, but I recommend that you

You'll also need to create an AWS key file in the format which bsdec2-image-upload expects:

ACCESS_KEY_ID=...
ACCESS_KEY_SECRET=...

Having done that, you can invoke bsdec2-image-upload:

# bsdec2-image-upload disk.img "AMI Name" "AMI Description" aws-region S3bucket awskeys

There are three additional options you can specify:

After it uploads the image and registers the AMI, bsdec2-image-upload will print the AMI IDs for the relevant region(s). (Either for every region, or just for the single region where you uploaded it.)

Go ahead and create an AMI now, and try launching it.

Boot configuration

Odds are that your instance started booting and got as far as the boot loader launching the kernel, but at some point after that things went sideways. Now we start the iterative process of building disk images, turning them into AMIs, launching said AMIs, and seeing where they break. Some things you'll probably run into here:

At this point, you should be able to launch an EC2 instance, get console output showing that it booted, and connect to the SSH daemon. (Remember to allow incoming connections on port 22 when you launch the EC2 instance!)

EC2 configuration

Now it's time to make the AMI behave like an EC2 instance. To this end, I prepared a set of rc.d scripts for FreeBSD. Most importantly, they If your OS has an rc system derived from NetBSD's rc.d, you may be able to use these scripts without any changes by simply installing them and enabling them in /etc/rc.conf; otherwise you may need to write your own scripts using mine as a model.

Firstboot scripts

A feature I added to FreeBSD a few years ago is the concept of "firstboot" scripts: These startup scripts are only run the first time a system boots. The aforementioned configinit and SSH key fetching scripts are flagged this way — so if your OS doesn't support the "firstboot" keyword on rc.d scripts you'll need to hack around that — but EC2 instances also ship with other scripts set to run on the first boot:

While none of these are strictly necessary, I find them to be extremely useful and highly recommend implementing similar functionality in your systems.

Support my work!

I hope you find this useful, or at very least interesting. Please consider supporting my work in this area; while I'm happy to contribute my time to supporting open source software, it would be nice if I had money coming in which I could use to cover incidental expenses (e.g., conference travel) so that I didn't end up paying to contribute to FreeBSD.

Posted at 2018-07-14 06:30 | Permanent link | Comments

Tarsnap pricing change

I launched the current Tarsnap website in 2009, and while we've made some minor adjustments to it over the years — e.g., adding a page of testimonials, adding much more documentation, and adding a page with .deb binary packages — the changes have overall been relatively modest. One thing people criticized the design for in 2009 was the fact that prices were quoted in picodollars; this is something I have insisted on retaining for the past eight years.

One of the harshest critics of Tarsnap's flat rate picodollars-per-byte pricing model is Patrick McKenzie — known to much of the Internet as "patio11" — who despite our frequent debates can take credit for ten times more new Tarsnap customers than anyone else, thanks to a single ten thousand word blog post about Tarsnap. The topic of picodollars has become something of an ongoing debate between us, with Patrick insisting that they communicate a fundamental lack of seriousness and sabotage Tarsnap's success as a business, and me insisting that using they communicate exactly what I want to communicate, and attract precisely the customer base I want to have. In spite of our disagreements, however, I really do value Patrick's input; indeed, the changes I mentioned above came about in large part due to the advice I received from him, and for a long time I've been considering following more of Patrick's advice.

A few weeks ago, I gave a talk at the AsiaBSDCon conference about profiling the FreeBSD kernel boot. (I'll be repeating the talk at BSDCan if anyone is interested in seeing it in person.) Since this was my first time in Tokyo (indeed, my first time anywhere in Asia) and despite communicating with him frequently I had never met Patrick in person, I thought it was only appropriate to meet him for dinner; fortunately the scheduling worked out and there was an evening when he was free and I wasn't suffering too much from jetlag. After dinner, Patrick told me about a cron job he runs:

I knew then that the time was coming to make a change Patrick has long awaited: Getting rid of picodollars. It took a few weeks before the right moment arrived, but I'm proud to announce that as of today, April 1st 2018, Tarsnap's storage pricing is 8333333 attodollars per byte-day.

This addresses a long-standing concern I've had about Tarsnap's pricing: Tarsnap bills customers for usage on a daily basis, but since 250 picodollars is not a multiple of 30, usage bills have been rounded. Tarsnap's accounting code works with attodollars internally (Why attodollars? Because it's easy to have 18 decimal places of precision using fixed-point arithmetic with a 64-bit "attodollars" part.) and so during 30-day months I have in fact been rounding down and billing customers at a rate of 8333333 attodollars per byte-day for years — so making this change on the Tarsnap website brings it in line with the reality of the billing system.

Of course, there are other advantages to advertising Tarsnap's pricing in attodollars. Everything which was communicated by pricing storage in picodollars per byte-month is communicated even more effectively by advertising prices in attodollars per byte-day, and I have no doubt that Tarsnap users will appreciate the increased precision.

Posted at 2018-04-01 00:00 | Permanent link | Comments

FreeBSD/EC2 history

A couple years ago Jeff Barr published a blog post with a timeline of EC2 instances. I thought at the time that I should write up a timeline of the FreeBSD/EC2 platform, but I didn't get around to it; but last week, as I prepared to ask for sponsorship for my work I decided that it was time to sit down and collect together the long history of how the platform has evolved and improved over the years.

Normally I don't edit blog posts after publishing them (with the exception of occasional typographical corrections), but I do plan on keeping this post up to date with future developments.

The current status

The upcoming FreeBSD release (11.2) supports: IPv6, Enhanced Networking (both generations), Amazon Elastic File System, Amazon Time Sync Service, both consoles (Serial + VGA), and every EC2 instance type (although I'm not sure if FreeBSD has drivers to make use of the FPGA or GPU hardware on those instances).

When a FreeBSD/EC2 instance first launches, it uses configinit to perform any desired configuration based on user-data scripts, and then (unless configinit is used to change this) resizes its root filesystem to fit the provided root disk, downloads and installs critical updates, sets up the ec2-user user for SSH access, and prints SSH host keys to the consoles.

If there's something else you think FreeBSD should support or a change you'd like to see to the default configuration, please let me know.

Posted at 2018-02-12 19:50 | Permanent link | Comments

Some thoughts on Spectre and Meltdown

By now I imagine that all of my regular readers, and a large proportion of the rest of the world, have heard of the security issues dubbed "Spectre" and "Meltdown". While there have been some excellent technical explanations of these issues from several sources — I particularly recommend the Project Zero blog post — I have yet to see anyone really put these into a broader perspective; nor have I seen anyone make a serious attempt to explain these at a level suited for a wide audience. While I have not been involved with handling these issues directly, I think it's time for me to step up and provide both a wider context and a more broadly understandable explanation.

The story of these attacks starts in late 2004. I had submitted my doctoral thesis and had a few months before flying back to Oxford for my defense, so I turned to some light reading: Intel's latest "Optimization Manual", full of tips on how to write faster code. (Eking out every last nanosecond of performance has long been an interest of mine.) Here I found an interesting piece of advice: On Intel CPUs with "Hyper-Threading", a common design choice (aligning the top of thread stacks on page boundaries) should be avoided because it would result in some resources being overused and others being underused, with a resulting drop in performance. This started me thinking: If two programs can hurt each others' performance by accident, one should be able to measure whether its performance is being hurt by the other; if it can measure whether its performance is being hurt by people not following Intel's optimization guidelines, it should be able to measure whether its performance is being hurt by other patterns of resource usage; and if it can measure that, it should be able to make deductions about what the other program is doing.

It took me a few days to convince myself that information could be stolen in this manner, but within a few weeks I was able to steal an RSA private key from OpenSSL. Then started the lengthy process of quietly notifying Intel and all the major operating system vendors; and on Friday the 13th of May 2005 I presented my paper describing this new attack at BSDCan 2005 — the first attack of this type exploiting how a running program causes changes to the microarchitectural state of a CPU. Three months later, the team of Osvik, Shamir, and Tromer published their work, which showed how the same problem could be exploited to steal AES keys. (Note that there were side channel attacks discovered over the preceding years which relied on microarchitectural details; but in those cases information was being revealed by the time taken by the cryptographic operation in question. My work was the first to demonstrate that information could leak from a process into the microarchitectural state and then be extracted from there by another process.)

Over the following years there have been many attacks which expoit different aspects of CPU design — exploiting L1 data cache collisions, exploiting L1 code cache collisions, exploiting L2 cache collisions, exploiting the TLB, exploiting branch prediction, etc. — but they have all followed the same basic mechanism: A program does something which interacts with the internal state of a CPU, and either we can measure that internal state (the more common case) or we can set up that internal state before the program runs in a way which makes the program faster or slower. These new attacks use the same basic mechanism, but exploit an entirely new angle. But before I go into details, let me go back to basics for a moment.

Understanding the attacks

These attacks exploit something called a "side channel". What's a side channel? It's when information is revealed as an inadvertant side effect of what you're doing. For example, in the movie 2001, Bowman and Poole enter a pod to ensure that the HAL 9000 computer cannot hear their conversation — but fail to block the optical channel which allows Hal to read their lips. Side channels are related to a concept called "covert channels": Where side channels are about stealing information which was not intended to be conveyed, covert channels are about conveying information which someone is trying to prevent you from sending. The famous case of a Prisoner of War blinking the word "TORTURE" in Morse code is an example of using a covert channel to convey information.

Another example of a side channel — and I'll be elaborating on this example later, so please bear with me if it seems odd — is as follows: I want to know when my girlfriend's passport expires, but she won't show me her passport (she complains that it has a horrible photo) and refuses to tell me the expiry date. I tell her that I'm going to take her to Europe on vacation in August and watch what happens: If she runs out to renew her passport, I know that it will expire before August; while if she doesn't get her passport renewed, I know that it will remain valid beyond that date. Her desire to ensure that her passport would be valid inadvertantly revealed to me some information: Whether its expiry date was before or after August.

Over the past 12 years, people have gotten reasonably good at writing programs which avoid leaking information via side channels; but as the saying goes, if you make something idiot-proof, the world will come up with a better idiot; in this case, the better idiot is newer and faster CPUs. The Spectre and Meltdown attacks make use of something called "speculative execution". This is a mechanism whereby, if a CPU isn't sure what you want it to do next, it will speculatively perform some action. The idea here is that if it guessed right, it will save time later — and if it guessed wrong, it can throw away the work it did and go back to doing what you asked for. As long as it sometimes guesses right, this saves time compared to waiting until it's absolutely certain about what it should be doing next. Unfortunately, as several researchers recently discovered, it can accidentally leak some information during this speculative execution.

Going back to my analogy: I tell my girlfriend that I'm going to take her on vacation in June, but I don't tell her where yet; however, she knows that it will either be somewhere within Canada (for which she doesn't need a passport, since we live in Vancouver) or somewhere in Europe. She knows that it takes time to get a passport renewed, so she checks her passport and (if it was about to expire) gets it renewed just in case I later reveal that I'm going to take her to Europe. If I tell her later that I'm only taking her to Ottawa — well, she didn't need to renew her passport after all, but in the mean time her behaviour has already revealed to me whether her passport was about to expire. This is what Google refers to "variant 1" of the Spectre vulnerability: Even though she didn't need her passport, she made sure it was still valid just in case she was going to need it.

"Variant 2" of the Spectre vulnerability also relies on speculative execution but in a more subtle way. Here, instead of the CPU knowing that there are two possible execution paths and choosing one (or potentially both!) to speculatively execute, the CPU has no idea what code it will need to execute next. However, it has been keeping track and knows what it did the last few times it was in the same position, and it makes a guess — after all, there's no harm in guessing since if it guesses wrong it can just throw away the unneeded work. Continuing our analogy, a "Spectre version 2" attack on my girlfriend would be as follows: I spend a week talking about how Oxford is a wonderful place to visit and I really enjoyed the years I spent there, and then I tell her that I want to take her on vacation. She very reasonably assumes that — since I've been talking about Oxford so much — I must be planning on taking her to England, and runs off to check her passport and potentially renew it... but in fact I tricked her and I'm only planning on taking her to Ottawa.

This "version 2" attack is far more powerful than "version 1" because it can be used to exploit side channels present in many different locations; but it is also much harder to exploit and depends intimately on details of CPU design, since the attacker needs to make the CPU guess the correct (wrong) location to anticipate that it will be visiting next.

Now we get to the third attack, dubbed "Meltdown". This one is a bit weird, so I'm going to start with the analogy here: I tell my girlfriend that I want to take her to the Korean peninsula. She knows that her passport is valid for long enough; but she immediately runs off to check that her North Korean visa hasn't expired. Why does she have a North Korean visa, you ask? Good question. She doesn't — but she runs off to check its expiry date anyway! Because she doesn't have a North Korean visa, she (somehow) checks the expiry date on someone else's North Korean visa, and then (if it is about to expire) runs out to renew it — and so by telling her that I want to take her to Korea for a vacation I find out something she couldn't have told me even if she wanted to. If this sounds like we're falling down a Dodgsonian rabbit hole... well, we are. The most common reaction I've heard from security people about this is "Intel CPUs are doing what???", and it's not by coincidence that one of the names suggested for an early Linux patch was Forcefully Unmap Complete Kernel With Interrupt Trampolines (FUCKWIT). (For the technically-inclined: Intel CPUs continue speculative execution through faults, so the fact that a page of memory cannot be accessed does not prevent it from, well, being accessed.)

How users can protect themselves

So that's what these vulnerabilities are all about; but what can regular users do to protect themselves? To start with, apply the damn patches. For the next few months there are going to be patches to operating systems; patches to individual applications; patches to phones; patches to routers; patches to smart televisions... if you see a notification saying "there are updates which need to be installed", install the updates. (However, this doesn't mean that you should be stupid: If you get an email saying "click here to update your system", it's probably malware.) These attacks are complicated, and need to be fixed in many ways in many different places, so each individual piece of software may have many patches as the authors work their way through from fixing the most easily exploited vulnerabilities to the more obscure theoretical weaknesses.

What else can you do? Understand the implications of these vulnerabilities. Intel caught some undeserved flak for stating that they believe "these exploits do not have the potential to corrupt, modify or delete data"; in fact, they're quite correct in a direct sense, and this distinction is very relevant. A side channel attack inherently reveals information, but it does not by itself allow someone to take control of a system. (In some cases side channels may make it easier to take advantage of other bugs, however.) As such, it's important to consider what information could be revealed: Even if you're not working on top secret plans for responding to a ballistic missile attack, you've probably accessed password-protected websites (Facebook, Twitter, Gmail, perhaps your online banking...) and possibly entered your credit card details somewhere today. Those passwords and credit card numbers are what you should worry about.

Now, in order for you to be attacked, some code needs to run on your computer. The most likely vector for such an attack is through a website — and the more shady the website the more likely you'll be attacked. (Why? Because if the owners of a website are already doing something which is illegal — say, selling fake prescription drugs — they're far more likely to agree if someone offers to pay them to add some "harmless" extra code to their site.) You're not likely to get attacked by visiting your bank's website; but if you make a practice of visiting the less reputable parts of the World Wide Web, it's probably best to not log in to your bank's website at the same time. Remember, this attack won't allow someone to take over your computer — all they can do is get access to information which is in your computer's memory at the time they carry out the attack.

For greater paranoia, avoid accessing suspicious websites after you handle any sensitive information (including accessing password-protected websites or entering your credit card details). It's possible for this information to linger in your computer's memory even after it isn't needed — it will stay there until it's overwritten, usually because the memory is needed for something else — so if you want to be safe you should reboot your computer in between.

For maximum paranoia: Don't connect to the internet from systems you care about. In the industry we refer to "airgapped" systems; this is a reference back to the days when connecting to a network required wires, so if there was a literal gap with just air between two systems, there was no way they could communicate. These days, with ubiquitous wifi (and in many devices, access to mobile phone networks) the terminology is in need of updating; but if you place devices into "airplane" mode it's unlikely that they'll be at any risk. Mind you, they won't be nearly as useful — there's almost always a tradeoff between security and usability, but if you're handling something really sensitive, you may want to consider this option. (For my Tarsnap online backup service I compile and cryptographically sign the packages on a system which has never been connected to the Internet. Before I turned it on for the first time, I opened up the case and pulled out the wifi card; and I copy files on and off the system on a USB stick. Tarsnap's slogan, by the way, is "Online backups for the truly paranoid".)

How developers can protect everyone

The patches being developed and distributed by operating systems — including microcode updates from Intel — will help a lot, but there are still steps individual developers can take to reduce the risk of their code being exploited.

First, practice good "cryptographic hygiene": Information which isn't in memory can't be stolen this way. If you have a set of cryptographic keys, load only the keys you need for the operations you will be performing. If you take a password, use it as quickly as possible and then immediately wipe it from memory. This isn't always possible, especially if you're using a high level language which doesn't give you access to low level details of pointers and memory allocation; but there's at least a chance that it will help.

Second, offload sensitive operations — especially cryptographic operations — to other processes. The security community has become more aware of privilege separation over the past two decades; but we need to go further than this, to separation of information — even if two processes need exactly the same operating system permissions, it can be valuable to keep them separate in order to avoid information from one process leaking via a side channel attack against the other.

One common design paradigm I've seen recently is to "TLS all the things", with a wide range of applications gaining understanding of the TLS protocol layer. This is something I've objected to in the past as it results in unnecessary exposure of applications to vulnerabilities in the TLS stacks they use; side channel attacks provide another reason, namely the unnecessary exposure of the TLS stack to side channels in the application. If you want to add TLS to your application, don't add it to the application itself; rather, use a separate process to wrap and unwrap connections with TLS, and have your application take unencrypted connections over a local (unix) socket or a loopback TCP/IP connection.

Separating code into multiple processes isn't always practical, however, for reasons of both performance and practical matters of code design. I've been considering (since long before these issues became public) another form of mitigation: Userland page unmapping. In many cases programs have data structures which are "private" to a small number of source files; for example, a random number generator will have internal state which is only accessed from within a single file (with appropriate functions for inputting entropy and outputting random numbers), and a hash table library would have a data structure which is allocated, modified, accessed, and finally freed only by that library via appropriate accessor functions. If these memory allocations can be corralled into a subset of the system address space, and the pages in question only mapped upon entering those specific routines, it could dramatically reduce the risk of information being revealed as a result of vulnerabilities which — like these side channel attacks — are limited to leaking information but cannot be (directly) used to execute arbitrary code.

Finally, developers need to get better at providing patches: Not just to get patches out promptly, but also to get them into users' hands and to convince users to install them. That last part requires building up trust; as I wrote last year, one of the worst problems facing the industry is the mixing of security and non-security updates. If users are worried that they'll lose features (or gain "features" they don't want), they won't install the updates you recommend; it's essential to give users the option of getting security patches without worrying about whether anything else they rely upon will change.

What's next?

So far we've seen three attacks demonstrated: Two variants of Spectre and one form of Meltdown. Get ready to see more over the coming months and years. Off the top of my head, there are four vulnerability classes I expect to see demonstrated before long:

Final thoughts on vulnerability disclosure

The way these issues were handled was a mess; frankly, I expected better of Google, I expected better of Intel, and I expected better of the Linux community. When I found that Hyper-Threading was easily exploitable, I spent five months notifying the security community and preparing everyone for my announcement of the vulnerability; but when the embargo ended at midnight UTC and FreeBSD published its advisory a few minutes later, the broader world was taken entirely by surprise. Nobody knew what was coming aside from the people who needed to know; and the people who needed to know had months of warning.

Contrast that with what happened this time around. Google discovered a problem and reported it to Intel, AMD, and ARM on June 1st. Did they then go around contacting all of the operating systems which would need to work on fixes for this? Not even close. FreeBSD was notified the week before Christmas, over six months after the vulnerabilities were discovered. Now, FreeBSD can occasionally respond very quickly to security vulnerabilities, even when they arise at inconvenient times — on November 30th 2009 a vulnerability was reported at 22:12 UTC, and on December 1st I provided a patch at 01:20 UTC, barely over 3 hours later — but that was an extremely simple bug which needed only a few lines of code to fix; the Spectre and Meltdown issues are orders of magnitude more complex.

To make things worse, the Linux community was notified and couldn't keep their mouths shut. Standard practice for multi-vendor advisories like this is that an embargo date is set, and nobody does anything publicly prior to that date. People don't publish advisories; they don't commit patches into their public source code repositories; and they definitely don't engage in arguments on public mailing lists about whether the patches are needed for different CPUs. As a result, despite an embargo date being set for January 9th, by January 4th anyone who cared knew about the issues and there was code being passed around on Twitter for exploiting them.

This is not the first time I've seen people get sloppy with embargoes recently, but it's by far the worst case. As an industry we pride ourselves on the concept of responsible disclosure — ensuring that people are notified in time to prepare fixes before an issue is disclosed publicly — but in this case there was far too much disclosure and nowhere near enough responsibility. We can do better, and I sincerely hope that next time we do.

Posted at 2018-01-17 02:40 | Permanent link | Comments

FreeBSD/EC2 on C5 instances

Last week, Amazon released the "C5" family of EC2 instances, continuing their trend of improving performance by both providing better hardware and reducing the overhead associated with virtualization. Due to the significant changes in this new instance family, Amazon gave me advance notice of their impending arrival several months ago, and starting in August I had access to (early versions of) these instances so that I could test FreeBSD on them. Unfortunately the final launch date took me slightly by surprise — I was expecting it to be later in the month — so there are still a few kinks which need to be worked out for FreeBSD to run smoothly on C5 instances. I strongly recommend that you read the rest of this blog post before you use FreeBSD on EC2 C5 instances. (Or possibly skip to the end if you're not interested in learning about any of the underlying details.)

Ever since the first EC2 instances launched — the ones which were retrospectively named "m1.small" — Amazon has relied on the Xen hypervisor. No longer: C5 instances use KVM. This sounds like it would be a problem, but in fact that change didn't bother FreeBSD at all: Now that everything uses hardware-based paging virtualization, the core of the FreeBSD kernel barely noticed the change. (This would have been a much larger problem if FreeBSD/EC2 images were using Xen paravirtualized paging, but EC2 has provided hardware virtualization in all of their new instance types since October 2012.) As usual, it's the drivers which have caused problems for FreeBSD.

Under Xen, EC2 was able to provide FreeBSD with Xen "paravirtualized" devices: A privileged virtual machine within each physical EC2 host had access to the physical disks and network, and FreeBSD would interact with it via the Xen "netfront/netback" and "blkfront/blkback" drivers. There were a lot of tricks used to eke out every last scrap of performance, but this had an inevitable performance cost: Every network packet or disk I/O would need to be handled not just by the FreeBSD kernel but also by the Linux kernel running in the "Dom0" domain. Starting a few years ago, Amazon offered "Enhanced Networking" where an EC2 instance could talk directly to network adapter hardware — first with Intel 10GbE network interfaces, but later with Amazon's custom-designed "Elastic Network Adapter" hardware; FreeBSD gained support for the ENA network interface in FreeBSD 11.1, thanks to Amazon taking the step of proactively looking for (and paying) someone to port their Linux driver. Until very recently, there was no similar "pass-through" for disks; from the original m1.small in August 2006 until the I3 family arrived in February 2017, disks always showed up as Xen block devices. With the I3 "high I/O" family, ephemeral disks were exposed as directly accessible NVMe devices for the first time — but EBS volumes were still exposed as Xen devices block devices.

I had the first hint that Amazon was going to be doing something interesting when I was asked if FreeBSD would boot if its root disk was NVMe instead of being a Xen block device. As I recall it, my answer was as follows:

"Yeah, it should work just fine; FreeBSD supports NVMe disks, so it will taste the disk, read the GPT labels, and boot from the one marked as rootfs. I never hard-coded the device name of the boot disk anywhere.

But wait, how is this going to work? AMI boot disks are EBS volumes. You can't be copying the disk image to a local NVMe disk before booting; that would take too long, and changes would be orphaned if the node failed. YOU'RE BUILDING A HARDWARE FRONT-END TO EBS? You guys are insane! Even rolling out a software update to EBS must be a nightmare at your scale, and now you want to add the headache of dealing with hardware on top of that?

Well, apparently Amazon has some engineers who are both very brave and extremely talented: EBS volumes show up on EC2 instances as "NVMe" hardware.

Of course, EBS volumes aren't NVMe devices — and herein lies the problem. You can attach and detach EBS volumes from a running EC2 instance with a single API call (or a few mouse clicks if you prefer a GUI), and I doubt anyone has ever tested hotplug and hotunplug of physical NVMe disks on a FreeBSD system. Moreover, I'm absolutely certain that nobody has ever tested hotplugging and hotunplugging physical NVMe disks from a Legacy PCI bus which is hanging off an Intel 440FX chipset — which is what a C5 instance looks like! Unsurprisingly, with untested code paths came new and interesting bugs.

The first problem I ran into is that when I attached or detached EBS volumes, nothing happened. FreeBSD isn't expecting new devices to appear and disappear on the PCI bus! It turns out that EC2 is sending an ACPI notification about hotplug events, and when I compiled a FreeBSD kernel with ACPI debugging enabled, it was clear what was happening:

kernel: evmisc-0267 EvQueueNotifyRequest : No notify handler for Notify, ignoring (S1F_, 1) node 0xfffff80007832400
FreeBSD wasn't listening for the ACPI hotplug notifications, so they were simply getting dropped on the floor. Fortunately the FreeBSD project includes smarter people than me, and John Baldwin pointed out that we have a tool for this: devctl rescan pci0 prompts FreeBSD to rescan the PCI bus and detect any changes in the hardware. Attaching an EBS volume and running this command makes the new disk promptly appear, exactly as expected.

Unfortunately the detach case doesn't work quite so well. When I removed an EBS volume and ran devctl rescan pci0, rather than FreeBSD removing the disk, I got a kernel panic. It turned out that the FreeBSD NVMe driver was marking its device nodes as "eternal" (which allows for some locking optimizations) and you're not allowed to remove such device nodes. OK, get rid of that flag and recompile and try again and... another kernel panic. Turns out that some (completely untested) teardown code was freeing a structure and then calling a function pointer stored within it; without kernel debugging enabled this might have worked, but as it was, it turned out that calling 0xdeadc0dedeadc0de is not a very good idea. OK, fix the order of the teardown code, recompile, try again... and FreeBSD didn't panic when I instructed it to rescan the PCI bus and detect that the NVMe disk went away. But it didn't remove all of its device nodes either, and as soon as anything touched the orphan device node, I got another kernel panic. Apparently nobody ever got around to finishing the NVMe device removal code.

So the situation is as follows:

  1. FreeBSD versions prior to FreeBSD 11.1 will not run on C5, because they lack support for the ENA networking hardware — on Xen-based EC2 instances, earlier FreeBSD versions can get virtualized Xen networking, but of course that's not available on C5.
  2. FreeBSD 11.1 and HEAD (as of mid-November 2017) will boot and run just fine on C5 as long as you never attach or detach EBS volumes.
  3. If you attach or detach an EBS volume and then reboot, you'll see the devices you expect.
  4. If you attach an EBS volume and run devctl rescan pci0, you'll see the new volume.
  5. If you detach an EBS volume and run devctl rescan pci0, you will either get an immediate kernel panic or be left with a device node which causes a kernel panic as soon as it is touched.
  6. In FreeBSD 11.2 and later, everything should Just Work.

That last bit, which depends on fixing the NVMe driver, is currently being worked on by Warner Losh (not to be confused with Warren Lash, who is a completely unrelated character in Michael Lucas' git commit murder). It also depends on someone figuring out how to catch the ACPI events in question, but that's more of a question of finding the best way rather than finding a way: In the worst case, I could ignore the ACPI events completely and ship 11.2 AMIs with a daemon which runs a bus rescan every few seconds.

Thanks to Matthew Wilson, David Duncan, John Baldwin, and Warner Losh for their help with figuring things out here.

Posted at 2017-11-17 01:45 | Permanent link | Comments

Recent posts

Monthly Archives

Yearly Archives


RSS