A year of funded FreeBSD
I've been maintaining FreeBSD on the Amazon EC2 platform ever since I first got it booting in 2010, but in November 2023 I added to my responsibilities the role of FreeBSD release engineering lead — just in time to announce the availability of FreeBSD 14.0, although Glen Barber did all the release engineering work for that release. While I receive a small amount of funding from Antithesis and from my FreeBSD/EC2 Patreon, it rapidly became clear that my release engineering duties were competing with — in fact, out-competing — FreeBSD/EC2 for my available FreeBSD volunteer hours: In addition to my long list of "features to implement" stagnating, I had increasingly been saying "huh that's weird... oh well, no time to investigate that now". In short, by early 2024 I was becoming increasingly concerned that I was not in a position to be a good "owner" of the FreeBSD/EC2 platform.For several years leading up to this point I had been talking to Amazonians on and off about the possibility of Amazon sponsoring my FreeBSD/EC2 work; rather predictably, most of those conversation ended up with my contacts at Amazon rhyming with "Amazon should definitely sponsor the work you're doing... but I don't have any money available in my budget for this". Finally in April 2024 I found someone with a budget, and after some discussions around timeline, scope, and process, it was determined that Amazon would support me for a year via GitHub Sponsors. I'm not entirely sure if the year in question was June through May or July through June — money had to move within Amazon, from Amazon to GitHub, from GitHub to Stripe, and finally from Stripe into my bank account, so when I received money doesn't necessarily reflect when Amazon intended to give me money — but either way the sponsorship either has come to an end or is coming to an end soon, so I figured now was a good time to write about what I've done.
Amazon was nominally sponsoring me for 40 hours/month of work on FreeBSD release engineering and FreeBSD/EC2 development — I made it clear to them that sponsoring one and not the other wasn't really feasible, especially given dependencies between the two — and asked me to track how much time I was spending on things. In the end, I spent roughly 50 hours/month on this work, averaging 20 hours/month spent on EC2-specific issues, 20 hours/month making FreeBSD releases happen, and 10 hours/month on other release engineering related work — although the exact breakdown varied dramatically from month to month.
Following FreeBSD's quarterly release schedule (which I announced in July 2024, but put together and presented at the FreeBSD developer summit at BSDCan in May 2024), I managed four FreeBSD releases during the past year: FreeBSD 13.4, in September 2024; FreeBSD 14.2, in December 2024; FreeBSD 13.5, in March 2025; and FreeBSD 14.3, currently scheduled for release on June 10th. The work involved in managing each of these releases — nagging developers to get their code into the tree in time, approving (or disapproving!) merge requests, coordinating with other teams, building and testing images (usually three Betas, one Release Candidate, and the final Release), writing announcement text, and fixing any release-building breakage which arose along the way — mostly happened in the month prior to the release (I refer to the second month of each calendar quarter as "Beta Month") and ranged from a low of 33.5 hours (for FreeBSD 13.5) to a high of 79 hours (for FreeBSD 14.2). As one might imagine, the later in a stable branch you get, the fewer the number of things there are breaking and the lower the amount of work required for doing a release; while I wasn't tracking hours when I managed FreeBSD 14.1, I suspect it took close to 100 hours of release engineering time, and FreeBSD 15.0 is very likely to be well over that.
On the FreeBSD/EC2 side of things, there were two major features which Amazon encouraged me to prioritize: The "power driver" for AWS Graviton instances (aka "how the EC2 API tells the OS to shut down" — without this, FreeBSD ignores the shutdown signal and a few minutes later EC2 times out and yanks the virtual power cable), and device hotplug on AWS Graviton instances. The first of these was straightforward: On Graviton systems, the "power button" is a GPIO pin, the details of which are specified via an ACPI _AEI object. I added code to find those in ACPI and pass the appropriate configuration through to the driver for the PL061 GPIO controller; when the GPIO pin is asserted, the controller generates an interrupt which causes the ACPI "power button" event to be triggered, which in turn now shuts down the system. There was one minor hiccup: The ACPI tables provided by EC2 specify that the GPIO pin in question should be configured as a "Pull Up" pin, but the PL061 controller in fact doesn't have any pullup/pulldown resistors; this didn't cause problems on Linux because Linux silently ignores GPIO configuration failures, but on FreeBSD we disabled the device after failing to configure it. I believe this EC2 bug will be fixed in future Graviton systems; but in the mean time I ship FreeBSD/EC2 AMIs with a new "quirk": ACPI_Q_AEI_NOPULL, aka "Ignore the PullUp flag on GPIO pin specifications in _AEI objects".
Getting hotplug working — or more specifically, getting hot unplug working, since that's where most of the problems arose — took considerably more work, largely because there were several different problems, each presenting on a subset of EC2 instance types:
- On some Graviton systems, we leaked a (virtual) IRQ reservation during PCI attach; this is harmless in most cases, but after attaching and detaching an EBS volume 67 times we would run out of IRQs and the FreeBSD kernel would panic. This IRQ leakage was happening from some "legacy" PCI interrupt routing code, and in the end I simply added a boot loader setting to turn that code off in EC2.
- On some Graviton systems, the firmware uses PCI device power state as an indication that the OS has finished using a device and is ready for it to be "ejected". This is 100% a bug in EC2, and I believe it will be fixed in due course; in the mean time, FreeBSD/EC2 AMIs have an ACPI quirk ACPI_Q_CLEAR_PME_ON_DETACH which instructs them to flip some bits in the PCI power management register before ejecting a device.
- On the newest generation of EC2 instances (both x86 and Graviton), FreeBSD's nvme driver would panic after a PCIe unplug. This bug I didn't need to fix myself, beyond pointing it out to our nvme driver maintainer.
- On some EC2 instances (both x86 and Graviton), we would see a "ghost" device on the PCI bus after it was ejected; attempts to access the device would fail, but the "ghost" would block any attempt to attach new devices. This turned out to be another EC2 bug: The Nitro firmware managing the PCI bus operated asynchronously from the firmware managing the PCI devices, so there was a window of a few ms where a device had been unplugged but the PCI bus still reported it as being present. On Linux this is (almost always) not a problem since Linux scans buses periodically and typically loses the race; but on FreeBSD we immediately re-scan the PCI bus after detaching a device, so we usually won the race against the Nitro firmware, causing us to "see" the device which was no longer present. My understanding is that this is being fixed in Nitro to ensure that the PCI bus "knows" about a device detach before the detach is acknowledged to the operating system; but in the mean time, FreeBSD/EC2 AMIs have an ACPI quirk ACPI_Q_DELAY_BEFORE_EJECT_RESCAN which adds a 10 ms delay between signalling that a device should be ejected and rescanning the PCI bus.
While those two were Amazon's top priorities for FreeBSD/EC2 work, they were by no means the only things I worked on; in fact they only took up about half of the time I spent on EC2-specific issues. I did a lot of work in 2021 and 2022 to speed up the FreeBSD boot process, but among the "that's weird but I don't have time to investigate right now" issues I had noticed in late 2023 and early 2024 was that FreeBSD/EC2 instances sometimes took a surprisingly long time to boot. I hadn't measured how long they took, mind you; but as part of the FreeBSD weekly snapshot process I ran test boots of a few EC2 instance types, and I had needed to increase the sleep time between launching instances and trying to SSH into them.
Well, the first thing to do with any sort of performance issues is to collect data; so I benchmarked boot time on weekly EC2 AMI builds dating back to 2018 — spinning up over ten thousand EC2 instances in the process — and started generating FreeBSD boot performance plots. Collecting new data and updating those plots is now part of my weekly snapshot testing process; but even without drawing plots, I could immediately see some issues. I got to work:
- Starting in the first week of 2024, the FreeBSD boot process suddenly got about 3x slower. I started bisecting commits, and tracked it down to... a commit which increased the root disk size from 5 GB to 6 GB. Why? Well, I reached out to some of my friends at Amazon, and it turned out that the answer was somewhere between "magic" and "you really don't want to know"; but the important part for me was that increasing the root disk size to 8 GB restored performance to earlier levels.
-
FreeBSD was also taking a long time to boot on Graviton 2 (aka c6g
and similar) instances, and after some investigation I tracked this down to
a problem with kernel entropy seeding: If the FreeBSD kernel doesn't have
enough entropy to generate secure random numbers, the boot process stalls
until it collects more entropy — and in a VM, that can take a while.
Now, we have code to obtain entropy via the EFI boot loader — which
effectively means asking the Nitro firmware to give us a secure seed —
but that ran into two problems: First, it wasn't actually running in EC2,
and second, when it did run, it was absurdly slow on the Graviton 2.
The first problem was easily solved: The entropy seeding request was being made from the "boot menu" lua code, and (ironically in order to improve boot performance) we bypass that menu in EC2; moving that to the right place in the boot loader lua code ensured that it ran regardless of whether the menu was enabled. The second problem turned out to be a Graviton 2 issue: It could provide a small amount of entropy quickly, but took a long time to provide the 2048 bytes which the FreeBSD boot loader was requesting. This large request was due to the way that FreeBSD seeded its entropy system; 32 pools each needed 64 bytes of entropy. Since this was in no way cryptographically necessary — the multiple pools exist only as a protection in case a pool state is leaked — I rewrote the code to make it possible to take 64 bytes from EFI and use PBKDF2 as an "entropy spreader" to turn that into the 2048 bytes our entropy-feeding API needed. This took the boot time of FreeBSD arm64/base/UFS from ~25 seconds down to ~8 seconds. - I also noticed that ZFS images were taking quite a bit longer to boot than UFS images — and interestingly, this delta varied depending on the amount of data on the disk (but not the disk size itself). I traced this down to a weird interaction of our filesystem-building code (makefs) and what happens when you attach a ZFS pool: ZFS performs some filesystem verification steps which involve traversing the most recent transaction group, but makefs puts the everything into a single transaction group — so when EC2 ZFS images booted, they had to read and process metadata for every single file on disk (hence the number of files on disk affecting the time taken). Once I tracked down the issue, I was able to report it to FreeBSD's makefs guru (Mark Johnston), who solved the problem simply by recording a higher transaction group on the filesystem — that way, the single transaction group was not "recent enough" to prompt the ZFS transaction group verification logic. ZFS images promptly dropped from ~22 seconds down to ~11 seconds of boot time.
- Finally — and this issue was one I caught promptly as a result of including boot performance in my weekly testing — in December 2024 I updated the net/aws-ec2-imdsv2-get port to support IPv6. This port provides a command-line interface to the EC2 Instance MetaData Service, and is necessary because when Amazon launched "IMDSv2" to paper over (but not properly fix) the security problem inherent in exposing IAM credentials over HTTP, they made it impossible to use FreeBSD fetch(1) to access the IMDS. Unfortunately when IPv6 support was added to aws-ec2-imdsv2-get, two mistakes were made: First, it attempted to connect on IPv6 first (even though IPv4-only is the default IMDS instance configuration); and second, it kept the default TCP timeout (75 seconds). Thanks to my testing, I got this fixed promptly, to attempt IPv4 first and reduce the timeout to 100 ms — considering that IMDS requests are serviced without ever leaving the physical system, waiting more than 100 ms seemed unnecessary!
One thing which had long been on my "features to implement" list for FreeBSD/EC2 but I hadn't found time for earlier was adding more AMI "flavours": A year ago, we had base (the FreeBSD base system, with minimal additional code installed from the ports tree to make it "act like an EC2 AMI") and cloud-init (as the name suggests, FreeBSD with Cloud-init installed). I added two more flavours of FreeBSD AMI to the roster: small AMIs, which are like base except without debug symbols, the LLDB debugger, 32-bit libraries, FreeBSD tests, or the Amazon SSM Agent or AWS CLI — which collectively reduces the disk space usage from ~5 GB to ~1 GB while not removing anything which most people will use — and builder AMIs, which are FreeBSD AMI Builder AMIs, providing an easy path for users to create customized FreeBSD AMIs.
Of course, with 4 flavours of FreeBSD AMIs — and two filesystems (UFS and ZFS), two architectures (amd64 and arm64), and three versions of FreeBSD (13-STABLE, 14-STABLE, and 15-CURRENT) — all of the weekly snapshot builds were starting to add up; so in May I finally got around to cleaning up old images (and their associated EBS snapshots). While I don't pay for these images — the FreeBSD release engineering AWS account is sponsored by Amazon — it was still costing someone money; so when I realized I could get rid of 336 TB of EBS snapshots, I figured it was worth spending a few hours writing shell scripts.
While most of my time was spent on managing release cycles and maintaining the FreeBSD/EC2 platform, I did also spend some time on broader release engineering issues — in fact, part of the design of the "quarterly" release schedule is that it leaves a few weeks between finishing one release and starting the next to allow for release engineering work which can't effectively be done in the middle of a release cycle. The first issue I tackled here was parallelizing release building: With a large number of EC2 AMIs being built, a large proportion of the release build time was being spent not building but rather installing FreeBSD into VM images. I reworked the release code to parallelize this, but found that it caused sporadic build failures — which were very hard to isolate, since they only showed up with a complete release build (which took close to 24 hours) and not with any subset of the build. After many hours of work I finally tracked the problem down to a single missing Makefile line: We weren't specifying that a directory should be created before files were installed into it. With that fix, I was able to reduce the release build from ~22 hours down to ~13 hours, and also "unlock" the ability to add more EC2 AMI flavours (which I couldn't do earlier since it would have increased the build time too much).
Another general release engineering issue I started tackling was the problem of build reproducibility — aided by the fact that I had EC2 to draw upon. As part of my weekly testing of snapshot images, I now spin up EC2 instances and have them build their own AMIs — and then use diffoscope to compare the disk images they built against the ones they were launched from. This has already found several issues — including some which appeared partway through the year and were identified quickly thanks to the regular testing — of which I've fixed a few and some others I've passed on to other developers to tackle.
Of course, in addition to the big projects there's also a plethora of smaller issues to tackle. Build breakage (weekly snapshot builds are good at finding this!); reviewing patches to the ENA driver; helping Dave Cottlehuber add support for building OCI Containers and uploading them to repositories; teaching my bsdec2-image-upload tool to gracefully handle internal AWS errors; reporting an AWS security issue I stumbled across... some days everything falls under the umbrella of "other stuff which needs to get done", but a lot of it is just as important as the larger projects.
So what's next? Well, I'm still the FreeBSD release engineering lead and the maintainer of the FreeBSD/EC2 platform — just with rather less time to devote to this work. FreeBSD releases will continue to happen — 15.0 should land in December, followed in 2026 by 14.4, 15.1, 14.5, and 15.2 — but I probably won't have time to jump in and fix things as much, so late-landing features are more likely to get removed from a release rather than fixed in time for the release; we were only able to ship OCI Containers starting in FreeBSD 14.2 because I had funded hours to make sure all the pieces landed intact, and that sort of involvement won't be possible. On the EC2 side, now that I have regression testing of boot performance set up, I'll probably catch any issues which need to be fixed there; but the rest of my "features to implement" list — automatically growing filesystems when EBS volumes expand, better automatic configuration with multiple network interfaces (and network interface hot plug), rolling "pre-patched" AMIs (right now FreeBSD instances update themselves when they first boot), putting together a website to help users generate EC2 user-data files (e.g., for installing packages and launching daemons), returning to my work on FreeBSD/Firecracker and making it a supported FreeBSD platform, etc. — is likely to stagnate unless I find more time.
I've been incredibly lucky to get this sponsorship from Amazon; it's far more than most open source developers ever get. I wish it wasn't ending; but I'm proud of the work I've done and I'll always be grateful to Amazon for giving me this opportunity.