How to support FreeBSD on your cloud
As most of my readers are aware, I ported FreeBSD to run on the Amazon EC2 cloud, and have been maintaining the platform ever since. Between this and my (far more recent) role as FreeBSD release engineering lead, I sometimes get a question from other cloud providers: "What's involved in supporting FreeBSD" — or sometimes, "How can we make FreeBSD on our cloud work as well as FreeBSD works on EC2?". This came up again while I was at the FreeBSD Vendor Summit in San Jose, and it occurred to me that it would be useful to write things down once — even if there's only a handful of clouds which aspire to compete with Amazon, it will save me time and allow me to give a more complete answer than if I'm trying to answer off the top of my head at a conference.1. Contact us. It may sound obvious, but we can't help you if you don't get in touch with us. Perhaps slightly less obvious: Please email the release engineering team (re@freebsd.org). Don't just email me directly; in addition to the fact that additional eyes from the release engineering team make it less likely that I'll miss an email, I'm not going to be the release engineering lead forever; I try to CC the team on as much as possible so that whoever succeeds me will have the maximum possible opportunity to see what's involved.
2. We'll need a sponsored account. In AWS I actually have three sponsored accounts — the account I use for development, the release engineering account (for security reasons, this needs to be an account which is only used for publishing images), and an account which hosts some FreeBSD project infrastructure. The first and third are optional, but you want us to publish images and we don't want to pay to do this. (Why should you want us to publish images, rather than doing it yourself? Among other things, because that dramatically increases the odds that I can write "is available" in a release announcement rather than "will hopefully be available soon".)
3. Tell us how to publish images. Hopefully you've already determined that FreeBSD can run in your cloud; but we need to know how to package it up (do you need a raw disk image, or something more complex?), upload it into your cloud, and often as a third step convert the uploaded disk image into whatever your version of an "AMI" is. We would like to integrate this into the release building process and publish images as part of our weekly snapshot build process; experience shows that this works much better than only uploading images when a release arrives.
4. Find someone who can be a liaison between you and the FreeBSD project. The ideal person here is a FreeBSD committer who works for your company. It's important that they are able to contact the right people in your company when issues arise, and it's essential that they're sufficiently involved in FreeBSD to understand the project culture and be aware of ongoing events (e.g. release cycles).
5. Find someone who will test the weekly images. At a minimum, this means making sure the images exist, boot, and pass some minimal testing every week. Things break, often for wildly unexpected reasons; detecting and reporting problems promptly is important. The person who does this doesn't need to be a developer, although some basic shell scripting ability is certainly useful; the most important thing is that they are reliable and diligent. This is probably a couple hours a week on average, but that is a bit misleading; it's more like an hour a week if nothing breaks, but five hours a week if something goes wrong and needs to be investigated. It's ideal if this is your FreeBSD liaison, since once they find problems they'll need to talk to people to get them fixed, but it's not absolutely necessary. You may be tempted to just run automated QA. That's great, but I don't think it can replace a human poking at virtual machines; my experience with EC2 is that images break in all sorts of wild ways, and nothing beats a human for saying "huh, I have no idea what's going on here but something seems weird".
6. Make sure your FreeBSD liaison is aware of upcoming hardware well in advance. You're probably doing new "generations" of cloud hardware every two years, and you probably only have it ready for testing a couple months before it launches. But each FreeBSD stable branch only gets a release every 6 months (two branches alternating means a release every quarter); if you wait until you have hardware available, there won't be a release with any changes you need on launch day. A lot of useful planning can happen before hardware is available, but only if there's someone who is aware of what's coming and is in a position to say "hey, do we support foo in our bar driver?"
7. Provide drivers for any special hardware you have. This probably means network interfaces, since we seem to be moving towards a world where everyone uses nvme disks for storage but everyone has their own custom network hardware. Here you're going to need a kernel developer with some experience with FreeBSD, but assuming you wrote a Linux driver, a lot of that work can be reused as long as you don't mind putting a BSD license on it. Ideally your driver maintainer will get a FreeBSD commit bit and maintain it in our tree — we have a process for giving commit privileges to driver maintainers who are employed by vendors — if this isn't an option, you'll need your "FreeBSD liaison" to push driver updates in for you. There is absolutely no need for the person maintaining your network driver to be the person doing weekly testing.
That's the bare minimum to have a functional FreeBSD offering. EC2 goes beyond this; they spent a year sponsoring me to work part-time on FreeBSD and got neat features like boot performance plots and multiple AMI "flavours" out, and I've also done a lot of things over the years as unfunded work aimed at improving FreeBSD on EC2 — for example, "install security updates when we first boot" and "make FreeBSD boot fast" — which can also yield benefits for FreeBSD on other clouds.
It's going to take a lot of work to make any other cloud's FreeBSD support as good as EC2's FreeBSD support. I intend to keep it that way. But I'd like to have more clouds at least providing functional FreeBSD offerings — and to put in the work needed to keep them that way.
Thoughts on (Amazonian) Leadership
Amazon's Leadership Principles are famous, not just within Amazon but also in the tech world at large. While they're frequently mocked — including by Amazonians — they're also generally sensible rules by which to run a company. I've been an Amazon customer for over 25 years and an AWS customer for almost 20 years, and also an AWS Hero for 6 years, and while I've never worked for Amazon I feel that I've seen behind the curtain enough to offer some commentary on a few of these principles.-
Customer Obsession: Leaders start with the customer and work backwards.
They work vigorously to earn and keep customer trust. Although leaders pay
attention to competitors, they obsess over customers.
Customer Obsession is great, but I often see Amazonians taking this too simplistically: "Start with the customer" doesn't have to mean "ask customers what they want and then give them faster horses". In the early days of AWS I saw a lot of what I call "cool engineering driven" products: When EC2 launched, it wasn't really clear what people would do with it, but it was very cool and it was clear that it could be a big deal in some form, sooner or later. Some time around 2012, the culture in AWS seemed to shift from "provide cool building blocks" to "build what customers are asking for" and in my view this was a step in the wrong direction (mind you, not nearly as much as the ca. 2020 shift to "build what analysts are asking for in quarterly earnings calls").This tension of what customers are asking for vs what customers really need shows up in areas like resilience. Amazon's "Well-Architected Framework" strongly exhorts customers to avoid building production workloads in a single Availability Zone — but Amazon's cross-AZ bandwidth pricing is painful, and Amazon doesn't provide useful tools for building durable multi-AZ applications. Most customers are not going to implement Paxos, and very few customers — certainly not executives who are removed from actual development processes — are going to ask Amazon for Paxos-as-a-service; but if Amazonians sat down and asked themselves "what do customers need in order to design their applications well" they could probably come up with several services which Amazon already has internally. AWS should return to its roots and release important building blocks — the things customers will need, not necessarily what they're asking for.
-
Ownership: Leaders are owners. They think long term and don't sacrifice
long-term value for short-term results. They act on behalf of the entire
company, beyond just their own team. They never say "that's not my job."
This principle is both too narrow, and not being fulfilled, in my view. It's not enough to simply act on behalf of the entire company: It's important to act on behalf of the entire technological ecosystem. Some Amazonians are great at this — I recently commited patches to FreeBSD's bhyve because an Amazonian was putting together a standard for interrupt handling in large VMs, and even though Amazon doesn't make any use of bhyve (at least, I don't think it does!) he understood the importance of getting standards widely accepted across the entire virtualization space rather than narrowly in the code Amazon relied upon. There's a saying in computer security, that anything which makes one of us less secure makes all of us less secure: Attackers will leverage an exploit against one system to allow them to attack another system. While the same does not directly apply in other fields, working with others to produce the best results for everyone will be much better in the long-term than focusing solely on what Amazon needs right now.But in general Amazon doesn't even live up to its stated (narrow) promise of having leaders acting on behalf of the entire company — it's simply too siloed. Amazon is famously secretive, and this applies internally as well as externally: When AWS launches two similar services, it's often because two teams didn't know what each other was working on. How can leaders act across the entire company if nobody knows what's happening outside of their team? They can't; and if Amazon wants to allow its best people to be true Owners, Amazon needs to start breaking down walls.
-
Bias for Action: Speed matters in business. Many decisions and actions
are reversible and do not need extensive study. We value calculated risk
taking.
Amazonians talk about "one-way doors" and "two-way doors", and it is quite true that many decisions are can be reversed... but that doesn't always mean that there is no cost associated with reversing a decision. There is a clear and widely recognized tension between "Bias for Action" and another principle, "Insist on the Highest Standards"; but there is also a tension between this and earning and keeping customer trust. When AWS ships a service which is half-baked, it diminishes customer trust in AWS as a whole; even if the problems in that service ultimately get corrected (either by fixing them or in some cases by simply getting rid of a service which should never have existed in the first place) the memory of a failed launch will live on in customers' minds for years to come.During my seven-year tenure as FreeBSD Security Officer, people knew me as the guy sending out security advisories; but the most important thing I did was not to ship Security Advisories — that is, it was to stop the train and say "no, we are not going to send this out yet". I knew that for all the importance of getting patches into people's hands in a timely manner, it was even more important to establish trust: If I gave people a broken patch, even once, they would be much slower to install security updates in the future. My team became familiar with the phrase "convince me that this is correct", and I'd like to see more of that at senior levels of Amazon: Principal and Distinguished Engineers need to step in with a bias for inaction, and use the respect they have earned to stop projects which do not meet the highest standards before they undermine trust. Amazon's hiring process famously includes "bar raisers" who can veto hiring decisions; they should also have service bar raisers who can veto launches.
A year of funded FreeBSD
I've been maintaining FreeBSD on the Amazon EC2 platform ever since I first got it booting in 2010, but in November 2023 I added to my responsibilities the role of FreeBSD release engineering lead — just in time to announce the availability of FreeBSD 14.0, although Glen Barber did all the release engineering work for that release. While I receive a small amount of funding from Antithesis and from my FreeBSD/EC2 Patreon, it rapidly became clear that my release engineering duties were competing with — in fact, out-competing — FreeBSD/EC2 for my available FreeBSD volunteer hours: In addition to my long list of "features to implement" stagnating, I had increasingly been saying "huh that's weird... oh well, no time to investigate that now". In short, by early 2024 I was becoming increasingly concerned that I was not in a position to be a good "owner" of the FreeBSD/EC2 platform.For several years leading up to this point I had been talking to Amazonians on and off about the possibility of Amazon sponsoring my FreeBSD/EC2 work; rather predictably, most of those conversation ended up with my contacts at Amazon rhyming with "Amazon should definitely sponsor the work you're doing... but I don't have any money available in my budget for this". Finally in April 2024 I found someone with a budget, and after some discussions around timeline, scope, and process, it was determined that Amazon would support me for a year via GitHub Sponsors. I'm not entirely sure if the year in question was June through May or July through June — money had to move within Amazon, from Amazon to GitHub, from GitHub to Stripe, and finally from Stripe into my bank account, so when I received money doesn't necessarily reflect when Amazon intended to give me money — but either way the sponsorship either has come to an end or is coming to an end soon, so I figured now was a good time to write about what I've done.
Amazon was nominally sponsoring me for 40 hours/month of work on FreeBSD release engineering and FreeBSD/EC2 development — I made it clear to them that sponsoring one and not the other wasn't really feasible, especially given dependencies between the two — and asked me to track how much time I was spending on things. In the end, I spent roughly 50 hours/month on this work, averaging 20 hours/month spent on EC2-specific issues, 20 hours/month making FreeBSD releases happen, and 10 hours/month on other release engineering related work — although the exact breakdown varied dramatically from month to month.
Following FreeBSD's quarterly release schedule (which I announced in July 2024, but put together and presented at the FreeBSD developer summit at BSDCan in May 2024), I managed four FreeBSD releases during the past year: FreeBSD 13.4, in September 2024; FreeBSD 14.2, in December 2024; FreeBSD 13.5, in March 2025; and FreeBSD 14.3, currently scheduled for release on June 10th. The work involved in managing each of these releases — nagging developers to get their code into the tree in time, approving (or disapproving!) merge requests, coordinating with other teams, building and testing images (usually three Betas, one Release Candidate, and the final Release), writing announcement text, and fixing any release-building breakage which arose along the way — mostly happened in the month prior to the release (I refer to the second month of each calendar quarter as "Beta Month") and ranged from a low of 33.5 hours (for FreeBSD 13.5) to a high of 79 hours (for FreeBSD 14.2). As one might imagine, the later in a stable branch you get, the fewer the number of things there are breaking and the lower the amount of work required for doing a release; while I wasn't tracking hours when I managed FreeBSD 14.1, I suspect it took close to 100 hours of release engineering time, and FreeBSD 15.0 is very likely to be well over that.
On the FreeBSD/EC2 side of things, there were two major features which Amazon encouraged me to prioritize: The "power driver" for AWS Graviton instances (aka "how the EC2 API tells the OS to shut down" — without this, FreeBSD ignores the shutdown signal and a few minutes later EC2 times out and yanks the virtual power cable), and device hotplug on AWS Graviton instances. The first of these was straightforward: On Graviton systems, the "power button" is a GPIO pin, the details of which are specified via an ACPI _AEI object. I added code to find those in ACPI and pass the appropriate configuration through to the driver for the PL061 GPIO controller; when the GPIO pin is asserted, the controller generates an interrupt which causes the ACPI "power button" event to be triggered, which in turn now shuts down the system. There was one minor hiccup: The ACPI tables provided by EC2 specify that the GPIO pin in question should be configured as a "Pull Up" pin, but the PL061 controller in fact doesn't have any pullup/pulldown resistors; this didn't cause problems on Linux because Linux silently ignores GPIO configuration failures, but on FreeBSD we disabled the device after failing to configure it. I believe this EC2 bug will be fixed in future Graviton systems; but in the mean time I ship FreeBSD/EC2 AMIs with a new "quirk": ACPI_Q_AEI_NOPULL, aka "Ignore the PullUp flag on GPIO pin specifications in _AEI objects".
Getting hotplug working — or more specifically, getting hot unplug working, since that's where most of the problems arose — took considerably more work, largely because there were several different problems, each presenting on a subset of EC2 instance types:
- On some Graviton systems, we leaked a (virtual) IRQ reservation during PCI attach; this is harmless in most cases, but after attaching and detaching an EBS volume 67 times we would run out of IRQs and the FreeBSD kernel would panic. This IRQ leakage was happening from some "legacy" PCI interrupt routing code, and in the end I simply added a boot loader setting to turn that code off in EC2.
- On some Graviton systems, the firmware uses PCI device power state as an indication that the OS has finished using a device and is ready for it to be "ejected". This is 100% a bug in EC2, and I believe it will be fixed in due course; in the mean time, FreeBSD/EC2 AMIs have an ACPI quirk ACPI_Q_CLEAR_PME_ON_DETACH which instructs them to flip some bits in the PCI power management register before ejecting a device.
- On the newest generation of EC2 instances (both x86 and Graviton), FreeBSD's nvme driver would panic after a PCIe unplug. This bug I didn't need to fix myself, beyond pointing it out to our nvme driver maintainer.
- On some EC2 instances (both x86 and Graviton), we would see a "ghost" device on the PCI bus after it was ejected; attempts to access the device would fail, but the "ghost" would block any attempt to attach new devices. This turned out to be another EC2 bug: The Nitro firmware managing the PCI bus operated asynchronously from the firmware managing the PCI devices, so there was a window of a few ms where a device had been unplugged but the PCI bus still reported it as being present. On Linux this is (almost always) not a problem since Linux scans buses periodically and typically loses the race; but on FreeBSD we immediately re-scan the PCI bus after detaching a device, so we usually won the race against the Nitro firmware, causing us to "see" the device which was no longer present. My understanding is that this is being fixed in Nitro to ensure that the PCI bus "knows" about a device detach before the detach is acknowledged to the operating system; but in the mean time, FreeBSD/EC2 AMIs have an ACPI quirk ACPI_Q_DELAY_BEFORE_EJECT_RESCAN which adds a 10 ms delay between signalling that a device should be ejected and rescanning the PCI bus.
While those two were Amazon's top priorities for FreeBSD/EC2 work, they were by no means the only things I worked on; in fact they only took up about half of the time I spent on EC2-specific issues. I did a lot of work in 2021 and 2022 to speed up the FreeBSD boot process, but among the "that's weird but I don't have time to investigate right now" issues I had noticed in late 2023 and early 2024 was that FreeBSD/EC2 instances sometimes took a surprisingly long time to boot. I hadn't measured how long they took, mind you; but as part of the FreeBSD weekly snapshot process I ran test boots of a few EC2 instance types, and I had needed to increase the sleep time between launching instances and trying to SSH into them.
Well, the first thing to do with any sort of performance issues is to collect data; so I benchmarked boot time on weekly EC2 AMI builds dating back to 2018 — spinning up over ten thousand EC2 instances in the process — and started generating FreeBSD boot performance plots. Collecting new data and updating those plots is now part of my weekly snapshot testing process; but even without drawing plots, I could immediately see some issues. I got to work:
- Starting in the first week of 2024, the FreeBSD boot process suddenly got about 3x slower. I started bisecting commits, and tracked it down to... a commit which increased the root disk size from 5 GB to 6 GB. Why? Well, I reached out to some of my friends at Amazon, and it turned out that the answer was somewhere between "magic" and "you really don't want to know"; but the important part for me was that increasing the root disk size to 8 GB restored performance to earlier levels.
-
FreeBSD was also taking a long time to boot on Graviton 2 (aka c6g
and similar) instances, and after some investigation I tracked this down to
a problem with kernel entropy seeding: If the FreeBSD kernel doesn't have
enough entropy to generate secure random numbers, the boot process stalls
until it collects more entropy — and in a VM, that can take a while.
Now, we have code to obtain entropy via the EFI boot loader — which
effectively means asking the Nitro firmware to give us a secure seed —
but that ran into two problems: First, it wasn't actually running in EC2,
and second, when it did run, it was absurdly slow on the Graviton 2.
The first problem was easily solved: The entropy seeding request was being made from the "boot menu" lua code, and (ironically in order to improve boot performance) we bypass that menu in EC2; moving that to the right place in the boot loader lua code ensured that it ran regardless of whether the menu was enabled. The second problem turned out to be a Graviton 2 issue: It could provide a small amount of entropy quickly, but took a long time to provide the 2048 bytes which the FreeBSD boot loader was requesting. This large request was due to the way that FreeBSD seeded its entropy system; 32 pools each needed 64 bytes of entropy. Since this was in no way cryptographically necessary — the multiple pools exist only as a protection in case a pool state is leaked — I rewrote the code to make it possible to take 64 bytes from EFI and use PBKDF2 as an "entropy spreader" to turn that into the 2048 bytes our entropy-feeding API needed. This took the boot time of FreeBSD arm64/base/UFS from ~25 seconds down to ~8 seconds. - I also noticed that ZFS images were taking quite a bit longer to boot than UFS images — and interestingly, this delta varied depending on the amount of data on the disk (but not the disk size itself). I traced this down to a weird interaction of our filesystem-building code (makefs) and what happens when you attach a ZFS pool: ZFS performs some filesystem verification steps which involve traversing the most recent transaction group, but makefs puts the everything into a single transaction group — so when EC2 ZFS images booted, they had to read and process metadata for every single file on disk (hence the number of files on disk affecting the time taken). Once I tracked down the issue, I was able to report it to FreeBSD's makefs guru (Mark Johnston), who solved the problem simply by recording a higher transaction group on the filesystem — that way, the single transaction group was not "recent enough" to prompt the ZFS transaction group verification logic. ZFS images promptly dropped from ~22 seconds down to ~11 seconds of boot time.
- Finally — and this issue was one I caught promptly as a result of including boot performance in my weekly testing — in December 2024 I updated the net/aws-ec2-imdsv2-get port to support IPv6. This port provides a command-line interface to the EC2 Instance MetaData Service, and is necessary because when Amazon launched "IMDSv2" to paper over (but not properly fix) the security problem inherent in exposing IAM credentials over HTTP, they made it impossible to use FreeBSD fetch(1) to access the IMDS. Unfortunately when IPv6 support was added to aws-ec2-imdsv2-get, two mistakes were made: First, it attempted to connect on IPv6 first (even though IPv4-only is the default IMDS instance configuration); and second, it kept the default TCP timeout (75 seconds). Thanks to my testing, I got this fixed promptly, to attempt IPv4 first and reduce the timeout to 100 ms — considering that IMDS requests are serviced without ever leaving the physical system, waiting more than 100 ms seemed unnecessary!
One thing which had long been on my "features to implement" list for FreeBSD/EC2 but I hadn't found time for earlier was adding more AMI "flavours": A year ago, we had base (the FreeBSD base system, with minimal additional code installed from the ports tree to make it "act like an EC2 AMI") and cloud-init (as the name suggests, FreeBSD with Cloud-init installed). I added two more flavours of FreeBSD AMI to the roster: small AMIs, which are like base except without debug symbols, the LLDB debugger, 32-bit libraries, FreeBSD tests, or the Amazon SSM Agent or AWS CLI — which collectively reduces the disk space usage from ~5 GB to ~1 GB while not removing anything which most people will use — and builder AMIs, which are FreeBSD AMI Builder AMIs, providing an easy path for users to create customized FreeBSD AMIs.
Of course, with 4 flavours of FreeBSD AMIs — and two filesystems (UFS and ZFS), two architectures (amd64 and arm64), and three versions of FreeBSD (13-STABLE, 14-STABLE, and 15-CURRENT) — all of the weekly snapshot builds were starting to add up; so in May I finally got around to cleaning up old images (and their associated EBS snapshots). While I don't pay for these images — the FreeBSD release engineering AWS account is sponsored by Amazon — it was still costing someone money; so when I realized I could get rid of 336 TB of EBS snapshots, I figured it was worth spending a few hours writing shell scripts.
While most of my time was spent on managing release cycles and maintaining the FreeBSD/EC2 platform, I did also spend some time on broader release engineering issues — in fact, part of the design of the "quarterly" release schedule is that it leaves a few weeks between finishing one release and starting the next to allow for release engineering work which can't effectively be done in the middle of a release cycle. The first issue I tackled here was parallelizing release building: With a large number of EC2 AMIs being built, a large proportion of the release build time was being spent not building but rather installing FreeBSD into VM images. I reworked the release code to parallelize this, but found that it caused sporadic build failures — which were very hard to isolate, since they only showed up with a complete release build (which took close to 24 hours) and not with any subset of the build. After many hours of work I finally tracked the problem down to a single missing Makefile line: We weren't specifying that a directory should be created before files were installed into it. With that fix, I was able to reduce the release build from ~22 hours down to ~13 hours, and also "unlock" the ability to add more EC2 AMI flavours (which I couldn't do earlier since it would have increased the build time too much).
Another general release engineering issue I started tackling was the problem of build reproducibility — aided by the fact that I had EC2 to draw upon. As part of my weekly testing of snapshot images, I now spin up EC2 instances and have them build their own AMIs — and then use diffoscope to compare the disk images they built against the ones they were launched from. This has already found several issues — including some which appeared partway through the year and were identified quickly thanks to the regular testing — of which I've fixed a few and some others I've passed on to other developers to tackle.
Of course, in addition to the big projects there's also a plethora of smaller issues to tackle. Build breakage (weekly snapshot builds are good at finding this!); reviewing patches to the ENA driver; helping Dave Cottlehuber add support for building OCI Containers and uploading them to repositories; teaching my bsdec2-image-upload tool to gracefully handle internal AWS errors; reporting an AWS security issue I stumbled across... some days everything falls under the umbrella of "other stuff which needs to get done", but a lot of it is just as important as the larger projects.
So what's next? Well, I'm still the FreeBSD release engineering lead and the maintainer of the FreeBSD/EC2 platform — just with rather less time to devote to this work. FreeBSD releases will continue to happen — 15.0 should land in December, followed in 2026 by 14.4, 15.1, 14.5, and 15.2 — but I probably won't have time to jump in and fix things as much, so late-landing features are more likely to get removed from a release rather than fixed in time for the release; we were only able to ship OCI Containers starting in FreeBSD 14.2 because I had funded hours to make sure all the pieces landed intact, and that sort of involvement won't be possible. On the EC2 side, now that I have regression testing of boot performance set up, I'll probably catch any issues which need to be fixed there; but the rest of my "features to implement" list — automatically growing filesystems when EBS volumes expand, better automatic configuration with multiple network interfaces (and network interface hot plug), rolling "pre-patched" AMIs (right now FreeBSD instances update themselves when they first boot), putting together a website to help users generate EC2 user-data files (e.g., for installing packages and launching daemons), returning to my work on FreeBSD/Firecracker and making it a supported FreeBSD platform, etc. — is likely to stagnate unless I find more time.
I've been incredibly lucky to get this sponsorship from Amazon; it's far more than most open source developers ever get. I wish it wasn't ending; but I'm proud of the work I've done and I'll always be grateful to Amazon for giving me this opportunity.
Chunking attacks on Tarsnap (and others)
Ten years ago I wrote that it would require someone smarter than me to extract information from the way that Tarsnap splits data into chunks. Well, I never claimed to be the smartest person in the world! Working with Boris Alexeev and Yan X Zhang, I've just uploaded a paper to the Cryptology ePrint Archive describing a chosen-plaintext attack which would allow someone with access to the Tarsnap server (aka me, Amazon, or the NSA) or potentially someone with sufficient ability to monitor network traffic (e.g. someone watching your wifi transmissions) to extract Tarsnap's chunking parameters. We also present both known and chosen plaintext attacks against BorgBackup, and known plaintext attacks against Restic.And, of course, because Tarsnap is intended to be Online backups for the truly paranoid, I've released a new version of Tarsnap today (version 1.0.41) which contains mitigations for these attacks, bringing us back to "I can't see any computationally feasible attack"; but I'm also exploring possibilities for making the chunking provably secure.
I'm sure many people reading this right now are asking the same question: Are my secrets safe? To this I have to say "almost certainly yes". The attack we have to leak Tarsnap's chunking parameters is a chosen plaintext attack — you would have to archive data provided to you by the attacker — and the chosen plaintext has a particular signature (large blocks of "small alphabet" data) which would show up on the Tarsnap server (I can't see your data, but I can see block sizes, and this sort of plaintext is highly compressible). Furthermore, even after obtaining Tarsnap's chunking parameters, leaking secret data would be very challenging, requiring an interactive attack which mixes chosen plaintext with your secrets.
Leaking known data (e.g. answering the question "is this machine archiving a copy of the FreeBSD 13.5-RELEASE amd64 dvd1.iso file") is possible given knowledge of the chunking parameters; but this doesn't particularly enhance an attacker's capabilities since an attacker who can perform a chosen plaintext attack (necessary in order to extract Tarsnap's chunking parameters) can already determine if you have a file stored, by prompting you to store it again and using deduplication as an oracle.
In short: Don't worry, but update to the latest version anyway.
Thanks to Boris Alexeev, Yan X Zhang, Kien Tuong Truong, Simon-Philipp Merz, Matteo Scarlata, Felix Gunther and Kenneth G. Paterson for their assistance. It takes a village.
My re:Invent asks
As an AWS Hero I get free admission to the AWS re:Invent conference; while it's rare that I'm interested in many talks — in previous years I've attended "Advanced" talks which didn't say anything which wasn't already in the published documentation — I do find that it provides a very good opportunity to talk to Amazonians.While I'm sure many of the things I ask for get filed under "Colin is weird", I know sometimes Amazon does pay attention — at least, once I find the right person to talk to. Since I have quite a list this year, and I know some Amazonians (and maybe even non-Amazonians) may be interested, I figured I might as well post them here.
- More Amazonian OSS developers at re:Invent. I'm looking forward to meeting some Valkey developers on Wednesday, but I was disappointed that none of the Firecracker developers are in attendance. Amazon has a policy of not having engineers attend re:Invent unless they're giving talks (and I'd love to see this policy changed in general) but it's absolutely essential for Open Source developers to go to conferences; that's how we meet potential contributors. If your open source team doesn't go to conferences, they're not really doing open source, no matter what license you put on the code.
- Lower cross-AZ bandwidth pricing. I don't even particularly care about the cost; but being worried about avoiding cross-AZ bandwidth is making people design bad systems. One of the guidelines in Amazon's "well-architected framework" is to deploy the workload to multiple locations and Amazon specificially calls out using a single Availability Zone as a problem — but concerns about cross-AZ bandwidth (even if it turns out that the concerns are unwarranted!) are preventing people from following this guideline.
-
On-the-rack EBS storage. I don't know how Amazon datacenters are set up,
but the latency of disk I/O to "SSD" EBS volumes strongly suggests that they
are a significant distance away from EC2 instances which are accessing them.
At the other end of the latency scale, some EC2 instance types have SSDs
directly attached to the instance hardware, with dramatically better I/O
performance — but have low durability (if the instance dies the data
is gone) and no elasticity (each instance type has a certain amount of disk
attached).
Having EBS storage available on the same racks as EC2 nodes would provide an intermediate point, allowing lower latency than the speed of light allows for across-the-datacenter I/Os, while allowing some flexibility in the size of volumes. Users would have to accept that "provision me a volume on the same rack as this instance" might return "sorry all the disks on that rack are full"; but at least at instance launch time requests could be satisfied by searching for a rack with sufficient rack-local disk. - CHERI capable instances. This has been a long standing wishlist item for me and I know I'm not going to get it any time soon; but I know Amazon (and other clouds) have Morello boards for research purposes. CHERI has huge advantages for security and whichever cloud pursues this first will be miles ahead of the competition.
-
Marketplace support for "pending" or "scheduled" releases. When I add new
FreeBSD releases to the AWS Marketplace, they first go through an approval
process and then get copied out to all the EC2 regions; once that is done,
the Marketplace updates the "product" listing with the new version and sends
out emails to all the current users telling them about the new version.
This often means that Amazon is sending emails announcing new FreeBSD
releases a couple days before I send out the official FreeBSD release
announcement.
I don't want to wait and add new versions to the Marketplace later, because the timeline is unpredictable — usually a couple hours but sometimes a day or more — so I'd like to be able to tell the Marketplace about the upcoming FreeBSD release and have them get everything ready but not update the website or send out email until I'm ready to send out our announcement (we usually allow a few days for mirrors and clouds to sync).
I don't know if or when I'm likely to get any of these, but I like to think that I convinced people that what I was asking for was at least somewhat sensible. Maybe between them and other Amazonians who will no doubt read this, I'll get at least a few of the things on my wishlist.
For the sake of transparency: In addition to giving me (and other AWS Heroes) free admission and travel to re:Invent, Amazon is sponsoring my FreeBSD work. About half of what they're paying for is EC2-specific stuff; the other half is FreeBSD release engineering. Without their support, a number of important features would not have landed in FreeBSD 14.2-RELEASE; thank you Amazon.