Interesting tarsnap statistics

I admit it: I'm a numbers junkie. I like taking streams of numbers and looking for patterns; and I like trying to figure out the reasons behind those patterns. Running my tarsnap online backup service has provided me with a great source of numbers: I keep extensive logs, and there are enough tarsnap users now that the randomness of individual users is starting to get washed away. In the interest of science, then -- or if not science, at very least curiosity -- here's some statistics I've gathered.

Average data stored: Ignoring inactive accounts (people who sign up for tarsnap but never use it), the amount of data tarsnap users have stored closely follows a lognormal distribution; the median is roughly 1 GB, while the mean is roughly 8 GB.

Machines per user: Users can register any number of machines as belonging to the same account; the service treats them entirely independently aside from the financial/accounting aspects. Out of the set of active tarsnap users, 57% have just one machine registered; 22% have two machines registered; 9% have three machines registered; 7% have four machines registered; and 5% have five or more machines registered.

Data downloaded: Backups are often described as a "write once, read never" storage problem -- it's important that the data be available if and when needed, but the hope is always that you won't ever need it. So far, on average 3% of data stored on tarsnap has been downloaded each month.

Frequency of archives: Approximately 30% of systems running tarsnap have created an archive in the past 24 hours. Subjectively (i.e. I'm too lazy to write a script to figure out exact statistics for this, but I've noticed this by eye) it looks like most of these systems are creating backups from cron jobs, since they create archives at the same time each day.

Archive creation time of day: Archive creation is spread quite evenly around the clock; the only statistically significant peaks are at 06:00-06:59, 10:00-10:59, and 13:00-13:59. Cron jobs running at 6AM in UTC, EDT, and PDT time zones respectively, perhaps?

Archive creation time of hour: In contrast, archive creation is not evenly spread around the hour: There is a large spike in traffic at :00, and smaller spikes at :10, :15, :25, :30, and :50. This is a very clear sign of cron jobs.

Unearned revenue: Tarsnap works by having people prepay for their usage (with so many people storing under 1 GB and paying under $0.30 each month, charging credit cards every month would be infeasible). The money people have sitting in their tarsnap accounts waiting to be spent is defined by accountants as "unearned revenue". At the present time, tarsnap has roughly 6 months of unearned revenue -- that is, on average, tarsnap users have enough money in their accounts to pay for the next 6 months of their usage. Naturally, this number varies dramatically from account to account, and is negatively correlated with the amount of data stored -- if you only have 1 MB of data stored, $5 will last you over a thousand years. (For the record: The money tarsnap users have deposited into their accounts but not spent yet is sitting and waiting safely -- it's not my money yet, so I'm not going to do anything crazy with it.)

Payment sizes:Tarsnap users can deposit money into their accounts whenever they like, in any amount so long as it's $5 or more (allowing smaller payments would result in too much being eaten up by processing fees). Of the payments received to date, 26% have been $5; 24% have been $10; 15% have been $20; 11% have been $50; 5% have been $100; 5% have been $15; 4% have been $30; 3% have been $25; and 7% have been other sizes. The popularity of 5/10/20/50/100 is unsurprising (give people freedom to pick numbers, and they'll usually pick round numbers), but I'm not sure why $15 and $30 are so popular (even at 5% and 4% of the payments, their popularity is statistically significant). Perhaps tarsnap's pricing of $0.30 per GB of bandwidth and $0.30 per GB-month of storage is responsible for making people "think three"?

Is there anything else interesting I can pull out of my log files? Submit questions via the comments below. I will not publish information from which tarsnap's revenue or profits can be derived (since tarsnap's profits are my income, I consider that to be personally private), nor will I publish information which could be tied to individual tarsnap users (e.g., I will not answer questions like "what is the most data any one user has stored"); but aside from those limitations, anything is fair game.

Posted at 2009-08-21 22:20 | Permanent link | Comments
blog comments powered by Disqus

Recent posts

Monthly Archives

Yearly Archives