Here is a lightly edited compilation about explanations about the methodology used by Hugh Howey and Data Guy for their 10 (so far) Author Earnings reports.
I try and give a source for the different extracts, and will probably update it if new explanations arrive.
General
From The 50K reportFor techies out there who geek out on methodology, the spider works like this: It crawls through all the categories, sub-categories, and sub-sub-categories listed on Amazon, starting from the very top and working its way down. It scans each product page and parses the text straight from the source html. Along with title, author, price, star-rating, and publisher information, the spider also grabs the book’s overall Amazon Kindle store sales ranking. This overall sales ranking is then used to slot each title into a single master list. Duplicate entries, from books appearing on multiple bestseller lists, get discarded.
Our spider is looking at a snapshot of sales rankings for one particular day. Extrapolation is only useful for determining relative market share and theoretical earnings potential. Our conclusions assume that the proportion of self-published to traditionally published titles doesn’t change dramatically from day to day, and the similarity of datasets lends that assumption some support.
The preponderance of nonfiction in the february and later sample does not reflect market share. Rather, it reflects the many hundreds of detailed Amazon sub-sub-sub-category bestseller lists for non-fiction (Health, Fitness & Dieting > Alternative Medicine > Holistic, for example), that make lower-selling nonfiction more visible to the spider than equally low-selling fiction.
A few things make doing so a little challenging: the once-per-quarter frequency of our data capture and the high turnover of the bestseller lists and sublists.
(In our October report, we found that almost 80,000 of the 120,000 July bestsellers had since fallen off the lists to be replaced by 80,000 others.)
(In our October report, we found that almost 80,000 of the 120,000 July bestsellers had since fallen off the lists to be replaced by 80,000 others.)
We are getting a very comprehensive look at Amazon sales every time, though — the data “holes” are mostly down where titles are selling fewer than a handful of copies each. With each dataset, we’re capturing:
– practically all of the top several hundred ranks
– 95% of the top 1,000
– 80% of the top 5,000
– 68% of the top 10,000
– 52% of the top 25,000
– 42% of the top 50,000
– 33% of the top 100,000
– 11% of the top 1,000,000
– some additional ones ranked in the 2,000,000-3,000,000 range (mostly from really specific nonfiction bestseller lists like “Renaissance Painter Biographies” or whatever.)
– 95% of the top 1,000
– 80% of the top 5,000
– 68% of the top 10,000
– 52% of the top 25,000
– 42% of the top 50,000
– 33% of the top 100,000
– 11% of the top 1,000,000
– some additional ones ranked in the 2,000,000-3,000,000 range (mostly from really specific nonfiction bestseller lists like “Renaissance Painter Biographies” or whatever.)
Ideally, I’d like to grab all 3 million-ish every single day instead…
But the comprehensiveness of our snapshots comes at a nontrivial technical cost. For the technically curious out there, the data collection for this last report used 40 enterprise-grade servers (with 8 high-speed CPUs each) to crawl Amazon’s best seller lists and product pages, sucking almost 600 Gigabytes of HTML webpages across the Internet and ripping their HTML apart to extract the information we need and store it into a MySQL database. Each run takes a few hours, after which we shut the servers down before they burn a hole in our bank accounts.
Each report is thus a deep cross-sectional study of Amazon’s sales that day, but each is a single snapshot taken on a particular day. Their compositional consistency from quarter to quarter strongly suggests that we wouldn’t find much variation on the days in between, either. But perhaps we’ll try a longitudinal study in parallel at some point (or even better, someone else will) using a smaller set of titles.
For the May 2015 data set (which lists 200K ebooks), I launched the spider simultaneously on 120 servers, each with 8 CPUs and 16 GB of RAM. This Author Earnings data run took roughly an hour and a half, while running over a thousand separate webcrawler threads on those 120 servers. During that time, it read and extracted data from nearly a million Amazon.com product pages — print and audio books as well as ebooks — over a terabyte of data in all.
But the anonymized spreadsheet we publish is just the tip of the iceberg. Even so, it’s an unwieldy 60MB or so in size — we may trim it back down to 120K in future reports, just to keep things manageable.
On Rank to Sales ranking
For this report, Author Earnings threw out all of our previous assumptions. We built a brand new rank-to-sales conversion curve from the ground up. This time we based it on raw, Amazon-reported sales data on the precise daily sales figures for hundreds of individual books from many different authors, spanning a period of many months. Our raw sales data included titles ranked in Amazon’s Overall Top 5 — titles whose KDP reports verified that they were each selling many thousands of copies a day — and it also included books ranked in the hundreds of thousands — whose KDP reports revealed were selling less than a single copy a day. We combined that mass of hard sales data with a complete daily record of Amazon Kindle sales rankings for each of those books, pulled directly from individual AuthorCentral graphs. We ended up with nearly a million distinct data points in total.
Why did we need so many data points? Because Amazon’s Overall Best Seller Rankings aren’t a simple calculation based on each book’s single-day sales — they also factor in time-decaying sales from previous days as well. To reverse-engineer Amazon’s ranking algorithms, the more raw sales and ranking data we used, the more accurate our results would get. So we fired up some powerful computers, fed them all that raw data, and let them crunch the numbers.
For our fellow geeks: We applied both old-school statistical curve-fitting approaches and more modern machine learning techniques, iterating our underlying numerical model until we zeroed in on the solution that yielded the best predictive accuracy. Taking advantage of a neat mathematical series-convergence trick (one whose applicability was no accident, because Amazon’s algorithms undoubtedly rely on it, too), we ended up with a brand new, simpler, more elegant, and far more accurate rank-to-sales conversion formula for Kindle ebooks.
For the non-geeks: Our data-science awesomesauce now tastes even better.
Here’s what the new rank-to-sales curve looks like:
In retrospect, it’s striking how well AE’s old, crowdsourced rank-to-sales curve (in black) matches our new data-derived one. Graphically, the old AE curve ping-pongs back and forth between the new computed upper bound (shown in red), defined by the higher number of daily sales required to first “hit” a rank when spiking up from a much lower sales baseline, and the new computed lower bound (shown in blue), defined by the more modest number of daily sales required to steadily “hold” the same rank through consistent day-to-day sales.
(Old Rank to Sales was :)
http://www.hughhowey.com/the-january-author-earnings-report/#comment-233671
Sales Rank | Sales Per Day |
---|---|
1 | 7,000 |
5 | 4,000 |
20 | 3,000 |
35 | 2,000 |
100 | 1,000 |
200 | 500 |
350 | 250 |
500 | 175 |
750 | 120 |
1,500 | 100 |
3,000 | 70 |
5,500 | 25 |
10,000 | 15 |
50,000 | 5 |
100,000 | 1 |
Mostly, it still follows: http://www.theresaragan.com/salesrankingchart.html with a few additional data points added (like the one at rank 100) to increase curve accuracy.
We’ve left it consistent since we started to avoid introducing yet another variable into the report-to-report comparisons.
Hugh clearly stated that these numbers were based on data gathered by numerous writers of their own books and corresponding rank/sales numbers. He included three different links. Numerous authors have corroborated these correlations.
The rank within a category or sub-category is irrelevant. Sales numbers are generated based on overall store rank.
Even if you don’t believe these correlations are accurate, they are applied uniformly to all books – both self-published and traditionally published. So no matter what you plug in, the relative sales will remain the same. If you think self-published authors aren’t making as much as the charts indicate, then that means the traditionally published authors aren’t making as much either.
Initial report (5th footnote)
Daily sales according to Amazon rank can be found in numerous places, including here, here, and here. Depending on the source, the model changes, but not enough to greatly affect the results. Keep in mind that the dollar figures and the exact sales are irrelevant to the ratio and percentages shown. Any change in those numbers impacts all books equally, so the picture of how authors are doing according to how they publish remains the same. These daily sales figures are adjustable in our spreadsheet, which contains our full data set and which we are offering at the low, low price of absolutely zilch.
Initial report (5th footnote)
Daily sales according to Amazon rank can be found in numerous places, including here, here, and here. Depending on the source, the model changes, but not enough to greatly affect the results. Keep in mind that the dollar figures and the exact sales are irrelevant to the ratio and percentages shown. Any change in those numbers impacts all books equally, so the picture of how authors are doing according to how they publish remains the same. These daily sales figures are adjustable in our spreadsheet, which contains our full data set and which we are offering at the low, low price of absolutely zilch.
Integration for missing books
But we know what the shape of the sales-to-rank curve is, and so we know what the “missing” books at ranks in between the ones we captured are selling. We then numerically integrate the whole curve to get a total daily sales number for all ebooks at all ranks. In other words, for each rank, whether or not we happened to capture that particular book in our data set, we add up its corresponding unit sales to compute Amazon’s total unit sales. Picture “shading in the area under the curve.”
While the books in long tail below rank 100,000 are shown as having 0 daily sales in our spreadsheet, they actually do sell a book every few days in the 100,000-500,000 range, a book a week in the 500,000-1,000,000 range, etc. (We zeroed those out in the spreadsheet because we didn’t want to get caught up explaining to the math-challenged how a book can sell a fraction of a copy a day. But we do include those fraction-sellers in the integrated total of 1,542,000 total ebooks sold per day (of which 1,331,910 are ranked 1-100,000).
The thing that makes [numerical integration] easy (and accurate) is the by-definition monotonically-decreasing nature of the sales-to-rank curve (it’s a pareto distribution, more or less, with a couple kinks in it caused by different “list visibility” regimes).
So it just becomes a choice of what numerical-integration interpolation strategy you use. We used linear interpolation between sales-to-rank data points, to get an appropriate level of accuracy.
Error magnitude didn’t matter as much before, as out focus was mainly the relative performance of books published via each path. Therefore, an error affected all sectors consistently and equally and didn’t change those relative results.
However, now we’re looking at predicting the actual absolute number of ebook sales on Amazon.com, and the actual absolute size of the market as a whole. That requires more accuracy.
“Within 20%” is no longer good enough — we need a better handle on the accuracy. That’ll be our next focus.
The data, however, doesn’t follow a strict pareto or power-law distribution — it’s close, but not exact. There are those rank regimes I mentioned where the slope steepens or flattens — most likely due to sharp differences in how much bestseller list visibility books get in those ranges
100 000 and lesser ranks
http://authorearnings.com/report/january-2015-author-earnings-report/#comment-224419
To make the spreadsheet simpler, we left out the roughly 13% of Amazon’s sales that live down in the deep long tail below rank 100,000. But we do account for them when scaling up our daily sample to estimate total daily or annual sales.
Ranks 1 to 100,000 of the rank-to-sales curve add up to a total of 1,331,910 sales per day.
Ranks 101,000 to 3 million+ add up to roughly 210,000 more sales per day.
The reason that we didn’t put them in the spreadsheet is we didn’t want to have to keep explaining to the less mathematically inclined folks how a book can sell a fraction of a copy in a day.
Categories
A frequent question in the comments is:
How were books classified as “Indie-Published,” “Small/Medium Publisher,” or ‘Uncategorized Single-Author Publisher”?
Here’s how:
1) The Big-5 Published books were easy to separate out, no matter what imprint they were published under, by checking the “Sold By” line in the Amazon Product Details, which listed one of: Random House, Penguin, Hachette, Macmillan, HarperCollins, or Simon & Shuster as seller.
2) If multiple author names used the same listed Publisher, and the book’s “Sold By” wasn’t one of the Big-5, it was considered a Small/Medium Publisher. A lot of these might indeed be Indie Publishers, but we wanted to be conservative and err on the side of understating–rather than overstating–Indie numbers.
3) If no Publisher at all was listed under Product Details, the book was considered Indie-Published.
4) If the full name of the author was included in the Publisher name, the book was considered Indie-Published.
5) The remaining books, whose publishers represented only a single author name, were initially grouped under Uncategorized Single-Author Publisher, and sorted by revenue. Then we rolled up our sleeves.
Going down the list one by one, we Googled the publisher names and author names. We were able to classify hundreds of them. Many were already known to us… for example: Broad Reach Publishing (Hugh), Laree Bailey Press (H.M. Ward), Reprobatio Inc. (Russell Blake), etc. We started from the biggest earners and went down, until the names became too obscure to find and we ran out of energy and time, and none of the remaining Uncategorized Single-Author Publishers individually accounted for a significant chunk of revenue.
So the vast majority of the remaining Uncategorized Single-Author Publishers are most likely “Indies in disguise.” But there are also a few examples of poor-selling imprints of small and medium traditional publishers in the mix (such as Baen), so again we didn’t want to overstate Indie market share by lumping them all in with the Indies.
Is there any way to quantify how much of the Small Medium Publisher/Single –> Indie market share can be attributed to re-classification of the publishers to Indie ?
The answer is very little – I just checked. Less than 0.1% of what was originally classified as Small/Medium Publisher income has been reclassified over the course of the last few reports. What you are seeing there is actual market-mix shift.
On the other hand, ~1.3% of what was “Uncategorized” income back in Feb 2014 report has since been definitively classified as indie, while another ~0.2% of it has since moved into Small/Medium Publisher income.
Kindle Unlimited
The methodology is explained in the October 2014 report
http://authorearnings.com/report/october-2014-author-earnings-report-2/The amount paid per borrow is independent of price and depends instead on how much Amazon funds a shared pool. The rate per borrow has averaged $1.62 over the three months since KU launched. Each borrow appears to affect ebook ranking just as a sale does, so we have to take the borrow-to-sales rate into account for our earnings projection. As you will soon see, our data is robust enough that even wildly varying estimates for this rate do not appreciably affect our results. Before we get to our new baseline earnings report, let’s look at what our final graph would look like with five different assumptions for the borrow rate.
The difference in the total share of earnings by publishing type is only affected by a few percent even with wildly impossible assumptions about the borrow rate. In order to determine which of these charts to go with, we collected data from hundreds of authors and their individual titles, and these averages showed an average borrow/sales rate close to 1:1. The 50% borrow/50% sales data will be used for the rest of the report, and it will provide a baseline for our future reports.
(Note from TheSFReader : the amount paid is updated at each subsequent report based on the most recent rate/borrow)
Update for KU 2.0 in the September 2015 report
from http://authorearnings.com/report/september-2015-author-earnings-report/#comment-296014Kindle Unlimited does make things a little trickier. But Amazon also provides us a nice monthly mechanism for calibrating our model: the overall KU payout size and the number of KENP read. With the KU 2.0 switch to compensation for pages read, the ghost-borrow issue is no longer a source of error. Our model for KU compensation now factors in the page-length of each title, the per-page KU 2.0 payout, and an average-%-read factor that lets us exactly match Amazon’s announced $11.5 million / 2-billion-KENP-read numbers from July.
(Update on the 05/06/2015 for additional data related to the 200K sample)
(Update on the 05/07/2015 for precisions on the KU borrows impact + Additional data on the 100K to 3M+ book sales)
(Update on the 09/14/2015 with updates on KU 2.0 specific methodology)
(Update on the 02/10/2016 with updates on the reverse-engineered rank-to-sales conversion)