Here is a lightly edited compilation about explanations about the methodology used by Hugh Howey and Data Guy for their 10 (so far) Author Earnings reports.
I try and give a source for the different extracts, and will probably update it if new explanations arrive.
GeneralFrom The 50K report
For techies out there who geek out on methodology, the spider works like this: It crawls through all the categories, sub-categories, and sub-sub-categories listed on Amazon, starting from the very top and working its way down. It scans each product page and parses the text straight from the source html. Along with title, author, price, star-rating, and publisher information, the spider also grabs the book’s overall Amazon Kindle store sales ranking. This overall sales ranking is then used to slot each title into a single master list. Duplicate entries, from books appearing on multiple bestseller lists, get discarded.
(In our October report, we found that almost 80,000 of the 120,000 July bestsellers had since fallen off the lists to be replaced by 80,000 others.)
– 95% of the top 1,000
– 80% of the top 5,000
– 68% of the top 10,000
– 52% of the top 25,000
– 42% of the top 50,000
– 33% of the top 100,000
– 11% of the top 1,000,000
– some additional ones ranked in the 2,000,000-3,000,000 range (mostly from really specific nonfiction bestseller lists like “Renaissance Painter Biographies” or whatever.)
For the May 2015 data set (which lists 200K ebooks), I launched the spider simultaneously on 120 servers, each with 8 CPUs and 16 GB of RAM. This Author Earnings data run took roughly an hour and a half, while running over a thousand separate webcrawler threads on those 120 servers. During that time, it read and extracted data from nearly a million Amazon.com product pages — print and audio books as well as ebooks — over a terabyte of data in all.
But the anonymized spreadsheet we publish is just the tip of the iceberg. Even so, it’s an unwieldy 60MB or so in size — we may trim it back down to 120K in future reports, just to keep things manageable.
On Rank to Sales ranking
|Sales Rank||Sales Per Day|
Mostly, it still follows: http://www.theresaragan.com/salesrankingchart.html with a few additional data points added (like the one at rank 100) to increase curve accuracy.
Initial report (5th footnote)
Daily sales according to Amazon rank can be found in numerous places, including here, here, and here. Depending on the source, the model changes, but not enough to greatly affect the results. Keep in mind that the dollar figures and the exact sales are irrelevant to the ratio and percentages shown. Any change in those numbers impacts all books equally, so the picture of how authors are doing according to how they publish remains the same. These daily sales figures are adjustable in our spreadsheet, which contains our full data set and which we are offering at the low, low price of absolutely zilch.
Integration for missing books
100 000 and lesser ranks
The reason that we didn’t put them in the spreadsheet is we didn’t want to have to keep explaining to the less mathematically inclined folks how a book can sell a fraction of a copy in a day.
The amount paid per borrow is independent of price and depends instead on how much Amazon funds a shared pool. The rate per borrow has averaged $1.62 over the three months since KU launched. Each borrow appears to affect ebook ranking just as a sale does, so we have to take the borrow-to-sales rate into account for our earnings projection. As you will soon see, our data is robust enough that even wildly varying estimates for this rate do not appreciably affect our results. Before we get to our new baseline earnings report, let’s look at what our final graph would look like with five different assumptions for the borrow rate.
The difference in the total share of earnings by publishing type is only affected by a few percent even with wildly impossible assumptions about the borrow rate. In order to determine which of these charts to go with, we collected data from hundreds of authors and their individual titles, and these averages showed an average borrow/sales rate close to 1:1. The 50% borrow/50% sales data will be used for the rest of the report, and it will provide a baseline for our future reports.
(Note from TheSFReader : the amount paid is updated at each subsequent report based on the most recent rate/borrow)
Update for KU 2.0 in the September 2015 reportfrom http://authorearnings.com/report/september-2015-author-earnings-report/#comment-296014
Kindle Unlimited does make things a little trickier. But Amazon also provides us a nice monthly mechanism for calibrating our model: the overall KU payout size and the number of KENP read. With the KU 2.0 switch to compensation for pages read, the ghost-borrow issue is no longer a source of error. Our model for KU compensation now factors in the page-length of each title, the per-page KU 2.0 payout, and an average-%-read factor that lets us exactly match Amazon’s announced $11.5 million / 2-billion-KENP-read numbers from July.
(Update on the 05/06/2015 for additional data related to the 200K sample)
(Update on the 05/07/2015 for precisions on the KU borrows impact + Additional data on the 100K to 3M+ book sales)
(Update on the 09/14/2015 with updates on KU 2.0 specific methodology)
(Update on the 02/10/2016 with updates on the reverse-engineered rank-to-sales conversion)