Wednesday, February 4, 2015

An "Author Earnings" Methodology primer

(updated in September 2015)

Here is a lightly edited compilation about explanations about the methodology used by Hugh Howey and Data Guy for their 10 (so far) Author Earnings reports.

I try and give a source for the different extracts, and will probably update it if new explanations arrive.


From The 50K report

For techies out there who geek out on methodology, the spider works like this: It crawls through all the categories, sub-categories, and sub-sub-categories listed on Amazon, starting from the very top and working its way down. It scans each product page and parses the text straight from the source html. Along with title, author, price, star-rating, and publisher information, the spider also grabs the book’s overall Amazon Kindle store sales ranking. This overall sales ranking is then used to slot each title into a single master list. Duplicate entries, from books appearing on multiple bestseller lists, get discarded.
Our spider is looking at a snapshot of sales rankings for one particular day. Extrapolation is only useful for determining relative market share and theoretical earnings potential. Our conclusions assume that the proportion of self-published to traditionally published titles doesn’t change dramatically from day to day, and the similarity of datasets lends that assumption some support.
The preponderance of nonfiction in the february and later sample does not reflect market share. Rather, it reflects the many hundreds of detailed Amazon sub-sub-sub-category bestseller lists for non-fiction (Health, Fitness & Dieting > Alternative Medicine > Holistic, for example), that make lower-selling nonfiction more visible to the spider than equally low-selling fiction.
A few things make doing so a little challenging: the once-per-quarter frequency of our data capture and the high turnover of the bestseller lists and sublists.
(In our October report, we found that almost 80,000 of the 120,000 July bestsellers had since fallen off the lists to be replaced by 80,000 others.)
We are getting a very comprehensive look at Amazon sales every time, though — the data “holes” are mostly down where titles are selling fewer than a handful of copies each. With each dataset, we’re capturing:
– practically all of the top several hundred ranks
– 95% of the top 1,000
– 80% of the top 5,000
– 68% of the top 10,000
– 52% of the top 25,000
– 42% of the top 50,000
– 33% of the top 100,000
– 11% of the top 1,000,000
– some additional ones ranked in the 2,000,000-3,000,000 range (mostly from really specific nonfiction bestseller lists like “Renaissance Painter Biographies” or whatever.)
Ideally, I’d like to grab all 3 million-ish every single day instead… :)
But the comprehensiveness of our snapshots comes at a nontrivial technical cost. For the technically curious out there, the data collection for this last report used 40 enterprise-grade servers (with 8 high-speed CPUs each) to crawl Amazon’s best seller lists and product pages, sucking almost 600 Gigabytes of HTML webpages across the Internet and ripping their HTML apart to extract the information we need and store it into a MySQL database. Each run takes a few hours, after which we shut the servers down before they burn a hole in our bank accounts.
Each report is thus a deep cross-sectional study of Amazon’s sales that day, but each is a single snapshot taken on a particular day. Their compositional consistency from quarter to quarter strongly suggests that we wouldn’t find much variation on the days in between, either. But perhaps we’ll try a longitudinal study in parallel at some point (or even better, someone else will) using a smaller set of titles.

For the May 2015 data set (which lists 200K ebooks), I launched the spider simultaneously on 120 servers, each with 8 CPUs and 16 GB of RAM. This Author Earnings data run took roughly an hour and a half, while running over a thousand separate webcrawler threads on those 120 servers. During that time, it read and extracted data from nearly a million product pages — print and audio books as well as ebooks — over a terabyte of data in all.

But the anonymized spreadsheet we publish is just the tip of the iceberg. Even so, it’s an unwieldy 60MB or so in size — we may trim it back down to 120K in future reports, just to keep things manageable.

On Rank to Sales ranking

For this report, Author Earnings threw out all of our previous assumptions. We built a brand new rank-to-sales conversion curve from the ground up. This time we based it on raw, Amazon-reported sales data on the precise daily sales figures for hundreds of individual books from many different authors, spanning a period of many months. Our raw sales data included titles ranked in Amazon’s Overall Top 5 — titles whose KDP reports verified that they were each selling many thousands of copies a day — and it also included books ranked in the hundreds of thousands — whose KDP reports revealed were selling less than a single copy a day. We combined that mass of hard sales data with a complete daily record of Amazon Kindle sales rankings for each of those books, pulled directly from individual AuthorCentral graphs. We ended up with nearly a million distinct data points in total.
Why did we need so many data points? Because Amazon’s Overall Best Seller Rankings aren’t a simple calculation based on each book’s single-day sales — they also factor in time-decaying sales from previous days as well. To reverse-engineer Amazon’s ranking algorithms, the more raw sales and ranking data we used, the more accurate our results would get. So we fired up some powerful computers, fed them all that raw data, and let them crunch the numbers.
For our fellow geeks: We applied both old-school statistical curve-fitting approaches and more modern machine learning techniques, iterating our underlying numerical model until we zeroed in on the solution that yielded the best predictive accuracy. Taking advantage of a neat mathematical series-convergence trick (one whose applicability was no accident, because Amazon’s algorithms undoubtedly rely on it, too), we ended up with a brand new, simpler, more elegant, and far more accurate rank-to-sales conversion formula for Kindle ebooks.
For the non-geeks: Our data-science awesomesauce now tastes even better.
Here’s what the new rank-to-sales curve looks like:
Screen Shot 2016-02-09 at 12.43.43 AM
In retrospect, it’s striking how well AE’s old, crowdsourced rank-to-sales curve (in black) matches our new data-derived one. Graphically, the old AE curve ping-pongs back and forth between the new computed upper bound (shown in red), defined by the higher number of daily sales required to first “hit” a rank when spiking up from a much lower sales baseline, and the new computed lower bound (shown in blue), defined by the more modest number of daily sales required to steadily “hold” the same rank through consistent day-to-day sales.
(Old Rank to Sales was :)
Sales RankSales Per Day

Mostly, it still follows: with a few additional data points added (like the one at rank 100) to increase curve accuracy.
We’ve left it consistent since we started to avoid introducing yet another variable into the report-to-report comparisons.
Hugh clearly stated that these numbers were based on data gathered by numerous writers of their own books and corresponding rank/sales numbers. He included three different links. Numerous authors have corroborated these correlations.
The rank within a category or sub-category is irrelevant. Sales numbers are generated based on overall store rank.
Even if you don’t believe these correlations are accurate, they are applied uniformly to all books – both self-published and traditionally published. So no matter what you plug in, the relative sales will remain the same. If you think self-published authors aren’t making as much as the charts indicate, then that means the traditionally published authors aren’t making as much either.

Initial report (5th footnote)

Daily sales according to Amazon rank can be found in numerous places, including here, here, and here. Depending on the source, the model changes, but not enough to greatly affect the results. Keep in mind that the dollar figures and the exact sales are irrelevant to the ratio and percentages shown. Any change in those numbers impacts all books equally, so the picture of how authors are doing according to how they publish remains the same. These daily sales figures are adjustable in our spreadsheet, which contains our full data set and which we are offering at the low, low price of absolutely zilch.

Integration for missing books

But we know what the shape of the sales-to-rank curve is, and so we know what the “missing” books at ranks in between the ones we captured are selling. We then numerically integrate the whole curve to get a total daily sales number for all ebooks at all ranks. In other words, for each rank, whether or not we happened to capture that particular book in our data set, we add up its corresponding unit sales to compute Amazon’s total unit sales. Picture “shading in the area under the curve.”
While the books in long tail below rank 100,000 are shown as having 0 daily sales in our spreadsheet, they actually do sell a book every few days in the 100,000-500,000 range, a book a week in the 500,000-1,000,000 range, etc. (We zeroed those out in the spreadsheet because we didn’t want to get caught up explaining to the math-challenged how a book can sell a fraction of a copy a day. ;) But we do include those fraction-sellers in the integrated total of 1,542,000 total ebooks sold per day (of which 1,331,910 are ranked 1-100,000).
 The thing that makes [numerical integration] easy (and accurate) is the by-definition monotonically-decreasing nature of the sales-to-rank curve (it’s a pareto distribution, more or less, with a couple kinks in it caused by different “list visibility” regimes).
So it just becomes a choice of what numerical-integration interpolation strategy you use. We used linear interpolation between sales-to-rank data points, to get an appropriate level of accuracy.
Error magnitude didn’t matter as much before, as out focus was mainly the relative performance of books published via each path. Therefore, an error affected all sectors consistently and equally and didn’t change those relative results.
However, now we’re looking at predicting the actual absolute number of ebook sales on, and the actual absolute size of the market as a whole. That requires more accuracy.
“Within 20%” is no longer good enough — we need a better handle on the accuracy. That’ll be our next focus.
The data, however, doesn’t follow a strict pareto or power-law distribution — it’s close, but not exact. There are those rank regimes I mentioned where the slope steepens or flattens — most likely due to sharp differences in how much bestseller list visibility books get in those ranges

100 000 and lesser ranks

To make the spreadsheet simpler, we left out the roughly 13% of Amazon’s sales that live down in the deep long tail below rank 100,000. But we do account for them when scaling up our daily sample to estimate total daily or annual sales.
Ranks 1 to 100,000 of the rank-to-sales curve add up to a total of 1,331,910 sales per day.
Ranks 101,000 to 3 million+ add up to roughly 210,000 more sales per day.

The reason that we didn’t put them in the spreadsheet is we didn’t want to have to keep explaining to the less mathematically inclined folks how a book can sell a fraction of a copy in a day.


A frequent question in the comments is:
How were books classified as “Indie-Published,”  “Small/Medium Publisher,” or ‘Uncategorized Single-Author Publisher”?
Here’s how:
1) The Big-5 Published books were easy to separate out, no matter what imprint they were published under, by checking the “Sold By” line in the Amazon Product Details, which listed one of: Random House, Penguin, Hachette, Macmillan, HarperCollins, or Simon & Shuster as seller.
2) If multiple author names used the same listed Publisher, and the book’s “Sold By” wasn’t one of the Big-5, it was considered a Small/Medium Publisher. A lot of these might indeed be Indie Publishers, but we wanted to be conservative and err on the side of understating–rather than overstating–Indie numbers.
3) If no Publisher at all was listed under Product Details, the book was considered Indie-Published.
4) If the full name of the author was included in the Publisher name, the book was considered Indie-Published.
5) The remaining books, whose publishers represented only a single author name, were initially grouped under Uncategorized Single-Author Publisher, and sorted by revenue. Then we rolled up our sleeves.
Going down the list one by one, we Googled the publisher names and author names. We were able to classify hundreds of them. Many were already known to us… for example: Broad Reach Publishing (Hugh), Laree Bailey Press (H.M. Ward), Reprobatio Inc. (Russell Blake), etc. We started from the biggest earners and went down, until the names became too obscure to find and we ran out of energy and time, and none of the remaining Uncategorized Single-Author Publishers individually accounted for a significant chunk of revenue.
So the vast majority of the remaining  Uncategorized Single-Author Publishers are most likely “Indies in disguise.” But there are also a few examples of poor-selling imprints of small and medium traditional publishers in the mix (such as Baen), so again we didn’t want to overstate Indie market share by lumping them all in with the Indies.
Is there any way to quantify how much of the Small Medium Publisher/Single –> Indie market share can be attributed to re-classification of the publishers to Indie ?
The answer is very little – I just checked. Less than 0.1% of what was originally classified as Small/Medium Publisher income has been reclassified over the course of the last few reports. What you are seeing there is actual market-mix shift.
On the other hand, ~1.3% of what was “Uncategorized” income back in Feb 2014 report has since been definitively classified as indie, while another ~0.2% of it has since moved into Small/Medium Publisher income.

Kindle Unlimited

The methodology is explained in the October 2014 report

The amount paid per borrow is independent of price and depends instead on how much Amazon funds a shared pool. The rate per borrow has averaged $1.62 over the three months since KU launched. Each borrow appears to affect ebook ranking just as a sale does, so we have to take the borrow-to-sales rate into account for our earnings projection. As you will soon see, our data is robust enough that even wildly varying estimates for this rate do not appreciably affect our results. Before we get to our new baseline earnings report, let’s look at what our final graph would look like with five different assumptions for the borrow rate.

The difference in the total share of earnings by publishing type is only affected by a few percent even with wildly impossible assumptions about the borrow rate. In order to determine which of these charts to go with, we collected data from hundreds of authors and their individual titles, and these averages showed an average borrow/sales rate close to 1:1. The 50% borrow/50% sales data will be used for the rest of the report, and it will provide a baseline for our future reports.

(Note from TheSFReader : the amount paid is updated at each subsequent report based on the most recent rate/borrow)

Update for KU 2.0 in the September 2015 report

Kindle Unlimited does make things a little trickier. But Amazon also provides us a nice monthly mechanism for calibrating our model: the overall KU payout size and the number of KENP read. With the KU 2.0 switch to compensation for pages read, the ghost-borrow issue is no longer a source of error. Our model for KU compensation now factors in the page-length of each title, the per-page KU 2.0 payout, and an average-%-read factor that lets us exactly match Amazon’s announced $11.5 million / 2-billion-KENP-read numbers from July.

(Update on the 05/06/2015 for additional data related to the 200K sample)
(Update on the 05/07/2015 for precisions on the KU borrows impact + Additional data on the 100K to 3M+ book sales)
(Update on the 09/14/2015 with updates on KU 2.0 specific methodology)
(Update on the 02/10/2016 with updates on the reverse-engineered rank-to-sales conversion)


  1. This is the most important point:

    "Even if you don’t believe these correlations are accurate, they are applied uniformly to all books – both self-published and traditionally published. So no matter what you plug in, the relative sales will remain the same. If you think self-published authors aren’t making as much as the charts indicate, then that means the traditionally published authors aren’t making as much either."

    That is the clincher that shows their data indicates a real sea change in publishing.

  2. If there’s one online income source I like talking about most, it’s definitely self-publishing on Amazon. I’m normally a pretty modest guy but I’ve gotta say… I rock at self-publishing!

    I’ve increased my monthly income from nothing to nearly $2K in less than three years just from selling books on Amazon… and I was making a grand a month within a year.

    The post on how I make money self-publishing has been one of the most popular on my personal blog so I wanted to update it with everything I’ve learned over the last few years. I’ve included updates on how to turn your books into a passive source of income and how to make the whole process easier.

    Ok, so $2K a month isn’t huge money but it’s getting there and it’s growing very quickly.

    If you want to learn more about making money with Kindle then check out “” which is the #1 Amazon Kindle Training out there.

    I can't recommend it enough. That's how I got started almost three years ago.


  3. Is this is the first occasion when you are hiring Lucknow Escorts from us? Well, at that point, it very well might be mistaken for you what to look for. You might be taking a gander at the image of Lucknow call girls in high heels, showing their big boobs or in attractive lingerie. Yet, when you are employing the Call Girls in Lucknow, you need to think about numerous different things more than that. Choosing a VIP call girl and Escorts in Lucknow and call up the agency may not the only one which you should do.

  4. Welcome to the best and most famous Dehradun Escorts Agency of Sagar's in Uttarakhand. At Sagar Escort Service we only deal with high profile . For those who have come here to enjoy life in Dehradun or are tired of work life and want to spend a happy weekend can contact us to get a Call girls in Dehradun at the best price available.

  5. We have the most flawless Dwarka Escorts who are 24/7 ready to entertain their customers and completely ready to satisfy customer's most out of control sexy dreams. Our most engaging Dwarka Escorts will cause you to feel like heaven.

  6. If you're leading a hectic and busy lifestyle, then you can add pleasure to it just book these hot babes and fuck them tonight and be tension free.
    Delhi Escorts Ishagarg
    High Independent Escorts in Delhi
    Russian Escorts in Delhi
    Delhi Escorts Waiting For you
    Satishfy Your Intimate Dreams

  7. Are you guys exploring modern, impressive Delhi Escort Service in Delhi? We questioned this because we are suitable for conducting premium class and pretty Delhi Call Girls for your satisfaction. If you've got been lonely for an extended time, so some time has begun. We are here to supply the foremost essential services in Delhi to you. Yes, we are talking escort services in Delhi, which has imperialism of the cutest girls within the city. We penetrate the intensity of your heart to watch your wants, and this is often what keeps us better to the remainder of others. it's difficult to be alone during this exotic city of the state .
    Call Girls in Delhi could be a far better choice for a few space for spending class time and enjoying romantic dates. the foremost satisfying part is that these expert love-makers are accessible at rock bottom rates. If you're looking for entertainment here, so what might be better than selecting an escort girl in Delhi?

  8. Welcome to the destination of life describing it's never complete without a partner. every one should have some needfulness of a partner in their life. a number of the lads are lucky they need more friends and that they are proud of their needfulness. But there are many that have the aim to still found their final destination. Welcome to the planet of the incredible lifetime of Call Girls in Mahipalpur. You’ll have your true companion for any event you've got within the world. Our Mahipalpur Escorts provides an entrancing choice for a high-quality time during a city of sovereign people.
    Escort Service in Mahipalpur were making real-time opportunities to supply a magical companionship for the critical phase of the meeting. Now, brooding about the premium escorts service, enjoy the creamy Mahipalpur Escorts Service with our female escorts. you'll feel the important joy of companionship which may offer you perfect happiness and privacy. We are covering each and each aspect to gratify your needs.

  9. After browsing a true tedious work schedule on day to day , men would have the natural wants for a few kind of relaxation and happiness. it's very obvious that the women having the glamorous appearance and desirability would be ranking within the higher position compared to the standard individuals within the list. Such is that the case with the women performing at our agency and therefore the time spend under their guidance is actually fascinating enough to quench the lusty matters of men. Independent call girls in Connaught place is understood to be the foremost intriguing experts being the perfect choice for in call also as outcall mode of services.
    It would be very easy and cozy to possess these enchanting hotdog babes in your arms as they're highly cooperative in nature and behavior. There would be few loss incurred by the purchasers to possess these ladies involved in Escort Service in Connaught place as their intimate partners who have the capabilities of providing the utmost level of erotic entertainment.

  10. The girls are entertaining and sparking they skills to charge your body. They like to entertain new peoples who are affected by a stressful life. all of them wish to provide you with the pleasure of a high level. we offer you together with your choice and desirable Escorts in Paharganj for the acute sex. Our agency has many Call girls in Paharganj who are from a special community. a number of the women are a university student, busty ramp models and therefore the hot housewives who are trying to find hardcore sex.
    The girl will satisfy your sexual needs and take you the peak of utmost pleasure. all of them girls love, to be frank, or close with strangers, they not want to waste time with an uneventful life and she or he loves the thrills of the sexual comrade. Our Escorts Service in Paharganj are highly qualified and well mannered, you'll also get her for the occasions and therefore the parties.
    Click here > > paharganj escort

  11. Researchers recruited 38 physically active, untrained university students aged years and divided them into four different groups for a six-week long study. The first group of eleven subjects were asked to consume 0.15 grams of creatine/kg of body weight for 2 days/week for six weeks. They performed 3 sets of 10 repetitions of each exercise for resistance training 2 days/week.

    Rhino Spark Male
    Xoth Keto BHB
    Insulux Comentarios
    Quick Flow Male Enhancement
    Keto Burn Advantage
    Keto GT Doctor Juan
    Xoth Keto BHB