E-Reading And RayTracing: An "Author Earnings" Methodology primer

(updated in September 2015)

Here is a lightly edited compilation about explanations about the methodology used by Hugh Howey and Data Guy for their 10 (so far) Author Earnings reports.

I try and give a source for the different extracts, and will probably update it if new explanations arrive.

General

From The 50K report

For techies out there who geek out on methodology, the spider works like this: It crawls through all the categories, sub-categories, and sub-sub-categories listed on Amazon, starting from the very top and working its way down. It scans each product page and parses the text straight from the source html. Along with title, author, price, star-rating, and publisher information, the spider also grabs the book’s overall Amazon Kindle store sales ranking. This overall sales ranking is then used to slot each title into a single master list. Duplicate entries, from books appearing on multiple bestseller lists, get discarded.

Our spider is looking at a snapshot of sales rankings for one particular day. Extrapolation is only useful for determining relative market share and theoretical earnings potential. Our conclusions assume that the proportion of self-published to traditionally published titles doesn’t change dramatically from day to day, and the similarity of datasets lends that assumption some support.

The preponderance of nonfiction in the february and later sample does not reflect market share. Rather, it reflects the many hundreds of detailed Amazon sub-sub-sub-category bestseller lists for non-fiction (Health, Fitness & Dieting > Alternative Medicine > Holistic, for example), that make lower-selling nonfiction more visible to the spider than equally low-selling fiction.

From a comment under the January 2015 report

A few things make doing so a little challenging: the once-per-quarter frequency of our data capture and the high turnover of the bestseller lists and sublists.
(In our October report, we found that almost 80,000 of the 120,000 July bestsellers had since fallen off the lists to be replaced by 80,000 others.)

We are getting a very comprehensive look at Amazon sales every time, though — the data “holes” are mostly down where titles are selling fewer than a handful of copies each. With each dataset, we’re capturing:

– practically all of the top several hundred ranks
– 95% of the top 1,000
– 80% of the top 5,000
– 68% of the top 10,000
– 52% of the top 25,000
– 42% of the top 50,000
– 33% of the top 100,000
– 11% of the top 1,000,000
– some additional ones ranked in the 2,000,000-3,000,000 range (mostly from really specific nonfiction bestseller lists like “Renaissance Painter Biographies” or whatever.)

Ideally, I’d like to grab all 3 million-ish every single day instead…

But the comprehensiveness of our snapshots comes at a nontrivial technical cost. For the technically curious out there, the data collection for this last report used 40 enterprise-grade servers (with 8 high-speed CPUs each) to crawl Amazon’s best seller lists and product pages, sucking almost 600 Gigabytes of HTML webpages across the Internet and ripping their HTML apart to extract the information we need and store it into a MySQL database. Each run takes a few hours, after which we shut the servers down before they burn a hole in our bank accounts.

Each report is thus a deep cross-sectional study of Amazon’s sales that day, but each is a single snapshot taken on a particular day. Their compositional consistency from quarter to quarter strongly suggests that we wouldn’t find much variation on the days in between, either. But perhaps we’ll try a longitudinal study in parallel at some point (or even better, someone else will) using a smaller set of titles.

http://authorearnings.com/report/may-2015-author-earnings-report/#comment-295886

For the May 2015 data set (which lists 200K ebooks), I launched the spider simultaneously on 120 servers, each with 8 CPUs and 16 GB of RAM. This Author Earnings data run took roughly an hour and a half, while running over a thousand separate webcrawler threads on those 120 servers. During that time, it read and extracted data from nearly a million Amazon.com product pages — print and audio books as well as ebooks — over a terabyte of data in all.

But the anonymized spreadsheet we publish is just the tip of the iceberg. Even so, it’s an unwieldy 60MB or so in size — we may trim it back down to 120K in future reports, just to keep things manageable.

On Rank to Sales ranking

http://authorearnings.com/report/february-2016-author-earnings-report/

For this report, Author Earnings threw out all of our previous assumptions. We built a brand new rank-to-sales conversion curve from the ground up. This time we based it on raw, Amazon-reported sales data on the precise daily sales figures for hundreds of individual books from many different authors, spanning a period of many months. Our raw sales data included titles ranked in Amazon’s Overall Top 5 — titles whose KDP reports verified that they were each selling many thousands of copies a day — and it also included books ranked in the hundreds of thousands — whose KDP reports revealed were selling less than a single copy a day. We combined that mass of hard sales data with a complete daily record of Amazon Kindle sales rankings for each of those books, pulled directly from individual AuthorCentral graphs. We ended up with nearly a million distinct data points in total.

Why did we need so many data points? Because Amazon’s Overall Best Seller Rankings aren’t a simple calculation based on each book’s single-day sales — they also factor in time-decaying sales from previous days as well. To reverse-engineer Amazon’s ranking algorithms, the more raw sales and ranking data we used, the more accurate our results would get. So we fired up some powerful computers, fed them all that raw data, and let them crunch the numbers.

For our fellow geeks: We applied both old-school statistical curve-fitting approaches and more modern machine learning techniques, iterating our underlying numerical model until we zeroed in on the solution that yielded the best predictive accuracy. Taking advantage of a neat mathematical series-convergence trick (one whose applicability was no accident, because Amazon’s algorithms undoubtedly rely on it, too), we ended up with a brand new, simpler, more elegant, and far more accurate rank-to-sales conversion formula for Kindle ebooks.

For the non-geeks: Our data-science awesomesauce now tastes even better.

Here’s what the new rank-to-sales curve looks like:

In retrospect, it’s striking how well AE’s old, crowdsourced rank-to-sales curve (in black) matches our new data-derived one. Graphically, the old AE curve ping-pongs back and forth between the new computed upper bound (shown in red), defined by the higher number of daily sales required to first “hit” a rank when spiking up from a much lower sales baseline, and the new computed lower bound (shown in blue), defined by the more modest number of daily sales required to steadily “hold” the same rank through consistent day-to-day sales.

(Old Rank to Sales was :)

http://www.hughhowey.com/the-january-author-earnings-report/#comment-233671

Sales Rank	Sales Per Day
1	7,000
5	4,000
20	3,000
35	2,000
100	1,000
200	500
350	250
500	175
750	120
1,500	100
3,000	70
5,500	25
10,000	15
50,000	5
100,000	1

Mostly, it still follows: http://www.theresaragan.com/salesrankingchart.html with a few additional data points added (like the one at rank 100) to increase curve accuracy.

We’ve left it consistent since we started to avoid introducing yet another variable into the report-to-report comparisons.

A comment from Daniel Knight under the 50K report

Hugh clearly stated that these numbers were based on data gathered by numerous writers of their own books and corresponding rank/sales numbers. He included three different links. Numerous authors have corroborated these correlations.

The rank within a category or sub-category is irrelevant. Sales numbers are generated based on overall store rank.

Even if you don’t believe these correlations are accurate, they are applied uniformly to all books – both self-published and traditionally published. So no matter what you plug in, the relative sales will remain the same. If you think self-published authors aren’t making as much as the charts indicate, then that means the traditionally published authors aren’t making as much either.

Initial report (5th footnote)

Daily sales according to Amazon rank can be found in numerous places, including here, here, and here. Depending on the source, the model changes, but not enough to greatly affect the results. Keep in mind that the dollar figures and the exact sales are irrelevant to the ratio and percentages shown. Any change in those numbers impacts all books equally, so the picture of how authors are doing according to how they publish remains the same. These daily sales figures are adjustable in our spreadsheet, which contains our full data set and which we are offering at the low, low price of absolutely zilch.

Integration for missing books

Data Guy on Hugh Howey's blog

But we know what the shape of the sales-to-rank curve is, and so we know what the “missing” books at ranks in between the ones we captured are selling. We then numerically integrate the whole curve to get a total daily sales number for all ebooks at all ranks. In other words, for each rank, whether or not we happened to capture that particular book in our data set, we add up its corresponding unit sales to compute Amazon’s total unit sales. Picture “shading in the area under the curve.”

While the books in long tail below rank 100,000 are shown as having 0 daily sales in our spreadsheet, they actually do sell a book every few days in the 100,000-500,000 range, a book a week in the 500,000-1,000,000 range, etc. (We zeroed those out in the spreadsheet because we didn’t want to get caught up explaining to the math-challenged how a book can sell a fraction of a copy a day.

But we do include those fraction-sellers in the integrated total of 1,542,000 total ebooks sold per day (of which 1,331,910 are ranked 1-100,000).

An other from Data Guy

The thing that makes [numerical integration] easy (and accurate) is the by-definition monotonically-decreasing nature of the sales-to-rank curve (it’s a pareto distribution, more or less, with a couple kinks in it caused by different “list visibility” regimes).

So it just becomes a choice of what numerical-integration interpolation strategy you use. We used linear interpolation between sales-to-rank data points, to get an appropriate level of accuracy.

Still Data Guy

Error magnitude didn’t matter as much before, as out focus was mainly the relative performance of books published via each path. Therefore, an error affected all sectors consistently and equally and didn’t change those relative results.

However, now we’re looking at predicting the actual absolute number of ebook sales on Amazon.com, and the actual absolute size of the market as a whole. That requires more accuracy.

“Within 20%” is no longer good enough — we need a better handle on the accuracy. That’ll be our next focus.

The data, however, doesn’t follow a strict pareto or power-law distribution — it’s close, but not exact. There are those rank regimes I mentioned where the slope steepens or flattens — most likely due to sharp differences in how much bestseller list visibility books get in those ranges

100 000 and lesser ranks

http://authorearnings.com/report/january-2015-author-earnings-report/#comment-224419

To make the spreadsheet simpler, we left out the roughly 13% of Amazon’s sales that live down in the deep long tail below rank 100,000. But we do account for them when scaling up our daily sample to estimate total daily or annual sales.

Ranks 1 to 100,000 of the rank-to-sales curve add up to a total of 1,331,910 sales per day.

Ranks 101,000 to 3 million+ add up to roughly 210,000 more sales per day.

The reason that we didn’t put them in the spreadsheet is we didn’t want to have to keep explaining to the less mathematically inclined folks how a book can sell a fraction of a copy in a day.

Kindle Unlimited

The methodology is explained in the October 2014 report

http://authorearnings.com/report/october-2014-author-earnings-report-2/

The amount paid per borrow is independent of price and depends instead on how much Amazon funds a shared pool. The rate per borrow has averaged $1.62 over the three months since KU launched. Each borrow appears to affect ebook ranking just as a sale does, so we have to take the borrow-to-sales rate into account for our earnings projection. As you will soon see, our data is robust enough that even wildly varying estimates for this rate do not appreciably affect our results. Before we get to our new baseline earnings report, let’s look at what our final graph would look like with five different assumptions for the borrow rate.

The difference in the total share of earnings by publishing type is only affected by a few percent even with wildly impossible assumptions about the borrow rate. In order to determine which of these charts to go with, we collected data from hundreds of authors and their individual titles, and these averages showed an average borrow/sales rate close to 1:1. The 50% borrow/50% sales data will be used for the rest of the report, and it will provide a baseline for our future reports.

(Note from TheSFReader : the amount paid is updated at each subsequent report based on the most recent rate/borrow)

Update for KU 2.0 in the September 2015 report

from http://authorearnings.com/report/september-2015-author-earnings-report/#comment-296014
Kindle Unlimited does make things a little trickier. But Amazon also provides us a nice monthly mechanism for calibrating our model: the overall KU payout size and the number of KENP read. With the KU 2.0 switch to compensation for pages read, the ghost-borrow issue is no longer a source of error. Our model for KU compensation now factors in the page-length of each title, the per-page KU 2.0 payout, and an average-%-read factor that lets us exactly match Amazon’s announced $11.5 million / 2-billion-KENP-read numbers from July.

(Update on the 05/06/2015 for additional data related to the 200K sample)
(Update on the 05/07/2015 for precisions on the KU borrows impact + Additional data on the 100K to 3M+ book sales)
(Update on the 09/14/2015 with updates on KU 2.0 specific methodology)
(Update on the 02/10/2016 with updates on the reverse-engineered rank-to-sales conversion)

9 comments:

NirmalaFebruary 24, 2015 at 7:32 PM
This is the most important point:

"Even if you don’t believe these correlations are accurate, they are applied uniformly to all books – both self-published and traditionally published. So no matter what you plug in, the relative sales will remain the same. If you think self-published authors aren’t making as much as the charts indicate, then that means the traditionally published authors aren’t making as much either."

That is the clincher that shows their data indicates a real sea change in publishing.
TheSFReaderMarch 21, 2015 at 3:14 PM
Yes it does. Thanks Nirmala :-)
Payal SinghFebruary 8, 2021 at 3:05 AM
Welcome to the best and most famous Dehradun Escorts Agency of Sagar's in Uttarakhand. At Sagar Escort Service we only deal with high profile . For those who have come here to enjoy life in Dehradun or are tired of work life and want to spend a happy weekend can contact us to get a Call girls in Dehradun at the best price available.
Riya RaiFebruary 10, 2021 at 2:56 AM
We have the most flawless Dwarka Escorts who are 24/7 ready to entertain their customers and completely ready to satisfy customer's most out of control sexy dreams. Our most engaging Dwarka Escorts will cause you to feel like heaven.
Bed Pari Bangalore EscortsFebruary 12, 2021 at 4:15 AM
Thanks for sharing this brilliant article it was a very useful and helpful article.

bellandur call girls
nandi hills escorts
majestic call girls
call girls jp nagar
whitefield escorts
hubli escorts
Call Girls in Bangalore
Bangalore Escorts
AnonymousFebruary 16, 2021 at 4:50 AM
If you're leading a hectic and busy lifestyle, then you can add pleasure to it just book these hot babes and fuck them tonight and be tension free.
Delhi Escorts Ishagarg
High Independent Escorts in Delhi
Russian Escorts in Delhi
Delhi Escorts Waiting For you
Satishfy Your Intimate Dreams
AnonymousApril 15, 2021 at 12:21 AM
After browsing a true tedious work schedule on day to day , men would have the natural wants for a few kind of relaxation and happiness. it's very obvious that the women having the glamorous appearance and desirability would be ranking within the higher position compared to the standard individuals within the list. Such is that the case with the women performing at our agency and therefore the time spend under their guidance is actually fascinating enough to quench the lusty matters of men. Independent call girls in Connaught place is understood to be the foremost intriguing experts being the perfect choice for in call also as outcall mode of services.
It would be very easy and cozy to possess these enchanting hotdog babes in your arms as they're highly cooperative in nature and behavior. There would be few loss incurred by the purchasers to possess these ladies involved in Escort Service in Connaught place as their intimate partners who have the capabilities of providing the utmost level of erotic entertainment.
Rhino Spark MaleJuly 29, 2021 at 7:41 AM
Researchers recruited 38 physically active, untrained university students aged years and divided them into four different groups for a six-week long study. The first group of eleven subjects were asked to consume 0.15 grams of creatine/kg of body weight for 2 days/week for six weeks. They performed 3 sets of 10 repetitions of each exercise for resistance training 2 days/week.

Rhino Spark Male
Xoth Keto BHB
Insulux Comentarios
Quick Flow Male Enhancement
Keto Burn Advantage
Keto GT Doctor Juan
Xoth Keto BHB
Infinity Insurance InfoNovember 29, 2021 at 12:21 PM
Revo keto
Trim Life Keto
Lean Time Keto
Core Keto Pro
Carrie Underwood Keto
Robust Keto
Nutra Ace Keto
Pure Lava Keto
Oprah Winfrey Keto
Optimal Easy Keto