Reproducible We weren't going
Posted: Thu Feb 13, 2025 8:42 am
We had to make a choice. It was a tough one. Do we start with a known strong bias that doesn't favor Moz, or do we start with a known weak bias that does? We could use a random selection from our own index as the starting point for this process, which would be pseudo-random but could potentially favor Moz, or we could start with a small, public index like the Quantcast Top Million. Like we could start with a small, public index that would be strongly biased toward good sites.
We decided to go with the latter as a starting point because the Quantcast data is:
A to make the "random URL selection" part of the Moz API, so we needed something that other people in the industry could get started with. Quantcast Top Million is free for everyone.
Not biased towards Moz : We would prefer to err on the side of caution, even if it means more work to remove bias.
Known bias : The bias in the Quantcast Top 1,000,000 was easily understandable — these are important sites and we need to address this bias.
Quantcast bias is natural: any link graph bahrain number data already shares some of the Quantcast bias (powerful sites are more likely to be well-linked)
With this in mind, we selected 10,000 domains from the Quantcast Top Million and began the process of removing bias.
2. Choosing based on domain size rather than importance
Because we knew that the Quantcast Top Million was ranked by traffic and wanted to reduce this bias, we introduced a new bias based on site size. For each of the 10,000 sites, we used the "site:" command to identify the number of pages on the site according to Google and also retrieved the top 100 pages for the domain. We could now balance the "importance bias" against the "size bias," which is more reflective of the number of URLs on the web. This was the first step in reducing the known bias of only high-quality sites in the Quantcast Top Million.
We decided to go with the latter as a starting point because the Quantcast data is:
A to make the "random URL selection" part of the Moz API, so we needed something that other people in the industry could get started with. Quantcast Top Million is free for everyone.
Not biased towards Moz : We would prefer to err on the side of caution, even if it means more work to remove bias.
Known bias : The bias in the Quantcast Top 1,000,000 was easily understandable — these are important sites and we need to address this bias.
Quantcast bias is natural: any link graph bahrain number data already shares some of the Quantcast bias (powerful sites are more likely to be well-linked)
With this in mind, we selected 10,000 domains from the Quantcast Top Million and began the process of removing bias.
2. Choosing based on domain size rather than importance
Because we knew that the Quantcast Top Million was ranked by traffic and wanted to reduce this bias, we introduced a new bias based on site size. For each of the 10,000 sites, we used the "site:" command to identify the number of pages on the site according to Google and also retrieved the top 100 pages for the domain. We could now balance the "importance bias" against the "size bias," which is more reflective of the number of URLs on the web. This was the first step in reducing the known bias of only high-quality sites in the Quantcast Top Million.