6 Jun 2011
And here’s the not-so-good news: This is why you can’t totally rely on synthetic page test data when you’re optimizing your site.
A couple of weeks ago, I had an interesting conversation with a customer in Europe. He said that he’d read my post about how most non-landing-page views are actually flow views, and he wanted to know how this related to the total number of object requests on each page. In my quest to look at the statistics we throw around every day in light of real-world data, I started to investigate.
First, I examined the provenance of three sets of widely cited statistics about the size of web pages:
Google (2010): Average page is 320K and contains 44 objects
Methodology: Google collected this data from a sample of several billions of pages that were processed as part of Google’s crawl and indexing pipeline. In processing these pages, they not only take into account the main HTML of the page, but also discover and process all embedded resources such as images, scripts and stylesheets.
- The tests from which these numbers are derived assume that users are arriving at the site with an empty cache.
- Some sites may present a different view of the resources to Googlebot than to regular users. For example, until recently, Google’s own servers used to serve CSS and JS uncompressed to Googlebot, while compressing them for regular user browsers.
- Pages are rendered and subresources are discovered through the eye of WebKit. If a page serves resources differently for Internet Explorer or Firefox, those won’t be visible here.
- Sampling of pages for processing is not uniformly random or unbiased. For example, pages with higher PageRank are more likely to be included in these metrics.
Charzinski (2010): Average page is 507K and contains 65 objects
Methodology: These numbers have been widely publicized via performance consultant Andrew King’s well-known graph (above) and blog post. Andrew derived the 2009 data points from this paper by Joachim Charzinski, who analyzed a number of popular websites as part of his research into the efficiency of client-side caching.
Caveat: Tests assume an empty cache.
HTTP Archive (2011): Average page is 678K and contains 78 objects
- Tests assume an empty cache.
- Pages are rendered and subresources are discovered through the eye of Internet Explorer. If a page serves resources differently for Chrome or Firefox, those won’t be visible here.
- Heavily focused on the largest websites.
- Only home pages are tested.
How do these numbers compare to real-world data?
There’s a pretty big discrepancy between 44 objects and 78 objects, and between 320K and 678K. But these tests all have one thing in common: they are all looking at the number of requests on a page with an empty cache.
This is obviously not how we use sites in the real world. I wanted to look at the actual number of resources sent to pages being served to real users.
1. Selecting test subjects
To gain some insight into this, I turned to Strangeloop’s data warehouse, which captures some very interesting statistics on page resources. I narrowed my search by using the following criteria:
- Sites that were segmenting traffic. (Web content optimization can dramatically change the number of requests, and I was not interested in looking at post-acceleration data.)
- Sites that were using expires headers on most resources.
- Sites that did not use a content delivery network (CDN). (I needed to see all of the resource requests. Although this biases the insights towards smaller customers, I did not want to go through the hassle of cross referencing my findings with client CDN analytics, which are notoriously fickle.)
- Sites with 10M page views per month or more. (Although the non-CDN criterion biased me towards smaller customers, I did not want to go to too small.)
- Sites with generally templated pages. (Given that I needed to run WebPagetest on the sites to see how many requests they served with an empty cache, I wanted to find sites where this would be easy.)
Using these criteria, I found three customers who made excellent candidates for my test. (Obviously, this can’t compare with the billions of site crawled by Google, but I was looking for consistent trends. If my findings had been all over the place, I wouldn’t be writing this post right now. )
2. Determining number of requests for real-world pages
I selected key landing pages and representative sub pages for each site. I then searched our data warehouse to see how many resources were sent to browsers for each page.
3. Determining number of requests for pages with an empty cache
I then went to WebPagetest and ran tests on the same pages to see how many requests they had with an empty cache. Then I cross referenced this with the numbers I gathered in step 2.
From steps 2 and 3, I came up with the following tables:
|Client 1||Client 2||Client 3||Average|
|Empty cache: # of requests sent||112||89||153||118|
|Real world: # of requests sent||73||61||112||82|
|Client 1||Client 2||Client 3||Average|
|Empty cache: # of requests sent||134||72||123||110|
|Real world: # of requests sent||52||34||45||44|
The real world numbers looked suspiciously like repeat view numbers, so I ran a series of repeat view tests to see what was up. The results didn’t correlate to the real numbers at all, but I wanted to dig a bit deeper before laying this issue to rest. I ran WebPagetest flows based on a sample of real user flows, and I saw that users were getting very different resources depending on how they got to that page.
Takeaway: To improve KPIs, keep your eyes on the real world
Clearly, this is a pretty small sample size, and I don’t expect my findings to rock your world. Obviously we can’t draw any sweeping conclusions, but I found it interesting how the real-world data consistently differed from the synthetic test data. It serves as yet another reminder of something that’s easy to forget: We can’t look at synthetic tests as our primary benchmark.
I continue to be evangelical about the fact that we can’t focus on synthetic tests because they don’t represent what actual users are seeing. If you think your users are seeing 110 resources and they are actually seeing half that — or even less — on secondary pages, you might change how you optimize or how you choose to implement optimization. If we want to improve key performance indicators, we must always keep our eyes on the real world.