Performance measurement

Tim Morrow (Betfair): Why third-party content keeps him up at night [PODCAST]

Tim Morrow has rocked the performance community on at least three distinct occasions.

The first time was at Velocity 2009, when he shared a case study from Shopzilla, where he was senior architect, which presented findings that became a cornerstone of how I and many others talked about the business value of performance. They cut above-the-fold load times down to less than 2 seconds, and as a result saw revenue gains of 5-12%.

From where I sit, it’s pretty hard to top findings like this, but Tim managed to do it when he came back to Velocity a year later and offered an awesomely candid case study showing how Shopzilla took its eye off the ball, performance-wise. As developers were occupied with other projects, load times slowly deteriorated until pages were once again taking 5 or more seconds to load. Customers were quick to notice and complain, which spurred a renewed internal effort to make pages faster.

More recently, as head of sports delivery at Betfair, Tim brings his commitment to customer satisfaction to the creation of another industry first, which he helped launch in the summer of 2011: a customer-facing charter that addresses the issue of page speed and makes a clear pledge to users:

After reliability, we believe that speed is a key feature of our products. We also acknowledge that we have a long way to go but we are working on it. In simple terms we commit to ensure our site becomes faster. To be more specific, we aim for 99.9% of bets placed in less than a second and our aspirational website Service Level Agreement is as follows. Under peak loads, with performance measured at the 95th percentile, for typical user bandwidths and a 0% error rate, our users shall experience Visual Progress (header loaded) in less than 1 second, Time to Interact with useful content within 1.5 seconds and full page loads within 3 seconds.

Like so many of the people I meet in our community, Tim Morrow is a practical idealist when it comes to performance. He has an inspiring combination of aspirational, visionary thinking, and the savvy to back up thought with action. It was my great privilege to speak with him about topics ranging from third-party content to performance testing. I hope you enjoy listening.

Listen to the podcast: Tim Morrow

Related posts:

Big Data vs. Big Enough Data

These days, there’s a lot of excitement around big data, and for good reason. It gives companies unprecedented power to harness customer information and increase their competitiveness in a time when the ability to compete globally has never been more important.

I have a lot of smart friends — some, like Eric Goldsmith, Cliff Crocker, and Buddy Brewer, who have been kind enough to come on as guests for my podcast — who are working with big data in meaningful and important ways.

But inevitably, where there’s excitement, there’s also hype. The tech community loves a new altar to worship at (and I’m putting up my hand here as well), and “big data” is  the official shrine of 2013.

I’ve also noticed a growing conviction that, given the choice between grabbing all the data or grabbing a sample of the data, we should always choose to work with all the data. The problem with this conviction is that it divides companies into two groups: the data haves and the data have-nots. If you don’t have access to billions and billions of data points, then there’s an understandable sense of frustration at being left behind.

In today’s post, I want to do two things:

  • talk about when it’s okay — and possibly even better — to use big enough data, rather than big data, and
  • as a caveat to the point above, explore the question of how big is big enough by using a recent example from our work here at Strangeloop.

When is “big enough” good enough?

In TechCrunch a couple of months back, there was a really great interview with Dr. Michael Wu, Principal Scientist of Analytics at Lithium. It may seem funny that I’m using it to argue for big enough data, because the thrust of the interview is that we actually need to look at even larger data sets, but I think the two arguments can peacefully co-exist.

Dr. Wu says:

While data does give you information, the fallacy of big data is that more data doesn’t mean you will get “proportionately” more information. In fact, the more data you have, the less information you gain as a proportion of the data.

In other words:

  • Are massive data sets going to ensure that your insights are statistically relevant? Definitely yes.
  • Are massive data sets going to deliver a proportionately massive number of amazing insights? Probably not.

In my opinion, there’s one scenarios in which it’s fine use “big enough” data:

To generate a hypothesis to be further tested by bigger data.

I had a great chat with Eric Goldsmith a while back, where he gave the best explanation I’ve heard for how and why to mine big data. His mantra is “Mine the data for correlation and then experiment for causation.” While Eric didn’t specify the sizes of data sets he uses for mining and experimentation, I’m going to take the liberty of borrowing his mantra to offer this advice to anyone looking to make their data mining process more agile:

  1. Start with a smaller (but still statistically significant) data set.
  2. Identify trends.
  3. Develop hypotheses.
  4. Look to your larger data set to test your hypotheses.

These four points are all good and sound easy, but you still need some statistical significance. For example, variance can affect data set size, and so can the number of variables to analyze and correlate. “How big is big enough?” is the crucial question, which leads us to the second part of this post…

So, how big is big enough? An experiment in finding the sweet spot.

Today I’m sharing just one example of how we answered this question here at Strangeloop. While it doesn’t totally conform to the four points outlined above, it does demonstrate how we figured out how big was big enough in a specific scenario.

Objective

This was an experiment conducted by Ken Jackson, one of our senior software engineers, to compare and analyze the effects of varying numbers of WebPagetest runs for a real customer’s site (whose name I can’t share, for obvious reasons) to show before-and-after acceleration results. The goal was to test the assumption that 10 test runs is enough to deliver enough data for us to draw meaningful conclusions, given the amount of variance in the data we were collecting. It’s important to note that we were looking at just one variable: load time.

Methodology

  1. Using a WebPagetest private instance, gathered data for 100 runs, both treated and untreated, on the site’s home page. This generated the baseline for comparing the other tests.
  2. Fed that data into a statistical resampling exercise that simulated 10,000 treated vs untreated tests with 3 runs, 10 runs, and 30 runs.
  3. To reduce variability as much as possible, used first-view only, a single browser (IE9), a single location (San Jose), and a single connectivity (cable).

Results

Overall, the acceleration from the median doc complete times for the 100 runs was 31%. So, as already stated, the goal was to find out which smaller set of runs, if any, yielded similar results.

The results are shown as a series of histograms below.

3 runs

The first histogram below shows the results for tests with 3 runs. The height of each bar indicates the number of tests that gave a particular acceleration value. For example, the peak in this histogram shows about 900 treated vs untreated tests that resulted in a 50% acceleration. That seems a little high compared to the 31% obtained from running 100 tests, so already the 3 run results seem suspect.

The most striking thing about this graph is the number of bars that are less than 0% acceleration. There are cases where a 3-run test of treated vs untreated will show a negative acceleration even though we can be confident that the acceleration on this page close to 31%. In rare cases, a 3-run test can even show an acceleration of -100%.

10 runs

The next histogram shows the results from simulating 10,000 treated vs untreated tests with 10 runs. The peak is closer to 30% and overall the shape of the curve is much narrower.

However, there are still cases where the acceleration measured from a 10-run test would be negative – by as much as -50%. The width of the curve is still quite wide, so we would expect 10-run tests to often vary between 20 to 40% acceleration.

30 runs

The histogram for 10,000 simulated 30-run tests shows a stronger peak at 30% acceleration and the overall curve is narrower still.

Conclusions

The results indicate that up to 30 runs are often needed to reliably demonstrate acceleration value — even on a site with 30% acceleration and fairly low variability. As Ken pointed out when he wrote this up in our internal blog, even more runs would be needed on sites that have higher variability and/or lower acceleration.

Why? The more variance there is in your data, the more data you need. Many big data projects are about capturing hundreds of variables and comparing across many of them. But if there’s not a lot variance — if there’s a small number of variables or correlations between elements — then you can get away with using smaller data sets.

Takeaway

It’s impossible to cover all the finer points of the science of data collection and analysis in a single blog post. Our findings are specific to this unique scenario and shouldn’t be extrapolated to other scenarios.

Instead, I’m hoping this post will serve as an example of a real-world situation that made us ask “How big is big enough?” and then made us look for a way to answer that question. In this case, we learned that the answer is “More than we think”, due to the amount of variance in the data. So while we had to make our data bigger in order to make it statistically viable, we didn’t have to blow it up to capital-B-capital-D Big Data proportions.

Getting back to my point at the top of this post: big data doesn’t mean wasting compute cycles and testing forever, and it’s not about collecting a lot of the same measurement well past statistical viability. Depending on the complexity of whatever you’re testing and the rate of variance in your results, you may be able to find a point at which you’ve controlled enough variables and have enough measurements that you don’t need to keep testing.

Related posts:

Ilya Grigorik (Google): When it comes to tackling web performance, we all still have a lot to learn [PODCAST]

It doesn’t matter what area of web performance you specialize in, if you’re anywhere in the performance space, this week’s podcast has something for you:

  • social analytics
  • RUM
  • SPDY
  • HTTP 2.0
  • CDNs
  • mobile performance
  • scuttlebutt about sharing an office with Steve Souders

The guy who’s delivering the goods on all these facets of performance is Ilya Grigorik, one of the smartest people I’ve ever talked to, inside or outside our industry. Ilya has a fascinating breadth of experience. He’s deep in the weeds on a few sides of the performance equation: from protocols to analytics to how pages are actually put together. As developer advocate at Google, he’s not only chin deep in Google Analytics and SPDY, but also walking the DOM and fiddling with CSS. And he gets to share an office with Steve, which you’ll hear me grill him mercilessly about. :)

What makes talking with Ilya really refreshing is that he doesn’t just have a lot to share about what he knows — he’s also really candid about what we don’t know and what we could be doing better. And he’s remarkably upbeat about the Sisyphean task of making the entire web faster. Enjoy.

Listen to the podcast: Ilya Grigorik

Related posts:

Eric Goldsmith (AOL): Why everyone has to think like a data scientist [PODCAST]

It was a great honour and a privilege to chat with AOL performance evangelist and operations architect Eric Goldsmith for this week’s podcast. If you’re newer to the performance scene, Eric’s name may not be familiar to you. That’s because in 2010 he got pulled off the web performance circuit sideways into the world of big data, and went from headlining at Velocity to headlining at Strata.

But web performance and big data are intersecting worlds, and make no mistake about it, Eric is still very much a big thinker — and doer — in the performance world. With real user measurement (RUM) poised to become a topic on every site owner’s lips, Eric shares some important insights about how to extract actionable metrics from your RUM data, and how to avoid falling into the trap of confusing correlation with causation.

While Eric is in the enviable position of having massive amounts of data to mine (in his own words: “I revel in the scale”), he points out that, for those mining smaller data sets, the fundamentals remain the same. He also mentions that preaching the importance of statistics is an uphill battle at any size of company, even AOL.

We covered a lot of ground in this podcast, from how to teach stats within a corporate culture to changes in the RUM world over the past seven years. Enjoy.

Listen to the podcast: Eric Goldsmith

Related posts:

This week on the Web Performance Today podcast: Cliff Crocker and Buddy Brewer

I feel extremely lucky in the calibre of guests who’ve been kind enough to chat with me on the Web Performance Today podcast. Last week, we kicked things off with Pat Meenan and Stephen Thair. This week, I’m talking to Cliff Crocker and Buddy Brewer.

Cliff Crocker is a great guy whom I’m happy to also call a friend. If you’re in the performance space, Cliff is a really interesting person to talk to because he’s sat on different sides of the fence, first as a solution vendor at Keynote, then as a customer at Walmart, and then again as a solution vendor — this time at SOASTA, where he’s currently VP Product. If you want to learn about the dynamics of the buying and selling process, he’s the person to talk to. And as you might also guess, he has a massive amount of insight into real user measurement and where that industry is heading. Which is a good segue into my other guest this week…

Buddy Brewer is one of the co-founders, along with Philip Tellis, of LogNormal, one of the most innovative RUM tools on the market — so innovative, in fact, that it was recently acquired by SOASTA. Everyone who’s ever worked with Buddy agrees he’s a rare combination — an awesome guy who also happens to be really sharp. A lot of technologists dream of taking their startup from bootstrap to successful acquisition, but Buddy and Philip actually made that dream come true — and in just over a year. When you listen to this podcast, you’ll get some great insight into how they made it happen.

I hope you enjoy this week’s interviews. If you have any feedback or suggestions for future podcasts, let me know.

Listen:

Related posts: