Etsy Icon>

Code as Craft

Mitigating the winner’s curse in online experiments main image

Mitigating the winner’s curse in online experiments

  image

Experimentation plays a central role in how we test and deploy new ideas at Etsy. Not only does it provide teams with a scientific and practical procedure to identify which ideas affect our users, but it also allows us to estimate the size of that impact. This latter task is crucial for quantifying the business value created by new changes, and feeds directly into Etsy’s strategic decisions and financial planning.

The way we choose which changes to deploy follows a traditional hypothesis testing approach. If randomly exposing a small group of users to a particular treatment (a new design, added functionality, etc.) yields a positive lift in some metric of interest — and if that lift qualifies as statistically significant (unlikely to be due to chance alone in the absence of any effect) — then we call the treatment a “win” and feel confident deploying it at scale.

However, when we try to gauge the underlying effect of that winning treatment, naively taking the observed lift at face value will lead to an estimate that overshoots the mark. This phenomenon — commonly referred to as the winner’s curse — is a built-in limitation of our decision-making protocol. It is an artifact of how experimenters select winning treatments and it plagues how we estimate their impact, despite our best intentions.

In this article, we review how our experimentation framework gives rise to the winner’s curse. We present techniques from Bayesian statistics that can help break this curse, by discounting reported lifts to counteract the tendency toward overestimation. We discuss the challenges and benefits of this methodology, and how it has led us to a more accurate accounting of business impact for experiments at Etsy.

What is the winner’s curse ?

To assess whether a particular treatment has a positive impact on our customers, we typically run a randomized experiment known as an A/B test, where we compare a random sample of users exposed to the treatment with another sample of control users exposed to the current experience. As the experiment concludes, we observe some measurable lift — a difference, either positive or negative — in our chosen success metric (Figure 1).

illustration_ab_test
Figure 1. Toy visual of a randomized experiment. From a subset of our user population, we expose a random group (A) of users to a prospective treatment (symbolized by the flag), while another group (B) of users is presented with the current experience. At the end of the experiment, we measure the empirical lift (i.e. the relative difference) in a chosen success metric (e.g. the proportion of users who make at least one purchase) between the two groups.

Our first task is to determine whether the treatment is in fact improving our success metric. We have to keep in mind that the observed lift is only an approximation of the true lift (roughly speaking, the lift we would observe if our entire population of users were exposed to the treatment). Observed lifts inevitably deviate from the truth by some degree, as a consequence of the unmeasurable noise intrinsic to randomized experiments.

Happily, the random assignment of users to treatments enforces some helpful properties on this noise. In particular, the noise is symmetrical: the observed lift is equally likely to overestimate or underestimate the truth, and its average over many independent replications is expected to equal the true lift (Figure 2, panel 1). And we can also quantify the characteristic size of the noise (its standard deviation).

Together, these properties allow for a simple and practical decision rule: when the observed lift exceeds a specified threshold, we deem the result statistically significant and regard the treatment as a win (Figure 2, panel 2). By choosing our threshold appropriately, we can achieve desired error rates regarding both false positives (incorrectly claiming a win) and false negatives (missing out on a win). In other words, we can reliably detect lifts that are truly positive, in spite of the fact that we can't observe their true values directly.

Knowing which treatments are wins is one thing, but estimating the size of their effects is another. Impact is a question of key importance for Etsy’s strategic and financial planning, and answering it is not trivial. As it turns out, naively trusting the observed lifts of the reported wins will generally lead to a substantial overestimation of their real impact (Figure 2, panel 3).

observed_lifts
Figure 2. Observed lifts (y-axis) against their true lifts (x-axis). Each point represents an experiment, simulated independently over a grid of possible values of true lifts, using a fixed standard deviation (for illustrative purposes, since in practice the true lifts are not known). For reference, we plot the diagonal line whose slope is 1 (corresponding to points for which y = x). The left panel (1) shows that the observed lifts are noisy but centered at their true lifts. The middle panel (2) illustrates how we select wins (orange points) by keeping the experiments whose observed lifts exceed a chosen threshold (red dashed horizontal line). The right panel (3) shows that, on average, the observed lifts from reported wins lie above the diagonal and thus tend to overestimate their true lifts.

Every treatment that meets our winning criteria is a winner, but not all winners are created equal. We can expect that some number of the less solid winners will have snuck in with values that are higher than their true lifts. In other words, conditional upon being reported as a win, the observed lift is now expected to overestimate its true lift. And this is what we call the winner’s curse, well known in the scientific literature [1, 2, 3]: a provable form of selection bias, which leads us to overstate the value of our wins and thereby exaggerate their true impact.

The winner’s curse has nothing to do with human biases (confirmation bias, p-hacking, etc.). It is a systematic bias, inherent in our use of a selection protocol. And unless we want to see winning treatments consistently underperforming our too-high expectations for them, we need a principled way to correct for it.

Breaking the curse

Theory tells us not to take the observed lifts of our winning experiments at face value. Since we expect them to exaggerate the truth, it seems natural to apply a discount to the observed lift to offset any overestimation. The name of the game is to determine how much the discount should be.

Having run countless thousands of experiments over the years, we've developed a pretty good sense of what believable lift values look like. In particular, we acknowledge that it is genuinely hard to move our success metrics by a meaningful amount, more so as Etsy’s services become increasingly mature over time, making the control experience an ever harder benchmark to beat. This is reflected in the high concentration of past observed lifts around zero, which suggests that most true lifts are likely to be small. On the other hand, we also believe that major breakthroughs are possible (e.g. adding a brand new service, shifting paradigms for Etsy’s search algorithm, etc.), however infrequent.

As probability distributions are essentially mathematical representations of beliefs, we can formalize our acquired understanding by fitting a statistical model on past historical lifts. The chosen model — inspired by Deng et al. [4] — mixes together light- and heavy-tailed distributions, thus capturing the higher plausibility of small and incremental lifts, while still leaving room for larger (but rarer) ones.

To recap, we have two forces at play: the observed lifts from a set of winning experiments (which we know are cursed to some degree) and our prior belief, based on past experiments, of what values of true lifts are plausible (Figure 3).

historical_lifts_and_prior
Figure 3. Representation of our beliefs using probability distributions. Conceptually, the density (y-axis) indicates values (x-axis) that we believe to be the most plausible for the true lift. The higher the density (y), the more credibility we give to the corresponding value (x). The red blob on the right represents a belief that naively trusts the observed lift (red vertical dashed line) for a given experiment reported as a win. On the other hand, the blue blob on the left represents a historically informed belief, built upon thousands of past observations (blue hatched histogram), inducing a range of more credible lift values.

To combine these two beliefs into a single, coherent one, we take a Bayesian approach, which provides a framework for quantifying and updating knowledge probabilistically. Bayesian statistics enables formal answers to questions like, “Given our domain expertise and what was actually observed during an experiment, what value of true lift is the most credible?”. At a high level, the process can be thought of as an algorithmic way for a prior belief to evolve into a posterior belief, on the basis of newly observed data. More specifically, we use a form of Gibbs sampling [5], a technique from the Markov Chain Monte Carlo literature, to produce the full distribution of our posterior belief. Although we won't go into technical detail here, this posterior distribution allows for a more informed guess of the true lift that the winning treatment can be expected to produce (Figure 4).

posterior_distribution
Figure 4. Combining the two components from Figure 3 into a single coherent belief. The resulting posterior belief is represented by the purple blob, which can be used to infer the value of the true lift. For example, we can define a “discounted” lift as the mean of this posterior distribution (purple vertical line).

The estimated lift given by our posterior belief can be thought of as a discounted version of the raw observed lift, where the discount combats the inflationary bias of the winner’s curse. The discounted lift behaves as a weighted average of the lift observed in the current experiment and the lift we anticipate based solely on our knowledge of past experiments. How much weight we put on the observed lift directly relates to how much we trust the experiment: the greater the precision of the observed lift (the smaller its standard deviation), the less skeptical we are about it, hence the more inclined we are to abandon our prior belief, and vice versa (Figure 5).

bayesian_estimation
Figure 5. As the precision of the observed lift (red vertical dashed line) decreases (i.e. as the red blob flattens and becomes more spread out), the discounted lift (purple vertical line) becomes more conservative and gets pulled more heavily toward our prior belief (blue blob on the left). Conversely, the higher the precision of the observed lift (i.e. the more concentrated the red blob), the more closely the discounted lift coincides with the observed lift.

One appeal of this method is that it effectively produces discounts that are tailored to each experiment’s respective credibility, as opposed to applying a fixed common haircut to every experiment equally. This flexibility can be tuned to ensure that we are not over-discounting the lifts of experiments whose effects are strong enough for the selection bias to be negligible (Figure 6).

illustration_shrinkage
Figure 6. Going back to panel 3 from Figure 2, we see that our discounts (abstractly represented by blue arrows) aim to pull the observed lifts toward more conservative values. We also see that for large enough effects (right end of the x-axis), the upward bias from the winner’s curse is negligible and therefore no discounting is needed.

In summary, by discounting the observed lifts of reported wins, we are able to reliably mitigate the issue of the winner's curse. Our Bayesian methodology induces an adaptive discounting mechanism that appropriately reflects each experiment’s respective uncertainty (Figure 7).

example_experiments
Figure 7. Example of two experiments (1 and 2) with similar observed lifts in some (redacted) success metric of interest but with different precisions due to their respective sample sizes (number of browsers in the experiment). The observed lift in Experiment 2 has a higher precision (smaller standard error) and is thus more believable, which leads to it receiving a less severe discount.

Large-scale experimentation: a curse and a blessing

Trustworthy experimentation requires us to be alert to more than just traditional type I and type II errors (false positive and false negative rates, respectively). The winner's curse is connected to the broader notion of type M error, also known as the exaggeration ratio: the factor by which the magnitude of an effect is overestimated [6]. This type of error can be especially prominent in experiments that are underpowered (when sample sizes are not large enough).

Closely related to this is the concept of false discovery rate, which introduces the idea that, even among treatments whose lifts appear statistically significant, a substantial portion could still turn out to have no actual effect [7]. Without proper precautions, false discovery is actually exacerbated as the number of treatments increases, which makes it of central concern in our age of high-throughput experimentation, where we routinely have hundreds of experiments running each quarter, of which only a small minority are likely to have meaningful impact.

In a prior post, we discussed the complex and large-scale ecosystem of experimentation at Etsy. As it happens, at this kind of scale, discounting lifts can have benefits beyond questions of selection bias. We are effectively trying to find a needle of signal in a haystack of noise, and discounting — more commonly known as shrinkage in the statistical literature — is a well-established technique for improving the performance of estimators [8, 9]. (In much the same way, regularization is ubiquitous in machine learning when fitting models involving a large number of parameters, out of which only a few will end up mattering.)

A/B testing has gained popularity in large part from the simplicity (on paper at least) of its classical hypothesis testing framework. But we should always keep in mind that real-life experiments involve extra conditioning and selection procedures, which can alter the inference and must be properly accounted for. Fortunately, despite the additional challenges brought about by dealing with ever more experiments, we can still hope to achieve sensible insights by leveraging the fact that these experiments are run together as part of a broader collective, which opens the door to sharing learnings and borrowing information across multiple experiments.

Acknowledgements

Special thanks to Kevin Gaan for helping with the internal review of this post, and to Michael Dietz, our external editor. I would also like to thank* Anastasia Erbe, Clare Burke, Gerald van den Berg, Michelle Borczuk, Samantha Emanuele, and Zach Armentrout from the Product Analytics and Strategic Finance team — as well as Alexander Tank and Julie Beckley from the Experimentation Science team — for their feedback and thoughtful discussions.

* Listed by team and alphabetical order of first names.

References

[1] M. Lee, M. Sheng (2018). Winner's curse: bias estimation for total effects of features in online controlled experiments.

[2] E. W. van Zwet, E. A. Cator (2021). The significance filter, the winner's curse and the need to shrink.

[3] I. Andrews, T. Kitagawa, A. McCloskey (2019). Inference on winners.

[4] A. Deng, Y. Li, J. Lu, V. Ramamurthy (2021). On post-selection inference in A/B testing.

[5] D. A. van Dyk, T. Park (2008). Partially collapsed Gibbs samplers: theory and methods.

[6] A. Gelman, J. Carlin (2014). Beyond power calculations: assessing type S (Sign) and type M (Magnitude) errors.

[7] Y. Benjamini, Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing.

[8] D. Coey, T. Cunningham (2019). Improving treatment effect estimators through experiment splitting.

[9] B. Efron (2012). Large-scale inference: empirical Bayes methods for estimation, testing, and prediction.