# Mitigating the winner’s curse in online experiments

Experimentation plays a central role in how we test and deploy new ideas at Etsy. Not only does it provide teams with a scientific and practical procedure to identify which ideas affect our users, but it also allows us to estimate the size of that impact. This latter task is crucial for quantifying the business value created by new changes, and feeds directly into Etsy’s strategic decisions and financial planning.

The way we choose which changes to deploy follows a traditional hypothesis testing approach. If randomly exposing a small group of users to a particular treatment (a new design, added functionality, etc.) yields a positive lift in some metric of interest — and if that lift qualifies as statistically significant (unlikely to be due to chance alone in the absence of any effect) — then we call the treatment a “win” and feel confident deploying it at scale.

However, when we try to gauge the underlying effect of that winning treatment, naively taking the observed lift at face value will lead to an estimate that overshoots the mark. This phenomenon — commonly referred to as the *winner’s curse* — is a built-in limitation of our decision-making protocol. It is an artifact of how experimenters select winning treatments and it plagues how we estimate their impact, despite our best intentions.

In this article, we review how our experimentation framework gives rise to the winner’s curse. We present techniques from Bayesian statistics that can help break this curse, by discounting reported lifts to counteract the tendency toward overestimation. We discuss the challenges and benefits of this methodology, and how it has led us to a more accurate accounting of business impact for experiments at Etsy.

## What is the winner’s curse ?

To assess whether a particular treatment has a positive impact on our customers, we typically run a randomized experiment known as an *A/B test*, where we compare a random sample of users exposed to the treatment with another sample of control users exposed to the current experience. As the experiment concludes, we observe some measurable lift — a difference, either positive or negative — in our chosen success metric (Figure 1).

Our first task is to determine whether the treatment is in fact improving our success metric. We have to keep in mind that the *observed lift* is only an approximation of the *true lift* (roughly speaking, the lift we would observe if our entire population of users were exposed to the treatment). Observed lifts inevitably deviate from the truth by some degree, as a consequence of the unmeasurable noise intrinsic to randomized experiments.

Happily, the random assignment of users to treatments enforces some helpful properties on this noise. In particular, the noise is symmetrical: the observed lift is equally likely to overestimate or underestimate the truth, and its average over many independent replications is expected to equal the true lift (Figure 2, panel 1). And we can also quantify the characteristic size of the noise (its *standard deviation*).

Together, these properties allow for a simple and practical decision rule: when the observed lift exceeds a specified threshold, we deem the result statistically significant and regard the treatment as a win (Figure 2, panel 2). By choosing our threshold appropriately, we can achieve desired error rates regarding both false positives (incorrectly claiming a win) and false negatives (missing out on a win). In other words, we can reliably detect lifts that are truly positive, in spite of the fact that we can't observe their true values directly.

Knowing which treatments are wins is one thing, but estimating the size of their effects is another. Impact is a question of key importance for Etsy’s strategic and financial planning, and answering it is not trivial. As it turns out, naively trusting the observed lifts of the reported wins will generally lead to a substantial overestimation of their real impact (Figure 2, panel 3).

Every treatment that meets our winning criteria is a winner, but not all winners are created equal. We can expect that some number of the less solid winners will have snuck in with values that are higher than their true lifts. In other words, conditional upon being reported as a win, the observed lift is now expected to overestimate its true lift. And this is what we call the winner’s curse, well known in the scientific literature [1, 2, 3]: a provable form of selection bias, which leads us to overstate the value of our wins and thereby exaggerate their true impact.

The winner’s curse has nothing to do with human biases (confirmation bias, p-hacking, etc.). It is a systematic bias, inherent in our use of a selection protocol. And unless we want to see winning treatments consistently underperforming our too-high expectations for them, we need a principled way to correct for it.

## Breaking the curse

Theory tells us not to take the observed lifts of our winning experiments at face value. Since we expect them to exaggerate the truth, it seems natural to apply a discount to the observed lift to offset any overestimation. The name of the game is to determine how much the discount should be.

Having run countless thousands of experiments over the years, we've developed a pretty good sense of what believable lift values look like. In particular, we acknowledge that it is genuinely hard to move our success metrics by a meaningful amount, more so as Etsy’s services become increasingly mature over time, making the control experience an ever harder benchmark to beat. This is reflected in the high concentration of past observed lifts around zero, which suggests that most true lifts are likely to be small. On the other hand, we also believe that major breakthroughs are possible (e.g. adding a brand new service, shifting paradigms for Etsy’s search algorithm, etc.), however infrequent.

As probability distributions are essentially mathematical representations of beliefs, we can formalize our acquired understanding by fitting a statistical model on past historical lifts. The chosen model — inspired by Deng et al. [4] — mixes together light- and heavy-tailed distributions, thus capturing the higher plausibility of small and incremental lifts, while still leaving room for larger (but rarer) ones.

To recap, we have two forces at play: the observed lifts from a set of winning experiments (which we know are cursed to some degree) and our *prior* belief, based on past experiments, of what values of true lifts are plausible (Figure 3).

To combine these two beliefs into a single, coherent one, we take a Bayesian approach, which provides a framework for quantifying and updating knowledge probabilistically. Bayesian statistics enables formal answers to questions like, “Given our domain expertise and what was actually observed during an experiment, what value of true lift is the most credible?”. At a high level, the process can be thought of as an algorithmic way for a prior belief to evolve into a *posterior* belief, on the basis of newly observed data. More specifically, we use a form of Gibbs sampling [5], a technique from the Markov Chain Monte Carlo literature, to produce the full distribution of our posterior belief. Although we won't go into technical detail here, this posterior distribution allows for a more informed guess of the true lift that the winning treatment can be expected to produce (Figure 4).

The estimated lift given by our posterior belief can be thought of as a discounted version of the raw observed lift, where the discount combats the inflationary bias of the winner’s curse. The discounted lift behaves as a weighted average of the lift observed in the current experiment and the lift we anticipate based solely on our knowledge of past experiments. How much weight we put on the observed lift directly relates to how much we trust the experiment: the greater the precision of the observed lift (the smaller its standard deviation), the less skeptical we are about it, hence the more inclined we are to abandon our prior belief, and vice versa (Figure 5).

One appeal of this method is that it effectively produces discounts that are tailored to each experiment’s respective credibility, as opposed to applying a fixed common haircut to every experiment equally. This flexibility can be tuned to ensure that we are not over-discounting the lifts of experiments whose effects are strong enough for the selection bias to be negligible (Figure 6).

In summary, by discounting the observed lifts of reported wins, we are able to reliably mitigate the issue of the winner's curse. Our Bayesian methodology induces an adaptive discounting mechanism that appropriately reflects each experiment’s respective uncertainty (Figure 7).

## Large-scale experimentation: a curse and a blessing

Trustworthy experimentation requires us to be alert to more than just traditional *type I* and *type II* errors (false positive and false negative rates, respectively). The winner's curse is connected to the broader notion of *type M* error, also known as the exaggeration ratio: the factor by which the magnitude of an effect is overestimated [6]. This type of error can be especially prominent in experiments that are underpowered (when sample sizes are not large enough).

Closely related to this is the concept of *false discovery rate*, which introduces the idea that, even among treatments whose lifts appear statistically significant, a substantial portion could still turn out to have no actual effect [7]. Without proper precautions, false discovery is actually exacerbated as the number of treatments increases, which makes it of central concern in our age of high-throughput experimentation, where we routinely have hundreds of experiments running each quarter, of which only a small minority are likely to have meaningful impact.

In a prior post, we discussed the complex and large-scale ecosystem of experimentation at Etsy. As it happens, at this kind of scale, discounting lifts can have benefits beyond questions of selection bias. We are effectively trying to find a needle of signal in a haystack of noise, and discounting — more commonly known as *shrinkage* in the statistical literature — is a well-established technique for improving the performance of estimators [8, 9]. (In much the same way, *regularization* is ubiquitous in machine learning when fitting models involving a large number of parameters, out of which only a few will end up mattering.)

A/B testing has gained popularity in large part from the simplicity (on paper at least) of its classical hypothesis testing framework. But we should always keep in mind that real-life experiments involve extra conditioning and selection procedures, which can alter the inference and must be properly accounted for. Fortunately, despite the additional challenges brought about by dealing with ever more experiments, we can still hope to achieve sensible insights by leveraging the fact that these experiments are run together as part of a broader collective, which opens the door to sharing learnings and borrowing information across multiple experiments.

## Acknowledgements

Special thanks to Kevin Gaan for helping with the internal review of this post, and to Michael Dietz, our external editor. I would also like to thank* Anastasia Erbe, Clare Burke, Gerald van den Berg, Michelle Borczuk, Samantha Emanuele, and Zach Armentrout from the Product Analytics and Strategic Finance team — as well as Alexander Tank and Julie Beckley from the Experimentation Science team — for their feedback and thoughtful discussions.

* *Listed by team and alphabetical order of first names.*

## References

[1] M. Lee, M. Sheng (2018). *Winner's curse: bias estimation for total effects of features in online controlled experiments*.

[2] E. W. van Zwet, E. A. Cator (2021). *The significance filter, the winner's curse and the need to shrink*.

[3] I. Andrews, T. Kitagawa, A. McCloskey (2019). *Inference on winners*.

[4] A. Deng, Y. Li, J. Lu, V. Ramamurthy (2021). *On post-selection inference in A/B testing*.

[5] D. A. van Dyk, T. Park (2008). *Partially collapsed Gibbs samplers: theory and methods*.

[6] A. Gelman, J. Carlin (2014). *Beyond power calculations: assessing type S (Sign) and type M (Magnitude) errors*.

[7] Y. Benjamini, Y. Hochberg (1995). *Controlling the false discovery rate: a practical and powerful approach to multiple testing*.

[8] D. Coey, T. Cunningham (2019). *Improving treatment effect estimators through experiment splitting*.

[9] B. Efron (2012). *Large-scale inference: empirical Bayes methods for estimation, testing, and prediction*.