# Imbalance detection for healthier experimentation

Deciding to launch a new product feature at Etsy requires teams to rigorously assess whether it will improve the user experience. This assessment is generally performed via a randomized experiment, also known as an *A/B test*, which quantifies the impact of the feature by presenting it to a random set of users. The point of A/B testing is not only to measure how Etsy’s success metrics are affected, but also — and most importantly — to reliably attribute an observed change in metrics to the new feature.

Despite their conceptual simplicity, A/B tests are complex to implement, and flawed setups can lead to incorrect conclusions. One problem that can arise in misconfigured experiments is *imbalance*, where the groups being compared consist of such dissimilar user populations that any attempt to credit the feature under test with a change in success metrics becomes questionable. Yet as our company grows, so does the number of experiments running concurrently across different teams, making manual validation of each experimental setup unsustainable. The need to preserve scientific validity while increasing our experimentation capacity has motivated us to build automated guardrails that can protect experimenters against reaching wrong conclusions.

In this post, we present the system we built to automatically detect imbalance, explain the statistical methodology that powers it, and describe how we scaled it to accommodate hundreds of experiments every day. We also reflect on the importance of building guardrails for our expanding experimentation platform in order to preserve its trustworthiness as experiments become more popularized.

## Why does balance matter ?

An A/B test aims to quantify the impact of a treatment (e.g. a new product feature) by measuring the change in a metric of interest (e.g. the percentage of users who make a purchase) between two groups of users: a *control group* who do not see the feature, and a *treatment group* who do. Provided that the groups are *balanced* — composed of user populations that are similar in every aspect but the *variant* presented to them, treatment or control — observing a significant difference in our metric of interest between the two groups is a reliable indication that the treatment has had a direct impact on it (Figure 1).

Balance can be engineered. By allocating users to treatment or control groups randomly and with no reliance on any user information, the groups are guaranteed to become increasingly similar as their sizes grow. In particular, we expect them to exhibit equal distributions of pre-treatment user attributes, for every possible attribute (e.g. user location, among others). This allows for a meaningful comparison of metric values between the groups (Figure 2).

In real life, of course, implementing randomized allocation involves non-trivial code, data pipelines, case-specific logic and so on, all of which are susceptible to errors despite best intentions. It is thus not uncommon to see experimental setups that inadvertently produce *imbalanced* groups, where the distributions of specific user attributes differ drastically between treatment and control. In these cases, a change in the metric cannot be read as the sole effect of the treatment, since disparities in other user attributes might also have had an impact on the outcome. This makes results from an imbalanced experiment ambiguous and unreliable (Figures 3 and 4).

Since balance is expected from a properly randomized allocation, the presence of imbalanced groups in an experiment is a red flag that warrants investigation. Observing uneven distributions of a particular (pre-treatment) user attribute could suggest even more alarming imbalance in other, unmeasured attributes (known as *confounding factors*), as well as deeper issues with how the experiment was implemented. Identifying such imbalances is therefore crucial to preventing experimenters from reaching invalid conclusions.

## Imbalance lurks in many places

Imbalance is related to but not quite the same as *bucketing skew* (also referred to in [1] as *sample ratio mismatch*), where the sizes of the groups differ significantly from their expected sizes. For instance, bucketing skew could consist of observing a treatment group that ends up much larger than the control, despite an intention to assign each user with equal probability to either group. The imbalance we are addressing here puts the emphasis on the composition of the groups and can be regarded as a collective assessment of bucketing skew at the level of different subgroups.

Although there is no comprehensive list of all possible causes of imbalance, there are typical patterns of experimental setups that tend to produce it. One common cause of imbalance is when the implementation of the treatment introduces unforeseen behaviors in the data pipelines. Another frequent source of imbalance can be the design of an experiment itself. This is especially true for experiments that select their users based on particular conditions. For instance, when the treatment is targeted to specific pages (e.g. enabling videos in the thumbnails of listing pages), it could seem preferable to restrict the experiment only to users who reach those pages. But consider what happens if the treatment unintentionally causes page loads to fail more often for users with unreliable internet connections. If our test only logs data when the targeted page loads successfully, we skew the treatment group against that subset of users. In contrast with the control, the treatment group will exhibit an overrepresentation of users from locations with faster internet connections, hindering a fair comparison.

Generally speaking, asymmetrical triggering conditions or data logging (sometimes referred to as *activation biases*) between treatment and control groups are quite susceptible to creating imbalance.

## Detecting imbalance by testing for dependence

Implementation and design flaws in A/B tests are difficult to preempt, more so as the volume and complexity of experiments increase. Our experience of this at Etsy led us to look for an automated solution: we wanted to build a generic detection system that could run autonomously and would be valid for any given experiment. Since data collected by flawed processes is rarely salvageable and online traffic is a precious resource, a key desideratum for the system was that it should warn our teams about potential imbalances as early in the lifetime of their experiments as possible.

How do we test for imbalance? Let’s start by looking at the data. Our user population is described through the lens of *segmentations*, which are different ways of partitioning a population into a finite number of distinct subgroups, according to some pre-treatment variables. For example, a segmentation by region would consist of several possible segments, one for each region. Given a particular segmentation and an experiment with N users, our data consists of N independent pairs of labels, indicating which segment and which variant each user belongs to. The data can be summarized by counting the number of users in each of the combinations of segment and variant. These counts are collected into what is known as a *contingency table* (Figures 5 and 6).

The contingency table fully captures the distribution of segments for each variant in our data. Asking whether the groups are balanced means asking whether the conditional distribution of segments, within a particular variant, remains the same regardless of what variant we look at. In statistical terms, this amounts to assessing whether the two variables “segment” and “variant” are *independent*. Therefore, the detection of imbalance can be formally cast as a test for the lack of independence between two variables.

The first thing to note is that, given our data, we know exactly what the *expected* contingency table would be if segment and variant were truly independent. Therefore, a natural way to quantify the lack of independence is to measure how far our *observed* table is from this *expected* table. We do so by using the well-designed discrepancy function presented in [2], which we denote as U. The important point here is that larger values of U signal stronger lack of independence, and thus greater imbalance (Figure 7).

Our next step is to decide when to declare a statistically significant imbalance. If we could know the values U can take when independence truly holds, then we would know which values are unusual enough to constitute evidence against a hypothesized independence. This is where an elegant property of independence comes to our rescue: if segment and variant are indeed independent, then all the possible permutations of the observed variant labels will have equal probabilities of occurring (Figure 8).

In light of this fact, we can perform what is called a *permutation test*. Each permutation of the variant labels induces a new contingency table, which produces its own value of U. By randomly permuting these labels M times (for a chosen large number M), we generate M values of U, all equally probable under independence. This collection of values gives us a sense of how U would be distributed if segment and variant were truly independent (Figure 9).

Since larger values of U indicate more imbalance, our decision rule declares the imbalance statistically significant if our observed U exceeds the upper 𝛼-quantile of the distribution, for some chosen 𝛼 that regulates how conservative we want to be with false detections. Equivalently, by defining the *p-value* as the proportion of values (out of M+1) that are at least as large as our observed U, we would declare significance if this p-value is less than 𝛼 (Figure 10).

In addition to having optimal detection properties [3, 4], the main appeal of this permutation test is that it belongs to the family of *exact* tests. Unlike other common tests that rely on large-sample approximations (e.g. chi-squared test, G-test, etc.), the probability of a false detection with this permutation test is at most 𝛼 ... regardless of the sample size! This is particularly important given our desire to alert teams about imbalances as early as possible, while the number of data points in their experiments is still small.

## The importance of scalability

The previous section presented a reliable method to detect imbalance for a given experiment and a specific user segmentation: perform a permutation test to produce a p-value, then declare imbalance if the p-value passes a chosen threshold. Our next challenge is to scale this test so we can apply it to the dozens of segmentations and hundreds of experiments that Etsy runs every day, each experiment potentially involving up to tens of millions of users.

Our first consideration for scalability is purely computational. A naive implementation would store individual labels from all N users and perform the required permutation tests by successively permuting labels (possibly tens of millions), repeating that step about a hundred thousand times, and doing the entire procedure for every experiment and each segmentation. That would require quadrillions of operations for just a single day of data, every day, which would be prohibitively costly and would drastically slow our cadence of experimental results.

What makes permutation tests viable is the realization that: 1) all we need in order to compute the statistic U is a contingency table, not individual labels; 2) permuting individual labels induces a particular probability distribution of contingency tables, which can be derived analytically [5]; and 3) it is actually possible to sample from that distribution directly [6]. This leads to a huge improvement, since performing one permutation with R variants and C segments can now be reduced to sampling only R × C numbers instead of N. With our experiments typically involving two variants and about five segments (for a given segmentation), this means sampling only ten numbers per permutation rather than tens of millions.

We obtain another major speed gain by leveraging the parallelizable structure of the problem: all our permutation tests can be independently run in parallel (one test per experiment per segmentation); and for a given test, all permutations can be drawn in parallel as well. This allows us to leverage distributed computing to perform the tests at scale.

However, being able to run thousands of permutation tests at high speed is not good enough. With such a large number of tests at play, we now need to address the scalability question from a statistical perspective: the more tests we perform, the more susceptible we are to raising false alerts, unless we take active precautions. This is known as the *multiple testing problem*, which we mitigate by using the well-established Benjamini-Hochberg (BH) procedure [7, 8]. Rather than interpreting each test separately, the BH procedure collects the p-values from all the tests, and applies an adaptive thresholding rule to determine which ones are statistically significant while guaranteeing a desired *false discovery rate* (i.e. bounding the expected ratio of the number of false alerts over the number of alerts).

With M permutations, the smallest possible p-value that a permutation test can generate by construction is equal to 1/(M+1). On the other hand, the smallest threshold involved in a BH procedure decreases with the number of tests performed. In other words, as the number of tests we can computationally afford increases, corrective actions to preserve statistical validity become stricter and require us to also increase the number of permutations within each test, which in turn makes each test more computationally intensive. This nicely illustrates how computational scalability and statistical scalability are closely intertwined.

By combining all the above ingredients, we built an imbalance detection system that is scalable both computationally and statistically (Figure 11).

Our pipeline begins by collecting the number of users for each combination of experiment, segment, and variant. Storage requirements can be kept to a minimum, since we only need access to aggregated counts of users rather than individual labels. These counts constitute the inputs of our permutation tests, which are performed via an implementation in Spark that takes advantage of the parallel computing and efficient sampling tricks presented earlier. The outputs of the tests are then post-processed collectively to ensure an acceptable rate of false alerts.

Finally, detected imbalances trigger warning banners on the concerned experiments’ dashboards, so that teams at Etsy get automatically informed of experiments that show signs of imbalance (Figure 12).

Upon detecting an imbalance, our system prompts experimenters to inspect the affected segmentations and provides a tool to visualize the corresponding distributions of segments. These inspections can then lead to further actions (e.g. interrupting the experiment) depending on the severity of the situation.

## Impact and reflections

Thanks to our automated detection system, Etsy teams can now receive timely and actionable alerts when suspicious imbalances arise in their experiments. These alerts can help protect them against reaching flawed conclusions.

Considering our growing number of experiments, monitoring tools and guardrails have become vital to preserving the trustworthiness of experimental results. That places considerations of scalability — both computational *and* statistical — at the center of the advancement of our experimentation platform.

As seen with permutation tests, the question of scalability reveals a subtle interplay between engineering and statistics. On the one hand, statistical methods require clever engineering in order to run at acceptable speeds on our large amounts of data. On the other hand, as these methods become computationally more scalable and get applied to an increasing number of experiments, we enter a whole new realm of statistical issues, with the previously discussed *multiple testing problem* only being the tip of the iceberg. Many of these statistical issues are non-existent at the level of individual experiments but start to manifest themselves once we look at experiments collectively, and call for further statistical methods to be combatted.

These challenges are a good reminder that understanding how engineering and statistics feed off each other is not only educational, but truly indispensable for enabling trustworthy learnings and sustainable growth.

## Acknowledgements

We would like to thank Alaina Waagner, Allison McKnight, Anastasia Erbe, Ercan Yildiz, Gerald van den Berg, John Mapelli, Kevin Gaan, MaryKate Guidry, Michael Dietz, Mike Lang, Nick Solomon, Samantha Emanuele, and Zach Armentrout for their thoughtful feedback and helpful discussions.

## References

[1] R. Kohavi, D. Tang, Y. Xu (2020). *Trustworthy online controlled experiments: a practical guide to A/B testing*.

[2] T. B. Berrett, R. J. Samworth (2021). *USP: an independence test that improves on Pearson’s chi-squared and the G-test*.

[3] T. B. Berrett, I. Kontoyiannis, R. J. Samworth (2021). *Optimal rates for independence testing via U-statistic permutation tests*.

[4] I. Kim, S. Balakrishnan, L. Wasserman (2020). *Minimax optimality of permutation tests*.

[5] J. H. Halton (1969). *A rigorous derivation of the exact contingency formula*.

[6] W. M. Patefield (1981). *An efficient method of generating random R × C tables with given row and column totals*.

[7] Y. Benjamini, Y. Hochberg (1995). *Controlling the false discovery rate: a practical and powerful approach to multiple testing*.

[8] Y. Benjamini, D. Yekutieli (2001). *The control of the false discovery rate in multiple testing under dependency*.