Etsy Icon>

Code as Craft

Imbalance detection for healthier experimentation main image

Imbalance detection for healthier experimentation

  image

Deciding to launch a new product feature at Etsy requires teams to rigorously assess whether it will improve the user experience. This assessment is generally performed via a randomized experiment, also known as an A/B test, which quantifies the impact of the feature by presenting it to a random set of users. The point of A/B testing is not only to measure how Etsy’s success metrics are affected, but also — and most importantly — to reliably attribute an observed change in metrics to the new feature.

Despite their conceptual simplicity, A/B tests are complex to implement, and flawed setups can lead to incorrect conclusions. One problem that can arise in misconfigured experiments is imbalance, where the groups being compared consist of such dissimilar user populations that any attempt to credit the feature under test with a change in success metrics becomes questionable. Yet as our company grows, so does the number of experiments running concurrently across different teams, making manual validation of each experimental setup unsustainable. The need to preserve scientific validity while increasing our experimentation capacity has motivated us to build automated guardrails that can protect experimenters against reaching wrong conclusions.

In this post, we present the system we built to automatically detect imbalance, explain the statistical methodology that powers it, and describe how we scaled it to accommodate hundreds of experiments every day. We also reflect on the importance of building guardrails for our expanding experimentation platform in order to preserve its trustworthiness as experiments become more popularized.

Why does balance matter ?

An A/B test aims to quantify the impact of a treatment (e.g. a new product feature) by measuring the change in a metric of interest (e.g. the percentage of users who make a purchase) between two groups of users: a control group who do not see the feature, and a treatment group who do. Provided that the groups are balanced — composed of user populations that are similar in every aspect but the variant presented to them, treatment or control — observing a significant difference in our metric of interest between the two groups is a reliable indication that the treatment has had a direct impact on it (Figure 1).

cartoon_no_segment_with_treatment_blue
Figure 1. If the only differentiating trait between the treatment group (labeled “on”) and the control group (labeled “off”) is the variant they are presented (i.e. their exposure or non-exposure to the treatment, symbolized by the flag), then we can confidently conclude that a change in our metric of interest is caused by the treatment.

Balance can be engineered. By allocating users to treatment or control groups randomly and with no reliance on any user information, the groups are guaranteed to become increasingly similar as their sizes grow. In particular, we expect them to exhibit equal distributions of pre-treatment user attributes, for every possible attribute (e.g. user location, among others). This allows for a meaningful comparison of metric values between the groups (Figure 2).

cartoon_randomization_balanced
Figure 2. In the presence of other user attributes (e.g. US-based users, colored blue, vs. non-US-based users in green), we may still credit the treatment (symbolized by the flag) with the change in metric value, as long as the attributes have the same distributions in both groups. The intuition is that, thanks to such balance, the effects from user attributes offset each other when computing the difference in metric values, leaving only the treatment as the main contributor to the observed change.

In real life, of course, implementing randomized allocation involves non-trivial code, data pipelines, case-specific logic and so on, all of which are susceptible to errors despite best intentions. It is thus not uncommon to see experimental setups that inadvertently produce imbalanced groups, where the distributions of specific user attributes differ drastically between treatment and control. In these cases, a change in the metric cannot be read as the sole effect of the treatment, since disparities in other user attributes might also have had an impact on the outcome. This makes results from an imbalanced experiment ambiguous and unreliable (Figures 3 and 4).

example_imbalance_region
Figure 3. Example of imbalance in an experiment. The distributions of user regions in the control group (top row, labeled “off”) and the treatment group (bottom row, labeled “on”) are visibly different. Such an imbalance is not intended by the randomized allocation mechanism and casts doubt on the validity of the experiment.
cartoon_randomization_imbalanced
Figure 4. When a pre-treatment user attribute exhibits different distributions between the treatment group (labeled “on”) and the control group (labeled “off”), the contribution of the treatment (symbolized by the flag) to the change in metric values becomes ambiguous. The causes of that change are now unclear since we are comparing two groups made of dissimilar user populations, whose metric values would naturally differ even in the complete absence of any treatment.

Since balance is expected from a properly randomized allocation, the presence of imbalanced groups in an experiment is a red flag that warrants investigation. Observing uneven distributions of a particular (pre-treatment) user attribute could suggest even more alarming imbalance in other, unmeasured attributes (known as confounding factors), as well as deeper issues with how the experiment was implemented. Identifying such imbalances is therefore crucial to preventing experimenters from reaching invalid conclusions.

Imbalance lurks in many places

Imbalance is related to but not quite the same as bucketing skew (also referred to in [1] as sample ratio mismatch), where the sizes of the groups differ significantly from their expected sizes. For instance, bucketing skew could consist of observing a treatment group that ends up much larger than the control, despite an intention to assign each user with equal probability to either group. The imbalance we are addressing here puts the emphasis on the composition of the groups and can be regarded as a collective assessment of bucketing skew at the level of different subgroups.

Although there is no comprehensive list of all possible causes of imbalance, there are typical patterns of experimental setups that tend to produce it. One common cause of imbalance is when the implementation of the treatment introduces unforeseen behaviors in the data pipelines. Another frequent source of imbalance can be the design of an experiment itself. This is especially true for experiments that select their users based on particular conditions. For instance, when the treatment is targeted to specific pages (e.g. enabling videos in the thumbnails of listing pages), it could seem preferable to restrict the experiment only to users who reach those pages. But consider what happens if the treatment unintentionally causes page loads to fail more often for users with unreliable internet connections. If our test only logs data when the targeted page loads successfully, we skew the treatment group against that subset of users. In contrast with the control, the treatment group will exhibit an overrepresentation of users from locations with faster internet connections, hindering a fair comparison.

Generally speaking, asymmetrical triggering conditions or data logging (sometimes referred to as activation biases) between treatment and control groups are quite susceptible to creating imbalance.

Detecting imbalance by testing for dependence

Implementation and design flaws in A/B tests are difficult to preempt, more so as the volume and complexity of experiments increase. Our experience of this at Etsy led us to look for an automated solution: we wanted to build a generic detection system that could run autonomously and would be valid for any given experiment. Since data collected by flawed processes is rarely salvageable and online traffic is a precious resource, a key desideratum for the system was that it should warn our teams about potential imbalances as early in the lifetime of their experiments as possible.

How do we test for imbalance? Let’s start by looking at the data. Our user population is described through the lens of segmentations, which are different ways of partitioning a population into a finite number of distinct subgroups, according to some pre-treatment variables. For example, a segmentation by region would consist of several possible segments, one for each region. Given a particular segmentation and an experiment with N users, our data consists of N independent pairs of labels, indicating which segment and which variant each user belongs to. The data can be summarized by counting the number of users in each of the combinations of segment and variant. These counts are collected into what is known as a contingency table (Figures 5 and 6).

data_labels
Figure 5. Given an experiment with N users, our data consists of N independent pairs of labels. Each pair indicates the segment (represented by different colors) and the variant (labeled “on” for treatment or “off” for control) of the corresponding user. Segments can take any finite number of values (3 here for illustration).
data_contingency_tables_resized
Figure 6. The contingency table consists of 1 column per possible segment and 1 row per possible variant, where each cell indicates the number of users having the corresponding combination of segment and variant. We indicate the total number of users in each variant to help build intuition (but that last column is not part of the contingency table). This illustration arbitrarily uses 2 variants, 3 segments, and N = 10,000 users (more generally, with R variants and C segments, the contingency table would be of size R by C).

The contingency table fully captures the distribution of segments for each variant in our data. Asking whether the groups are balanced means asking whether the conditional distribution of segments, within a particular variant, remains the same regardless of what variant we look at. In statistical terms, this amounts to assessing whether the two variables “segment” and “variant” are independent. Therefore, the detection of imbalance can be formally cast as a test for the lack of independence between two variables.

The first thing to note is that, given our data, we know exactly what the expected contingency table would be if segment and variant were truly independent. Therefore, a natural way to quantify the lack of independence is to measure how far our observed table is from this expected table. We do so by using the well-designed discrepancy function presented in [2], which we denote as U. The important point here is that larger values of U signal stronger lack of independence, and thus greater imbalance (Figure 7).

statistic_u_intuition
Figure 7. Given our observed contingency table (top), we can derive what the expected contingency table would be if segment and variant were independent (bottom). The statistic U captures a notion of distance between these two tables, hence larger values of U indicate larger lack of independence. For curious readers, the exact expression of U can be found in [2]. In this illustration, we assume equal probabilities of allocation to the treatment and control groups (respectively labeled “on” and “off”), so the expected contingency table is obtained by keeping the total numbers of units in each segment fixed and forming equally sized groups within each segment.

Our next step is to decide when to declare a statistically significant imbalance. If we could know the values U can take when independence truly holds, then we would know which values are unusual enough to constitute evidence against a hypothesized independence. This is where an elegant property of independence comes to our rescue: if segment and variant are indeed independent, then all the possible permutations of the observed variant labels will have equal probabilities of occurring (Figure 8).

permutation_labels
Figure 8. Under the assumption that segment and variant are independent, all the possible permutations of the observed variant labels share the same probability of occurrence. The animation illustrates some possible permutations of the labels.

In light of this fact, we can perform what is called a permutation test. Each permutation of the variant labels induces a new contingency table, which produces its own value of U. By randomly permuting these labels M times (for a chosen large number M), we generate M values of U, all equally probable under independence. This collection of values gives us a sense of how U would be distributed if segment and variant were truly independent (Figure 9).

permutation_test_labeled_accelerated
Figure 9. Construction of the distribution of U under independence by randomly permuting the variant labels. Each permutation (top-left) induces a particular contingency table (bottom-left) and a corresponding value of U (moving light-blue ribbon on the right). The simulated values of U are collected to build the histogram on the right (values of U on the x-axis, their frequencies on the y-axis). The animation shows 1 permutation per frame (sped up after 30 permutations), with N = 10,000 users and M = 1,000 permutations (arbitrarily chosen for the sake of illustration).

Since larger values of U indicate more imbalance, our decision rule declares the imbalance statistically significant if our observed U exceeds the upper 𝛼-quantile of the distribution, for some chosen 𝛼 that regulates how conservative we want to be with false detections. Equivalently, by defining the p-value as the proportion of values (out of M+1) that are at least as large as our observed U, we would declare significance if this p-value is less than 𝛼 (Figure 10).

permutation_test_rejection_region
Figure 10. Given the distribution of U under the assumption of independence between segment and variant, we declare the imbalance statistically significant if our observed U lands inside the rejection region, defined as the upper tail of the distribution with level 𝛼 (for a chosen 𝛼 between 0 and 1). In other words, we reject our hypothesis of independence if the observed U is greater than 100 × (1 - 𝛼) percent of the (M+1) values. This is equivalent to rejecting our hypothesis when the p-value is less than 𝛼, where the p-value is computed by adding up the heights of all the blue bars on the right side of the observed U.

In addition to having optimal detection properties [3, 4], the main appeal of this permutation test is that it belongs to the family of exact tests. Unlike other common tests that rely on large-sample approximations (e.g. chi-squared test, G-test, etc.), the probability of a false detection with this permutation test is at most 𝛼 ... regardless of the sample size! This is particularly important given our desire to alert teams about imbalances as early as possible, while the number of data points in their experiments is still small.

The importance of scalability

The previous section presented a reliable method to detect imbalance for a given experiment and a specific user segmentation: perform a permutation test to produce a p-value, then declare imbalance if the p-value passes a chosen threshold. Our next challenge is to scale this test so we can apply it to the dozens of segmentations and hundreds of experiments that Etsy runs every day, each experiment potentially involving up to tens of millions of users.

Our first consideration for scalability is purely computational. A naive implementation would store individual labels from all N users and perform the required permutation tests by successively permuting labels (possibly tens of millions), repeating that step about a hundred thousand times, and doing the entire procedure for every experiment and each segmentation. That would require quadrillions of operations for just a single day of data, every day, which would be prohibitively costly and would drastically slow our cadence of experimental results.

What makes permutation tests viable is the realization that: 1) all we need in order to compute the statistic U is a contingency table, not individual labels; 2) permuting individual labels induces a particular probability distribution of contingency tables, which can be derived analytically [5]; and 3) it is actually possible to sample from that distribution directly [6]. This leads to a huge improvement, since performing one permutation with R variants and C segments can now be reduced to sampling only R × C numbers instead of N. With our experiments typically involving two variants and about five segments (for a given segmentation), this means sampling only ten numbers per permutation rather than tens of millions.

We obtain another major speed gain by leveraging the parallelizable structure of the problem: all our permutation tests can be independently run in parallel (one test per experiment per segmentation); and for a given test, all permutations can be drawn in parallel as well. This allows us to leverage distributed computing to perform the tests at scale.

However, being able to run thousands of permutation tests at high speed is not good enough. With such a large number of tests at play, we now need to address the scalability question from a statistical perspective: the more tests we perform, the more susceptible we are to raising false alerts, unless we take active precautions. This is known as the multiple testing problem, which we mitigate by using the well-established Benjamini-Hochberg (BH) procedure [7, 8]. Rather than interpreting each test separately, the BH procedure collects the p-values from all the tests, and applies an adaptive thresholding rule to determine which ones are statistically significant while guaranteeing a desired false discovery rate (i.e. bounding the expected ratio of the number of false alerts over the number of alerts).

With M permutations, the smallest possible p-value that a permutation test can generate by construction is equal to 1/(M+1). On the other hand, the smallest threshold involved in a BH procedure decreases with the number of tests performed. In other words, as the number of tests we can computationally afford increases, corrective actions to preserve statistical validity become stricter and require us to also increase the number of permutations within each test, which in turn makes each test more computationally intensive. This nicely illustrates how computational scalability and statistical scalability are closely intertwined.

By combining all the above ingredients, we built an imbalance detection system that is scalable both computationally and statistically (Figure 11).

pipeline_illustration
Figure 11. Bird's-eye view of our imbalance detection system. Given the number of users in each segment and variant for each experiment, permutation tests are performed in parallel, while accounting for multiple testing. Teams are then notified of detected imbalances and can inspect the segment distributions in their experiments.

Our pipeline begins by collecting the number of users for each combination of experiment, segment, and variant. Storage requirements can be kept to a minimum, since we only need access to aggregated counts of users rather than individual labels. These counts constitute the inputs of our permutation tests, which are performed via an implementation in Spark that takes advantage of the parallel computing and efficient sampling tricks presented earlier. The outputs of the tests are then post-processed collectively to ensure an acceptable rate of false alerts.

Finally, detected imbalances trigger warning banners on the concerned experiments’ dashboards, so that teams at Etsy get automatically informed of experiments that show signs of imbalance (Figure 12).

catapult_redacted_simplified
dashboard_redacted
Figure 12. Detected imbalances are surfaced directly on the monitoring pages of experiments via a warning banner (top picture). This banner indicates which segmentations are suspicious, and invites experimenters to inspect the distributions of segments using a dedicated visualization tool (bottom picture). Specific details about the experiment and segmentations are inessential to this illustration and have been redacted.

Upon detecting an imbalance, our system prompts experimenters to inspect the affected segmentations and provides a tool to visualize the corresponding distributions of segments. These inspections can then lead to further actions (e.g. interrupting the experiment) depending on the severity of the situation.

Impact and reflections

Thanks to our automated detection system, Etsy teams can now receive timely and actionable alerts when suspicious imbalances arise in their experiments. These alerts can help protect them against reaching flawed conclusions.

Considering our growing number of experiments, monitoring tools and guardrails have become vital to preserving the trustworthiness of experimental results. That places considerations of scalability — both computational and statistical — at the center of the advancement of our experimentation platform.

As seen with permutation tests, the question of scalability reveals a subtle interplay between engineering and statistics. On the one hand, statistical methods require clever engineering in order to run at acceptable speeds on our large amounts of data. On the other hand, as these methods become computationally more scalable and get applied to an increasing number of experiments, we enter a whole new realm of statistical issues, with the previously discussed multiple testing problem only being the tip of the iceberg. Many of these statistical issues are non-existent at the level of individual experiments but start to manifest themselves once we look at experiments collectively, and call for further statistical methods to be combatted.

These challenges are a good reminder that understanding how engineering and statistics feed off each other is not only educational, but truly indispensable for enabling trustworthy learnings and sustainable growth.

Acknowledgements

We would like to thank Alaina Waagner, Allison McKnight, Anastasia Erbe, Ercan Yildiz, Gerald van den Berg, John Mapelli, Kevin Gaan, MaryKate Guidry, Michael Dietz, Mike Lang, Nick Solomon, Samantha Emanuele, and Zach Armentrout for their thoughtful feedback and helpful discussions.

References

[1] R. Kohavi, D. Tang, Y. Xu (2020). Trustworthy online controlled experiments: a practical guide to A/B testing.

[2] T. B. Berrett, R. J. Samworth (2021). USP: an independence test that improves on Pearson’s chi-squared and the G-test.

[3] T. B. Berrett, I. Kontoyiannis, R. J. Samworth (2021). Optimal rates for independence testing via U-statistic permutation tests.

[4] I. Kim, S. Balakrishnan, L. Wasserman (2020). Minimax optimality of permutation tests.

[5] J. H. Halton (1969). A rigorous derivation of the exact contingency formula.

[6] W. M. Patefield (1981). An efficient method of generating random R × C tables with given row and column totals.

[7] Y. Benjamini, Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing.

[8] Y. Benjamini, D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency.