Etsy Icon>

Code as Craft

How Etsy Uses Thermodynamics to Help You Search main image
How Etsy Uses Thermodynamics to Help You Search

How Etsy Uses Thermodynamics to Help You Search


Etsy shoppers love the large and diverse selection of our marketplace. But, for those who don’t know exactly what they’re looking for, the sheer number and variety of items available can be more frustrating than delightful. In July, we introduced a new user interface which surfaces the top categories for a search request to help users explore the results for queries like “gift.” Searchers who issue broad queries like this often don’t have a specific type of item in mind, and are especially likely to finish their visit empty-handed. Our team lead, Gio, described our motivations and process in an (excellent) blog post last month, which gives more background on the project. In this post, I’ll focus on how we developed and iterated on our heuristic for classifying queries as "broad."

Our navigable interface, shown for a query for “geeky gift” Our navigable interface, shown for a query for “geeky gift”

Quantifying “broadness”

When I describe what I’m working on to people outside the team, they often jump in with a guess about how we use machine learning techniques to determine which queries are broad in code. While we could have used complex, offline signals like click or purchasing behavior to learn which queries should trigger the category interface, we actually base the decision on a single calculation, evaluated entirely at runtime, which uses very basic statistics about the search result set.

There have been several advantages to sticking with a simpler metric. By avoiding query-specific behavioral signals, our approach works for all languages and long-tail queries out of the gate. It’s performant and relatively easy to debug. It’s also (knock on wood) stable and easy to maintain, with very few external dependencies or moving parts. I'll explain how we do it, and arguably justify the title of this post in the process.

Let’s take “geeky” as an example of a broad query, one that tells us very little about what type of item the user is looking for. Jewelry is the top category for “geeky,” but there are many items in all of the top-level categories.

Top Categories for

Compare to the distribution of results for “geeky mug,” which are predictably concentrated in the Home & Living category.

Top Categories for

In plainspeak, the calculation we use measures how spread out across the marketplace the items returned for the query are. The distribution of results for “geeky” suggests that the user might benefit from seeing the top categories, which demonstrate the breadth of geeky paraphernalia available on the site, from a periodic table-patterned bow tie to a “cutie pi” mug. The distribution for “geeky mug” is dominated by one category, and shouldn’t trigger the category interface.

The categories shown for a query for “geeky” The categories shown for a query for “geeky”

Doing the math

In order to quantify how “spread out” items are, we start by taking the number of results returned for the query in each of the top-level categories and deriving the probability that an item is in each category. Since 20% of the items returned are in the Jewelry category and 15% of items are in the Accessories category, the probability values for Jewelry and Accessories would be .2 and .15 respectively. We use these values as the inputs to the Shannon entropy formula:

Shannon entropy formula

This formula is a measure of the disorder of a probability distribution. It’s essentially equivalent to the formula used to calculate the thermodynamic entropy of a physical system, which models a similar concept. For our purposes, let rt be the total number of results and ri be the number of results for category i. Then the probability value in the above equation would be (ri /rt) and the entropy of the distribution of a search result set across its categories can be expressed as:

Entropy of a search result set

In this way, we can determine when to show categories without using any offline signals. This is not to say that we didn’t use data in our development process at all. To determine the entropy threshold above which we should show categories, we looked at the entropies for a large sample of queries and made a fairly liberal judgement call on a good dividing line (i.e. a low threshold). Once we had results from an AB experiment which showed the new interface to real users, we looked to see how it affected user behavior for queries with lower entropy levels, and refined the cut-off based on the numbers. But this was a one-off analysis; we expect the threshold to be static over time, since the distribution of our marketplace across categories changes slowly.

Taking it to the next level

A broad query may not necessarily have high entropy at the top level of our taxonomy. Results for “geeky jewelry” are unsurprisingly concentrated in our Jewelry category, but there are still many types of items that are returned. We’d like to guide users into more specific subcategories, like Earrings and Necklaces, so we introduced a secondary round of entropy calculations for queries that don’t qualify as broad at the top level. It works like this: if the result set does not have sufficiently high entropy to trigger the categories at the top level, we determine the entropy within the most populous category (i.e. the entropy of its subcategories) and show those subcategories if that value exceeds our entropy cut-off.

Top Subcategories for

The graph above demonstrates the level of spread of results for the query “jewelry” across subcategories of the top-level Jewelry category. This method allows us to dive into top-level categories in cases like this, while sticking to a simple runtime decision based on category counts.

Showing subcategories for a query for “geek jewelry” Showing subcategories for a query for “geek jewelry”

Iterating on entropy

While we were testing this approach, we noticed that a query like “shoes,” which we hoped would be high entropy within the Shoes category, was actually high entropy at the top level.

Top-level categories for the “shoes” query Top-level categories for the “shoes” query… doesn’t seem quite right

Items returned for “shoes” are apparently sufficiently spread across the whole marketplace to trigger top-level groups, although there are an unusually high number of items in the Shoes category.

Top Categories for

More generally, items in our marketplace tend to be concentrated in the most popular categories. A result set is likely to have many more Accessories items than Shoes items, because the former category is an order of magnitude larger than the latter. We want to be able to compensate for this uneven global distribution of items when we calculate the probabilities that we use in our entropy calculation.

By dividing the number of items in each category that are returned for the active search by the total number of items in that category, we get a number we can think of as the affinity between the category and the search query. Although fewer than 50% of the results that come back for a query for “shoes” are in the Shoes category, 100% of items in the Shoes category are returned for a query for “shoes,” so its category affinity is much higher than its raw share of the result set.

Top Categories for

Normalizing the affinity values so they sum to one, we use these measurements as the inputs to the same Shannon entropy formula that we used in the first iteration. The normalization step ensures that we can compare entropy values across search result sets of different sizes. Letting ri represent the number of items in category i for the active search query, and ti represent the total number of items in that category, the affinity value for category i, ai, is simply (ri / ti). Taking s as the sum of all affinity values, then, the affinity-based entropy is:

Affinity-based entropy of a search result set

From a Bayesian perspective, both the original result count-based values and the affinity values calculate the probability that a listing is in a category given that it is returned for the search query. The difference is that the affinity formulation corresponds to a flat prior distribution of categories whereas the original formulation corresponds to the observed category distribution of items in our marketplace. By controlling for the uneven distribution of items across categories on Etsy, affinity-based entropy fixed our “shoes” problem, and improved the quality of our system in general.

Refining by recipient on a query for “geeky shoes” Refining by recipient on a query for “geeky shoes”

Keeping it simple

Although our iterations on entropy have introduced more complexity than we had at the outset, we still reap the benefits of avoiding opaque offline computations and dependencies on external infrastructure. Big data signals can be incredibly powerful, but they introduce architectural costs that it turns out aren’t necessary for a functional broad query classifier.

On the user-facing level, making Etsy easier to explore is something I’ve wanted to work on since before I started working here many years ago. It’s very frustrating for searchers to navigate through the millions of items of all types that we return for many popular queries. If you’ll indulge my thermodynamics metaphor once more, by helping to guide users out of high-entropy result sets, we’re battling the heat death of Etsy search—and that’s literally pretty cool.

Couldn't stomach that "heat death" joke? Leave a comment or let me know on twitter.

Huge thanks due to Giovanni Fernandez-Kincade, Stan Rozenraukh, Jaime Delanghe and Rob Hall.