Etsy Icon>

Code as Craft

Deep Learning for Search Ranking at Etsy main image

Deep Learning for Search Ranking at Etsy

  image

Until recently, search ranking at Etsy was powered by a gradient-boosted decision tree model. That was a model that performed well and served relevant results, and we used it to deliver Etsy's first-ever personalized search. Over time though we saw diminishing returns. Decision trees heavily rely on manually engineered features, and are limited in the types of input features they can take. Despite adding more features to the tree model, and more powerful ones, our relevancy gains slowly began to plateau.

Deep learning models provide a breadth of exploration for improvements beyond the limits of traditional tree feature engineering: using embedding features directly, incorporating multi-modal data, altering the network architecture itself. In our situation, approaching a relevancy ceiling, that looked like an attractive prospect. But migrating a production decision tree model to a neural ranking model brought us into unfamiliar territory. We were confident that the benefits of the new model would usher in the next generation of search ranking, but since migration would require major changes in the development pipelines and serving infrastructure, we knew it came with challenges. (Etsy search is composed of two ranking layers for optimal search results; in this post we focus exclusively on our second-pass layer.) Over the course of exactly one year (down to the day!) we iterated, experimented, and re-iterated until we finally launched Etsy's first-ever unified deep learning model for search ranking.

Along the way we had to modernize our ML development pipeline, and we moved to open-source tools and functions – with TF Ranking, TensorFlow's learning-to-rank library, at the core, along with off-the-shelf losses and metrics – to build the new model. Migrating to neural ranking has meant decreased model training time, a significantly improved developer experience, and brand new types of features available to us that were unavailable in the tree model. And for us this is just the beginning. In this post, we share the journey we took to evolve our ranking model and where we see ourselves going from here.

To ensemble or not to ensemble?

Our first thought was to go with an ensemble model: a gradient-boosted decision tree and a simple neural ranking model with three layers and softmax loss. The general idea for ensembling was that adding predictions from both models would yield an improvement over the baseline. The features that we had created for the tree model were a product of years of manual feature engineering and domain expertise. We wanted to combine that accumulated knowledge with whatever we were going to learn from the neural ranking model. Once the ensemble model proved out, we'd turn our attention to further developing the neural ranking model, until it reached the point where it could stand alone.

Our ensemble model would train the neural network using tree scores as a bias. Neural networks are known to be universal approximators, with the potential to learn complex functions while avoiding heavy reliance on engineered features. We hypothesized that the neural ranking model would learn additional information about query, users, and listings beyond the features engineered in the tree, and that that would be a win for relevancy. We developed the following scoring and loss function for a neural network to combine the outputs of both models:

score(g,h,x) = g(x) + h(x)

loss(g,h,x) = -nDCG(g(x) + h(x))

where g = GBDT model, h = neural network model

Offline, though, the ensemble showed no improvement over baseline when looking at NDCG. Additionally, when we scoped out the steps needed to load-test the ensemble model with live traffic, it became clear that the design of Etsy's search application itself was a roadblock. We would have to build support for inferencing multiple models in parallel, which would require a non-trivial development effort–and for a system that was really just an intermediate step to get to the final unified model.

When we took a step back, we saw that the ensemble model came with even more costs. Each iteration of new feature engineering would mean double the usual time and effort, since we'd have to modify input features for two models. Though the tree and the neural ranking models could potentially use the same inputs in different ways, we'd be spending more time exploring the optimal methods to incorporate new feature sets into them both. The whole development and prototyping cycle for the ensemble looked much more complicated than we could reasonably support.

Factor in the bad offline performance, infrastructure complications, added latency and deploy time from the two-model setup, and the ensemble simply had too many cons for us to move forward with it.

Towards a unified deep learning model

After learning from the failures and complications of the ensemble model, our next step was to see how far the neural network could go on its own. Compared to a DNN + GBDT ensemble model, a single DNN model has a number of advantages:

  • From a developer's perspective, it creates a friendlier, more unified and productive experience. Along with the single DNN model, we would redesign our machine learning pipeline by leveraging Kubeflow Pipeline, and other Google Cloud services. The new pipeline shrinks the gap between the local and cloud development environments -- something not easily achieved with the ensemble method.
  • DNN delivers fresher models by reducing model training time. The training time of the ensemble model would be the maximum of the DNN + GBDT model at the least. For the DNN model, training time could be easily accelerated with GPUs. In practice, the model training time was reduced to ⅛ of the original time by adding 8 GPUs. Limited by the slow training time of the GBDT model, the ensemble could never have taken advantage of that speed gain.
  • For model serving, the DNN + GBDT ensemble adds extra inference latency compared to inference on a single DNN model, which might have hurt user experience.
  • It's less costly to maintain one model than two. DNN and GBDT models share very limited inherent properties. As a result, an ensemble would have to maintain two end-to-end pipelines, adding cost and introducing instability to the system.

We started building the neural model by porting over our engineered features from the decision tree. Most of these were simple first- or second-order numerical ratio features such as click rate and listing price. These were added to the neural ranking model in addition to raw, normalized features. After some experimenting with methods, we landed on log1p normalization for better performance. It also happened to be one of the faster norming methods, as it doesn't rely on the dataset distribution.

To represent text features in the model we added custom embeddings trained on Etsy data. These embeddings gave a 10% boost to offline purchase NDCG. Etsy query and listing text differs greatly from the news stories or other corpuses that off-the-shelf models are typically trained on. Because the TensorFlow world offers easy access to pretrained embeddings, we also experimented with off-the-shelf embedding models like NNLM and BERT to represent our listings. As offline results proved, though, they were suboptimal for ranking results, demonstrating the power of domain-specific trained embeddings. In the tree world, we would create similarity scores between text features as model inputs, but in the new neural ranking world we could also directly leverage semantic representations of text.

Prototyping showed us that the neural ranking model required longer training windows than the tree model to achieve similar performance. Our adjustments to the window ended up increasing data size by about 40%. Past this point, adding more training days to the model resulted in diminishing returns. Deep learning models tend to require lots of big data, so it comes as no surprise that the neural ranking model performed better with more information.

Surpassing parity offline in our prototype of Etsy’s first-ever unified neural ranking model was a milestone for us. But replicating that success online was a much greater challenge. Serving the neural ranking model at scale with TensorFlow presented a number of infrastructure problems that we had to resolve, which we will detail in a forthcoming post about scaling Etsy's first deep learning model for search ranking. At the same time, we faced the challenges of migrating a new kind of model to production, and making sure our offline wins would convert to online.

Journey to launch

While we made improvements to our neural ranking model offline, the tree-based ranking system was continuing to evolve in production. We had a moving target to follow: an unfortunate reality of sizable software migrations, from which not even machine learning projects are exempt. The work required to convert features from various data sources inside Etsy to data types compatible with our new TensorFlow model imposed some delays. And then there were serving errors that emerged related to our novel use of TensorFlow for search ranking. With every delay, the gap between our model and the production baseline widened. But as a testament to how much the new pipeline improved developer experience in prototyping models, we were able to catch up with and even surpass baseline performance.

Our first online experiment tested neural ranking against a production model with extra real-time and browser-level features that ours didn't have. These features had to first be converted into TensorFlow Examples before they could be used at inference, and at the time of the experiment they weren't ready yet. But we went ahead and launched an experiment with the beta neural ranking model anyway, as a serving test and to gain what insights we could on the model's online performance.

The beta experiment showed cold start listings gaining more impressions and more purchases from new users than the tree model, despite the neural ranking model lacking real-time/browser features. One hypothesis is that gains in these two cold start areas can be attributed to our custom-trained embeddings that had learned Etsy-specific query and listing pairs. In the tree model, text features consisted largely of similarities derived from ngrams: but the custom embeddings learn semantic representations for a user's query and for candidate listings in the same space. Overall though, and as expected, the beta model performed less well on the majority of key metrics than the tree baseline.

Once we had access to all the features at inference time we could finally launch an experiment focused on model performance. This would be the first time we were truly going head-to-head with the production model in a fair fight, with the goal of the neural model being neutral or positive on key metrics. In this final round of online testing, we were able to reach parity with the production baseline models on Web platforms and actually exceeded performance on our mobile application platform. That was a solid signal to us that we could begin replacing the legacy model.

Where we are now

The new development lifecycle we created for our neural ranking model cut iteration time and time-to-deploy in half. This meant faster offline testing of new features and fresher models. It was largely thanks to the new framework that we were able to play multiple rounds of catch-up with the production model prior to launch.

With a streamlined process comes better developer experience. It used to be that making changes might require pull requests in four different repositories; we've unified them, and reduced context switching. Moving to open-source libraries for feature processing and model deployment means less dependence on the idiosyncrasies of bespoke, in-house tools. We've reduced or eliminated a number of the sources of developer fatigue built into the previous pipeline, and that has increased our throughput of models prototyped as a result.

In effect, the neural ranking model is already paying for itself: Etsy is saving hundreds of thousands of dollars in computation costs annually, compared with the previous system, from model serving and training. And we expect even greater impacts as we continue refining and extending our new approach. We want to see what we can do for search ranking with incremental model training, using the latest available data to deploy fresh models faster and more frequently. We plan to experiment with new model architectures that can learn better semantic representations of Etsy listings and their neighborhoods, and will better understand our users and their query intentions. And we're going to investigate data augmentation, more important than ever now, given the data sensitivity of neural models, for improving search rankings.

Acknowledgements

This work is brought to you by the Search Ranking team with deep gratitude to the ML Infrastructure, Platform, & Systems (MIPS) and IR platform teams for their invaluable support.