Etsy Icon>

Code as Craft

Scaling Etsy Payments with Vitess: Part 2 – The “Seamless” Migration main image

Scaling Etsy Payments with Vitess: Part 2 – The “Seamless” Migration


Between Dec 2020 and May 2022, the Etsy Payments Platform, Database Reliability Engineering and Data Access Platform teams moved 23 tables, totaling over 40 billion rows, from four unsharded payments databases into a single sharded environment managed by Vitess. While Vitess was already managing our database infrastructure, this was our first usage of vindexes for sharding our data.

This is part 2 of our series on Sharding Payments with Vitess. In the first post, we focused on making application data models available for sharding. In this part, we’ll discuss what it takes to cut over a crucial high traffic system, and part 3 will go into detail about reducing the risks that emerge during the cutover.

Migrating Data

Once we had chosen a sharding method, our payments data had to be redistributed to forty new shards, the data verified as complete, reads and writes switched over from existing production databases, and any additional secret sauce applied. Considering this was only the payments path for all of, the pressure was on to get it right.

The only way to make things work would be to test, and test again. And while some of that could be done on an ad-hoc basis by individual engineers, to reorganize a system as critical as payments there is no substitute for a production-like environment. So we created a staging infrastructure where we could operationally test Vitess's internal tooling against snapshots of our production MySQL data, make whatever mistakes we needed to make, and then rebuild the environment, as often as necessary.

Staging was as real as we could make it: we had a clone of the production dataset, replicas and all, and were equipped with a full destination set of forty Vitess shards. Redistributing the data in this playground, seeing how the process behaved, throwing it all away and doing it again (and again), we built confidence that we could safely wrangle the running production infrastructure. We discovered, and documented, barriers and unknowns that had to be adjusted for (trying to shard on a nullable column, for instance). We learned to use VDiff to confirm data consistency and verify performance of the tooling, we learned to run test queries through Vitess proper, checking behavior and estimating workload, and we discovered various secondary indexing methods and processes (using CreateLookupVindex and ExternalizeVindex) to overcome some of these barriers.

One of the key things we found out about was making the read/write switch. Vitess's VReplication feature is the star performer here. Via the MoveTables workflow, VReplication sets up streams that replicate all writes in whichever direction you're currently operating. If you're writing to the source side, VReplication has streams to replicate those writes into the sharded destination hosts, properly distributing them to the correct shard. When you switch writes with Vitess, those VReplication streams are also reversed, now writing from the destination shards to the source database(s). Knowing we had this as an option was a significant confidence booster. If the first switch wasn't perfect, we could switch back and try again: both sides would stay consistent and in sync. Nothing would get lost in between.

MoveTables and SwitchWrites diagram
Top: The MoveTables workflow has VReplication setup to move existing data to the new destination shards and to propagate new writes. Bottom: After SwitchWrites the Vreplication stream reverses to the source database, keeping the origin data in sync with new writes to the sharded setup.

Cutting Over Production

VReplication paved most of the road for us to cut over production, but we did have our share of “whoa, I didn’t expect that” moments. As powerful as the Vitess workflow system is, it does have some limitations when attempting to move a substantial dataset, and that requires finding additional solutions for the same problem. We were traversing new ground here for Etsy, so even as we pushed onward with the migration we were still learning how to operate within Vitess.

In one phase of the project we switched traffic and within a couple of minutes saw a massive 40x jump in query volume (from ~5k/second to ~200k/second) on the same workload. What is going on here? After some quick investigation we found that a query gathering data formerly colocated on a monolithic database now required that data from some portion of the new shard set. The problem was, Vitess didn’t know exactly which of our forty shards that data existed on, since it was being requested via a column that was not part of the sharding column.

An explosion of queries per second
An explosion: This massive increase in query volume surprised us initially. A dive into what was generating this volume quickly revealed scatter queries were at the root of it.

Enter the story's second superhero: CreateLookupVindex. While we switched traffic back and headed to our staging environment to come up with a solution, we quickly realized the 40x increase was due to Vitess using a scatter gather process, sending queries to all shards in an effort to see which of them had pieces of the data. Besides those scatter queries being inefficient, they were also bombarding the databases with requests that returned empty result sets. To combat this, we created Secondary Vindexes to tell Vitess where the data was located using a different column than the Primary Vindex. These additional Vindexes allowed the VTGates to use a lookup table to identify the shards housing the relevant data and make sure they were the only ones receiving queries.

Impact of applying Secondary Vindexes
A solution: Application of Lookup Vindexes quelled the rampant scatter queries immediately, returning query volume to manageable levels.

Knowing there were additional datasets that would behave similarly, we were able to find those callsites and query patterns early, and apply those Secondary Vindexes while the MoveTables workflow was in progress. This prevented the same overwhelming traffic pattern from recurring and kept the final transitions from being disruptively resource-intensive. (For more information on the scatter traffic pattern, part 3 of this series has additional details.)

As we grew more comfortable with our command of the migration process, we decided that since Vitess supported multiple workflows per keyspace, we could get both the MoveTables and the Secondary Vindexes deployed simultaneously. However, there's a caveat: due to the way Secondary Vindexes are maintained by VReplication streams, the Vindexes cannot be externalized (made available to the system) until after the switching of writes is complete. These indexes work by inserting rows into a lookup table as writes are made to the owner table, keeping the lookup table current. While the MoveTables is in motion, the Secondary Vindex is doing its job inserting while the data is being sharded via the vReplication streams. And there's the rub: if you externalize the Vindex before writes are switched, there aren’t any writes being made to the destination shards, and you are going to miss all of those lookup records coming from the vReplication stream.

Taking all this into account, we performed the switch writes, externalized the two Secondary Vindexes we were creating, and found the machines we were running on couldn’t handle the query load effectively. No problem: we'd switch the writes back to the source database. Oops! We just broke our fancy new Vindexes because CreateLookupVindex was no longer running!

While it wasn’t the end of the world, it did mean we had to remove those Secondary Vindexes, remove all of their artifacts from the sharded dataset (drop the lookup tables), then rebuild them a second time. During the small window this work created we raised the specs for the destination cluster, then did the dance again. Switch writes, externalize the Vindexes, and roll forward. This time, happily, roll forward is exactly what things did.


Thanks to extensive testing on a full production copy in our staging environment, Vitess tooling being as powerful and performing as well as we expected, and meticulous precursor work on the data models, we encountered no disruption, no downtime, and no impact to normal operation throughout the course of these migrations. While there was tolerance for the idea of some disruption, the most important thing to our team was that our processes should run transparently. Completing this monumental project on a highly sensitive path without anyone noticing (inside Etsy or out) was very satisfying.

This is part 2 in our series on Sharding Payments with Vitess. Check out part 1 to read about the challenges of migrating our data models, and part 3 to read about reducing cutover risks.