Etsy Icon>

Code as Craft

Enhancing Cloud Usage Forecasting, Monitoring & Optimizing main image

Enhancing Cloud Usage Forecasting, Monitoring & Optimizing

  image

In 2020, Etsy concluded its migration from an on-premise data center to the Google Cloud Platform (GCP). During this transition, a dedicated team of program managers ensured the migration's success. Post-migration, this team evolved into the Etsy FinOps team, dedicated to maximizing the organization's cloud value by fostering collaborations within and outside the organization, particularly with our Cloud Providers.

Positioned within the Engineering organization under the Chief Architect, the FinOps team operates independently of any one Engineering org or function and optimizes globally rather than locally. This positioning, combined with Etsy's robust engineering culture focused on efficiency and craftsmanship, has fostered what we believe is a mature and successful FinOps practice at Etsy.

Forecast Methodology

A critical aspect of our FinOps approach is a strong forecasting methodology. A reliable forecast establishes an expected spending baseline against which we track actual spending, enabling us to identify deviations. We classify costs into distinct buckets:

  • Core Infrastructure: Includes the costs of infrastructure and services essential for operating the Etsy.com website.
  • Machine Learning & Product Enablement: Encompasses costs related to services supporting machine learning initiatives like search, recommendations, and advertisements.
  • Data Enablement: Encompasses costs related to shared platforms for data collection, data processing and workflow orchestration.
  • Dev: Encompasses non-production resources.

The FinOps forecasting model relies on a trailing Cost Per Visit (CPV) metric. While CPV provides valuable insights into changes, it's not without limitations:

  • A meaningful portion of web traffic to Etsy involves non-human activity, like web crawlers that’s not accounted for in CPV.
  • Some services have weaker correlations to user visits.
  • Dev, data, and ML training costs lack direct correlations to visits and are susceptible to short-term spikes during POCs, experiments or big data workflows.
  • A/B tests for new features can lead to short-term CPV increases, potentially resulting in long-term CPV changes upon successful feature launches.

Periodically, we run regression tests to validate if CPV should drive our forecasts. In addition to visits we have looked into headcount, GMV(Gross Merchandise Value) and revenue as independent variables. Thus far, visits have consistently exhibited the highest correlation to costs.

Monitoring and Readouts

We monitor costs using internal tools built on BigQuery and Looker. Customized dashboards for all of our Engineering teams display cost trends, CPV, and breakdowns by labels and workflows. Additionally, we've set up alerts to identify sudden spikes or gradual week-over-week/month-over-month growth.

Collaboration with the Finance department occurs weekly to compare actual costs against forecasts, identifying discrepancies for timely corrections. Furthermore, the FinOps team conducts recurring meetings with major cost owners and monthly readouts for Engineering and Product leadership to review forecasted figures and manage cost variances.

While we track costs at the organization/cost center level, we don't charge costs back to the teams. This both lowers our overhead and more importantly, provides flexibility to make tradeoffs that enable Engineering velocity.

Cost Increase Detection & Mitigation

Maintaining a healthy CPV involves swiftly identifying and mitigating cost increases, to achieve this we:

  • Analysis: Gather information on the increase's source, whether from specific cloud products, workflows, or usage pattern changes (ie variance in resource utilization).
  • Collaboration: Engage relevant teams, sharing insights and seeking additional context.
  • Validation: Validate cost increases from product launches or internal changes, securing buy-in from leadership if needed.
  • Mitigation: Unexpected increases undergo joint troubleshooting, where we outline and assign action items to owners, until issues are resolved.
  • Communication: Inform our finance partners about recent cost trends and their incorporation into the expected spend forecast post-confirmation or resolution with teams and engineering leadership.

Cost Optimization Initiatives

Another side of maintaining a healthy CPV involves cost optimization, offsetting increases from product launches. Ideas for cost-saving come as a result of collaboration between FinOps and engineering teams, with the Architecture team validating and implementing efficiency improvements. Notably we focus on the engineering or business impact of the cost optimization rather than solely on savings, recognizing that inefficiencies often signal larger problems.

Based on effort vs. value evaluations, some ideas are added to backlogs, while major initiatives warrant dedicated squads.Below is a breakout of some of the major wins we have had in the last year or so.

  • GCS Storage Optimization - In 2023 we stood up a squad focused on optimizing Etsy’s use of GCS, as it has been one of the largest growth areas for us over the past few years. The squad delivered a number of improvements including improved monitoring of usage, automation features for Data engineers, implementation of TTLs that match data access patterns/business needs and the adoption of Intelligent tiering. Due to these efforts, Etsy’s GCS usage is now less than it was 2 years ago.
  • Compute Optimization - Migrated over 90% of Etsy infrastructure that is serving traffic to the latest and greatest CPU platform. This improved our serving latency while reducing cost.
  • Increased Automation for model deployment - In an effort to improve the developer experience, our machine learning enablement team developed a tool to automate the compute configurations for new models being deployed, which also ended up saving us money.
  • Network Compression - Enabling network compression between our high throughput services both improved the latency profile and drastically reduced the networking cost.

What's Next

While our core infrastructure spend is well understood, our focus is on improving visibility into our Machine Learning platform's spend. As these systems are shared across teams, dissecting costs tied to individual product launches is challenging. Enhanced visibility will help us refine our ROI analysis of product experiments and pinpoint future areas of opportunity for optimization.