Etsy Logo

Code as Craft

Managing Hadoop Job Submission to Multiple Clusters main image

Managing Hadoop Job Submission to Multiple Clusters


At Etsy we have been running a Hadoop cluster in our datacenter since 2012.  This cluster handled both our scheduled production jobs as well as all ad hoc jobs.  After several years of running our entire workload on this one production Hadoop cluster, we recently built a second.  This has greatly expanded our capacity and ability to manage production and ad hoc workloads, and we got to have fun coming up with names for them (we settled on Pug and Basset!).  However, having more than one cluster has brought new challenges.  One of the more interesting issues that came up was how to manage the submission of ad hoc jobs with multiple clusters.

The Problem

As part of building out our second cluster we decided to split our current workload between the two clusters.  Our initial plan was to divide the Hadoop workload by having all scheduled production jobs run on one cluster and all ad hoc jobs on the other.  However, we recognized that those roles would change over time.  First, if there were an outage or we were performing maintenance on one of the clusters, we may shift all the workload to the other.  Also, as our workload changes or we introduce new technology, we may balance the workload differently between the two clusters.

When we had only one Hadoop cluster, users of Hadoop would not have to think about where to run their jobs.  Our goal was to keep it easy to run an ad hoc job without users needing to continually keep abreast of changes in which cluster to use.  The major obstacle for this goal is that all Hadoop users submit jobs from their developer VMs.  This means we would have to ensure that the changes necessary to switch which cluster should be used for ad hoc jobs propagate to all of the VMs in a timely fashion.  Otherwise some users would still be submitting their jobs to the wrong cluster, which could mean those jobs would fail or otherwise be disrupted. To simplify this and avoid such issues, we wanted a centralized mechanism for determining which cluster to use.

Other Issues

There were two related issues that we decided to address at the same time as managing the submission of ad hoc jobs to the correct cluster.  First, we wanted the cluster administrators to have the ability to disable ad hoc job submission entirely.  Previously we had relied on asking users via email and IRC to not submit jobs, which is only effective if everyone checks and sees that request before launching a job.  We wanted a more robust mechanism that would truly prevent running ad hoc jobs.  Also, we wanted a centralized location to view the client-side logs from running ad hoc jobs.  These would normally only be available in the user’s terminal, which complicates sharing these logs when getting help with debugging a problem.  We wanted both of these features regardless of having the second Hadoop cluster.  However, as we considered various approaches for managing ad hoc job submission to multiple clusters, we found that we could solve these problems at the same time.

Our Approach

We chose to use Apache Oozie to manage ad hoc job submission.  Using Oozie had several significant advantages for us.  First, we already were using Oozie for all of our scheduled production workflows.  As such we already understood it well and had it properly operationalized.  It also allowed us to reuse existing infrastructure rather than setting up something new, which greatly reduced the time and effort necessary to complete this project. Next, using Oozie let us distribute the load from the job client processes across the Hadoop cluster.  When ad hoc job submission occurred on users’ VMs, this load was naturally distributed.  Distributing this load across the Hadoop cluster allows this approach to grow with the cluster.  Moreover, using Oozie automatically provided a central location for viewing the client logs from job submission.  Since the clients run on the Hadoop cluster, their logs are available just like the logs from any other Hadoop job.  As such they can be shared and examined without needing to retrieve them from the user’s terminal.

There was one downside to using Oozie: it did not support automatically directing ad hoc jobs to the appropriate cluster or disabling the submission of ad hoc jobs.  We had to build this ourselves, but as Oozie was handling everything else it was very lightweight.  To minimize the amount of new infrastructure for this component, we used our existing internal API framework to manage this state.  We call this component the “state service”.

The Job Submission Process

Previously the process of submitting an ad hoc job looked like this: Original Job Submission Sequence Diagram

Now submitting an ad hoc job looks like this instead:   Job Submission Server Sequence Diagram

From the perspective of users nothing had changed; they would still launch jobs using our run_scalding script on their VM.  Internally, it would request the active ad hoc cluster using the API for the state service.  This API call would also indicate if ad hoc job submission was disabled, allowing the script to terminate.  Administrators can also set a message that would be displayed to users when this happens, which we use to provide information about why ad hoc jobs were disabled and the ETA on re-enabling them.

Once the script determined the cluster on which the job should run, it would generate an Oozie workflow from a template that would run the user’s job.  This occurs transparently to the user so that they do not have to be concerned about the details of the workflow definition.  The script then submits this generated workflow to Oozie, and the job runs.  The change most visible to users in this process is that the client logs no longer appear in their terminal as the job executes.  We considered trying to stream them from the cluster during execution, but to minimize complexity the script prints a link to the logs on the cluster after the job completes.

Other Options

While using Oozie ended up being the best choice for us, there were several other approaches we considered.

Apache Knox

Apache Knox is a gateway for Hadoop REST APIs.  The project primarily focuses on security, so it’s not an immediate drop-in solution for this problem.  However, it provides a gateway, similar to a reverse proxy, that maps externally exposed URLs to the URLs exposed by the actual Hadoop clusters.  We could have used this functionality to define URLs for an “ad hoc” cluster and change the Knox configuration to point that to the appropriate cluster. Nevertheless, we felt Knox was not a good choice for this problem.  Knox is a complex project with a lot of features, but we would have been using only a small subset of these.  Furthermore, we would be using it outside of its intended use case, which could complicate applying it to solve our problem.  Since we did not have experience operating Knox at scale, we felt it would be better to stick with Oozie, which we already understood and would not have to shoehorn into our use case.

Custom Job Submission Server

We also considered implementing our own custom process to both manage the state of which cluster was currently active for ad hoc jobs as well as handling centralized job submission.  While this would have provided the most flexibility, it also meant building a lot of new infrastructure.  We would have essentially been reimplementing Oozie, but without any of the community testing or support.  Since we were already using Oozie and it met all our requirements, there was no need to build something custom.

Gateway Server

The final approach we considered was having a “gateway server” and requiring users to SSH to that server and launch jobs from there instead of from their VM.  This would have simplified the infrastructure components for job submission.  The Hadoop configuration changes to point ad hoc job submissions to the appropriate cluster or disable job submission entirely would only need to be deployed there.  By its very nature it would provide a central location for the client logs.  However, we would have to manage scaling and load balancing for this approach ourselves.  Furthermore, it would represent a significant departure from how development is normally done at Etsy.  Allowing users to write and run Hadoop jobs from their VM is important for keeping Hadoop as accessible as possible.  Adding the additional step of moving changes and SSH-ing to a gateway server compromises that goal.


Using Oozie to manage ad hoc job submission in this way has worked well for us.  Reusing the Oozie infrastructure we already had let us quickly build this out, and having this new process for running jobs made the transition to having two Hadoop clusters much easier.  Moreover, we were able to keep the process of submitting an ad hoc job almost identical to the previous process, which minimized the disruption for users of Hadoop. As we were developing this, we found that there was only minimal discussion online about how other organizations have managed ad hoc job submission with multiple clusters.  Our hope is that this review of our approach as well as other options we considered is helpful if you are in the same situation and are looking for ideas for your own process of ad hoc job submission.