Our Experiences Setting Up Our Own Spark Cluster

7 minutes read

At Sortable, we use Spark for many of our large-scale computing needs. In production, we have regularly-scheduled Spark jobs that collect and aggregate the millions of datapoints that flow through our servers every day. We also use Spark for running large one-off jobs in development, which often require massively scaling up the cluster for the duration of the job and then scaling it back down after the job is complete.

When we first started using Spark, we relied on an AWS service called Amazon EMR. Amazon EMR is great for getting started, as it provides you with cloud-based machines that come pre-configured with everything you need for running a Spark cluster.

However, there were drawbacks to this approach. At the time, the version of Spark that came installed on EMR boxes only supported Scala 2.10, which was an issue since we compile all of our Scala code using Scala 2.11. While there were a few different approaches we could’ve taken to resolve this, the one we used in practice was to completely isolate all of our Spark code from the rest of our codebase. Our Spark code effectively lived in an entirely separate project, it had no dependencies on the rest of our codebase, nor did the rest of our codebase have any dependencies on it.

Another downside to using Amazon EMR was the cost. In addition to the cost you pay for the underlying machines, Amazon charges an additional premium for making use of their EMR service. If you’re making use of Amazon Spot Instances for your cluster, the additional cost of using Amazon EMR can exceed the cost of the instances themselves.

Ultimately, the pressure to move away from EMR because of these downsides built up to the point that we considered running our own Spark cluster on bare EC2 instances.

When choosing to run our own Spark cluster, the first question we asked ourselves was which cluster manager we were going to go with. There are two main options for clusters of non-trivial size: Mesos, and YARN. A great many paragraphs have been written about the differences between them, and while we decided to go with Mesos (due to being less than satisfied with some parts of YARN in EMR), either one probably would’ve worked for our needs.

To very briefly touch on how Mesos works: in Mesos you set up one or more “Mesos masters” that are aware of all other machines in the cluster. These other machines are called “slaves” or “agents”, and parameters like the number of CPUs, the amount of RAM, and the amount of disk space on each slave are called “resources”. The master(s) then send “resource offers” to applications, like Spark, which they can either accept or reject. If the application accepts the offer, then it is free to start running its given jobs using resources provided.

Mesos interface

An example of the Mesos web interface

We got our Mesos cluster up-and-running fairly quickly, but getting Spark to work properly with it was another matter. One of the main issues we encountered involved Spark’s two modes of operation: client mode, and cluster mode. In client mode, when submitting a job to Spark, the machine that submits the job acts as the “driver program” that executes all the code not being performed on an RDD. All other machines involved in running the program report their results back to the “driver program” whenever a collect or count action is performed. This machine does not have to be under the control of Mesos. In cluster mode, the machine that runs the driver code is picked and managed by the cluster manager (Mesos).

We originally set out to use cluster mode, because we were more familiar with the setup from our time spent on EMR. However, we were unable to get our Spark jobs to function correctly using cluster mode, a problem that we eventually tracked down to this unresolved issue. We were able to apply our own patch to Spark to solve this issue for our immediate needs, but since we didn’t want to have to maintain this patched version whenever we upgraded, we decided to explore client mode instead.

Client mode actually ended up providing a number of advantages for us. By having a dedicated machine to run the driver programs, we could give this machine much more memory than the other machines in our cluster, a useful attribute that allows the driver programs to run Spark jobs that have to collect a lot of data onto the driver during their operation.

Client mode is also much more friendly for developers writing new Spark applications. In cluster mode, to see the output of a program, you must navigate to the correct spot within a web-based UI. Killing a running program early also can only be done from the web-UI, or by sending a hard-to-remember command to the Mesos master. In client mode however, output from the program is printed directly to the terminal from which the application was launched, and killing the application early is as easy as hitting Ctrl-C, a much easier-to-use setup than cluster mode offers.

One feature we wanted from our new cluster was the ability for developers to run resource-intensive jobs without stealing resources from our critical production jobs. To do this, we made use of Mesos roles. On each slave (or agent) in the cluster, Mesos allows you to specify that a certain amount of the resources of that machine are reserved for the jobs that belong to a particular role. This allowed us to partition an amount of the cluster to be reserved for production jobs, the remainder of the cluster being left free to run less mission-critical jobs.

Now that our cluster was broken out into roles, it made sense to reduce the size of the cluster partition used by developers during the off-hours, and then increase that part of the cluster again the next morning. EC2 offers a very natural way to do this via Auto Scaling Groups. An Auto Scaling Group allows you to launch and manage groups of similar EC2 instances. For us, it was as simple as setting up the Auto Scaling Group to launch a bunch of Spark servers into the cluster every morning and then terminate them every night.

Of course, since we were managing all the instances in the cluster ourselves, the instances launched by Amazon each morning did not come preconfigured to work within our cluster. They would launch as vanilla Ubuntu 14.04 servers, then we would run a bunch of scripts on each new instance to upgrade all the packages on the instance to the latest version, and setup the instance to report its available resources to the Mesos master. Launching stock instances and upgrading and configuring them immediately afterward is a common practice for us, but this was the first time we were doing it on a large number of machines daily. As a result, we learned about a few pitfalls to this approach:

  • Our configuration scripts would take 10–15 minutes to complete. This was fine when the instances would launch before everyone came in in the morning, but was less fine if we wanted to scale up the size of the cluster mid-day.

  • Configuring an instance would sometimes fail, and the instance would exist in a zombie state until it was manually killed by a developer.

  • Occasionally, a developer would break the configuration script, and all of our newly-launched instances would fail to configure.

To solve these issues, we first ensured that whenever an instance failed to configure we would terminate it after a period of time, so that we didn’t end up paying for instances we weren’t using. We also started to make use of EC2 custom AMIs (Amazon Machine Images). These allow you to configure an EC2 instance to be exactly as you want it, create a snapshot of the instance in that state, and then launch new instances based off the snapshot that you took. The improvements have been dramatic — new instances are available in under two minutes (more than five times quicker than before), and almost never fail to configure correctly.

Running our own Spark cluster has proved to be a valuable experience for us. In the end, we’ve achieved greater control of our cluster, managed to fine-tune our handling of which jobs run on which parts of our cluster, fully support Scala 2.11 everywhere in our code base rather than having a separate project compiled on Scala 2.10 for our Spark jobs, and managed to save thousands of dollars each month compared to using Amazon EMR.

Updated: