Satya's blog - Spark with Databricks and AWS EMR

Nov 17 2017 12:18 Spark with Databricks and AWS EMR

Databricks is a web-based Spark notebook system, and also the company that wrote several spark libraries. The notebooks support Python and Scala.

Databricks runs your jobs in your AWS account, on EC2 instances that Databricks manages. The initial setup is a little tricky.

Once everything it set up, it's a pretty good system for developing and debugging Spark-based jobs. It can run jobs on a cron-like schedule. Databricks has support for retrying failed jobs, and can notify about success/failure by email. It gives good access to the spark worker logs for debugging.

However, this is expensive to run. I came up with a workflow that involves development/debugging on databricks, and then export the notebook as a script to be run in EMR (Elastic Map-Reduce, an AWS product). I use Python and pyspark, so this works pretty well.

I do need to import the libraries that Databricks imports automatically. I forget the exact import statements I used, but they are easy enough to figure out.

I used Luigi, which is a Python library for setting up task workflows (see also: Airflow). I set up a few abstract classes to encapsulate "this is an EMR job". Then I extended one of those classes to actually run a spark job, something like:

class ActualJob(AbstractSparkJob):
    luigi_param = luigi.Parameter()  # I usually pass a timestamp to every task

    def spark_cmd:
        return " --runs --spark-script --located-in s3://"

spark_cmd returns a command line string.

My AbstractSparkJob takes the given command and does one of two things: either submit an EMR step to the EMR cluster, using command_runner.jar to run the command, or ssh into the cluster and run spark-submit with the given command as its parameters.

(I don't have all the code do that available to post right now, sorry)

The abstract classes encapsulate all the work of spinning up an EMR cluster (and shutting it down), and making the AWS API calls via the boto library, and the work to ssh in and run spark-submit.

The abstract classes makes it easy for any other data infrastructure developer person to add more jobs.

The actual script that was exported from Databricks lives in S3 and is referenced by the command shown above. EMR does the work of fetching it from S3 and running it. Part of my data pipeline startup is to copy all the EMR scripts, which I keep in a convenient subdirectory, up to a specific prefix in S3. That way my actual Spark scripts all live in the same code repository as the rest of the pipeline.

Tag: spark aws