Satya's blog - 2017/

Nov 17 2017 12:18 Spark with Databricks and AWS EMR

Databricks is a web-based Spark notebook system, and also the company that wrote several spark libraries. The notebooks support Python and Scala.

Databricks runs your jobs in your AWS account, on EC2 instances that Databricks manages. The initial setup is a little tricky.

Once everything it set up, it's a pretty good system for developing and debugging Spark-based jobs. It can run jobs on a cron-like schedule. Databricks has support for retrying failed jobs, and can notify about success/failure by email. It gives good access to the spark worker logs for debugging.

However, this is expensive to run. I came up with a workflow that involves development/debugging on databricks, and then export the notebook as a script to be run in EMR (Elastic Map-Reduce, an AWS product). I use Python and pyspark, so this works pretty well.

I do need to import the libraries that Databricks imports automatically. I forget the exact import statements I used, but they are easy enough to figure out.

I used Luigi, which is a Python library for setting up task workflows (see also: Airflow). I set up a few abstract classes to encapsulate "this is an EMR job". Then I extended one of those classes to actually run a spark job, something like:

class ActualJob(AbstractSparkJob):
    luigi_param = luigi.Parameter()  # I usually pass a timestamp to every task

    def spark_cmd:
        return " --runs --spark-script --located-in s3://"

spark_cmd returns a command line string.

My AbstractSparkJob takes the given command and does one of two things: either submit an EMR step to the EMR cluster, using command_runner.jar to run the command, or ssh into the cluster and run spark-submit with the given command as its parameters.

(I don't have all the code do that available to post right now, sorry)

The abstract classes encapsulate all the work of spinning up an EMR cluster (and shutting it down), and making the AWS API calls via the boto library, and the work to ssh in and run spark-submit.

The abstract classes makes it easy for any other data infrastructure developer person to add more jobs.

The actual script that was exported from Databricks lives in S3 and is referenced by the command shown above. EMR does the work of fetching it from S3 and running it. Part of my data pipeline startup is to copy all the EMR scripts, which I keep in a convenient subdirectory, up to a specific prefix in S3. That way my actual Spark scripts all live in the same code repository as the rest of the pipeline.

Tag: spark aws

Feb 20 2017 14:17 Power consumption
CPU+RAM+fans23W 210mA
+ HDD 37W 240mA
+ GTX 105057W 500mA

Tag: geeky tech

Jan 22 2017 17:11 How to back up Samsung Galaxy S4 using adb

Connect the phone via USB. Go into Settings, About Phone, click on Build Version about 7 times. Then Developer Options becomes available, use that to turn on USB Debugging.

Use `lsusb` to get the phone's code. Example output:

Bus 001 Device 059: ID 04e8:6866 Samsung Electronics Co., Ltd GT-I9...

Create a file or directory `~/.android/adbusb.ini` containing the line "0x04e8" based on the lsusb output above.

As root, drop a file called /etc/udev/rules.d/s4-android.rules containing the following:

SUBSYSTEM=="usb", SYSFS{idVendor}=="04e8", MODE="0666"

Note that the vendor id is again from the lsusb above.

Run this too (as root):

chmod 644 /etc/udev/rules.d/s4-android.rules

Chef recipe for that file:

file 'udev_rule_s4_android' do
    path '/etc/udev/rules.d/s4-android.rules'
    mode '0644'
    owner 'root'
    group 'root'

execute 'udev-restart' do
    command '/etc/init.d/udev restart'
    subscribes :create, 'udev_rule_s4_android', :immediately
    action :nothing

Connect the phone in 'PTP' mode:
Connect the phone via USB. Drag down the notifications, click on the 'Connected as...' notification, and set it to PTP or Camera mode.

`adb shell` should now get you a shell on the phone.


Tag: android

In this folder: