Add an Apache Zeppelin UI to your Spark cluster on AWS EMR

Add an Apache Zeppelin UI to your Spark cluster on AWS EMR

Last updated:
Table of Contents

This is a small guide on how to add Apache Zeppelin to your Spark cluster on AWS Elastic MapReduce (EMR). It's the easiest way to get interactive access to Spark and be able to view results immediately.

TL;DR: You need to a) start a (non auto-terminating) cluster with the Zeppelin add-on using a private key, b) enable web connection with a SSH tunnel, c) enable a proxy on your browser (such as FoxyProxy) and d) navigate to http://your-cluster-url:8890

Apache Zeppelin

Apache Zeppelin is an incubating Apache project, whose aim is to project a generic notebook-like access to various backends. Spark is but one of them.

For info on how to launch a Spark cluster on AWS EMR, look at Creating a Spark Cluster on AWS EMR: a tutorial

A word of advice here: in order to be able to use Zeppelin on your cluster, you cannot have it automatically terminate after it is run.

This raises the possibility of your leaving your cluster active for an extended period of time if you're not careful. If you forget to manually terminate it after you're done using it, you can incur a large bill, denominated in U.S. dollars.

Don't forget to terminate your cluster after you're done using Zeppelin.

Step 1) Launch cluster with Zeppelin as add-on

Replace YOUR_KEY_NAME with your private key name.

$ aws emr create-cluster \
    --name "1-node Zeppelin cluster (turn me off after use)" \
    --instance-type m3.xlarge \
    --release-label emr-4.1.0 \
    --instance-count 1 \
    --ec2-attributes KeyName=YOUR_KEY_NAME \
    --use-default-roles \
    --applications Name=Spark Name=Zeppelin-Sandbox \
    --no-auto-terminate

Step 2) Open SSH Tunnel to cluster

After you find out your external IP, you can open an SSH Tunnel to it:

  • <port> can be any port you can use any that's not in use (higher in value than 1024), for example 8157
  • Your <cluster_address> will vary for you; it should look something like ec2-99-999-999-99.compute-1.amazonaws.com
ssh -i /path/to/your/key.pem -ND <port> <cluster_address>

Step 3) Install and configure FoxyProxy

Follow instructions on this AWS post

Step 4) Navigate to your cluster URL, port 8890

Make sure FoxyProxy (the little blue dot to the right of the adress bar) is enabled!

zeppelin_welcome_screen


Dialogue & Discussion