Using the AWS CLI to manage Spark Clusters on EMR: Examples and Reference

Last updated:

WIP Alert This is a work in progress. Current information is correct but more content will probably be added in the future.

All of these commands require that you instal awscli. On Ubuntu: sudo apt-get install python-pip; pip install awscli; then run aws configure to configure. More info here

Send a file to the master node in the cluster

If you omit --dest, the file will be sent to the home directory of the hadoop user, i.e. /home/hadoop/.

aws emr put \
  --cluster-id <youclusterid> \
  --key-pair-file <pathtopemfile>\
  --src <pathtofile>\
  --dest <pathonremoteserver>

SSH to the master node in the cluster

aws emr ssh \
  --cluster-id <yourclusterid> \
  --key-pair-file <pathtopemfile>

SSH to the master node in the cluster AND run a command manually

As an example, say we'd like to run two hadoop fs commands on our cluster. 'ENDSSH' signals the start of a HEREDOC; it's used to avoid needing to escape stuff.

aws emr ssh \
  --cluster-id <yourclusterid> \
  --key-pair-file <pathtopemfile> <<'ENDSSH'
hadoop fs -rm somefile
hadoop fs -put /path/to/myfile
ENDSSH

Submit a Spark Application to a running cluster (JAR is in the cluster already)

The file /home/hadoop/myfatjarfile.jar is the fat jar containing your job.

See Creating Scala fat JARs for Spark for more info

It needs to exist on the master node or you'll get an error. To send a file to the master node you can use aws emr put, as explained above.

Note that foo and bar are the parameters to the main method of you job.

aws emr add-steps \
  --cluster-id <yourclusterid> \
  --steps Type=Spark,\
Name="Sample Step to Run",\
Args=[--class,com.queirozf.SampleStep,--deploy-mode,cluster,--master,yarn,/home/hadoop/somejarfile.jar,foo,bar]

Submit a Spark Application to a running cluster (JAR is on S3)

If you would rather upload the fat JAR to S3 than to the EMR cluster, you can also use the s3 path in Args:

aws emr add-steps \
  --cluster-id <yourclusterid> \
  --steps Type=Spark,\
Name="Sample Step to Run",\
Args=[--class,com.queirozf.SampleStep,--deploy-mode,cluster,--master,yarn,s3://my-s3-bucket/path/to/myfatjarfile.jar,foo,bar]

See also

Dialogue & Discussion