Using the AWS CLI to manage Spark Clusters on EMR: Examples and Reference

Last updated: 02 May 2020

Table of Contents

Send a file the master node the cluster
SSH to master node in cluster
SSH to master node in cluster AND run command
Submit Spark Application to running cluster
Submit Spark Application to running cluster (JAR on S3)

All of these commands require that you instal awscli. On Ubuntu: sudo apt-get install python-pip; pip install awscli; then run aws configure to configure. More info here

Send a file the master node the cluster

If you omit --dest, the file will be sent to the home directory of the hadoop user, i.e. /home/hadoop/.

aws emr put \
  --cluster-id <youclusterid> \
  --key-pair-file <pathtopemfile>\
  --src <pathtofile>\
  --dest <pathonremoteserver>

SSH to master node in cluster

aws emr ssh \
  --cluster-id <yourclusterid> \
  --key-pair-file <pathtopemfile>

SSH to master node in cluster AND run command

As an example, say we'd like to run two hadoop fs commands on our cluster. 'ENDSSH' signals the start of a HEREDOC; it's used to avoid needing to escape stuff.

aws emr ssh \
  --cluster-id <yourclusterid> \
  --key-pair-file <pathtopemfile> <<'ENDSSH'
hadoop fs -rm somefile
hadoop fs -put /path/to/myfile
ENDSSH

Submit Spark Application to running cluster

The jar file should be in the cluster already!

The file /home/hadoop/myfatjarfile.jar is the fat jar containing your job.

See Creating Scala fat JARs for Spark for more info

It needs to exist on the master node or you'll get an error. To send a file to the master node you can use aws emr put, as explained above.

Note that foo and bar are the parameters to the main method of you job.

aws emr add-steps \
  --cluster-id <yourclusterid> \
  --steps Type=Spark,\
Name="Sample Step to Run",\
Args=[--class,com.queirozf.SampleStep,--deploy-mode,cluster,--master,yarn,/home/hadoop/somejarfile.jar,foo,bar]

Submit Spark Application to running cluster (JAR on S3)

If you would rather upload the fat JAR to S3 than to the EMR cluster, you can also use the s3 path in Args:

aws emr add-steps \
  --cluster-id <yourclusterid> \
  --steps Type=Spark,\
Name="Sample Step to Run",\
Args=[--class,com.queirozf.SampleStep,--deploy-mode,cluster,--master,yarn,s3://my-s3-bucket/path/to/myfatjarfile.jar,foo,bar]