Using the AWS CLI to manage Spark Clusters on EMR: Examples and Reference

Last updated:

WIP Alert This is a work in progress. Current information is correct but more content will probably be added in the future.

All of these commands require that you instal awscli. On Ubuntu: sudo apt-get install python-pip; pip install awscli; then run aws configure to configure. More info here

Send a file to the master node in the cluster

If you omit --dest, the file will be sent to the home directory of the hadoop user, i.e. /home/hadoop/.

aws emr put \
  --cluster-id <youclusterid> \
  --key-pair-file <pathtopemfile>\
  --src <pathtofile>\
  --dest <pathonremoteserver>

SSH to the master node in the cluster

aws emr ssh \
  --cluster-id <yourclusterid> \
  --key-pair-file <pathtopemfile>

SSH to the master node in the cluster AND run a command

As an example, say we'd like to run two hadoop fs commands on our cluster. 'ENDSSH' signals the start of a HEREDOC; it's used to avoid needing to escape stuff.

aws emr ssh \
  --cluster-id <yourclusterid> \
  --key-pair-file <pathtopemfile> <<'ENDSSH'
hadoop fs -rm somefile
hadoop fs -put /path/to/myfile

Add a step to a running cluster

The file /home/hadoop/somejarfile.jar is the fat jar containing your job. It needs to exist on the master node or you'll get an error. To send a file to the master node you can use aws emr put, as explained above.

Note that foo and bar are the parameters to the main method of you job.

aws emr add-steps \
  --cluster-id <yourclusterid> \
  --steps Type=Spark,\
Name="Sample Step to Run",\

See also

Dialogue & Discussion