Using the AWS CLI to manage Spark Clusters on EMR: Examples and Reference
Last updated:- Send a file the master node the cluster
- SSH to master node in cluster
- SSH to master node in cluster AND run command
- Submit Spark Application to running cluster
- Submit Spark Application to running cluster (JAR on S3)
All of these commands require that you instal awscli
. On Ubuntu: sudo apt-get install python-pip; pip install awscli;
then run aws configure
to configure. More info here
Send a file the master node the cluster
If you omit --dest
, the file will be sent to the home directory of the hadoop user, i.e. /home/hadoop/
.
aws emr put \
--cluster-id <youclusterid> \
--key-pair-file <pathtopemfile>\
--src <pathtofile>\
--dest <pathonremoteserver>
SSH to master node in cluster
aws emr ssh \
--cluster-id <yourclusterid> \
--key-pair-file <pathtopemfile>
SSH to master node in cluster AND run command
As an example, say we'd like to run two hadoop fs
commands on our cluster. 'ENDSSH'
signals the start of a HEREDOC; it's used to avoid needing to escape stuff.
aws emr ssh \
--cluster-id <yourclusterid> \
--key-pair-file <pathtopemfile> <<'ENDSSH'
hadoop fs -rm somefile
hadoop fs -put /path/to/myfile
ENDSSH
Submit Spark Application to running cluster
The jar file should be in the cluster already!
The file /home/hadoop/myfatjarfile.jar
is the fat jar containing your job.
See Creating Scala fat JARs for Spark for more info
It needs to exist on the master node or you'll get an error. To send a file to the master node you can use aws emr put
, as explained above.
Note that foo
and bar
are the parameters to the main method of you job.
aws emr add-steps \
--cluster-id <yourclusterid> \
--steps Type=Spark,\
Name="Sample Step to Run",\
Args=[--class,com.queirozf.SampleStep,--deploy-mode,cluster,--master,yarn,/home/hadoop/somejarfile.jar,foo,bar]
Submit Spark Application to running cluster (JAR on S3)
If you would rather upload the fat JAR to S3 than to the EMR cluster, you can also use the s3 path in Args
:
aws emr add-steps \
--cluster-id <yourclusterid> \
--steps Type=Spark,\
Name="Sample Step to Run",\
Args=[--class,com.queirozf.SampleStep,--deploy-mode,cluster,--master,yarn,s3://my-s3-bucket/path/to/myfatjarfile.jar,foo,bar]
See also
AWS EMR command line reference - It's funny that there is ** no mention** of how to add Spark steps on this page. It looks like it's an undocumented features.