WIP Alert This is a work in progress. Current information is correct but more content will probably be added in the future.
All of these commands require that you instal
awscli. On Ubuntu:
sudo apt-get install python-pip; pip install awscli; then run
aws configure to configure. More info here
Send a file to the master node in the cluster
If you omit
--dest, the file will be sent to the home directory of the hadoop user, i.e.
aws emr put \ --cluster-id <youclusterid> \ --key-pair-file <pathtopemfile>\ --src <pathtofile>\ --dest <pathonremoteserver>
SSH to the master node in the cluster
aws emr ssh \ --cluster-id <yourclusterid> \ --key-pair-file <pathtopemfile>
SSH to the master node in the cluster AND run a command
As an example, say we'd like to run two
hadoop fs commands on our cluster.
'ENDSSH' signals the start of a HEREDOC; it's used to avoid needing to escape stuff.
aws emr ssh \ --cluster-id <yourclusterid> \ --key-pair-file <pathtopemfile> <<'ENDSSH' hadoop fs -rm somefile hadoop fs -put /path/to/myfile ENDSSH
Add a step to a running cluster
/home/hadoop/somejarfile.jar is the fat jar containing your job. It needs to exist on the master node or you'll get an error. To send a file to the master node you can use
aws emr put, as explained above.
bar are the parameters to the main method of you job.
aws emr add-steps \ --cluster-id <yourclusterid> \ --steps Type=Spark,\ Name="Sample Step to Run",\ Args=[--class,com.queirozf.SampleStep,--deploy-mode,cluster,--master,yarn,/home/hadoop/somejarfile.jar,foo,bar]