Elastic MapReduce: merge Outputs from multiple Reducers into a single file

Elastic MapReduce: merge Outputs from multiple Reducers into a single file

Last updated:
Table of Contents

If you use EMR you may have your output split into several files with names such as part-r-00000, part-r-00001 and so on.

This is OK for Hadoop/Spark (they can read it) but you may need all results bundled up in a single file to ease reporting and better see the results.

Hadoop MapReduce

To merge all outputs into a single file, you can add another step to your workflow, namely an IdentityReducer with a single reducer ; it performs no actual operation on you data, just merges it into a single file.

The full name for the IdentityReducer is org.apache.hadoop.mapred.lib.IdentityReducer

This should be used if you don't have console access to the server where your job is running. If you can run direct hdfs commands, getmerge is the way to go.

Spark

Use coalesce(1) on your DataFrame/Dataset before writing it to disk or S3.


P.S.: You can force any hadoop workflow to use a single reducer for the reduce phase (thereby generating a single output file), but you risk crashing the application if all your data can't fit into a single node.

Dialogue & Discussion