Creating Scala Fat Jars for Spark on SBT with sbt-assembly Plugin

Last updated:

In order to submit Spark jobs to a Spark Cluster (via spark-submit), you need to include all dependencies (other than Spark itself) in the Jar, otherwise you won't be able to use those in your job.

Create fat Scala Jars using sbt-assembly

One way to do it (for Scala-based projects) is to use the sbt-assembly plugin.

Add sbt-assembly plugin to sbt

Create a file called assembly.sbt under the project/ directory, like this:

├── src/
└── project/
    └── assembly.sbt

In that file, add:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.2")

After adding it to SBT, you can run $ sbt assembly and a Jar file with all dependencies (so called Fat-jar) will be created for you.

Throubleshooting: "deduplicate: different file contents found in the following:"

This is a very common error that arises due to all sorts of duplicate files in the many projects you need to package together to form the fat jar.

You need to tell sbt-assembly how to fix those in order to have a clean packaged jar.

The following build.sbt file I've used in a Spark-Streaming project can be used as an example; just paste the assemblyMergeStrategy block into your build file and all errors should go away.

Note: The following works for Spark 1.x!

Click here if you are using Spark 2

// this file was written for spark 1.6.0 and scala 2.10.4
// it will not work on spark 2!

version := "1.0"

name := "my-sample-spark-streaming-project"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.6.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "1.6.0"
libraryDependencies += "com.amazonaws" % "amazon-kinesis-client" % "1.6.1"
libraryDependencies += "com.amazonaws" % "amazon-kinesis-producer" % "0.10.2"

assemblyMergeStrategy in assembly := {
  case PathList("javax", "servlet", xs @ _*) => MergeStrategy.last
  case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
  case PathList("org", "apache", xs @ _*) => MergeStrategy.last
  case PathList("com", "google", xs @ _*) => MergeStrategy.last
  case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last
  case PathList("com", "codahale", xs @ _*) => MergeStrategy.last
  case PathList("com", "yammer", xs @ _*) => MergeStrategy.last
  case "about.html" => MergeStrategy.rename
  case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
  case "META-INF/mailcap" => MergeStrategy.last
  case "META-INF/mimetypes.default" => MergeStrategy.last
  case "plugin.properties" => MergeStrategy.last
  case "log4j.properties" => MergeStrategy.last
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value
    oldStrategy(x)
}

Throubleshooting Spark 2: "deduplicate: different file contents found in the following:"

For Spark 2 you need to add a couple of lines to the above solution.

Here's a working build.sbt for Spark 2:

// this file was written for spark 2.0.0 and scala 2.11.8

version := "1.0"

name := "my-sample-spark2-streaming-project"

scalaVersion := "2.11.8"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % "provided"

assemblyMergeStrategy in assembly := {
  case PathList("org","aopalliance", xs @ _*) => MergeStrategy.last
  case PathList("javax", "inject", xs @ _*) => MergeStrategy.last
  case PathList("javax", "servlet", xs @ _*) => MergeStrategy.last
  case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
  case PathList("org", "apache", xs @ _*) => MergeStrategy.last
  case PathList("com", "google", xs @ _*) => MergeStrategy.last
  case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last
  case PathList("com", "codahale", xs @ _*) => MergeStrategy.last
  case PathList("com", "yammer", xs @ _*) => MergeStrategy.last
  case "about.html" => MergeStrategy.rename
  case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
  case "META-INF/mailcap" => MergeStrategy.last
  case "META-INF/mimetypes.default" => MergeStrategy.last
  case "plugin.properties" => MergeStrategy.last
  case "log4j.properties" => MergeStrategy.last
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value
    oldStrategy(x)
}

See also

Dialogue & Discussion