Creating Scala Fat Jars for Spark on SBT with sbt-assembly Plugin
Last updated:In order to submit Spark jobs to a Spark Cluster (via spark-submit), you need to include all dependencies (other than Spark itself) in the Jar, otherwise you won't be able to use those in your job.
Create fat Scala Jars using sbt-assembly
One way to do it (for Scala-based projects) is to use the sbt-assembly plugin.
Add sbt-assembly plugin to sbt
Create a file called assembly.sbt
under the project/
directory, like this:
├── src/
└── project/
└── assembly.sbt
In that file, add:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.2")
After adding it to SBT, you can run $ sbt assembly
and a Jar file with all dependencies (so called Fat-jar) will be created for you.
Spark 1.x
This is a very common error that arises due to all sorts of duplicate files in the many projects you need to package together to form the fat jar.
You need to tell sbt-assembly how to fix those in order to have a clean packaged jar.
The following build.sbt
file I've used in a Spark-Streaming project can be used as an example; just paste the assemblyMergeStrategy
block into your build file and all errors should go away.
Note: The following works for Spark 1.x!
// this file was written for spark 1.6.0 and scala 2.10.4
// it will not work on spark 2!
version := "1.0"
name := "my-sample-spark-streaming-project"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.6.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "1.6.0"
libraryDependencies += "com.amazonaws" % "amazon-kinesis-client" % "1.6.1"
libraryDependencies += "com.amazonaws" % "amazon-kinesis-producer" % "0.10.2"
assemblyMergeStrategy in assembly := {
case PathList("javax", "servlet", xs @ _*) => MergeStrategy.last
case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
case PathList("org", "apache", xs @ _*) => MergeStrategy.last
case PathList("com", "google", xs @ _*) => MergeStrategy.last
case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last
case PathList("com", "codahale", xs @ _*) => MergeStrategy.last
case PathList("com", "yammer", xs @ _*) => MergeStrategy.last
case "about.html" => MergeStrategy.rename
case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
case "META-INF/mailcap" => MergeStrategy.last
case "META-INF/mimetypes.default" => MergeStrategy.last
case "plugin.properties" => MergeStrategy.last
case "log4j.properties" => MergeStrategy.last
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
Spark 2.x
For Spark 2 you need to add a couple of lines to the above solution.
Here's a working build.sbt
for Spark 2:
// this file was written for spark 2.0.0 and scala 2.11.8
version := "1.0"
name := "my-sample-spark2-streaming-project"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % "provided"
assemblyMergeStrategy in assembly := {
case PathList("org","aopalliance", xs @ _*) => MergeStrategy.last
case PathList("javax", "inject", xs @ _*) => MergeStrategy.last
case PathList("javax", "servlet", xs @ _*) => MergeStrategy.last
case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
case PathList("org", "apache", xs @ _*) => MergeStrategy.last
case PathList("com", "google", xs @ _*) => MergeStrategy.last
case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last
case PathList("com", "codahale", xs @ _*) => MergeStrategy.last
case PathList("com", "yammer", xs @ _*) => MergeStrategy.last
case "about.html" => MergeStrategy.rename
case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
case "META-INF/mailcap" => MergeStrategy.last
case "META-INF/mimetypes.default" => MergeStrategy.last
case "plugin.properties" => MergeStrategy.last
case "log4j.properties" => MergeStrategy.last
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}