Debugging NullPointerException in Apache Spark

Last updated:

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

A variable may be null inside map, flatMap, etc block

Code inside blocks like map and flatMap get executed in worker/executor nodes.

It may be the case (particularly when you're running on actual, multi-node clusters rather than standalone setups) that some variables or function defined outside these blocks fail to be correctly sent across the network to the executor nodes, where this code is actually run, thus causing NullPointerException if you try to call a method on null.

Example (spark 2.1)

val HTML_TAGS_PATTERN = """<[^>]+>""".r

spark
.sparkContext
.textFile(pathToInputFile, numPartitions)
.toDS()
.map { str =>

  var body: String = "" 

  // NEXT LINE TRIGGERS NPE
  body = HTML_TAGS_PATTERN.replaceAllIn(str, " ")

  // other code here

}

The example above causes a NUllPointerException (NPE), while the code below doesn't:

spark
.sparkContext
.textFile(pathToInputFile, numPartitions)
.toDS()
.map { str =>

  var body: String = "" 

  // NEXT LINE DOES NOT TRIGGER NPE
  body = """<[^>]+>""".r.replaceAllIn(str, " ")

  // other code here

}

References:

Dialogue & Discussion