How to fix null values in Spark DataFrame columns
No telling if I’ll ever need this again, but this weekend I was helping someone with some Scala Spark work, and the short version of the story is that they were ending up with null values in their data after creating a Spark join
. The null values were ending up in two fields, one named balance and another named accountId, so I created these two Spark udf
functions to fix the data, converting null values into Long
in the first example, and null values into empty strings in the second example:
val fixBalance = udf((s: String) => if (s==null) 0 else s.toLong)
val df2: DataFrame = df.withColumn("balance", fixBalance($"balance"))
val fixAccountId = udf((s: String) => if (s==null) "" else s)
val df3: DataFrame = df2.withColumn("accountId", fixAccountId($"accountId"))
Notice that I started with a Spark DataFrame
named df
, then created df2
and then df3
. So then the final solution involved using df3
— which had the corrected, non-null data, thanks to the udf
functions — like this:
val res: Dataset[CustomerAccounts] = df3.groupBy( ...
In summary, if you ever have null values in Spark DataFrame columns, I hope these examples of how to fix those null values is helpful. There may be other ways to solve this problem, but this solution worked for what we were doing this weekend.
Reporting live from Boulder, Colorado,
Alvin Alexander
Recent blog posts
- Free Scala and functional programming video training courses
- Free: Introduction To Functional Programming video training course
- The #1 functional programming (and computer programming) book
- The User Story Mapping Workshop process
- Alvin Alexander, Certified Scrum Product Owner (CSPO)
- Alvin Alexander is now a Certified ScrumMaster (CSM)
- Our “Back To Then” app (for iOS and Android)
- A Docker cheat sheet
- Pushing a Scala 3 JAR/Docker file to Google Cloud Run
- Reading a CSV File Into a Spark RDD (Scala Cookbook recipe)