How to fix null values in Spark DataFrame columns
No telling if I’ll ever need this again, but this weekend I was helping someone with some Scala Spark work, and the short version of the story is that they were ending up with null values in their data after creating a Spark join
. The null values were ending up in two fields, one named balance and another named accountId, so I created these two Spark udf
functions to fix the data, converting null values into Long
in the first example, and null values into empty strings in the second example:
val fixBalance = udf((s: String) => if (s==null) 0 else s.toLong)
val df2: DataFrame = df.withColumn("balance", fixBalance($"balance"))
val fixAccountId = udf((s: String) => if (s==null) "" else s)
val df3: DataFrame = df2.withColumn("accountId", fixAccountId($"accountId"))
Notice that I started with a Spark DataFrame
named df
, then created df2
and then df3
. So then the final solution involved using df3
— which had the corrected, non-null data, thanks to the udf
functions — like this:
val res: Dataset[CustomerAccounts] = df3.groupBy( ...
In summary, if you ever have null values in Spark DataFrame columns, I hope these examples of how to fix those null values is helpful. There may be other ways to solve this problem, but this solution worked for what we were doing this weekend.
Reporting live from Boulder, Colorado,
Alvin Alexander
Recent blog posts
- Business Analyst: How to write accurate software requirements
- Business Analyst: A simple secret to running a great meeting
- One thing a business analyst should ask about any requirement
- Business Analysts and Use Case quality: Questions to ask yourself when writing a Use Case
- The three things a Business Analyst should think about during meetings
- Testing web applications Selenium with Scala 3 and ScalaTest
- Scala Cookbook 2021: A best-selling new release in OOP and FP
- Salar Rahmanian's newsletter (and Functional Programming, Simplified)
- Our “Back To Now” app: Now available on iOS and Android
- An Android location “Fused Location Provider API” example