Skip to content Skip to sidebar Skip to footer

Spark UDF With Dictionary Argument Fails

I have a column (myCol) in a Spark dataframe that has values 1,2 and I want to create a new column with the description of this values like 1-> 'A', 2->'B' etc I know that th

Solution 1:

It is possible, you just have to do it a bit differently.

dictionary= { 1:'A' , 2:'B' }

def add_descriptions(in_dict):
    def f(x):
        return in_dict.get(x)
    return udf(f)

df.withColumn(
    "description",
    add_descriptions(dictionary)(df.myCol)
)

If you want to add directly your dict in the UDF, as UDFs only accept columns as argument, you need to have a map column to replace your dict.


Solution 2:

If you are using Spark >= 2.4.0 you can also use the build-in map_from_arrays function in order to create map on the fly and then get the desired value with getItem as shown below:

from pyspark.sql.functions import lit, col, map_from_arrays, array
df = spark.createDataFrame([[1],[2],[3]]).toDF("key")

dict = { 1:'A' , 2:'B' }

map_keys = array([lit(k) for k in dict.keys()])
map_values = array([lit(v) for v in dict.values()])
map_func = map_from_arrays(map_keys, map_values) 

df = df.withColumn("description", map_func.getItem(df.key))

Output:

+---+-----------+
|key|description|
+---+-----------+
|  1|          A|
|  2|          B|
|  3|       null|
+---+-----------+

Solution 3:

Here's how to solve this with a dictionary that's been broadcasted (which is the most reliable way to solve the problem because it'll also work with large dictionaries):

def add_descriptions(dict_b):
    def f(x):
        return dict_b.value.get(x)
    return F.udf(f)

df = spark.createDataFrame([[1,], [2,], [3,]]).toDF("some_num")
dictionary= { 1:'A' , 2:'B' }
dict_b = spark.sparkContext.broadcast(dictionary)
df.withColumn(
    "res",
    add_descriptions(dict_b)(F.col("some_num"))
).show()
+--------+----+
|some_num| res|
+--------+----+
|       1|   A|
|       2|   B|
|       3|null|
+--------+----+

Great question, this is an important design pattern for PySpark programmers to master.


Post a Comment for "Spark UDF With Dictionary Argument Fails"