Tag Archives: Row to dict

Convert nested spark table Row structs to python dicts in functions, and then convert processed Rows to dataframes

Its been a while since I used flatMaps in spark, and most often it can be replaced by using Explode. But in my case today, I constructed a list of Rows and then converted them to an actual DF without the painful schema registry using rdd operations.

now, input is kinda messy and I used my convert functions to convert everything into dictionaries should they come in:

Of course, there are many other processing steps, this is just converting the massively nested data into the ones that I like to see, into easy python dictionaries instead of the nasty spark Row objects.

Now I can do my processing based on the converted data.

After that when I have to save the data, I did something like

def process_df(a,b,c):

....

return [T.Row(**item) for item in processed_data.values()]

so they become one list of Rows. Then, in order to convert the list of Row back to the df, I did this

df = data.rdd.flatMap(lambda x:process_df(x[0], x[1], x[2])).toDF()

there it is! The nested ugly dataframe now has been turned into a flat structure and much easier to get analysis going.