Skip to content Skip to sidebar Skip to footer

Spark Streaming: Read Csv String From Kafka, Write To Parquet

There are lots of online examples of reading json from Kafka (to write to parquet) - but I cannot figure out how to apply a schema to a CSV string from kafka. The streamed data: cu

Solution 1:

This is how I did it. Without from_json, extract the csv string:

interval=df.select(col("value").cast("string")) .alias("csv").select("csv.*")

And then split it into columns. This can be written as a parquet file using the same statement above

interval2=interval \
      .selectExpr("split(value,',')[0] as customer_id" \
                 ,"split(value,',')[1] as customer_acct_id" \
                 ,"split(value,',')[2] as serv_acct_id" \
                 ,"split(value,',')[3] as installed_service_id" \
                 ,"split(value,',')[4] as meter_id" \
                 ,"split(value,',')[5] as channel_number" \
                 ... etc
                 )

Post a Comment for "Spark Streaming: Read Csv String From Kafka, Write To Parquet"