Skip to content Skip to sidebar Skip to footer

Hourly Aggregation In Pyspark

I'm looking for a way to aggregate by hour my data. I want firstly to keep only hours in my evtTime. My DataFrame looks like this: Row(access=u'WRITE', agentHost=u'xxxxxx50.ha

Solution 1:

There are multiple possible solutions, the simplest one is to use only the required part as a string:

from pyspark.sql.functions import substring, to_timestamp

df = spark.createDataFrame(["2017-10-01 23:03:51.337"], "string").toDF("evtTime")

df.withColumn("hour", substring("evtTime", 0, 13)).show()
# +--------------------+-------------+                                            # |             evtTime|         hour|# +--------------------+-------------+# |2017-10-01 23:03:...|2017-10-01 23|# +--------------------+-------------+

or as a timestamp:

df.withColumn("hour", to_timestamp(substring("evtTime", 0, 13), "yyyy-MM-dd HH")).show()
# +--------------------+-------------------+
# |             evtTime|hour|
# +--------------------+-------------------+
# |2017-10-0123:03:...|2017-10-0123:00:00|
# +--------------------+-------------------+

You could also date_format:

from pyspark.sql.functions import date_format, col

df.withColumn("hour", date_format(col("evtTime").cast("timestamp"), "yyyy-MM-dd HH:00")).show()
# +--------------------+----------------+# |             evtTime|            hour|# +--------------------+----------------+# |2017-10-01 23:03:...|2017-10-01 23:00|# +--------------------+----------------+

or date_trunc:

from pyspark.sql.functions import date_trunc

df.withColumn("hour", date_trunc("hour", col("evtTime").cast("timestamp"))).show()
# +--------------------+-------------------+                                      # |             evtTime|               hour|# +--------------------+-------------------+# |2017-10-01 23:03:...|2017-10-01 23:00:00|# +--------------------+-------------------+

Post a Comment for "Hourly Aggregation In Pyspark"