Hourly Aggregation In Pyspark
I'm looking for a way to aggregate by hour my data. I want firstly to keep only hours in my evtTime. My DataFrame looks like this: Row(access=u'WRITE', agentHost=u'xxxxxx50.ha
Solution 1:
There are multiple possible solutions, the simplest one is to use only the required part as a string:
from pyspark.sql.functions import substring, to_timestamp
df = spark.createDataFrame(["2017-10-01 23:03:51.337"], "string").toDF("evtTime")
df.withColumn("hour", substring("evtTime", 0, 13)).show()
# +--------------------+-------------+ # | evtTime| hour|# +--------------------+-------------+# |2017-10-01 23:03:...|2017-10-01 23|# +--------------------+-------------+
or as a timestamp:
df.withColumn("hour", to_timestamp(substring("evtTime", 0, 13), "yyyy-MM-dd HH")).show()
# +--------------------+-------------------+
# | evtTime|hour|
# +--------------------+-------------------+
# |2017-10-0123:03:...|2017-10-0123:00:00|
# +--------------------+-------------------+
You could also date_format
:
from pyspark.sql.functions import date_format, col
df.withColumn("hour", date_format(col("evtTime").cast("timestamp"), "yyyy-MM-dd HH:00")).show()
# +--------------------+----------------+# | evtTime| hour|# +--------------------+----------------+# |2017-10-01 23:03:...|2017-10-01 23:00|# +--------------------+----------------+
or date_trunc
:
from pyspark.sql.functions import date_trunc
df.withColumn("hour", date_trunc("hour", col("evtTime").cast("timestamp"))).show()
# +--------------------+-------------------+ # | evtTime| hour|# +--------------------+-------------------+# |2017-10-01 23:03:...|2017-10-01 23:00:00|# +--------------------+-------------------+
Post a Comment for "Hourly Aggregation In Pyspark"