Skip to content Skip to sidebar Skip to footer

Combine Pivoted And Aggregated Column In PySpark Dataframe

My question is related to this one. I have a PySpark DataFrame, named df, as shown below. date | recipe | percent | volume ---------------------------------------- 2019-01-0

Solution 1:

According to Spark's source code, it has a special branch for pivoting with single aggregation.

    val singleAgg = aggregates.size == 1

    def outputName(value: Expression, aggregate: Expression): String = {
      val stringValue = value.name

      if (singleAgg) {
        stringValue <--- Here
      } 
      else {
        val suffix = {...}
        stringValue + "_" + suffix
      }
    }

I don't know the reason, but the single remaining option is column renaming.

Here is a simplified version for renaming:

  def rename(identity: Set[String], suffix: String)(df: DataFrame): DataFrame = {
    val fieldNames = df.schema.fields.map(filed => filed.name)
    val renamed = fieldNames.map(fieldName => {
      if (identity.contains(fieldName)) {
        fieldName
      } else {
        fieldName + suffix
      }} )

  df.toDF(renamed:_*)
  }

Usage:

rename(Set("date"), "_percent")(pivoted).show()

+----------+---------+---------+
|      date|A_percent|B_percent|
+----------+---------+---------+
|2019-01-01|    0.025|     0.05|
|2019-01-02|     0.11|     0.06|
+----------+---------+---------+

Post a Comment for "Combine Pivoted And Aggregated Column In PySpark Dataframe"