Skip to content Skip to sidebar Skip to footer

Pandas Groupby Result Using Different Combinations Of Boolean Array As Keys

I tried to understand the groupby using boolean array as key, here is the test code, a = pd.DataFrame([[True,False,False],[False,True,False]], columns=['A','B','C']) print(a)

Solution 1:

This is because the grouping parameter you are passing is not able to slice the dataframe into the required parts. So it simply copies the whole dataframe and 'zips' it to the grouping array you are passing. An example -

a = pd.DataFrame([[True,False,False],[False,True,False]], columns=['A','B','C'])

c = a.groupby([True,False])

print('length of grouper object:',len(c))
print(' ')
print(list(c)[0])
print(' ')
print(list(c)[1])
length of grouper object: 2
 
(False,        A     B      C
1FalseTrueFalse)
 
(True,       A      B      C
0TrueFalseFalse)

Notice that with the grouping array [False, True], its simply zipping the whole dataframe a with each. If you had instead passed something that it could find in the axis 1, it would consider the items in that series for grouping the dataset.

Another way to look at it is that the grouping parameter needs to be a series with a name and this name should be found in the columns of the data frame. If you pass an array [True, False], its basically an interpreted as a nameless series and therefore is unable to get its keys to do the splitting.

A good imagery to follow is the following if you want to understand how groupby works -

enter image description here

The split happens on the key provided, and the key needs to be referenceable in the index / columns, else its unable to split and simply returns unsplit dataframe with each group. This first step is what the grouper does, next is the apply and combine steps which are straightforward. In the tuples that you see above (printed output), the apply function operates on the t[1] element of each tuple, after which it combines it with the t[0] element of each tuple and concatenates vertically.

Solution 2:

Let's break it down

.groubpy().apply(pd.DataFrame) as you use in all variants takes the rows from each group and creates a dataframe, which basically returns self, so the output looks the same, but the way pandas gets there is different in every case

  1. b=a.groupby([False,False]): both rows belong to the same group (group_id False), and are parsed together once to form the same df

  2. c=a.groupby([True,False]): there are two groups with one row each. Apply takes each group and builds two separate DataFrames (one per group). Then concatenates and returs a df identical to original

  3. d=a.groupby([False,True]): same as #2 but now the first row belongs to group False. If you aggregated or applied a different function (other than pandas.DataFrame) you would see the df with True, False as the index (groupby sorts by default) and row 1 would appear as the first row, because it belongs to group True

Post a Comment for "Pandas Groupby Result Using Different Combinations Of Boolean Array As Keys"