Skip to content Skip to sidebar Skip to footer

How To Sum In Pandas By Unique Index In Several Columns?

I have a pandas DataFrame which details online activities in terms of 'clicks' during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 mi

Solution 1:

IIUC you can use groupby, sum and reset_index:

printdfUser_IDRegistrationSessionclicks023498762012-02-22 2014-04-24       2119872932011-02-01 2013-05-03       1222342142012-07-22 2014-01-22       7398744522010-12-22 2014-08-22       2printdf.groupby('User_ID')['clicks'].sum().reset_index()User_IDclicks019872931122342147223498762398744522

If first column User_ID is index:

printdfRegistrationSessionclicksUser_ID23498762012-02-22 2014-04-24       219872932011-02-01 2013-05-03       122342142012-07-22 2014-01-22       798744522010-12-22 2014-08-22       2printdf.groupby(level=0)['clicks'].sum().reset_index()User_IDclicks019872931122342147223498762398744522

Or:

print df.groupby(df.index)['clicks'].sum().reset_index()
   User_ID  clicks
019872931122342147223498762398744522

EDIT:

As Alexander pointed, you need filter data before groupby, if Session dates is less as Registration dates per User_ID:

printdfUser_IDRegistrationSessionclicks023498762012-02-22 2014-04-24       2119872932011-02-01 2013-05-03       1222342142012-07-22 2014-01-22       7398744522010-12-22 2014-08-22       2printdf[df.Session>=df.Registration].groupby('User_ID')['clicks'].sum().reset_index()User_IDclicks019872931122342147223498762398744522

I change 3. row of data for better sample:

printdfRegistrationSessionclicksUser_ID23498762012-02-22 2014-04-24       219872932011-02-01 2013-05-03       122342142012-07-22 2012-01-22       798744522010-12-22 2014-08-22       2printdf.Session>=df.RegistrationUser_ID2349876True1987293True2234214False9874452Truedtype:boolprintdf[df.Session>=df.Registration]RegistrationSessionclicksUser_ID23498762012-02-22 2014-04-24       219872932011-02-01 2013-05-03       198744522010-12-22 2014-08-22       2df1=df[df.Session>=df.Registration]printdf1.groupby(df1.index)['clicks'].sum().reset_index()User_IDclicks019872931123498762298744522

Solution 2:

The first thing to do is filter registrations dates that precede the registration date, then group on the User_ID and sum.

gb = (df[df.Session >= df.Registration]
      .groupby('User_ID')
      .clicks.agg({'Total_Clicks': np.sum}))

>>> gb
         Total_Clicks
User_ID              
19872931223421472349876298744522

For the use case you mentioned, I believe this is scalable. It always depends, of course, on your available memory.

Solution 3:

suppose your dataframe name is df, then do the following

df.groupby(['User_ID']).sum()[['User_ID','clicks']]

Post a Comment for "How To Sum In Pandas By Unique Index In Several Columns?"