How To Sum In Pandas By Unique Index In Several Columns?
I have a pandas DataFrame which details online activities in terms of 'clicks' during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 mi
Solution 1:
IIUC you can use groupby
, sum
and reset_index
:
printdfUser_IDRegistrationSessionclicks023498762012-02-22 2014-04-24 2119872932011-02-01 2013-05-03 1222342142012-07-22 2014-01-22 7398744522010-12-22 2014-08-22 2printdf.groupby('User_ID')['clicks'].sum().reset_index()User_IDclicks019872931122342147223498762398744522
If first column User_ID
is index
:
printdfRegistrationSessionclicksUser_ID23498762012-02-22 2014-04-24 219872932011-02-01 2013-05-03 122342142012-07-22 2014-01-22 798744522010-12-22 2014-08-22 2printdf.groupby(level=0)['clicks'].sum().reset_index()User_IDclicks019872931122342147223498762398744522
Or:
print df.groupby(df.index)['clicks'].sum().reset_index()
User_ID clicks
019872931122342147223498762398744522
EDIT:
As Alexander pointed, you need filter data before groupby
, if Session
dates is less as Registration
dates per User_ID
:
printdfUser_IDRegistrationSessionclicks023498762012-02-22 2014-04-24 2119872932011-02-01 2013-05-03 1222342142012-07-22 2014-01-22 7398744522010-12-22 2014-08-22 2printdf[df.Session>=df.Registration].groupby('User_ID')['clicks'].sum().reset_index()User_IDclicks019872931122342147223498762398744522
I change 3. row of data for better sample:
printdfRegistrationSessionclicksUser_ID23498762012-02-22 2014-04-24 219872932011-02-01 2013-05-03 122342142012-07-22 2012-01-22 798744522010-12-22 2014-08-22 2printdf.Session>=df.RegistrationUser_ID2349876True1987293True2234214False9874452Truedtype:boolprintdf[df.Session>=df.Registration]RegistrationSessionclicksUser_ID23498762012-02-22 2014-04-24 219872932011-02-01 2013-05-03 198744522010-12-22 2014-08-22 2df1=df[df.Session>=df.Registration]printdf1.groupby(df1.index)['clicks'].sum().reset_index()User_IDclicks019872931123498762298744522
Solution 2:
The first thing to do is filter registrations dates that precede the registration date, then group on the User_ID and sum.
gb = (df[df.Session >= df.Registration]
.groupby('User_ID')
.clicks.agg({'Total_Clicks': np.sum}))
>>> gb
Total_Clicks
User_ID
19872931223421472349876298744522
For the use case you mentioned, I believe this is scalable. It always depends, of course, on your available memory.
Solution 3:
suppose your dataframe name is df, then do the following
df.groupby(['User_ID']).sum()[['User_ID','clicks']]
Post a Comment for "How To Sum In Pandas By Unique Index In Several Columns?"