Faster Ways To Sort And Append Large Dataframe
I’m trying to sort some sales data per product day of sale and product ID, and then I would like to compute some statistics with pandas. Is there an efficient way to do this? My
Solution 1:
How about groupby
? It handles, so to speak, the iterations much more efficiently than loops and in much shorter and readable code. You would group on daySold
and productID
. This is obviously mock data, but you would want to turn your daySold
into a datetime
object first so you can easily group on it - I just kept the day, but you could keep the time if needed:
df.daySold=pd.to_datetime(df.daySold.apply(lambda x: x[:9]),format="%d%b%Y")
Then it is just a one-liner. With the groupby
object you can pass a number of different aggregation calls.
df.groupby(['daySold','productID']).agg({'quantitySold':[sum,np.std],'Price':[sum,np.std]})
quantitySold Price
sum std sum std
daySold productID
2017-01-31 Sd23454 854321.026479780.02017-02-13 Rt4564 1544NaN45NaN2017-02-18 Fdgd4 59753800.6989492500.02017-03-18 Fdgd4 4487NaN125NaN2017-08-30 Sd23454 7895NaN39NaN
EDIT:
You can use the groupby object to apply all manner of functions, off the shelf ones and ones you define yourself.
So you could do a dot product, requiring two columns / arrays of a dataframe, like so:
defdotter(df):
return np.sum(df.quantitySold*df.Price)
## or if you want to use numpy--may be faster for large datasets:#return np.dot(df.quantitySold,df.Price)
Call it by using apply method of groupby object:
df.groupby(['daySold','productID']).apply(dotter)daySoldproductID2017-01-31 Sd23454333062017-02-13 Rt4564694802017-02-18 Fdgd47468752017-03-18 Fdgd45608752017-08-30 Sd23454307905dtype:int64
Post a Comment for "Faster Ways To Sort And Append Large Dataframe"