Faster Ways To Sort And Append Large Dataframe
I’m trying to sort some sales data per product day of sale and product ID, and then I would like to compute some statistics with pandas. Is there an efficient way to do this? My
Solution 1:
How about groupby? It handles, so to speak, the iterations much more efficiently than loops and in much shorter and readable code. You would group on daySold and productID. This is obviously mock data, but you would want to turn your daySold into a datetime object first so you can easily group on it - I just kept the day, but you could keep the time if needed:
df.daySold=pd.to_datetime(df.daySold.apply(lambda x: x[:9]),format="%d%b%Y")
Then it is just a one-liner. With the groupby object you can pass a number of different aggregation calls.
df.groupby(['daySold','productID']).agg({'quantitySold':[sum,np.std],'Price':[sum,np.std]})
quantitySold Price
sum std sum std
daySold productID
2017-01-31 Sd23454 854321.026479780.02017-02-13 Rt4564 1544NaN45NaN2017-02-18 Fdgd4 59753800.6989492500.02017-03-18 Fdgd4 4487NaN125NaN2017-08-30 Sd23454 7895NaN39NaNEDIT:
You can use the groupby object to apply all manner of functions, off the shelf ones and ones you define yourself.
So you could do a dot product, requiring two columns / arrays of a dataframe, like so:
defdotter(df):
return np.sum(df.quantitySold*df.Price)
## or if you want to use numpy--may be faster for large datasets:#return np.dot(df.quantitySold,df.Price)Call it by using apply method of groupby object:
df.groupby(['daySold','productID']).apply(dotter)daySoldproductID2017-01-31 Sd23454333062017-02-13 Rt4564694802017-02-18 Fdgd47468752017-03-18 Fdgd45608752017-08-30 Sd23454307905dtype:int64
Post a Comment for "Faster Ways To Sort And Append Large Dataframe"