Skip to content Skip to sidebar Skip to footer

Faster Ways To Sort And Append Large Dataframe

I’m trying to sort some sales data per product day of sale and product ID, and then I would like to compute some statistics with pandas. Is there an efficient way to do this? My

Solution 1:

How about groupby? It handles, so to speak, the iterations much more efficiently than loops and in much shorter and readable code. You would group on daySold and productID. This is obviously mock data, but you would want to turn your daySold into a datetime object first so you can easily group on it - I just kept the day, but you could keep the time if needed:

df.daySold=pd.to_datetime(df.daySold.apply(lambda x: x[:9]),format="%d%b%Y")

Then it is just a one-liner. With the groupby object you can pass a number of different aggregation calls.

df.groupby(['daySold','productID']).agg({'quantitySold':[sum,np.std],'Price':[sum,np.std]})

                     quantitySold              Price     
                              sum          std   sum  std
daySold    productID                                     
2017-01-31 Sd23454            854321.026479780.02017-02-13 Rt4564            1544NaN45NaN2017-02-18 Fdgd4             59753800.6989492500.02017-03-18 Fdgd4             4487NaN125NaN2017-08-30 Sd23454           7895NaN39NaN

EDIT:

You can use the groupby object to apply all manner of functions, off the shelf ones and ones you define yourself.

So you could do a dot product, requiring two columns / arrays of a dataframe, like so:

defdotter(df):
    return np.sum(df.quantitySold*df.Price)
    ## or if you want to use numpy--may be faster for large datasets:#return np.dot(df.quantitySold,df.Price)

Call it by using apply method of groupby object:

df.groupby(['daySold','productID']).apply(dotter)daySoldproductID2017-01-31  Sd23454333062017-02-13  Rt4564694802017-02-18  Fdgd47468752017-03-18  Fdgd45608752017-08-30  Sd23454307905dtype:int64

Post a Comment for "Faster Ways To Sort And Append Large Dataframe"