Skip to content Skip to sidebar Skip to footer

Transferring Values Between Two Columns In A Pandas Data Frame

I have a pandas data frame like this: p q 0.5 0.5 0.6 0.4 0.3 0.7 0.4 0.6 0.9 0.1 So, I want to know, how can I transfer greater values to p column and vice versa for q column (Tr

Solution 1:

You could store some conditional series with np.where() and then apply them to the dataframe:

s1 = np.where(df['p'] < df['q'], df['q'], df['p'])
s2 = np.where(df['p'] > df['q'], df['q'], df['p'])
df['p'] = s1
df['q'] = s2
df
Out[1]: 
     p    q
0  0.5  0.5
1  0.6  0.4
2  0.7  0.3
3  0.6  0.4
4  0.9  0.1

You could also use .where():

s1 = df['p'].where(df['p'] > df['q'], df['q'])
s2 = df['p'].where(df['p'] < df['q'], df['q'])
df['p'] = s1
df['q'] = s2
df

I tested the execution times over varying rows from 100 rows to 1 million rows, and the answers that require passing axis=1 can be 10,000 times slower!:

  1. Erfan's numpy answer looks to be the fastest executing in milliseconds for large datasets
  2. My .where() answer also has great performance that keeps the time to execute in milliseconds (I assume `np.where() would have a similar outcome.
  3. I thought MHDG7's answer would be the slowest, but it is actually faster than Alexander's answer.
  4. I guess Alexander's answer is slow, because it requires passing axis=1. The fact that MGDG7's and Alexander's answer is row-wise (with axis=1), it means that it can slow things down tremendously for large dataframes.

As you can see a million row dataframe was taking minutes to execute. And, if you had a 10 million row to 100 million row dataframe these one-liners could take hours to execute.


from timeit import timeit
df = d.copy()

defdf_where(df):
    s1 = df['p'].where(df['p'] > df['q'], df['q'])
    s2 = df['p'].where(df['p'] < df['q'], df['q'])
    df['p'] = s1
    df['q'] = s2
    return df


defagg_maxmin(df):
    df[['p', 'q']] = df[['p', 'q']].agg([max, min], axis=1)
    return df


defnp_flip(df):
    df = pd.DataFrame(np.flip(np.sort(df), axis=1), columns=df.columns)
    return df


deflambda_x(df):
    df = df.apply(lambda x: [x['p'],x['q']] if x['p']>x['q'] else [x['q'],x['p']],axis=1,result_type='expand')
    return df


res = pd.DataFrame(
    index=[20, 200, 2000, 20000, 200000],
    columns='df_where agg_maxmin np_flip lambda_x'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df]*i)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        print(stmt, d.shape)
        res.at[i, j] = timeit(stmt, setp, number=1)

res.plot(loglog=True);

enter image description here

Solution 2:

Use numpy.sort to sort over the horizontal axis ascending, then flip the arrays over axis=1:

df = pd.DataFrame(np.flip(np.sort(df), axis=1), columns=df.columns)
pq00.50.510.60.420.70.330.60.440.90.1

Solution 3:

Use agg, pass a list of functions (max and min) and specify axis=1 to have those functions be applied to the columns row-wise.

df[['p', 'q']] = df[['p', 'q']].agg([max, min], axis=1)

>>> df
     p    q
00.50.510.60.420.70.330.60.440.90.1

Simple solutions are not always the most performant (e.g. the one above). The following solution is significantly faster. It masks the dataframe for where column p is less than column q, and then swaps the values.

mask = df['p'].lt(df['q'])
df.loc[mask, ['p', 'q']] = df.loc[mask, ['q', 'p']].to_numpy()
>>> df
     p    q
0  0.5  0.5
1  0.6  0.4
2  0.7  0.3
3  0.6  0.4
4  0.9  0.1

Solution 4:

you can use apply function :

df[['p','q']] = df.apply(lambda x: [x['p'],x['q']] if x['p']>x['q'] else [x['q'],x['p']],axis=1,result_type='expand' )

Post a Comment for "Transferring Values Between Two Columns In A Pandas Data Frame"