Skip to content Skip to sidebar Skip to footer

How To Remove Strings Present In A List From A Column In Pandas

I have a dataframe df, import pandas as pd df = pd.DataFrame( { 'ID': [1, 2, 3, 4, 5], 'name': [ 'Hello Kitty', 'Hello Puppy',

Solution 1:

I think need str.replace if want remove also substrings:

df['name'] = df['name'].str.replace('|'.join(To_remove_lst), '')

If possible some regex characters:

import re
df['name'] = df['name'].str.replace('|'.join(map(re.escape, To_remove_lst)), '')

print (df)
   ID            name
0   1           Kitty
1   2           Puppy
2   3     is  example
3   4   stackoverflow
4   5           World

But if want remove only words use nested list comprehension:

df['name'] = [' '.join([y for y in x.split() if y not in To_remove_lst]) for x indf['name']]

Solution 2:

I'd recommend re.sub in a list comprehension for speed.

import re
p = re.compile('|'.join(map(re.escape, To_remove_lst)))
df['name'] = [p.sub('', text) for text indf['name']] 

print (df)
   ID            name
0   1           Kitty
1   2           Puppy
2   3     is  example
3   4   stackoverflow
4   5           World

List comprehensions are implemented in C and operate in C speed. I highly recommend list comprehensions when working with string and regex data over pandas str functions for the time-being because the API is a bit slow.

The use of map(re.escape, To_remove_lst) is to escape any possible regex metacharacters which are meant to be treated literally during replacement.

The pattern is precompiled before calling regex.sub to reduce the overhead of compilation at each iteration.

I've also let it slide but please use PEP-8 compliant variable names "to_remove_lst" (lower-snake case).


Timings

df = pd.concat([df] * 10000)
%timeit df['name'].str.replace('|'.join(To_remove_lst), '')
%timeit [p.sub('', text) for text indf['name']] 

100 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
60 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Solution 3:

You can run a for loop for each element and then use str.replace

for WORD in To_remove_lst:
    df['name'] = df['name'].str.replace(WORD, '')

Output:

   ID            name
01           Kitty
12           Puppy
23is  example
34   stackoverflow
45           World

Post a Comment for "How To Remove Strings Present In A List From A Column In Pandas"