Skip to content Skip to sidebar Skip to footer

Split Pandas Dataframe By String

I'm new to using Pandas dataframes. I have data in a .csv like this: foo, 1234, bar, 4567 stuff, 7894 New Entry,, morestuff,1345 I'm reading it into the dataframe with df = pd.r

Solution 1:

1) Doing it on the fly while reading the file line-by-line and checking for NewEntry break is one approach.

2) Other way, if the dataframe already exists is to find the NewEntry and slice the dataframe into multiple ones to dff = {}

dfcol1col20foo12341bar45672stuff78943NewEntryNaN4morestuff1345

Find the NewEntry rows, add [-1] and [len(df.index)] for boundary conditions

rows = [-1] + np.where(df['col1']=='NewEntry')[0].tolist() + [len(df.index)]
[-1, 3L, 5]

Create the dict of dataframes

dff = {}                                                                            
for i, r in enumerate(rows[:-1]):                                                   
    dff[i] = df[r+1: rows[i+1]]                                                     

Dict of dataframes {0: datafram1, 1: dataframe2}

dff                           
{0:col1col20foo12341bar45672stuff7894, 1:col1col24morestuff1345}

Dataframe 1

dff[0]col1col20foo12341bar45672stuff7894

Dataframe 2

dff[1]              
        col1  col2  
4  morestuff  1345

Solution 2:

So using your example data which I concatenated 3 times, after loading (I named the cols 'a','b','c' for convenience) we then find the indices where you have 'New Entry' and the produce a list of tuples of these positions stepwise to mark the beg, end range.

We can then iterate over this list of tuples and slice the orig df and append to list:

In [22]:t="""foo,1234,bar,4567stuff,7894NewEntry,,morestuff,1345"""df=pd.read_csv(io.StringIO(t),header=None,names=['a','b','c'])df=pd.concat([df]*3,ignore_index=True)dfOut[22]:abc0foo1234 NaN1bar4567 NaN2stuff7894 NaN3NewEntryNaNNaN4morestuff1345 NaN5foo1234 NaN6bar4567 NaN7stuff7894 NaN8NewEntryNaNNaN9morestuff1345 NaN10foo1234 NaN11bar4567 NaN12stuff7894 NaN13NewEntryNaNNaN14morestuff1345 NaNIn [30]:importitertoolsidx=df[df['a']=='New Entry'].indexidx_list= [(0,idx[0])]
idx_list=idx_list+list(zip(idx,idx[1:]))idx_listOut[30]:
[(0, 3), (3, 8), (8, 13)]
In [31]:df_list= []
for i in idx_list:print(i)ifi[0]==0:df_list.append(df[i[0]:i[1]])else:df_list.append(df[i[0]+1:i[1]])df_list(0,3)(3,8)(8,13)Out[31]:
[       abc0foo1234 NaN1bar4567 NaN2stuff7894 NaN,            abc4morestuff1345 NaN5foo1234 NaN6bar4567 NaN7stuff7894 NaN,             abc9morestuff1345 NaN10foo1234 NaN11bar4567 NaN12stuff7894 NaN]

Post a Comment for "Split Pandas Dataframe By String"