Split Pandas Dataframe By String
Solution 1:
1) Doing it on the fly while reading the file line-by-line and checking for NewEntry
break is one approach.
2) Other way, if the dataframe already exists is to find the NewEntry
and slice the dataframe into multiple ones to dff = {}
dfcol1col20foo12341bar45672stuff78943NewEntryNaN4morestuff1345
Find the NewEntry
rows, add [-1]
and [len(df.index)]
for boundary conditions
rows = [-1] + np.where(df['col1']=='NewEntry')[0].tolist() + [len(df.index)]
[-1, 3L, 5]
Create the dict of dataframes
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
Dict of dataframes {0: datafram1, 1: dataframe2}
dff
{0:col1col20foo12341bar45672stuff7894, 1:col1col24morestuff1345}
Dataframe 1
dff[0]col1col20foo12341bar45672stuff7894
Dataframe 2
dff[1]
col1 col2
4 morestuff 1345
Solution 2:
So using your example data which I concatenated 3 times, after loading (I named the cols 'a','b','c' for convenience) we then find the indices where you have 'New Entry' and the produce a list of tuples of these positions stepwise to mark the beg, end range.
We can then iterate over this list of tuples and slice the orig df and append to list:
In [22]:t="""foo,1234,bar,4567stuff,7894NewEntry,,morestuff,1345"""df=pd.read_csv(io.StringIO(t),header=None,names=['a','b','c'])df=pd.concat([df]*3,ignore_index=True)dfOut[22]:abc0foo1234 NaN1bar4567 NaN2stuff7894 NaN3NewEntryNaNNaN4morestuff1345 NaN5foo1234 NaN6bar4567 NaN7stuff7894 NaN8NewEntryNaNNaN9morestuff1345 NaN10foo1234 NaN11bar4567 NaN12stuff7894 NaN13NewEntryNaNNaN14morestuff1345 NaNIn [30]:importitertoolsidx=df[df['a']=='New Entry'].indexidx_list= [(0,idx[0])]
idx_list=idx_list+list(zip(idx,idx[1:]))idx_listOut[30]:
[(0, 3), (3, 8), (8, 13)]
In [31]:df_list= []
for i in idx_list:print(i)ifi[0]==0:df_list.append(df[i[0]:i[1]])else:df_list.append(df[i[0]+1:i[1]])df_list(0,3)(3,8)(8,13)Out[31]:
[ abc0foo1234 NaN1bar4567 NaN2stuff7894 NaN, abc4morestuff1345 NaN5foo1234 NaN6bar4567 NaN7stuff7894 NaN, abc9morestuff1345 NaN10foo1234 NaN11bar4567 NaN12stuff7894 NaN]
Post a Comment for "Split Pandas Dataframe By String"