Skip to content Skip to sidebar Skip to footer

Process Subset Of Data Based On Variable Type In Python

I have the below data which I store in a csv (df_sample.csv). I have the column names in a list called cols_list. df_data_sample: df_data_sample = pd.DataFrame({

Solution 1:

For numeric columns it is simple:

num_cols = [k for k, v in attribute_dict.items() if v == 'NUM']
print (num_cols)
['ord_m1', 'rev_m1', 'equip_m1', 'oev_m1', 'irev_m1']

df1 = pd.read_csv('df_seg_sample.csv', usecols = [num_cols]).fillna(0)

But first part code is performance problem, especially in get_dummies called for 5 million rows:

df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True, prefix=column)

Unfortunately there is problem processes get_dummies in chunks.

Solution 2:

There are three things i would advice you to speed up your computaions:

  1. Take a look at pandas HDF5 capabilites. HDF is a binary file format for fast reading and writing data to disk.
  2. I would read in bigger chunks (several columns) of your csv file at once (depending on how big your memory is).
  3. There are many pandas operations you can apply to every column at once. For example nunique() (giving you the number of unique values, so you don't need unique().size). With these column-wise operations you can easily filter columns by selecting with a binary vector. E.g.
df = df.loc[:, df.nunique() >100] 
#filterouteverycolumnwhere less then100uniquevaluesare present

Also this answer from the author of pandas on large data workflow might be interesting for you.

Post a Comment for "Process Subset Of Data Based On Variable Type In Python"