How To Keep Track Of Columns After Encoding Categorical Variables?

June 22, 2022 Post a Comment

I am wondering how I can keep track of the original columns of a dataset once I perform data preprocessing on it? In the below code df_columns would tell me that column 0 in df_arr

Solution 1:

Can you confirm if future data sets will continue to have the same column names? If I got your question correctly, all that you will need to do is save df_columns from the original data frame and use it to reindex your new dataframe.

new_df_reindexed = new_df[df_columns]

To answer your other questions, you can one-hot encode your data using get_dummies() from pandas. Use the drop_first parameter to drop one of the generated column values and avoid the dummy variable trap. Also, save the column list of the one-hot-encoded data frame.

To ensure that you new / testing / holdout data set has the same column definition as that used in model training,

First use get_dummies() to one-hot-encode the new data set.
Use pandas reindex to bring the new dataframe into the same structure as the one used in model training - df.reindex(columns=train_one_hot_encode_col_list, axis="columns").
The above will create dummy variable columns for categorical column values in the training data set that are not present in the categorical columns of the new data set.
Finally, use the above method to remove any columns in the new data set that are not present in the old data set - test_df_reindexed = test_df_onehotencode[train_one_hot_encode_col_list]

If you follow these steps, you can completely rely on the list of original column names, and will not need to track column positions or categorical value definitions.

I would also advice you to read the below for further reference: One-hot encoding in pandas - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html Column re-indexing - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

Python Developer

How To Keep Track Of Columns After Encoding Categorical Variables?

Solution 1:

Post a Comment for "How To Keep Track Of Columns After Encoding Categorical Variables?"