How To Keep Track Of Columns After Encoding Categorical Variables?
Solution 1:
Can you confirm if future data sets will continue to have the same column names? If I got your question correctly, all that you will need to do is save df_columns
from the original data frame and use it to reindex your new dataframe.
new_df_reindexed = new_df[df_columns]
To answer your other questions, you can one-hot encode your data using get_dummies()
from pandas. Use the drop_first
parameter to drop one of the generated column values and avoid the dummy variable trap. Also, save the column list of the one-hot-encoded data frame.
To ensure that you new / testing / holdout data set has the same column definition as that used in model training,
- First use
get_dummies()
to one-hot-encode the new data set. - Use pandas
reindex
to bring the new dataframe into the same structure as the one used in model training -df.reindex(columns=train_one_hot_encode_col_list, axis="columns")
. - The above will create dummy variable columns for categorical column values in the training data set that are not present in the categorical columns of the new data set.
- Finally, use the above method to remove any columns in the new data set that are not present in the old data set -
test_df_reindexed = test_df_onehotencode[train_one_hot_encode_col_list]
If you follow these steps, you can completely rely on the list of original column names, and will not need to track column positions or categorical value definitions.
I would also advice you to read the below for further reference: One-hot encoding in pandas - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html Column re-indexing - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
Post a Comment for "How To Keep Track Of Columns After Encoding Categorical Variables?"