Skip to content Skip to sidebar Skip to footer

How Do I Get Feature Importances For Decision Tree Pipeline That Has Preprocessing And Classification Steps?

I'm trying to fit Decision Tree model on UCI Adult dataset. I built the following pipeline to do so: nominal_features = ['workclass', 'education', 'marital-status', 'occupation',

Solution 1:

I am afraid you cannot get importances for your initial features here. Your decision tree does not know anything about them; the only thing it sees and knows about is the encoded ones, and nothing else.

You may want to try the permutation importance instead, which has several advantages over the tree-based feature importance; it is also easily applicable to pipelines - see Permutation importance using a Pipeline in SciKit-Learn.

Solution 2:

Fundamentally, the importance of a data column can be obtained by summing the importances of all the features that are based on it. Identifying column-to-feature mappings could be a little difficult to do by hand, but you can always use automated tools for that.

For example, the SkLearn2PMML package can translate Scikit-Learn pipelines to PMML representation, and perform various analyses and transformations while doing so. The calculation of aggregate feature importances is well supported.

from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
  ("preprocessor", preprocessor),
  ("classifier", clf)
])
pipeline.fit(X, y)
# Re-map the dynamic attribute to a static pickleable attribute
clf.pmml_feature_importances_ = clf.feature_importances_

sklearn2pmml(pipeline, "PipelineWithImportances.pmml.xml")

Post a Comment for "How Do I Get Feature Importances For Decision Tree Pipeline That Has Preprocessing And Classification Steps?"