Skip to content Skip to sidebar Skip to footer

How Do I Change - Using For Loops To Call Multiple Functions - Into - Using A Pipeline To Call A Class?

So the basic requirement is that, I get a dictionary of models from user, and a dictionary of their hyper parameters and give a report. Currently goal is for binary classification,

Solution 1:

You can consider using map(), details here: https://www.geeksforgeeks.org/python-map-function/

Some programmers have the habit of avoiding raw loops - "A raw loop is any loop inside a function where the function serves purpose larger than the algorithm implemented by the loop". More details here: https://sean-parent.stlab.cc/presentations/2013-09-11-cpp-seasoning/cpp-seasoning.pdf

I think that's the reason you are asked to remove for loop.

Solution 2:

I have implemented a working solution. I should have worded my question better. I initially misunderstood how GridsearchCV or RandomizedSearchCV works internally. cv_results_ gives all the results of the grid available. I thought only the best estimator was available to us.

Using this, for each type of model, I took the max rank_test_score, and got the parameters making up the model. In this example, it is 4 models. Now I ran each of those models, i.e. the best combination of parameters for each model, with my test data, and predicted the required scores. I think this solution can be extended to RandomizedSearchCV and a lot more other options.

NOTE: This is just a trivial solution. Lot of modifications necessary, like needing to scale data for specific models, etc. This solution will just serve as a starting point which can be modified according to the user's needs.

Credits to this answer for the ClfSwitcher() class.

Following is the implementation of the class (suggestions to improve are welcomed).

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score, roc_auc_score, recall_score, precision_score
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator
import warnings
warnings.filterwarnings('ignore')

cancer = datasets.load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target
target = df['target']
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='target', axis=1), target, test_size=0.4, random_state=13, stratify=target)

classClfSwitcher(BaseEstimator):

    def__init__(self, model=RandomForestClassifier()):
        """
        A Custom BaseEstimator that can switch between classifiers.
        :param estimator: sklearn object - The classifier
        """ 

        self.model = model


    deffit(self, X, y=None, **kwargs):
        self.model.fit(X, y)
        return self


    defpredict(self, X, y=None):
        return self.model.predict(X)


    defpredict_proba(self, X):
        return self.model.predict_proba(X)

    defscore(self, X, y):
        return self.estimator.score(X, y)

classreport(ClfSwitcher):
    def__init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.grid = None
        self.full_report = None
        self.concise_report = None
        self.scoring_metrics = {
            'precision': precision_score,
            'recall': recall_score,
            'f1': f1_score,
            'roc_auc': roc_auc_score
        }


    defgriddy(self, pipeLine, parameters, **kwargs):
        self.grid = GridSearchCV(pipeLine, parameters, scoring='accuracy', n_jobs=-1)


    deffit_grid(self, X_train, y_train=None, **kwargs):
        self.grid.fit(X_train, y_train)

    defmake_grid_report(self):
        self.full_report = pd.DataFrame(self.grid.cv_results_)

    @staticmethoddefget_names(col):
        return col.__class__.__name__

    @staticmethoddefcalc_score(col, metric):
        returnround(metric(y_test, col.fit(X_train, y_train).predict(X_test)), 4)


    defmake_concise_report(self):
        self.concise_report = pd.DataFrame(self.grid.cv_results_)
        self.concise_report['model_names'] = self.concise_report['param_cst__model'].apply(self.get_names)
        self.concise_report = self.concise_report.sort_values(['model_names', 'rank_test_score'], ascending=[True, False]) \
                                                .groupby(['model_names']).head(1)[['param_cst__model', 'model_names']] \
                                                .reset_index(drop=True)

        for metric_name, metric_func in self.scoring_metrics.items():
            self.concise_report[metric_name] = self.concise_report['param_cst__model'].apply(self.calc_score, metric=metric_func)

        self.concise_report = self.concise_report[['model_names', 'precision', 'recall', 'f1', 'roc_auc', 'param_cst__model']]

pipeline = Pipeline([
    ('cst', ClfSwitcher()),
])

parameters = [
    {
        'cst__model': [RandomForestClassifier()],
        'cst__model__n_estimators': [10, 20],
        'cst__model__max_depth': [5, 10],
        'cst__model__criterion': ['gini', 'entropy']
    },
    {
        'cst__model': [SVC()],
        'cst__model__C': [10, 20],
        'cst__model__kernel': ['linear'],
        'cst__model__gamma': [0.0001, 0.001]
    },
    {
        'cst__model': [LogisticRegression()],
        'cst__model__C': [13, 17],
        'cst__model__penalty': ['l1', 'l2']
    },
    {
        'cst__model': [GradientBoostingClassifier()],
        'cst__model__n_estimators': [10, 50],
        'cst__model__max_depth': [3, 5],
        'cst__model__min_samples_leaf': [1, 2]
    }
]

my_report = report()
my_report.griddy(pipeline, parameters, scoring='f1')
my_report.fit_grid(X_train, y_train)
my_report.make_concise_report()
my_report.concise_report

Output Report as desired.

enter image description here

Post a Comment for "How Do I Change - Using For Loops To Call Multiple Functions - Into - Using A Pipeline To Call A Class?"