The Machine Learning Package - SklearncomPYre

25 Apr 2019

Reading time ~6 minutes

Facilitating beautifully efficient comparisons of machine learning classifiers and regression models

Motivation

As a part of Master of Data Science cohort at The University of British Columbia I would try out different machine learning models on the different datasets almost every week. I would compare the test scores and give my judgement as to which model could be used for further analysis. After doing this for couple of months I realized it was high time to make a package out of it and make my life easy by removing the redundant task of splitting the data, fitting different models and making comparison plot of accuracies.
SklearncomPYre package provides functions to help make the early stages of model selection and exploration easier to cycle through and meaningfully compare.

About the Package

SklearncomPYre harnesses the power of scikit-learn, combining it with pandas dataframes and matplotlib plots for easy, breezy, and beautiful machine learning exploration. A Python package facilitating beautifully efficient comparisons of machine learning classifiers and regression models. The package comprises of 3 functions which together have the capabilities to split the X (predictors) and y (target) inputs in the Train, Test and Validation datasets based on the input proportions given by the user. The package takes the input dictionary of the models user wants to fit and score on the data and additionally saves the bar plot of the Accuracy and Time comparison of fit and predict of these models. The details of each of these functions is given below.

Summary

SklearncomPYre harnesses the power of scikit-learn, combining it with pandas dataframes and matplotlib plots for easy, breezy, and beautiful machine learning exploration.

Looking to do the same in R? Check out caretcompaR!

Function 1: `split()`

The function splits the training input samples X, and target values y (class labels in classification, real numbers in regression) into train, test and validation sets according to specified proportions.

Outputs four array like training, validation, test, and combined training and validation sets and four y arrays.

Inputs:

X data set, type: Array like
Y data set, type: Array like
proportion of training data , type: float
proportion of test data , type: float
proportion of validation data, type: float

Outputs:

X train set, type: Array like
y train, type: Array like
X validation set, type: Array like
y validation, type: Array like
X train and validation set, type: Array like
y train and validation, type: Array like
X test set, type: Array like
y test, type: Array like

Function 2: `train_test_acc_time()`

The purpose of this function is to compare different sklearn regressors or classifiers in terms of training and test accuracies, and the time it takes to fit and predict. The function inputs are dictionary of models, input train samples Xtrain(input features), input test samples Xtest, target train values ytrain and target test values ytest (continuous or categorical).

The function outputs a beautiful dataframe with training & test scores, model variance, and the time it takes to fit and predict using different models.

Inputs:

Dictionary of ML classifiers or regressors.
X train set, type: Array-like
Y train set, type: Array-like
X test set, type: Array-like
Y test set, type: Array-like

Outputs:

Dataframe with 7 columns: (1) regressor or classifier name, (2) training accuracy, (3) test accuracy, (4) model variance, (5) time it takes to fit, (6) time it takes to predict and (7) total time. The dataframe will be sorted by test score in descending order.

Function 3: `comparison_viz()`

The purpose of this function is to visualize the output of train_test_acc_time() for easy communication and interpretation. The user has the choice to visualize a comparison of accuracies or time. It takes in a dataframe with 7 attributes i.e. model name, training & test scores, model variance, and the time it takes to fit, predict and total time.

Outputs a beautiful matplotlib bar chart comparison of different models’ training and test scores or the time it takes to fit and predict.

Inputs:

Dataframe with 7 columns: (1) regressor or classifier name, (2) training accuracy, (3) test accuracy, (4) model variance, (5) time it takes to fit, (6) time it takes to predict and (7) total time. Type: pandas.Dataframe
Choice of accuracy or time, with the default being ‘accuracy’ if no string is given. Type: string

Outputs:

Bar chart of accuracies or time comparison by models saved to root directory. Type: png

Install

Pleas use the following command to install the package. :
pip install git+https://github.com/UBC-MDS/SklearncomPYre.git

Once installed, load the package using following commands :

from SklearncomPYre.train_test_acc_time import train_test_acc_time
from SklearncomPYre.comparison_viz import comparison_viz
from SklearncomPYre.split import split

Dependencies

Python==3.6.8
matplotlib==3.0.1
numpy==1.15.4
pandas==0.20.3
scikit-learn==0.20.2
scipy==1.2.0

How To Use

Here is an example of how you can use SklearncomPYre:

# Example usage

# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Importing SklearncomPYre
from SklearncomPYre.train_test_acc_time import train_test_acc_time
from SklearncomPYre.comparison_viz import comparison_viz
from SklearncomPYre.split import split

# Loading the handy iris dataset
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

# Setting up a dictionary of classifiers to test

dictionary = {
   'knn': KNeighborsClassifier(),
   'LogRegression':LogisticRegression() ,
   'RForest': RandomForestClassifier()}

# Let's start by using the SklearncomPYre function split().

# Splitting up datasets into 40% training, 20% vaildation, and 40% tests sets.

X_train, y_train, X_val, y_val, X_train_val, y_train_val, X_test, y_test = split(X,y,0.4,0.2,0.4)

#Now, let's train some models and compare them in a pandas dataframe by using train_test_acc_time().

result = train_test_acc_time(dictionary,X_train,y_train,X_val,y_val)
result

# Next, let's take a look at some some plots with comparison_viz()

#Our plots will be saved to the working directory.

comparison_viz(result, "accuracy")
comparison_viz(result, 'time')

Credits

Function concepts inspired by UBC MDS DSCI 573 lab instructor Varada Kolhatkar.
README formatting inspiration from ptoolkit.
Badges by Shields IO
Logo designed at Canva

Where does this package fit in?

This package provides functions to help make the early stages of model selection and exploration easier to cycle through and meaningfully compare.

Our idea for this package was to facilitate the comparison of machine learning classifiers and models. Our inspiration came from UBC MDS DSCI 573 lab assignments where we learned to combine python’s sci-kit learn with pandas in order to produce interpretable comparisons of train and test accuracies and time efficiencies across models.

We are not currently aware of any packages that combine sci-kit learn and pandas for efficient and interpretable model-to-model comparisons. We expect that this combination is used in practice and after having used it while learning machine learning techniques during our UBC MDS coursework, we thought it would be a good combination of tools to formally package together.

We are aware of a new package, sklearn-pandas that combines sci-kit learn and pandas powers but this new package is tailored towards providing full-cycle machine learning functionality (feature selection, transformations, inputting/outputting pandas dataframes, etc.) rather than focusing facilitating model-to-model comparisons via dataframes.

Creators	GitHub Page
Birinder Singh	Birinder Singh GitHub
Jes Simkin	Jes Simkin GitHub
Talha Siddiqui	Talha Siddiqui GitHub

License

MIT License

Contribute

Interested in contributing? See our Contributing Guidelines and Code of Conduct.