Hands-On Automated Machine Learning
上QQ阅读APP看书,第一时间看更新

By which method can logistic regression be implemented?

A logistic regression model can be created by importing scikit-learn's LogisticRegression method. We load the packages as we did previously for creating a linear regression model:

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

We will use the dataset of an HR department that has the list of employees who have attrited in the past along with the employees who are continuing in the job:

hr_data = pd.read_csv('data/hr.csv', header=0)
hr_data.head()
hr_data = hr_data.dropna()
print(hr_data.shape)
print(list(hr_data.columns))

The output of the preceding code is as follows:

The dataset has 14999 rows and 10 columns. The data.columns function displays names of the attributes. The salary attribute has three values—high, low, and medium, and sales has seven values—IT, RandD, marketing, product_mng, sales, support, and technical. To use this discrete input data in the model, we need to convert it into numeric format. There are various ways to do so. One of the ways is to dummy encode the values, also known as one-hot encoding. Using this method, dummy columns are generated for each class of a categorical attribute.

For each dummy attribute, the presence of the class is represented by 1, and its absence is represented by 0.

Discrete data can either be nominal or ordinal. When there is a natural ordering of values in the discrete data, it is termed as ordinal. For example, categorical values such as high, medium, and low are ordered values. For these cases, label encoding is mostly used. When we cannot derive any relationship or order from the categorical or discrete values, it is termed as nominal. For example, colors such as red, yellow, and green have no order. For these cases, dummy encoding is a popular method.

The get_dummies method of pandas provides an easy interface for creating dummy variables in Python. The input for the function is the dataset and names of the attributes that are to be dummy encoded. In this case, we will be dummy encoding salary and sales attributes of the HR dataset:

data_trnsf = pd.get_dummies(hr_data, columns =['salary', 'sales'])
data_trnsf.columns

The output of the preceding code is as follows:

Now, the dataset is ready for modeling. The sales and salary attributes are successfully one-hot encoded. Next, as we are going to predict the attrition, we are going to use the left attribute as the target as it contains the information on whether an employee attrited or not. We can drop the left data from the input predictors dataset referred as to X in the code. The left attribute is denoted by Y (target):

X = data_trnsf.drop('left', axis=1)
X.columns

The output of the preceding code is as follows:

We split the dataset into train and test sets with a ratio of 70:30. 70% of the data will be used to train the logistic regression model and the remaining 30% to evaluate the accuracy of the model:

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, data_trnsf.left, test_size=0.3, random_state=42)
print(X_train)

As we execute the code snippet, four datasets are created. X_train and X_test are the train and test input predictor data. Y_train and Y_test are train and test input target data. Now, we will fit the model on the train data and evaluate the accuracy of the model on the test data. First, we create an instance of the LogisticsRegression() classifier class. Next, we fit the classifier on the training data: 

attrition_classifier = LogisticRegression()
attrition_classifier.fit(X_train, Y_train)

Once the model is successfully created, we use the predict method on the test input predictor dataset to predict the corresponding target values (Y_pred):

Y_pred = attrition_classifier.predict(X_test)

We need to create a confusion_matrix for evaluating a classifier. Most of the model evaluation metrics are based on the confusion matrix itself. There is a detailed discussion on confusion matrix and other evaluation metrics right after this section. For now, let's consider the confusion matrix as a matrix of four values that provides us with the count of values that were correctly and incorrectly predicted. Based on the values in the confusion matrix, the classifier's accuracy is calculated. The accuracy of our classifier is 0.79 or 79%, which means 79% of cases were correctly predicted:

from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(Y_test, Y_pred)
print(confusion_matrix)

print('Accuracy of logistic regression model on test dataset: {:.2f}'.format(attrition_classifier.score(X_test, Y_test)))

The output of the preceding code is as follows:

Sometimes, the accuracy might not be a good measure to judge the performance of a model. For example, in the case of unbalanced datasets, the predictions might be biased towards the majority class. So, we need to look at other metrics such as f1-score, area under curve (AUC), precision, and recall that gives a fair judgment about the model. We can retrieve the scores for all these metrics by importing the classification_report from scikit-learn's metric method:

from sklearn.metrics import classification_report
print(classification_report(Y_test, Y_pred))

The output of the preceding code is as follows:

Receiver Operating Characteristic (ROC) is most commonly used to visualize the performance of a binary classifier. AUC measure is the area under the ROC curve, and it provides a single number that summarizes the performance of the classifier based on the ROC curve. The following code snippet can be used to draw a ROC curve using Python:

from sklearn.metrics import roc_curve
from sklearn.metrics import auc

# Compute false positive rate(fpr), true positive rate(tpr), thresholds and roc auc(Area under Curve)
fpr, tpr, thresholds = roc_curve(Y_test, Y_pred)
auc = auc(fpr,tpr)

# Plot ROC curve
plt.plot(fpr, tpr, label='AUC = %0.2f' % auc)
#random prediction curve
plt.plot([0, 1], [0, 1], 'k--')
#Set the x limits
plt.xlim([0.0, 1.0])
#Set the Y limits
plt.ylim([0.0, 1.0])
#Set the X label
plt.xlabel('False Positive Rate(FPR) ')
#Set the Y label
plt.ylabel('True Positive Rate(TPR)')
#Set the plot title
plt.title('Receiver Operating Characteristic(ROC) Cure')
# Location of the AUC legend
plt.legend(loc="right")

The output of the preceding code is as follows:

The AUC for our model is 0.63. We are already seeing some of the metrics that are used to evaluate a classification model, and some of these are appearing strange. So, let's understand the metrics before moving onto our discussion on classification algorithms.