上QQ阅读APP看书，第一时间看更新

Multiple linear regression model

Multiple linear regression is a straightforward generalization of single predictor models. In a multiple linear regression model, the dependent variable is related to two or more independent variables. To perform a multiple linear regression analysis, the scikit-learn library will be used. From the sklearn.linear_model field, the LinearRegression class performs an ordinary least squares linear regression:

As usual, we will load the library through the following command:

from sklearn.linear_model import LinearRegression

Now, we can use the LinearRegression() function, as follows:

LModel = LinearRegression()

To fit the linear model, the fit() function will be used:

LModel.fit(X_train, Y_train)

In this case, in the training phase, we used the data extracted for this phase. At this point, we can use the model to make predictions.

To do this, the predict() function is also available in the scikit-learn library:

Y_predLM = LModel.predict(X_test)

Usually, a scatterplot is used to determine whether or not there is a relationship between data. However, a scatterplot can also be used to analyze the performance of a linear model. By reporting the actual and predicted values on the two axes, it is possible to check how this data is arranged. To help with the analysis, it is possible to trace the bisector of the quadrant, that is, the line of equation Y = X. Theoretically, all observations should rest on this line, but we can be satisfied that the data is closer to this line. About half of the data points must be below the line and the other half must be above the line. The points that move away significantly from this line represent possible outliers.

To plot the two scatterplots, we will use the matplotlib library:

plt.figure(1)
plt.subplot(121)
plt.scatter(Y_test, Y_predKM)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.title("Keras Neural Network Model")

plt.subplot(122)
plt.scatter(Y_test, Y_predLM)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.title("SKLearn Linear Regression Model")
plt.show()

In the following diagram, we can see two scatterplots:

Analyzing the preceding graphs, it is clear that the Keras model (on the left) returns better results. The points are better suited to the line (they are divided equally in equal parts above and below the line), and isolated points are fewer. To confirm this first intuition, we must also calculate the MSE for the linear regression model. To do this, the sklearn.metrics.mean_squared_error() function will be used. This function computes MSE regression loss.

First, we have to import the function:

from sklearn.metrics import mean_squared_error

Then, we can compute the MSE, as follows:

mse = mean_squared_error(Y_test, Y_predLM)
print('Linear Regression Model')
print(mse)

The following result is returned:

Linear Regression Model
0.014089115439987464

Comparing this value (0.014089115439987464) with the one returned by the Keras model (0.0038815933421901066), we can state that the Keras model is more performant and recording a decidedly lower error (one order of magnitude).