上QQ阅读APP看书，第一时间看更新

Naive Bayes classifier

Naive Bayes originally became popular as a method for spam detection. It is quick and fast. Naive Bayes assumes that the variables are all independent and not related to each other (a bad assumption, but that is what makes it Naive!). It also has the advantage that it does not need to be retrained when adding new data. Naive Bayes has its roots in Bayes theorem.

This simple example shows Naive Bayes in action. Using the Iris dataset, Naive Bayes will make a prediction for the fifth column using the first four columns as independent variables:

#use 1st 4 columns to predict the fifth
library(e1071)
iris.nb<-naiveBayes(iris[,1:4], iris[,5])
table(predict(iris.nb, iris[,1:4]), iris[,5])

The results of the table() function will go to the console. This output is the confusion matrix which is as follows:

             
             setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         47         3
  virginica       0          3        47

The correct classification rate for the problem is 96%. This can be easily verified at the console by summing up the values corresponding with the correct classification counts (that is, row 1/column 1, row 2/column 2, and row 3/column 3) and then dividing by the total number of rows.

Here is how you would do it in the console using the R calculator:

The confusion matrix also tells you which classifications did not perform well. For example, if you sum the values of column 2, you can see that there is a total of 50 versicolor species. However, there were a total of three misclassifications for the versicolor/virginia combination (following bold underlined):

            setosa versicolor virginica 
  setosa         50          0         0 
  versicolor      0         47         3 
  virginica       0          3        47

To identify which combinations were misclassified, we can write a little bit of code and examine the incorrect classification rows using a DataTable object. Using a DataTable object allows you to sort, search, and filter the data.

Merge the predictions with the original data. Then, assign a Correct or Wrong flag to the dataframe to designate whether the prediction was correct or not.

mrg <- cbind(pred,iris) 
mrg$correct <- ifelse(mrg$pred==mrg$Species,"Correct","Wrong")

Load the DT library and specify that you want an interactive datatable on the merged data. You will also want some interactive filtering capabilities, so specify filter='top' as a parameter.

library(DT) 
datatable(mrg,filter='top')

The interactive data table will open in the RStudio viewer:

To find the misclassified species, type Wrong, in the search box. The display will automatically update to show the incorrect predictions: