 # Naive Bayes Multi Class Text Classification Example

## 1. Overview

Let’s first discuss what is  Naive Bayes algorithm. “Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.”

In Computer science and statistics Naive Bayes also called as `Simple Bayes` and `Independence Bayes`.

In this article, We will implement News Articles Classification using `Multi Nomial Naive Bayes` algorithm.

## 2. Development Environment

Python : 3.6.5

IDE       : Pycharm community Edition

scikit-learn : 0.20.0

numpy : 1.15.3

matplotlib : 3.0.0

## 3. Steps of Text Classification

In this article, we have used 20NewsGroup datasets and divided the original twenty categories into five major categories like Mythology, Science, Sport, Technology, and Politics for better understanding.

Here we have used sklearn.datasets and panda data frames to load file contents and category for further processing. Here is the sample code.

```"""Load text files with categories as subfolder names.

Individual samples are assumed to be files stored a two levels folder
structure such as the following:

NewsArticles/
Mythology/
Mythology_file_1.txt
Mythology_file_2.txt
...
Mythology_file_100.txt
Politics/
Politics_file_101.txt
...
Politics_file_200.txt
Science/
Science_file_201.txt
...
Science_file_300.txt
Sports/
Sports_file_301.txt
...
Sports_file_400.txt
Technology/
Technology_file_401.txt
...
Technology_file_500.txt
..."""
# Source file directory
path_train = "G:\\DataSet\\DataSet\\NewsArticles"
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames

data_tags = ["filename", "category", "content"]
data_list = []

i = 0
for f in labelled_files:
i += 1

# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
```

### 3.2 Split datasets into Train/Test

Next step is to split data sets into two parts training and testing dataset which used to check our model performance. here we have split data sets into 80:20 ratio.80% training set and the remaining 20% is our testing data sets.

Refer below sample code:

```# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)

train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]

test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
print("Training Set Size "+str(len(train_posts)))
print("Testing Set Size "+str(len(test_posts)))```

### 3.3 PreProcess text

Preprocessing step includes removal of common words also called stop words, punctuation marks, stemming, lemmatization etc..

After that weight is assigned to each term, weight can be term count or TF-IDF which help identify the importance of a term in a corpus.`CountVectorizer `and `TfidfVectorizer` provide facility calculate term count and term TF-IDF respectively. Here we have used `TfidfVectorizer` weighting method. Refer below code for better understanding.

```vectorizer = TfidfVectorizer(stop_words='english', max_features=500)
X_train = vectorizer.fit_transform(train_posts)
vectorizer1 = CountVectorizer(stop_words='english', max_features=500)
X_test = vectorizer.transform(test_posts)
print("Vocab Size : "+str(X_train.shape))
```

### 3.4 Features Selection

Next step is to select top K features from our training corpus. There are many methods available for feature selection like chi-squre, ANOVA, LDA etc.. Refer feature selection for more details.

### 3.5 Build Model

Now it’s time to training our `Multi Class News article Classification model` as below.

```from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)
```

### 3.6 Apply on test data sets

Once a model is built, Next step is to validate and apply on target data sets.

```k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print("Cross validation start")
print(cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
print("Cross validation end")
print("start applying on testing data set")
predictions = clf.predict(X_test)
print("Model applied on testing data set")
class_names = ["Mythology", "Politics", "Science", "Sports", "Technology"]
print("***********Classification Report***********")
print(classification_report(test_tags, predictions, class_names))
```

### 3.7 Evaluate Model

Next step is to check how our model is performing. We can check the model performance using the Confusion Matrix. In our example, we have use `matplotlib` to plot confusion matrix.

```def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.figure()
print("confusion matrix")
print(cm)

plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape), range(cm.shape)):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
plt.show()

# Compute confusion matrix
cnf_matrix = confusion_matrix(test_tags, predictions)
# Plot Confusion Matrix

plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix')
```

### 3.8 Save Model

Pickle is used to save and load any machine learning model for future use.

```# save the model to disk
modelFileName = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelFileName, 'wb'))
print("model saved...")

# load previously saved model from disk
```

## 4. Example

```import pandas as pd
from pathlib import Path
import sklearn.datasets as skds
import pickle
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import KFold, cross_val_score
import itertools
import numpy as np
import matplotlib.pyplot as plt

"""Load text files with categories as subfolder names.

Individual samples are assumed to be files stored a two levels folder
structure such as the following:

NewsArticles/
Mythology/
Mythology_file_1.txt
Mythology_file_2.txt
...
Mythology_file_100.txt
Politics/
Politics_file_101.txt
...
Politics_file_200.txt
Science/
Science_file_201.txt
...
Science_file_300.txt
Sports/
Sports_file_301.txt
...
Sports_file_400.txt
Technology/
Technology_file_401.txt
...
Technology_file_500.txt
..."""
# Source file directory
path_train = "G:\\DataSet\\DataSet\\NewsArticles"
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames

data_tags = ["filename", "category", "content"]
data_list = []

i = 0
for f in labelled_files:
i += 1

# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)

train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]

test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
print("Training Set Size "+str(len(train_posts)))
print("Testing Set Size "+str(len(test_posts)))
#print(train_tags[:5])

"""

Simple Text Classification using Keras Deep Learning Python Library

vectorizer = CountVectorizer(stop_words='english', max_features=500)
"""
vectorizer = TfidfVectorizer(stop_words='english', max_features=500)
X_train = vectorizer.fit_transform(train_posts)
vectorizer1 = CountVectorizer(stop_words='english', max_features=500)
X_test = vectorizer.transform(test_posts)
print("Vocab Size : "+str(X_train.shape))
#https://www.opencodez.com/python/text-classification-using-keras.htm
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)

k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print("Cross validation start")
print(cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
print("Cross validation end")
print("start applying on testing data set")
predictions = clf.predict(X_test)
print("Model applied on testing data set")
class_names = ["Mythology", "Politics", "Science", "Sports", "Technology"]
print("***********Classification Report***********")
print(classification_report(test_tags, predictions, class_names))

def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.figure()
print("confusion matrix")
print(cm)

plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape), range(cm.shape)):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
plt.show()

# Compute confusion matrix
cnf_matrix = confusion_matrix(test_tags, predictions)
# Plot Confusion Matrix

plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix')
# save the model to disk
modelFileName = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelFileName, 'wb'))
print("model saved...")

# load previously saved model from disk
predictions = clf.predict(X_test)
print("***********Classification Report***********")
print(classification_report(test_tags, predictions, class_names))
print("complete")
```

## 5. Output

```Data loaded : 2666
Training Set Size 2132
Testing Set Size 534
Vocab Size : 2132
Cross validation start
[0.92037471 0.92740047 0.90140845 0.93896714 0.92488263]
Cross validation end
start applying on testing data set
Model applied on testing data set
***********Classification Report***********
precision    recall  f1-score   support

Mythology       0.98      0.92      0.95        88
Politics       0.96      0.94      0.95        96
Science       0.95      0.79      0.86       112
Sports       0.95      0.99      0.97       105
Technology       0.85      0.98      0.91       133

micro avg       0.93      0.93      0.93       534
macro avg       0.94      0.92      0.93       534
weighted avg       0.93      0.93      0.92       534

confusion matrix
[[ 81   4   3   0   0]
[  1  90   1   2   2]
[  0   0  88   2  22]
[  1   0   0 104   0]
[  0   0   1   1 131]]
Saving model
model saved...
***********Classification Report***********
precision    recall  f1-score   support

Mythology       0.98      0.92      0.95        88
Politics       0.96      0.94      0.95        96
Science       0.95      0.79      0.86       112
Sports       0.95      0.99      0.97       105
Technology       0.85      0.98      0.91       133

micro avg       0.93      0.93      0.93       534
macro avg       0.94      0.92      0.93       534
weighted avg       0.93      0.93      0.92       534

complete
``` ## 6. Conclusion

In this article, we have discussed multi-class classification (News Articles Classification) using python scikit-learn library along with how to load data, pre-process data, build and evaluate navie bayes model with confusion matrix, Plot Confusion matrix using matplotlib with a complete example.

## 7. References

Refer below link for more details:

## 8. Source Code

Multi Nomial Naive Bayes MultiClass Example

You can also download the source code of Multi Nomial Naive Bayes MultiClass and other useful examples from our git repository.