Naive Bayes Multi Class Text Classification Example

Table of Contents

1. Overview
2. Development Environment
3. Steps of Text Classification
4. Example
5. Output
6. Conclusion
7. References
8. Source Code
- Was this post helpful?

1. Overview

Let’s first discuss what is Naive Bayes algorithm. “Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.”

In Computer science and statistics Naive Bayes also called as Simple Bayes and Independence Bayes.

In this article, We will implement News Articles Classification using Multi Nomial Naive Bayes algorithm.

2. Development Environment

Python : 3.6.5

IDE : Pycharm community Edition

scikit-learn : 0.20.0

numpy : 1.15.3

matplotlib : 3.0.0

3. Steps of Text Classification

3.1 Load/Prepare Data sets

In this article, we have used 20NewsGroup datasets and divided the original twenty categories into five major categories like Mythology, Science, Sport, Technology, and Politics for better understanding.

Here we have used sklearn.datasets and panda data frames to load file contents and category for further processing. Here is the sample code.

"""Load text files with categories as subfolder names.
    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:
        NewsArticles/
            Mythology/
                Mythology_file_1.txt
                Mythology_file_2.txt
                ...
                Mythology_file_100.txt
            Politics/
                Politics_file_101.txt
                ...
                Politics_file_200.txt
            Science/
                Science_file_201.txt
                ...
                Science_file_300.txt
            Sports/
                Sports_file_301.txt
                ...
                Sports_file_400.txt
            Technology/
                Technology_file_401.txt
                ...
                Technology_file_500.txt
                ..."""
# Source file directory
path_train = "G:\\DataSet\\DataSet\\NewsArticles"
files_train = skds.load_files(path_train, load_content=False)
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames
data_tags = ["filename", "category", "content"]
data_list = []

# Read and add data from file to a list
i = 0
for f in labelled_files:
    data_list.append((f, label_names[label_index[i]], Path(f).read_text()))
    i += 1
# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
print("Data loaded : "+str(len(data)))

3.2 Split datasets into Train/Test

Next step is to split data sets into two parts training and testing dataset which used to check our model performance. here we have split data sets into 80:20 ratio.80% training set and the remaining 20% is our testing data sets.

Refer below sample code:

# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)
train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]
test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
print("Training Set Size "+str(len(train_posts)))
print("Testing Set Size "+str(len(test_posts)))

3.3 PreProcess text

Preprocessing step includes removal of common words also called stop words, punctuation marks, stemming, lemmatization etc..

After that weight is assigned to each term, weight can be term count or TF-IDF which help identify the importance of a term in a corpus.CountVectorizer and TfidfVectorizer provide facility calculate term count and term TF-IDF respectively. Here we have used TfidfVectorizer weighting method. Refer below code for better understanding.

vectorizer = TfidfVectorizer(stop_words='english', max_features=500)
X_train = vectorizer.fit_transform(train_posts)
vectorizer1 = CountVectorizer(stop_words='english', max_features=500)
X_test = vectorizer.transform(test_posts)
print("Vocab Size : "+str(X_train.shape[0]))

3.4 Features Selection

Next step is to select top K features from our training corpus. There are many methods available for feature selection like chi-squre, ANOVA, LDA etc.. Refer feature selection for more details.

3.5 Build Model

Now it’s time to training our Multi Class News article Classification model as below.

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)

3.6 Apply on test data sets

Once a model is built, Next step is to validate and apply on target data sets.

k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print("Cross validation start")
print(cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
print("Cross validation end")
print("start applying on testing data set")
predictions = clf.predict(X_test)
print("Model applied on testing data set")
class_names = ["Mythology", "Politics", "Science", "Sports", "Technology"]
print("***********Classification Report***********")
print(classification_report(test_tags, predictions, class_names))

3.7 Evaluate Model

Next step is to check how our model is performing. We can check the model performance using the Confusion Matrix. In our example, we have use matplotlib to plot confusion matrix.

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.figure()
    print("confusion matrix")
    print(cm)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    fmt = 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
    plt.show()

# Compute confusion matrix
cnf_matrix = confusion_matrix(test_tags, predictions)
# Plot Confusion Matrix
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix')

3.8 Save Model

Pickle is used to save and load any machine learning model for future use.

# save the model to disk
modelFileName = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelFileName, 'wb'))
print("model saved...")
print("load previously saved model")
# load previously saved model from disk
loaded_model = pickle.load(open(modelFileName, 'rb'))

4. Example

import pandas as pd
from pathlib import Path
import sklearn.datasets as skds
import pickle
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import KFold, cross_val_score
import itertools
import numpy as np
import matplotlib.pyplot as plt
"""Load text files with categories as subfolder names.
    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:
        NewsArticles/
            Mythology/
                Mythology_file_1.txt
                Mythology_file_2.txt
                ...
                Mythology_file_100.txt
            Politics/
                Politics_file_101.txt
                ...
                Politics_file_200.txt
            Science/
                Science_file_201.txt
                ...
                Science_file_300.txt
            Sports/
                Sports_file_301.txt
                ...
                Sports_file_400.txt
            Technology/
                Technology_file_401.txt
                ...
                Technology_file_500.txt
                ..."""
# Source file directory
path_train = "G:\\DataSet\\DataSet\\NewsArticles"
files_train = skds.load_files(path_train, load_content=False)
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames
data_tags = ["filename", "category", "content"]
data_list = []

# Read and add data from file to a list
i = 0
for f in labelled_files:
    data_list.append((f, label_names[label_index[i]], Path(f).read_text()))
    i += 1
# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
print("Data loaded : "+str(len(data)))
# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)
train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]
test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
print("Training Set Size "+str(len(train_posts)))
print("Testing Set Size "+str(len(test_posts)))
#print(train_tags[:5])
"""
 
 Simple Text Classification using Keras Deep Learning Python Library – Step By Step Guide

vectorizer = CountVectorizer(stop_words='english', max_features=500)
"""
vectorizer = TfidfVectorizer(stop_words='english', max_features=500)
X_train = vectorizer.fit_transform(train_posts)
vectorizer1 = CountVectorizer(stop_words='english', max_features=500)
X_test = vectorizer.transform(test_posts)
print("Vocab Size : "+str(X_train.shape[0]))
#https://www.opencodez.com/python/text-classification-using-keras.htm
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)
k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print("Cross validation start")
print(cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
print("Cross validation end")
print("start applying on testing data set")
predictions = clf.predict(X_test)
print("Model applied on testing data set")
class_names = ["Mythology", "Politics", "Science", "Sports", "Technology"]
print("***********Classification Report***********")
print(classification_report(test_tags, predictions, class_names))

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.figure()
    print("confusion matrix")
    print(cm)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    fmt = 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
    plt.show()

# Compute confusion matrix
cnf_matrix = confusion_matrix(test_tags, predictions)
# Plot Confusion Matrix
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix')
# save the model to disk
modelFileName = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelFileName, 'wb'))
print("model saved...")
print("load previously saved model")
# load previously saved model from disk
loaded_model = pickle.load(open(modelFileName, 'rb'))
print("Start prediction from loaded model")
predictions = clf.predict(X_test)
print("***********Classification Report***********")
print(classification_report(test_tags, predictions, class_names))
print("complete")

5. Output

Data loaded : 2666
Training Set Size 2132
Testing Set Size 534
Vocab Size : 2132
Cross validation start
[0.92037471 0.92740047 0.90140845 0.93896714 0.92488263]
Cross validation end
start applying on testing data set
Model applied on testing data set
***********Classification Report***********
              precision    recall  f1-score   support
   Mythology       0.98      0.92      0.95        88
    Politics       0.96      0.94      0.95        96
     Science       0.95      0.79      0.86       112
      Sports       0.95      0.99      0.97       105
  Technology       0.85      0.98      0.91       133
   micro avg       0.93      0.93      0.93       534
   macro avg       0.94      0.92      0.93       534
weighted avg       0.93      0.93      0.92       534
confusion matrix
[[ 81   4   3   0   0]
 [  1  90   1   2   2]
 [  0   0  88   2  22]
 [  1   0   0 104   0]
 [  0   0   1   1 131]]
Saving model
model saved...
load previously saved model
Start prediction from loaded model
***********Classification Report***********
              precision    recall  f1-score   support
   Mythology       0.98      0.92      0.95        88
    Politics       0.96      0.94      0.95        96
     Science       0.95      0.79      0.86       112
      Sports       0.95      0.99      0.97       105
  Technology       0.85      0.98      0.91       133
   micro avg       0.93      0.93      0.93       534
   macro avg       0.94      0.92      0.93       534
weighted avg       0.93      0.93      0.92       534
complete

6. Conclusion

In this article, we have discussed multi-class classification (News Articles Classification) using python scikit-learn library along with how to load data, pre-process data, build and evaluate navie bayes model with confusion matrix, Plot Confusion matrix using matplotlib with a complete example.

7. References

Refer below link for more details:

8. Source Code

Multi Nomial Naive Bayes MultiClass Example

You can also download the source code of Multi Nomial Naive Bayes MultiClass and other useful examples from our git repository.

Was this post helpful?

Let us know if you liked the post. That’s the only way we can improve.

Tags: AI, feature-selection, machine-learing, multi-class-classification, multinomial-naive-bayes, naive-bayes, panda, pickle, scikit-learn

Java Developer Zone

http://javadeveloperzone.com

JavaDeveloperZone is a group of innovative software developers. We are experienced in, ● Java Software Development ● Java web development ● Big Data development ● Data analytics ● Artificial Intelligence Development Our contributions will help Java developers and make development journey easy. Feel free to ask any questions and suggestions. Always have space for improvement! Feel free to Contact us for any software development services.