

Table of Contents
1. Overview
Let’s first discuss what is Naive Bayes algorithm. “Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.”
In Computer science and statistics Naive Bayes also called as Simple Bayes and Independence Bayes.
In this article, We will implement News Articles Classification using Multi Nomial Naive Bayes algorithm.
2. Development Environment
Python : 3.6.5
IDE : Pycharm community Edition
scikit-learn : 0.20.0
numpy : 1.15.3
matplotlib : 3.0.0
3. Steps of Text Classification
3.1 Load/Prepare Data sets
In this article, we have used 20NewsGroup datasets and divided the original twenty categories into five major categories like Mythology, Science, Sport, Technology, and Politics for better understanding.
Here we have used sklearn.datasets and panda data frames to load file contents and category for further processing. Here is the sample code.
"""Load text files with categories as subfolder names.
Individual samples are assumed to be files stored a two levels folder
structure such as the following:
NewsArticles/
Mythology/
Mythology_file_1.txt
Mythology_file_2.txt
...
Mythology_file_100.txt
Politics/
Politics_file_101.txt
...
Politics_file_200.txt
Science/
Science_file_201.txt
...
Science_file_300.txt
Sports/
Sports_file_301.txt
...
Sports_file_400.txt
Technology/
Technology_file_401.txt
...
Technology_file_500.txt
..."""
# Source file directory
path_train = "G:\\DataSet\\DataSet\\NewsArticles"
files_train = skds.load_files(path_train, load_content=False)
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames
data_tags = ["filename", "category", "content"]
data_list = []
# Read and add data from file to a list
i = 0
for f in labelled_files:
data_list.append((f, label_names[label_index[i]], Path(f).read_text()))
i += 1
# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
print("Data loaded : "+str(len(data)))
3.2 Split datasets into Train/Test
Next step is to split data sets into two parts training and testing dataset which used to check our model performance. here we have split data sets into 80:20 ratio.80% training set and the remaining 20% is our testing data sets.
Refer below sample code:
# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)
train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]
test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
print("Training Set Size "+str(len(train_posts)))
print("Testing Set Size "+str(len(test_posts)))3.3 PreProcess text
Preprocessing step includes removal of common words also called stop words, punctuation marks, stemming, lemmatization etc..
After that weight is assigned to each term, weight can be term count or TF-IDF which help identify the importance of a term in a corpus.CountVectorizer and TfidfVectorizer provide facility calculate term count and term TF-IDF respectively. Here we have used TfidfVectorizer weighting method. Refer below code for better understanding.
vectorizer = TfidfVectorizer(stop_words='english', max_features=500)
X_train = vectorizer.fit_transform(train_posts)
vectorizer1 = CountVectorizer(stop_words='english', max_features=500)
X_test = vectorizer.transform(test_posts)
print("Vocab Size : "+str(X_train.shape[0]))
3.4 Features Selection
Next step is to select top K features from our training corpus. There are many methods available for feature selection like chi-squre, ANOVA, LDA etc.. Refer feature selection for more details.
3.5 Build Model
Now it’s time to training our Multi Class News article Classification model as below.
from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.fit(X_train, train_tags)
3.6 Apply on test data sets
Once a model is built, Next step is to validate and apply on target data sets.
k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print("Cross validation start")
print(cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
print("Cross validation end")
print("start applying on testing data set")
predictions = clf.predict(X_test)
print("Model applied on testing data set")
class_names = ["Mythology", "Politics", "Science", "Sports", "Technology"]
print("***********Classification Report***********")
print(classification_report(test_tags, predictions, class_names))
3.7 Evaluate Model
Next step is to check how our model is performing. We can check the model performance using the Confusion Matrix. In our example, we have use matplotlib to plot confusion matrix.
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.figure()
print("confusion matrix")
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
plt.show()
# Compute confusion matrix
cnf_matrix = confusion_matrix(test_tags, predictions)
# Plot Confusion Matrix
plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix')
3.8 Save Model
Pickle is used to save and load any machine learning model for future use.
# save the model to disk
modelFileName = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelFileName, 'wb'))
print("model saved...")
print("load previously saved model")
# load previously saved model from disk
loaded_model = pickle.load(open(modelFileName, 'rb'))
4. Example
import pandas as pd
from pathlib import Path
import sklearn.datasets as skds
import pickle
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import KFold, cross_val_score
import itertools
import numpy as np
import matplotlib.pyplot as plt
"""Load text files with categories as subfolder names.
Individual samples are assumed to be files stored a two levels folder
structure such as the following:
NewsArticles/
Mythology/
Mythology_file_1.txt
Mythology_file_2.txt
...
Mythology_file_100.txt
Politics/
Politics_file_101.txt
...
Politics_file_200.txt
Science/
Science_file_201.txt
...
Science_file_300.txt
Sports/
Sports_file_301.txt
...
Sports_file_400.txt
Technology/
Technology_file_401.txt
...
Technology_file_500.txt
..."""
# Source file directory
path_train = "G:\\DataSet\\DataSet\\NewsArticles"
files_train = skds.load_files(path_train, load_content=False)
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames
data_tags = ["filename", "category", "content"]
data_list = []
# Read and add data from file to a list
i = 0
for f in labelled_files:
data_list.append((f, label_names[label_index[i]], Path(f).read_text()))
i += 1
# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
print("Data loaded : "+str(len(data)))
# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)
train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]
test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
print("Training Set Size "+str(len(train_posts)))
print("Testing Set Size "+str(len(test_posts)))
#print(train_tags[:5])
"""
Simple Text Classification using Keras Deep Learning Python Library – Step By Step Guide
vectorizer = CountVectorizer(stop_words='english', max_features=500)
"""
vectorizer = TfidfVectorizer(stop_words='english', max_features=500)
X_train = vectorizer.fit_transform(train_posts)
vectorizer1 = CountVectorizer(stop_words='english', max_features=500)
X_test = vectorizer.transform(test_posts)
print("Vocab Size : "+str(X_train.shape[0]))
#https://www.opencodez.com/python/text-classification-using-keras.htm
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)
k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print("Cross validation start")
print(cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
print("Cross validation end")
print("start applying on testing data set")
predictions = clf.predict(X_test)
print("Model applied on testing data set")
class_names = ["Mythology", "Politics", "Science", "Sports", "Technology"]
print("***********Classification Report***********")
print(classification_report(test_tags, predictions, class_names))
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.figure()
print("confusion matrix")
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
plt.show()
# Compute confusion matrix
cnf_matrix = confusion_matrix(test_tags, predictions)
# Plot Confusion Matrix
plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix')
# save the model to disk
modelFileName = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelFileName, 'wb'))
print("model saved...")
print("load previously saved model")
# load previously saved model from disk
loaded_model = pickle.load(open(modelFileName, 'rb'))
print("Start prediction from loaded model")
predictions = clf.predict(X_test)
print("***********Classification Report***********")
print(classification_report(test_tags, predictions, class_names))
print("complete")
5. Output
Data loaded : 2666
Training Set Size 2132
Testing Set Size 534
Vocab Size : 2132
Cross validation start
[0.92037471 0.92740047 0.90140845 0.93896714 0.92488263]
Cross validation end
start applying on testing data set
Model applied on testing data set
***********Classification Report***********
precision recall f1-score support
Mythology 0.98 0.92 0.95 88
Politics 0.96 0.94 0.95 96
Science 0.95 0.79 0.86 112
Sports 0.95 0.99 0.97 105
Technology 0.85 0.98 0.91 133
micro avg 0.93 0.93 0.93 534
macro avg 0.94 0.92 0.93 534
weighted avg 0.93 0.93 0.92 534
confusion matrix
[[ 81 4 3 0 0]
[ 1 90 1 2 2]
[ 0 0 88 2 22]
[ 1 0 0 104 0]
[ 0 0 1 1 131]]
Saving model
model saved...
load previously saved model
Start prediction from loaded model
***********Classification Report***********
precision recall f1-score support
Mythology 0.98 0.92 0.95 88
Politics 0.96 0.94 0.95 96
Science 0.95 0.79 0.86 112
Sports 0.95 0.99 0.97 105
Technology 0.85 0.98 0.91 133
micro avg 0.93 0.93 0.93 534
macro avg 0.94 0.92 0.93 534
weighted avg 0.93 0.93 0.92 534
complete

6. Conclusion
In this article, we have discussed multi-class classification (News Articles Classification) using python scikit-learn library along with how to load data, pre-process data, build and evaluate navie bayes model with confusion matrix, Plot Confusion matrix using matplotlib with a complete example.
7. References
Refer below link for more details:
- scikit-learn
- NaiveBayes
- Data Frame
- Pickle
- Numpy
- Machine Learning General Steps
- NaiveBayes Text Classification
- Model Evaluation Using Confusion Matrix Example
8. Source Code
Multi Nomial Naive Bayes MultiClass Example
You can also download the source code of Multi Nomial Naive Bayes MultiClass and other useful examples from our git repository.
