1. Overview

Let’s first discuss what is  Naive Bayes algorithm. “Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.”

In Computer science and statistics Naive Bayes also called as Simple Bayes and Independence Bayes.

In this article, We will implement News Articles Classification using Multi Nomial Naive Bayes algorithm.

2. Development Environment

Python : 3.6.5

IDE       : Pycharm community Edition

scikit-learn : 0.20.0

numpy : 1.15.3

matplotlib : 3.0.0

3. Steps of Text Classification

3.1 Load/Prepare Data sets

In this article, we have used 20NewsGroup datasets and divided the original twenty categories into five major categories like Mythology, Science, Sport, Technology, and Politics for better understanding.

Here we have used sklearn.datasets and panda data frames to load file contents and category for further processing. Here is the sample code.

"""Load text files with categories as subfolder names.

    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:

        NewsArticles/
            Mythology/
                Mythology_file_1.txt
                Mythology_file_2.txt
                ...
                Mythology_file_100.txt
            Politics/
                Politics_file_101.txt
                ...
                Politics_file_200.txt
            Science/
                Science_file_201.txt
                ...
                Science_file_300.txt
            Sports/
                Sports_file_301.txt
                ...
                Sports_file_400.txt
            Technology/
                Technology_file_401.txt
                ...
                Technology_file_500.txt
                ..."""
# Source file directory
path_train = "G:\\DataSet\\DataSet\\NewsArticles"
files_train = skds.load_files(path_train, load_content=False)
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames

data_tags = ["filename", "category", "content"]
data_list = []


# Read and add data from file to a list
i = 0
for f in labelled_files:
    data_list.append((f, label_names[label_index[i]], Path(f).read_text()))
    i += 1

# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
print("Data loaded : "+str(len(data)))

3.2 Split datasets into Train/Test

Next step is to split data sets into two parts training and testing dataset which used to check our model performance. here we have split data sets into 80:20 ratio.80% training set and the remaining 20% is our testing data sets.

Refer below sample code:

# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)

train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]

test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
print("Training Set Size "+str(len(train_posts)))
print("Testing Set Size "+str(len(test_posts)))

3.3 PreProcess text

Preprocessing step includes removal of common words also called stop words, punctuation marks, stemming, lemmatization etc..

After that weight is assigned to each term, weight can be term count or TF-IDF which help identify the importance of a term in a corpus.CountVectorizer and TfidfVectorizer provide facility calculate term count and term TF-IDF respectively. Here we have used TfidfVectorizer weighting method. Refer below code for better understanding.

vectorizer = TfidfVectorizer(stop_words='english', max_features=500)
X_train = vectorizer.fit_transform(train_posts)
vectorizer1 = CountVectorizer(stop_words='english', max_features=500)
X_test = vectorizer.transform(test_posts)
print("Vocab Size : "+str(X_train.shape[0]))

3.4 Features Selection

Next step is to select top K features from our training corpus. There are many methods available for feature selection like chi-squre, ANOVA, LDA etc.. Refer feature selection for more details.

3.5 Build Model

Now it’s time to training our Multi Class News article Classification model as below.

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)

3.6 Apply on test data sets

Once a model is built, Next step is to validate and apply on target data sets.

k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print("Cross validation start")
print(cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
print("Cross validation end")
print("start applying on testing data set")
predictions = clf.predict(X_test)
print("Model applied on testing data set")
class_names = ["Mythology", "Politics", "Science", "Sports", "Technology"]
print("***********Classification Report***********")
print(classification_report(test_tags, predictions, class_names))

3.7 Evaluate Model

Next step is to check how our model is performing. We can check the model performance using the Confusion Matrix. In our example, we have use matplotlib to plot confusion matrix.

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.figure()
    print("confusion matrix")
    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
    plt.show()


# Compute confusion matrix
cnf_matrix = confusion_matrix(test_tags, predictions)
# Plot Confusion Matrix

plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix')

3.8 Save Model

Pickle is used to save and load any machine learning model for future use.

# save the model to disk
modelFileName = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelFileName, 'wb'))
print("model saved...")

print("load previously saved model")
# load previously saved model from disk
loaded_model = pickle.load(open(modelFileName, 'rb'))

4. Example

import pandas as pd
from pathlib import Path
import sklearn.datasets as skds
import pickle
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import KFold, cross_val_score
import itertools
import numpy as np
import matplotlib.pyplot as plt

"""Load text files with categories as subfolder names.

    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:

        NewsArticles/
            Mythology/
                Mythology_file_1.txt
                Mythology_file_2.txt
                ...
                Mythology_file_100.txt
            Politics/
                Politics_file_101.txt
                ...
                Politics_file_200.txt
            Science/
                Science_file_201.txt
                ...
                Science_file_300.txt
            Sports/
                Sports_file_301.txt
                ...
                Sports_file_400.txt
            Technology/
                Technology_file_401.txt
                ...
                Technology_file_500.txt
                ..."""
# Source file directory
path_train = "G:\\DataSet\\DataSet\\NewsArticles"
files_train = skds.load_files(path_train, load_content=False)
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames

data_tags = ["filename", "category", "content"]
data_list = []


# Read and add data from file to a list
i = 0
for f in labelled_files:
    data_list.append((f, label_names[label_index[i]], Path(f).read_text()))
    i += 1

# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
print("Data loaded : "+str(len(data)))
# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)

train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]

test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
print("Training Set Size "+str(len(train_posts)))
print("Testing Set Size "+str(len(test_posts)))
#print(train_tags[:5])

"""
 https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments

 
Simple Text Classification using Keras Deep Learning Python Library
vectorizer = CountVectorizer(stop_words='english', max_features=500) """ vectorizer = TfidfVectorizer(stop_words='english', max_features=500) X_train = vectorizer.fit_transform(train_posts) vectorizer1 = CountVectorizer(stop_words='english', max_features=500) X_test = vectorizer.transform(test_posts) print("Vocab Size : "+str(X_train.shape[0])) #https://www.opencodez.com/python/text-classification-using-keras.htm from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.fit(X_train, train_tags) k_fold = KFold(n_splits=5, shuffle=True, random_state=0) print("Cross validation start") print(cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1)) print("Cross validation end") print("start applying on testing data set") predictions = clf.predict(X_test) print("Model applied on testing data set") class_names = ["Mythology", "Politics", "Science", "Sports", "Technology"] print("***********Classification Report***********") print(classification_report(test_tags, predictions, class_names)) def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.figure() print("confusion matrix") print(cm) plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) fmt = 'd' thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.ylabel('True label') plt.xlabel('Predicted label') plt.tight_layout() plt.show() # Compute confusion matrix cnf_matrix = confusion_matrix(test_tags, predictions) # Plot Confusion Matrix plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix') # save the model to disk modelFileName = 'finalized_model.sav' print("Saving model") pickle.dump(clf, open(modelFileName, 'wb')) print("model saved...") print("load previously saved model") # load previously saved model from disk loaded_model = pickle.load(open(modelFileName, 'rb')) print("Start prediction from loaded model") predictions = clf.predict(X_test) print("***********Classification Report***********") print(classification_report(test_tags, predictions, class_names)) print("complete")

5. Output

Data loaded : 2666
Training Set Size 2132
Testing Set Size 534
Vocab Size : 2132
Cross validation start
[0.92037471 0.92740047 0.90140845 0.93896714 0.92488263]
Cross validation end
start applying on testing data set
Model applied on testing data set
***********Classification Report***********
              precision    recall  f1-score   support

   Mythology       0.98      0.92      0.95        88
    Politics       0.96      0.94      0.95        96
     Science       0.95      0.79      0.86       112
      Sports       0.95      0.99      0.97       105
  Technology       0.85      0.98      0.91       133

   micro avg       0.93      0.93      0.93       534
   macro avg       0.94      0.92      0.93       534
weighted avg       0.93      0.93      0.92       534

confusion matrix
[[ 81   4   3   0   0]
 [  1  90   1   2   2]
 [  0   0  88   2  22]
 [  1   0   0 104   0]
 [  0   0   1   1 131]]
Saving model
model saved...
load previously saved model
Start prediction from loaded model
***********Classification Report***********
              precision    recall  f1-score   support

   Mythology       0.98      0.92      0.95        88
    Politics       0.96      0.94      0.95        96
     Science       0.95      0.79      0.86       112
      Sports       0.95      0.99      0.97       105
  Technology       0.85      0.98      0.91       133

   micro avg       0.93      0.93      0.93       534
   macro avg       0.94      0.92      0.93       534
weighted avg       0.93      0.93      0.92       534

complete

Naive Bayes Multi Class Text Classification Output

6. Conclusion

In this article, we have discussed multi-class classification (News Articles Classification) using python scikit-learn library along with how to load data, pre-process data, build and evaluate navie bayes model with confusion matrix, Plot Confusion matrix using matplotlib with a complete example.

7. References

Refer below link for more details:

8. Source Code

Multi Nomial Naive Bayes MultiClass Example

 

Was this post helpful?
Let us know, if you liked the post. Only in this way, we can improve us.
Yes
No

Leave a Reply

Your email address will not be published. Required fields are marked *