

Table of Contents
1. Overview
Let’s first discuss what is Naive Bayes algorithm. “Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.”
In Computer science and statistics Naive Bayes also called as Simple Bayes
and Independence Bayes
.
In this article, We will implement News Articles Classification using Multi Nomial Naive Bayes
algorithm.
2. Development Environment
Python : 3.6.5
IDE : Pycharm community Edition
scikit-learn : 0.20.0
numpy : 1.15.3
matplotlib : 3.0.0
3. Steps of Text Classification
3.1 Load/Prepare Data sets
In this article, we have used 20NewsGroup datasets and divided the original twenty categories into five major categories like Mythology, Science, Sport, Technology, and Politics for better understanding.
Here we have used sklearn.datasets and panda data frames to load file contents and category for further processing. Here is the sample code.
"""Load text files with categories as subfolder names. Individual samples are assumed to be files stored a two levels folder structure such as the following: NewsArticles/ Mythology/ Mythology_file_1.txt Mythology_file_2.txt ... Mythology_file_100.txt Politics/ Politics_file_101.txt ... Politics_file_200.txt Science/ Science_file_201.txt ... Science_file_300.txt Sports/ Sports_file_301.txt ... Sports_file_400.txt Technology/ Technology_file_401.txt ... Technology_file_500.txt ...""" # Source file directory path_train = "G:\\DataSet\\DataSet\\NewsArticles" files_train = skds.load_files(path_train, load_content=False) label_index = files_train.target label_names = files_train.target_names labelled_files = files_train.filenames data_tags = ["filename", "category", "content"] data_list = [] # Read and add data from file to a list i = 0 for f in labelled_files: data_list.append((f, label_names[label_index[i]], Path(f).read_text())) i += 1 # We have training data available as dictionary filename, category, data data = pd.DataFrame.from_records(data_list, columns=data_tags) print("Data loaded : "+str(len(data)))
3.2 Split datasets into Train/Test
Next step is to split data sets into two parts training and testing dataset which used to check our model performance. here we have split data sets into 80:20 ratio.80% training set and the remaining 20% is our testing data sets.
Refer below sample code:
# lets take 80% data as training and remaining 20% for test. train_size = int(len(data) * .8) train_posts = data['content'][:train_size] train_tags = data['category'][:train_size] train_files_names = data['filename'][:train_size] test_posts = data['content'][train_size:] test_tags = data['category'][train_size:] test_files_names = data['filename'][train_size:] print("Training Set Size "+str(len(train_posts))) print("Testing Set Size "+str(len(test_posts)))
3.3 PreProcess text
Preprocessing step includes removal of common words also called stop words, punctuation marks, stemming, lemmatization etc..
After that weight is assigned to each term, weight can be term count or TF-IDF which help identify the importance of a term in a corpus.CountVectorizer
and TfidfVectorizer
provide facility calculate term count and term TF-IDF respectively. Here we have used TfidfVectorizer
weighting method. Refer below code for better understanding.
vectorizer = TfidfVectorizer(stop_words='english', max_features=500) X_train = vectorizer.fit_transform(train_posts) vectorizer1 = CountVectorizer(stop_words='english', max_features=500) X_test = vectorizer.transform(test_posts) print("Vocab Size : "+str(X_train.shape[0]))
3.4 Features Selection
Next step is to select top K features from our training corpus. There are many methods available for feature selection like chi-squre, ANOVA, LDA etc.. Refer feature selection for more details.
3.5 Build Model
Now it’s time to training our Multi Class News article Classification model
as below.
from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.fit(X_train, train_tags)
3.6 Apply on test data sets
Once a model is built, Next step is to validate and apply on target data sets.
k_fold = KFold(n_splits=5, shuffle=True, random_state=0) print("Cross validation start") print(cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1)) print("Cross validation end") print("start applying on testing data set") predictions = clf.predict(X_test) print("Model applied on testing data set") class_names = ["Mythology", "Politics", "Science", "Sports", "Technology"] print("***********Classification Report***********") print(classification_report(test_tags, predictions, class_names))
3.7 Evaluate Model
Next step is to check how our model is performing. We can check the model performance using the Confusion Matrix. In our example, we have use matplotlib
to plot confusion matrix.
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.figure() print("confusion matrix") print(cm) plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) fmt = 'd' thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.ylabel('True label') plt.xlabel('Predicted label') plt.tight_layout() plt.show() # Compute confusion matrix cnf_matrix = confusion_matrix(test_tags, predictions) # Plot Confusion Matrix plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
3.8 Save Model
Pickle is used to save and load any machine learning model for future use.
# save the model to disk modelFileName = 'finalized_model.sav' print("Saving model") pickle.dump(clf, open(modelFileName, 'wb')) print("model saved...") print("load previously saved model") # load previously saved model from disk loaded_model = pickle.load(open(modelFileName, 'rb'))
4. Example
import pandas as pd from pathlib import Path import sklearn.datasets as skds import pickle from sklearn.metrics import confusion_matrix, classification_report from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.model_selection import KFold, cross_val_score import itertools import numpy as np import matplotlib.pyplot as plt """Load text files with categories as subfolder names. Individual samples are assumed to be files stored a two levels folder structure such as the following: NewsArticles/ Mythology/ Mythology_file_1.txt Mythology_file_2.txt ... Mythology_file_100.txt Politics/ Politics_file_101.txt ... Politics_file_200.txt Science/ Science_file_201.txt ... Science_file_300.txt Sports/ Sports_file_301.txt ... Sports_file_400.txt Technology/ Technology_file_401.txt ... Technology_file_500.txt ...""" # Source file directory path_train = "G:\\DataSet\\DataSet\\NewsArticles" files_train = skds.load_files(path_train, load_content=False) label_index = files_train.target label_names = files_train.target_names labelled_files = files_train.filenames data_tags = ["filename", "category", "content"] data_list = [] # Read and add data from file to a list i = 0 for f in labelled_files: data_list.append((f, label_names[label_index[i]], Path(f).read_text())) i += 1 # We have training data available as dictionary filename, category, data data = pd.DataFrame.from_records(data_list, columns=data_tags) print("Data loaded : "+str(len(data))) # lets take 80% data as training and remaining 20% for test. train_size = int(len(data) * .8) train_posts = data['content'][:train_size] train_tags = data['category'][:train_size] train_files_names = data['filename'][:train_size] test_posts = data['content'][train_size:] test_tags = data['category'][train_size:] test_files_names = data['filename'][train_size:] print("Training Set Size "+str(len(train_posts))) print("Testing Set Size "+str(len(test_posts))) #print(train_tags[:5]) """Simple Text Classification using Keras Deep Learning Python Library – Step By Step Guidevectorizer = CountVectorizer(stop_words='english', max_features=500) """ vectorizer = TfidfVectorizer(stop_words='english', max_features=500) X_train = vectorizer.fit_transform(train_posts) vectorizer1 = CountVectorizer(stop_words='english', max_features=500) X_test = vectorizer.transform(test_posts) print("Vocab Size : "+str(X_train.shape[0])) #https://www.opencodez.com/python/text-classification-using-keras.htm from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.fit(X_train, train_tags) k_fold = KFold(n_splits=5, shuffle=True, random_state=0) print("Cross validation start") print(cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1)) print("Cross validation end") print("start applying on testing data set") predictions = clf.predict(X_test) print("Model applied on testing data set") class_names = ["Mythology", "Politics", "Science", "Sports", "Technology"] print("***********Classification Report***********") print(classification_report(test_tags, predictions, class_names)) def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.figure() print("confusion matrix") print(cm) plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) fmt = 'd' thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.ylabel('True label') plt.xlabel('Predicted label') plt.tight_layout() plt.show() # Compute confusion matrix cnf_matrix = confusion_matrix(test_tags, predictions) # Plot Confusion Matrix plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix') # save the model to disk modelFileName = 'finalized_model.sav' print("Saving model") pickle.dump(clf, open(modelFileName, 'wb')) print("model saved...") print("load previously saved model") # load previously saved model from disk loaded_model = pickle.load(open(modelFileName, 'rb')) print("Start prediction from loaded model") predictions = clf.predict(X_test) print("***********Classification Report***********") print(classification_report(test_tags, predictions, class_names)) print("complete")
5. Output
Data loaded : 2666 Training Set Size 2132 Testing Set Size 534 Vocab Size : 2132 Cross validation start [0.92037471 0.92740047 0.90140845 0.93896714 0.92488263] Cross validation end start applying on testing data set Model applied on testing data set ***********Classification Report*********** precision recall f1-score support Mythology 0.98 0.92 0.95 88 Politics 0.96 0.94 0.95 96 Science 0.95 0.79 0.86 112 Sports 0.95 0.99 0.97 105 Technology 0.85 0.98 0.91 133 micro avg 0.93 0.93 0.93 534 macro avg 0.94 0.92 0.93 534 weighted avg 0.93 0.93 0.92 534 confusion matrix [[ 81 4 3 0 0] [ 1 90 1 2 2] [ 0 0 88 2 22] [ 1 0 0 104 0] [ 0 0 1 1 131]] Saving model model saved... load previously saved model Start prediction from loaded model ***********Classification Report*********** precision recall f1-score support Mythology 0.98 0.92 0.95 88 Politics 0.96 0.94 0.95 96 Science 0.95 0.79 0.86 112 Sports 0.95 0.99 0.97 105 Technology 0.85 0.98 0.91 133 micro avg 0.93 0.93 0.93 534 macro avg 0.94 0.92 0.93 534 weighted avg 0.93 0.93 0.92 534 complete
6. Conclusion
In this article, we have discussed multi-class classification (News Articles Classification) using python scikit-learn library along with how to load data, pre-process data, build and evaluate navie bayes model with confusion matrix, Plot Confusion matrix using matplotlib with a complete example.
7. References
Refer below link for more details:
- scikit-learn
- NaiveBayes
- Data Frame
- Pickle
- Numpy
- Machine Learning General Steps
- NaiveBayes Text Classification
- Model Evaluation Using Confusion Matrix Example
8. Source Code
Multi Nomial Naive Bayes MultiClass Example
You can also download the source code of Multi Nomial Naive Bayes MultiClass and other useful examples from our git repository.