Naive Bayes Text Classification Example

Table of Contents

1. Overview
2. Development Environment
3. Steps of Text Classification
4. Example
5. Output
6. Conclusion
7. References
- 8. Source Code
- Was this post helpful?

1. Overview

Let’s first discuss what is Naive Bayes algorithm. “Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.”

In Computer science and statistics Naive Bayes also called as Simple Bayes and Independence Bayes.

In this article, We will implement Email Spam detection system to identify an email document is spam or ham .we will use MultiNomial Naive Bayes of scikit learn to classify an email document.

2. Development Environment

Python : 3.6.5

IDE : Pycharm community Edition

scikit-learn : 0.20.0

3. Steps of Text Classification

3.1 Load/Prepare Data sets

Here we have used Enron Dataset to develop Email Spam Detection system. First step is to load email datasets content and it’s category.

Here we have used sklearn.datasets and panda dataframes to load file contents and category for further processing.Here is the sample code.

"""Load text files with categories as subfolder names.
    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:
        EmailSpam/
            Spam/
                file_1.txt
                file_2.txt
                ...
                file_42.txt
            ham/
                file_43.txt
                file_44.txt
                ..."""
# Source file directory
path_train = "E:\DataSet\EmailSpam"
files_train = skds.load_files(path_train, load_content=False)
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames
data_tags = ["filename", "category", "content"]
data_list = []
# Read and add data from file to a list
i = 0
for f in labelled_files:
    data_list.append((f, label_names[label_index[i]], Path(f).read_text()))
    i += 1
# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)

3.2 Split datasets into Train/Test

Next step is to split data sets into two parts training and testing data set which used to check our model performance. here we have split data sets into 80:20 ratio.80% training set and the remaining 20% is our testing data sets.

Refer below sample code

# 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)
train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]
test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]

3.3 PreProcess text

Preprocessing include removal of common terms also called stop words, punctuation marks, stemming, lemmatization etc..

After that weight is assigned to each term, weight can be term count or TF-IDF which help identify the importance of a term in a corpus.CountVectorizer and TfidfVectorizer provide facility calculate term count and term TF-IDF respectively.

from sklearn.feature_extraction.text import  CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(train_posts)
X_test = vectorizer.transform(test_posts)

3.4 Features Selection

Next step is to select top K features from our training corpus. There are many methods available for feature selection like chi-squre, ANOVA, LDA etc.. Refer feature selection for more details.

3.5 Build Model

Now it’s time to training our Email Spam Detection model as below.

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)

3.6 Apply on test data sets

from sklearn.model_selection import KFold, cross_val_score
k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print (cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
predections = clf.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(test_tags,predections,["ham","spam"]))
print(classification_report(test_tags,predections,["ham","spam"]))

3.7 Save Model

Save our email spam detection model in filesystem using pickle for later use.

import pickle
# save the model to disk
modelfilename = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelfilename, 'wb'))
print("model saved...")
print("load previously saved model")
# load previously saved model from disk
loaded_model = pickle.load(open(modelfilename, 'rb'))
print("Start predection")
predections = clf.predict(X_test)
print(classification_report(test_tags,predections,["ham","spam"]))

4. Example

Here is the complete example of Email spam text classification.

import pandas as pd
from pathlib import Path
import sklearn.datasets as skds
import pickle
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import  CountVectorizer
from sklearn.model_selection import KFold, cross_val_score
"""Load text files with categories as subfolder names.
    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:
        EmailSpam/
            Spam/
                file_1.txt
                file_2.txt
                ...
                file_42.txt
            ham/
                file_43.txt
                file_44.txt
                ..."""
# Source file directory
path_train = "E:\DataSet\EmailSpam"
files_train = skds.load_files(path_train, load_content=False)
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames
data_tags = ["filename", "category", "content"]
data_list = []
# Read and add data from file to a list
i = 0
for f in labelled_files:
    data_list.append((f, label_names[label_index[i]], Path(f).read_text()))
    i += 1
# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
print("Data loaded : "+str(len(data)))
# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)
train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]
test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
print("Training Set Size "+str(len(train_posts)))
print("Testing Set Size "+str(len(test_posts)))
#print(train_tags[:5])

# https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments
# https://www.opencodez.com/python/text-classification-using-keras.htm
#vectorizer = TfidfVectorizer()
vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(train_posts)
X_test = vectorizer.transform(test_posts)
print("Vocubalary")
print(X_train[1])
#https://www.opencodez.com/python/text-classification-using-keras.htm
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)
k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print("Cross validation start")
print (cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
print("Cross validation end")
print("start applying on testing data set")
predections = clf.predict(X_test)
print("Model applied on testing data set")
print("***********Classification Report***********")
print(classification_report(test_tags,predections,["ham","spam"]))
# save the model to disk
modelfilename = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelfilename, 'wb'))
print("model saved...")
print("load previously saved model")
# load previously saved model from disk
loaded_model = pickle.load(open(modelfilename, 'rb'))
print("Start predection from loaded model")
predections = clf.predict(X_test)
print("***********Classification Report***********")
print(classification_report(test_tags,predections,["ham","spam"]))
print("complete")

5. Output

Data loaded : 1478
Training Set Size 1034
Testing Set Size 444
Cross validation start
[1.         1.         0.99516908 0.99516908 0.99514563]
Cross validation end
start applying on testing data set
Model applied on testing data set
***********Classification Report***********
              precision    recall  f1-score   support
         ham       0.98      0.50      0.66       102
        spam       0.87      1.00      0.93       342
   micro avg       0.88      0.88      0.88       444
   macro avg       0.93      0.75      0.80       444
weighted avg       0.90      0.88      0.87       444
Saving model
model saved...
load previously saved model
Start predection from loaded model
***********Classification Report***********
              precision    recall  f1-score   support
         ham       0.98      0.50      0.66       102
        spam       0.87      1.00      0.93       342
   micro avg       0.88      0.88      0.88       444
   macro avg       0.93      0.75      0.80       444
weighted avg       0.90      0.88      0.87       444
complete

6. Conclusion

In this article, we have discussed binary classification (Email Spam Detection) using python scikit-learn library along with how to load data, pre-process data, build and evaluate navie bayes model.

7. References

Refer below link for more details:

8. Source Code

Multi Nomial NaiveBayes Text Classification

You can download the source code of Multi Nomial NaiveBayes Text Classification and other useful examples from our git repository.

Was this post helpful?

Let us know if you liked the post. That’s the only way we can improve.

Tags: AI, binary-classification, machine learning, python, sci-kit, sklearn, text-classification

Java Developer Zone

http://javadeveloperzone.com

JavaDeveloperZone is a group of innovative software developers. We are experienced in, ● Java Software Development ● Java web development ● Big Data development ● Data analytics ● Artificial Intelligence Development Our contributions will help Java developers and make development journey easy. Feel free to ask any questions and suggestions. Always have space for improvement! Feel free to Contact us for any software development services.