1. Overview

Let’s first discuss what is  Naive Bayes algorithm. “Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.”

In Computer science and statistics Naive Bayes also called as Simple Bayes and Independence Bayes.

In this article, We will implement Email Spam detection system to identify an email document is spam or ham .we will use MultiNomial Naive Bayes of scikit learn to clasify an email document.

2. Development Environment

Python : 3.6.5

IDE       : Pycharm community Edition

scikit-learn : 0.20.0

3. Steps of Text Classification

3.1 Load/Prepare Data sets

Here we have used Enron Dataset to develop Email Spam Detection system. First step is to load email datasets content and it’s category.

Here we have used sklearn.datasets and panda dataframes to load file contents and category for further processing.Here is the sample code.

"""Load text files with categories as subfolder names.

    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:

        EmailSpam/
            Spam/
                file_1.txt
                file_2.txt
                ...
                file_42.txt
            ham/
                file_43.txt
                file_44.txt
                ..."""
# Source file directory
path_train = "E:\DataSet\EmailSpam"
files_train = skds.load_files(path_train, load_content=False)
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames

data_tags = ["filename", "category", "content"]
data_list = []

# Read and add data from file to a list
i = 0
for f in labelled_files:
    data_list.append((f, label_names[label_index[i]], Path(f).read_text()))
    i += 1

# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)

3.2 Split datasets into Train/Test

Next step is to split data sets into two parts training and testing data set which used to check our model performance. here we have split data sets into 80:20 ratio.80% training set and the remaining 20% is our testing data sets.

Refer below sample code

# 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)

train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]

test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]

3.3 PreProcess text

Preprocessing include removal of common terms also called stop words, punctuation marks, stemming, lemitiations etc..

After that weight is assigned to each term, weight can be term count or TF-IDFwhich help identify the importance of a term in a corpus.CountVectorizer and TfidfVectorizer provide facility calculate term count and term TF-IDF respectively.

from sklearn.feature_extraction.text import  CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(train_posts)
X_test = vectorizer.transform(test_posts)

3.4 Features Selection

Next step is to select top K features from our training corpus.There are many methods available for feature selection like chi-squre, ANOVA, LDA etc.. Refer feature selection for more details.

3.5 Build Model

Now it’s time to training our Email Spam Detection model as below.

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)

3.6 Apply on test data sets

from sklearn.model_selection import KFold, cross_val_score
k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print (cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
predections = clf.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(test_tags,predections,["ham","spam"]))
print(classification_report(test_tags,predections,["ham","spam"]))

3.7 Save Model

Save our email spam detection model in filesystem using pickle for later use.

import pickle
# save the model to disk
modelfilename = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelfilename, 'wb'))
print("model saved...")

print("load previously saved model")
# load previously saved model from disk
loaded_model = pickle.load(open(modelfilename, 'rb'))
print("Start predection")
predections = clf.predict(X_test)
print(classification_report(test_tags,predections,["ham","spam"]))

4. Example

Here is the complete example of Email spam text classification.

import pandas as pd
from pathlib import Path
import sklearn.datasets as skds
import pickle
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import  CountVectorizer
from sklearn.model_selection import KFold, cross_val_score

"""Load text files with categories as subfolder names.

    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:

        EmailSpam/
            Spam/
                file_1.txt
                file_2.txt
                ...
                file_42.txt
            ham/
                file_43.txt
                file_44.txt
                ..."""
# Source file directory
path_train = "E:\DataSet\EmailSpam"
files_train = skds.load_files(path_train, load_content=False)
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames

data_tags = ["filename", "category", "content"]
data_list = []

# Read and add data from file to a list
i = 0
for f in labelled_files:
    data_list.append((f, label_names[label_index[i]], Path(f).read_text()))
    i += 1

# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
print("Data loaded : "+str(len(data)))
# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)

train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]

test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
print("Training Set Size "+str(len(train_posts)))
print("Testing Set Size "+str(len(test_posts)))
#print(train_tags[:5])


# https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments

# https://www.opencodez.com/python/text-classification-using-keras.htm
#vectorizer = TfidfVectorizer()

vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(train_posts)
X_test = vectorizer.transform(test_posts)
print("Vocubalary")
print(X_train[1])
#https://www.opencodez.com/python/text-classification-using-keras.htm
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)

k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print("Cross validation start")
print (cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
print("Cross validation end")
print("start applying on testing data set")
predections = clf.predict(X_test)
print("Model applied on testing data set")

print("***********Classification Report***********")
print(classification_report(test_tags,predections,["ham","spam"]))

# save the model to disk
modelfilename = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelfilename, 'wb'))
print("model saved...")

print("load previously saved model")
# load previously saved model from disk
loaded_model = pickle.load(open(modelfilename, 'rb'))
print("Start predection from loaded model")
predections = clf.predict(X_test)
print("***********Classification Report***********")
print(classification_report(test_tags,predections,["ham","spam"]))
print("complete")

5. Output

Data loaded : 1478
Training Set Size 1034
Testing Set Size 444

Cross validation start
[1.         1.         0.99516908 0.99516908 0.99514563]
Cross validation end
start applying on testing data set
Model applied on testing data set
***********Classification Report***********
              precision    recall  f1-score   support

         ham       0.98      0.50      0.66       102
        spam       0.87      1.00      0.93       342

   micro avg       0.88      0.88      0.88       444
   macro avg       0.93      0.75      0.80       444
weighted avg       0.90      0.88      0.87       444

Saving model
model saved...
load previously saved model
Start predection from loaded model
***********Classification Report***********
              precision    recall  f1-score   support

         ham       0.98      0.50      0.66       102
        spam       0.87      1.00      0.93       342

   micro avg       0.88      0.88      0.88       444
   macro avg       0.93      0.75      0.80       444
weighted avg       0.90      0.88      0.87       444

complete

6. Conclusion

In this article, we have discussed binary classification (Email Spam Detection) using python scikit-learn library along with how to load data, pre-process data, build and evaluate navie bayes model.

7. References

Refer below link for more details:

8. Source Code

Multi Nomial NaiveBayes Text Classification

 

Was this post helpful?
Let us know, if you liked the post. Only in this way, we can improve us.
Yes
No

Leave a Reply

Your email address will not be published. Required fields are marked *