

Table of Contents
1. Overview
Let’s first discuss what is Naive Bayes algorithm. “Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.”
In Computer science and statistics Naive Bayes also called as Simple Bayes
and Independence Bayes
.
In this article, We will implement Email Spam detection system to identify an email document is spam
or ham
.we will use MultiNomial Naive Bayes of scikit learn to classify an email document.
2. Development Environment
Python : 3.6.5
IDE : Pycharm community Edition
scikit-learn : 0.20.0
3. Steps of Text Classification
3.1 Load/Prepare Data sets
Here we have used Enron Dataset to develop Email Spam Detection system. First step is to load email datasets content and it’s category.
Here we have used sklearn.datasets and panda dataframes to load file contents and category for further processing.Here is the sample code.
"""Load text files with categories as subfolder names. Individual samples are assumed to be files stored a two levels folder structure such as the following: EmailSpam/ Spam/ file_1.txt file_2.txt ... file_42.txt ham/ file_43.txt file_44.txt ...""" # Source file directory path_train = "E:\DataSet\EmailSpam" files_train = skds.load_files(path_train, load_content=False) label_index = files_train.target label_names = files_train.target_names labelled_files = files_train.filenames data_tags = ["filename", "category", "content"] data_list = [] # Read and add data from file to a list i = 0 for f in labelled_files: data_list.append((f, label_names[label_index[i]], Path(f).read_text())) i += 1 # We have training data available as dictionary filename, category, data data = pd.DataFrame.from_records(data_list, columns=data_tags)
3.2 Split datasets into Train/Test
Next step is to split data sets into two parts training and testing data set which used to check our model performance. here we have split data sets into 80:20 ratio.80% training set and the remaining 20% is our testing data sets.
Refer below sample code
# 80% data as training and remaining 20% for test. train_size = int(len(data) * .8) train_posts = data['content'][:train_size] train_tags = data['category'][:train_size] train_files_names = data['filename'][:train_size] test_posts = data['content'][train_size:] test_tags = data['category'][train_size:] test_files_names = data['filename'][train_size:]
3.3 PreProcess text
Preprocessing include removal of common terms also called stop words, punctuation marks, stemming, lemmatization etc..
After that weight is assigned to each term, weight can be term count or TF-IDF which help identify the importance of a term in a corpus.CountVectorizer
and TfidfVectorizer
provide facility calculate term count and term TF-IDF respectively.
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(stop_words='english') X_train = vectorizer.fit_transform(train_posts) X_test = vectorizer.transform(test_posts)
3.4 Features Selection
Next step is to select top K features from our training corpus. There are many methods available for feature selection like chi-squre, ANOVA, LDA etc.. Refer feature selection for more details.
3.5 Build Model
Now it’s time to training our Email Spam Detection model as below.
from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.fit(X_train, train_tags)
3.6 Apply on test data sets
from sklearn.model_selection import KFold, cross_val_score k_fold = KFold(n_splits=5, shuffle=True, random_state=0) print (cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1)) predections = clf.predict(X_test) from sklearn.metrics import confusion_matrix, classification_report print(confusion_matrix(test_tags,predections,["ham","spam"])) print(classification_report(test_tags,predections,["ham","spam"]))
3.7 Save Model
Save our email spam detection model in filesystem using pickle for later use.
import pickle # save the model to disk modelfilename = 'finalized_model.sav' print("Saving model") pickle.dump(clf, open(modelfilename, 'wb')) print("model saved...") print("load previously saved model") # load previously saved model from disk loaded_model = pickle.load(open(modelfilename, 'rb')) print("Start predection") predections = clf.predict(X_test) print(classification_report(test_tags,predections,["ham","spam"]))
4. Example
Here is the complete example of Email spam text classification.
import pandas as pd from pathlib import Path import sklearn.datasets as skds import pickle from sklearn.metrics import confusion_matrix, classification_report from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import KFold, cross_val_score """Load text files with categories as subfolder names. Individual samples are assumed to be files stored a two levels folder structure such as the following: EmailSpam/ Spam/ file_1.txt file_2.txt ... file_42.txt ham/ file_43.txt file_44.txt ...""" # Source file directory path_train = "E:\DataSet\EmailSpam" files_train = skds.load_files(path_train, load_content=False) label_index = files_train.target label_names = files_train.target_names labelled_files = files_train.filenames data_tags = ["filename", "category", "content"] data_list = [] # Read and add data from file to a list i = 0 for f in labelled_files: data_list.append((f, label_names[label_index[i]], Path(f).read_text())) i += 1 # We have training data available as dictionary filename, category, data data = pd.DataFrame.from_records(data_list, columns=data_tags) print("Data loaded : "+str(len(data))) # lets take 80% data as training and remaining 20% for test. train_size = int(len(data) * .8) train_posts = data['content'][:train_size] train_tags = data['category'][:train_size] train_files_names = data['filename'][:train_size] test_posts = data['content'][train_size:] test_tags = data['category'][train_size:] test_files_names = data['filename'][train_size:] print("Training Set Size "+str(len(train_posts))) print("Testing Set Size "+str(len(test_posts))) #print(train_tags[:5]) # https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments # https://www.opencodez.com/python/text-classification-using-keras.htm #vectorizer = TfidfVectorizer() vectorizer = CountVectorizer(stop_words='english') X_train = vectorizer.fit_transform(train_posts) X_test = vectorizer.transform(test_posts) print("Vocubalary") print(X_train[1]) #https://www.opencodez.com/python/text-classification-using-keras.htm from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.fit(X_train, train_tags) k_fold = KFold(n_splits=5, shuffle=True, random_state=0) print("Cross validation start") print (cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1)) print("Cross validation end") print("start applying on testing data set") predections = clf.predict(X_test) print("Model applied on testing data set") print("***********Classification Report***********") print(classification_report(test_tags,predections,["ham","spam"])) # save the model to disk modelfilename = 'finalized_model.sav' print("Saving model") pickle.dump(clf, open(modelfilename, 'wb')) print("model saved...") print("load previously saved model") # load previously saved model from disk loaded_model = pickle.load(open(modelfilename, 'rb')) print("Start predection from loaded model") predections = clf.predict(X_test) print("***********Classification Report***********") print(classification_report(test_tags,predections,["ham","spam"])) print("complete")
5. Output
Data loaded : 1478 Training Set Size 1034 Testing Set Size 444 Cross validation start [1. 1. 0.99516908 0.99516908 0.99514563] Cross validation end start applying on testing data set Model applied on testing data set ***********Classification Report*********** precision recall f1-score support ham 0.98 0.50 0.66 102 spam 0.87 1.00 0.93 342 micro avg 0.88 0.88 0.88 444 macro avg 0.93 0.75 0.80 444 weighted avg 0.90 0.88 0.87 444 Saving model model saved... load previously saved model Start predection from loaded model ***********Classification Report*********** precision recall f1-score support ham 0.98 0.50 0.66 102 spam 0.87 1.00 0.93 342 micro avg 0.88 0.88 0.88 444 macro avg 0.93 0.75 0.80 444 weighted avg 0.90 0.88 0.87 444 complete
6. Conclusion
In this article, we have discussed binary classification (Email Spam Detection) using python scikit-learn library along with how to load data, pre-process data, build and evaluate navie bayes model.
7. References
Refer below link for more details:
8. Source Code
Multi Nomial NaiveBayes Text Classification
You can download the source code of Multi Nomial NaiveBayes Text Classification and other useful examples from our git repository.