 # Naive Bayes Text Classification Example

## 1. Overview

Let’s first discuss what is  Naive Bayes algorithm. “Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.”

In Computer science and statistics Naive Bayes also called as `Simple Bayes` and `Independence Bayes`.

In this article, We will implement Email Spam detection system to identify an email document is `spam` or `ham` .we will use MultiNomial Naive Bayes of scikit learn to classify an email document.

## 2. Development Environment

Python : 3.6.5

IDE       : Pycharm community Edition

scikit-learn : 0.20.0

## 3. Steps of Text Classification

Here we have used Enron Dataset to develop Email Spam Detection system. First step is to load email datasets content and it’s category.

Here we have used sklearn.datasets and panda dataframes to load file contents and category for further processing.Here is the sample code.

```"""Load text files with categories as subfolder names.

Individual samples are assumed to be files stored a two levels folder
structure such as the following:

EmailSpam/
Spam/
file_1.txt
file_2.txt
...
file_42.txt
ham/
file_43.txt
file_44.txt
..."""
# Source file directory
path_train = "E:\DataSet\EmailSpam"
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames

data_tags = ["filename", "category", "content"]
data_list = []

i = 0
for f in labelled_files:
i += 1

# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
```

### 3.2 Split datasets into Train/Test

Next step is to split data sets into two parts training and testing data set which used to check our model performance. here we have split data sets into 80:20 ratio.80% training set and the remaining 20% is our testing data sets.

Refer below sample code

```# 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)

train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]

test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
```

### 3.3 PreProcess text

Preprocessing include removal of common terms also called stop words, punctuation marks, stemming, lemmatization etc..

After that weight is assigned to each term, weight can be term count or TF-IDF which help identify the importance of a term in a corpus.`CountVectorizer `and `TfidfVectorizer` provide facility calculate term count and term TF-IDF respectively.

```from sklearn.feature_extraction.text import  CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(train_posts)
X_test = vectorizer.transform(test_posts)```

### 3.4 Features Selection

Next step is to select top K features from our training corpus. There are many methods available for feature selection like chi-squre, ANOVA, LDA etc.. Refer feature selection for more details.

### 3.5 Build Model

Now it’s time to training our Email Spam Detection model as below.

```from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)
```

### 3.6 Apply on test data sets

```from sklearn.model_selection import KFold, cross_val_score
k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print (cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
predections = clf.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(test_tags,predections,["ham","spam"]))
print(classification_report(test_tags,predections,["ham","spam"]))
```

### 3.7 Save Model

Save our email spam detection model in filesystem using pickle for later use.

```import pickle
# save the model to disk
modelfilename = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelfilename, 'wb'))
print("model saved...")

# load previously saved model from disk
print("Start predection")
predections = clf.predict(X_test)
print(classification_report(test_tags,predections,["ham","spam"]))```

## 4. Example

Here is the complete example of Email spam text classification.

```import pandas as pd
from pathlib import Path
import sklearn.datasets as skds
import pickle
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import  CountVectorizer
from sklearn.model_selection import KFold, cross_val_score

"""Load text files with categories as subfolder names.

Individual samples are assumed to be files stored a two levels folder
structure such as the following:

EmailSpam/
Spam/
file_1.txt
file_2.txt
...
file_42.txt
ham/
file_43.txt
file_44.txt
..."""
# Source file directory
path_train = "E:\DataSet\EmailSpam"
label_index = files_train.target
label_names = files_train.target_names
labelled_files = files_train.filenames

data_tags = ["filename", "category", "content"]
data_list = []

i = 0
for f in labelled_files:
i += 1

# We have training data available as dictionary filename, category, data
data = pd.DataFrame.from_records(data_list, columns=data_tags)
# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .8)

train_posts = data['content'][:train_size]
train_tags = data['category'][:train_size]
train_files_names = data['filename'][:train_size]

test_posts = data['content'][train_size:]
test_tags = data['category'][train_size:]
test_files_names = data['filename'][train_size:]
print("Training Set Size "+str(len(train_posts)))
print("Testing Set Size "+str(len(test_posts)))
#print(train_tags[:5])

# https://www.opencodez.com/python/text-classification-using-keras.htm
#vectorizer = TfidfVectorizer()

vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(train_posts)
X_test = vectorizer.transform(test_posts)
print("Vocubalary")
print(X_train)
#https://www.opencodez.com/python/text-classification-using-keras.htm
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_tags)

k_fold = KFold(n_splits=5, shuffle=True, random_state=0)
print("Cross validation start")
print (cross_val_score(clf, X_train, train_tags, cv=k_fold, n_jobs=1))
print("Cross validation end")
print("start applying on testing data set")
predections = clf.predict(X_test)
print("Model applied on testing data set")

print("***********Classification Report***********")
print(classification_report(test_tags,predections,["ham","spam"]))

# save the model to disk
modelfilename = 'finalized_model.sav'
print("Saving model")
pickle.dump(clf, open(modelfilename, 'wb'))
print("model saved...")

# load previously saved model from disk
predections = clf.predict(X_test)
print("***********Classification Report***********")
print(classification_report(test_tags,predections,["ham","spam"]))
print("complete")
```

## 5. Output

```Data loaded : 1478
Training Set Size 1034
Testing Set Size 444

Cross validation start
[1.         1.         0.99516908 0.99516908 0.99514563]
Cross validation end
start applying on testing data set
Model applied on testing data set
***********Classification Report***********
precision    recall  f1-score   support

ham       0.98      0.50      0.66       102
spam       0.87      1.00      0.93       342

micro avg       0.88      0.88      0.88       444
macro avg       0.93      0.75      0.80       444
weighted avg       0.90      0.88      0.87       444

Saving model
model saved...
***********Classification Report***********
precision    recall  f1-score   support

ham       0.98      0.50      0.66       102
spam       0.87      1.00      0.93       342

micro avg       0.88      0.88      0.88       444
macro avg       0.93      0.75      0.80       444
weighted avg       0.90      0.88      0.87       444

complete
```

## 6. Conclusion

In this article, we have discussed binary classification (Email Spam Detection) using python scikit-learn library along with how to load data, pre-process data, build and evaluate navie bayes model.

## 7. References

Refer below link for more details:

### 8. Source Code

Multi Nomial NaiveBayes Text Classification

You can download the source code of Multi Nomial NaiveBayes Text Classification and other useful examples from our git repository.