Spam Filtering using Naive Bayes Classifier

What is spam? Why spamming is such a big concern?

The Internet is increasing at a fast pace and over time it has become huge. Almost every person who uses the internet has their own email address, or perhaps even more than one and therefore spamming has become a great concern these days.

Spam emails are those unwanted email which we receive without our prior consent and is a form of commercial advertising as email is a very cost-effective medium for the sender, providing a medium for advertisers that is economically viable. If just a fraction of the recipients of a spam message purchases the advertised product, the spammers are making money and the spam problem is even more perpetuated.

At the present more than 95% of email messages sent worldwide is believed to be spam, making spam fighting tools increasingly important to all users of email.

There are many spam filters nowadays built using different approaches to identify the incoming email as spam.The commonly used approaches include whitelist / blacklist, keyword matching, postage, legislation, mail header analysis and content scanning etc.

Here, I am going to discuss Naive Bayes classifier for spam filtering.

Data Retrieval

The data from the emails is retrieved.After the data retrieval, the text of the emails is tokenized using a tokenizer.

After tokenization, we move on to the preprocessing step.

Preprocessing

Before starting with the constructing of Naive Bayes classifier's training and testing dataset, we need to perform certain preprocessing on the text of the emails.

The preprocessing techniques could be:

Stop word removal
Punctuations removal
Stemming using Porter Stemmer or any other stemmer
Lemmatization
Normalization

Splitting dataset into training and testing

Before, applying the Naive Bayes Classifier on the dataset of emails including spams and non-spams, we need to split it into training and testing such that both spam and non-spam emails are present in the training and testing dataset.

After splitting the dataset into training and testing, posterior probabilities are calculated for the words in the training dataset.

Applying Bayes Theorem in case of Spam Filtering

Bayes theorem serves as the backbone for Naive Bayes Classifier which is used to determine the probability that a given e-mail is spam, given words in this e-mail. If $S$ is the event of a given e-mail being a spam and

w

is a word in the e-mail, we will classify it as spam with probability i.e the posterior probability of spams is :

Pr (S | w) = \frac{Pr (w | S) \cdot Pr (S)}{Pr (w | S) \cdot Pr (S) + Pr (w | \bar{S}) \cdot Pr (\bar{S})}

where:

Pr (S)

is the prior probability of spam class, which is calculated by taking the ratio of no of spam emails to the total no of emails.

Pr (w | \bar{S})

is the prior probability of non-spam class

Pr (S)

Pr (w | S)

and

Pr (w | \bar{S})

can be calculated by simply counting the occurrence of each word in spam and non-spam e-mails in the training data.

Smoothing

Pr (S | w) = \frac{Pr (w | S) \cdot Pr (S)}{Pr (w | S) \cdot Pr (S) + Pr (w | \bar{S}) \cdot Pr (\bar{S})}

Let $c$ refer to a class (spam and non-spam), and let $w$ w refers to a token or word.
The likelihood probability P

P (w | c)

is given by

c o u n t ( w , c ) c o u n t ( c ) = counts w in class c counts of words in class c .

Since the testing data may contain words that are not present in the training data, we may get P(w|c) as zero for those words so we need to apply some kind of smoothing like Laplace smoothing or add one smoothing. Add one smoothing states that instead of considering the count of words not in the training data as zero take it as one.

P (w | c) = count ( w , c ) + 1 count ( c ) + | V | + 1,

where

V

refers to the vocabulary (the words in the training set).
Therefore, the probability of the unknown words will be

\frac{1}{count (c) + | V | + 1} .

Pr (S | w) = \frac{Pr (w | S) \cdot Pr (S)}{Pr (w | S) \cdot Pr (S) + Pr (w | \bar{S}) \cdot Pr (\bar{S})}

Predicting emails as spam

Now the model for classification has been trained we can use the same for predicting the emails in the test data as spam or non-spam.

Since we are considering a bag of words model, the ordering of words is not important. To calculate the likelihood probability of all the words in an email given the email is a spam, we just need to multiply the individual likelihoods of the words in the email.

Using the posterior of both the classes(spam and non-spam), the emails are classified into the class having higher posterior probability.

If Likelihood(spam)*Prior(spam)> Likelihood(non-spam)*Prior(non-spam) then the email is classified as spam else its classified as non-spam.

Currently, we are considering unigrams but the same can be done for n-grams.

After predicting the classes for test data, accuracy can be calculated by comparing the true labels with the predicted ones.

Now, I have shown how to classify the emails as spam and non-spam hence we can filter out our spam emails.

IIITD IR MELANAGE

Search This Blog