s

Naives Bayes Classifier

Introduction

Naive Bayes is a family of simple probabilistic classifier based on applying Bayes theorem with strong (naive) independence assumptions between the feauture. Naive Bayes classifier follows under classification in supervised learning task for modeling and predicting categorical variables. It is a very simple algorithm based around conditional probability and counting.

It is called “naive” because of its core assumption of conditional independence (i.e all input feautures are independent from one another) which is rarely holds true in the real world.

Applications of Naive Bayes Algorithms

  • Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.
  • Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.
  • Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments).
  • Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not.

Some of the real world examples

  1. To mark an email as sparm or not sparm.
  2. Classify a news article about technology, politics, or sports.
  3. Check a piece of text expressing positive emotions, or negative emotions?
  4. Used for face recognition software.

Strengths of Naives Bayes

Even though the conditional independence assumption rarely holds true, NB models actually perform surprisingly well in practice, especially for how simple they are. They are easy to implement and can scale with your dataset.

Naive Bayes Pros:

  1. It is easy and fast to predict class of test data set. It also perform well in multi class prediction.
  2. When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
  3. It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

Weaknesses of Naives Bayes

Due to their sheer simplicity, NB models are often beaten by models properly trained and tuned using the other algorithms.

Naive Bayes Cons:

  1. If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
  2. On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
  3. Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

How Naive Bayes handles:

  • Missing data: Since all the variables are considered independently, removing one from the probability calculation produces the same result as if you had trained without that variable to begin with; It’s pretty straightforward to build a Naive Bayes classifier that handles data with lots of null values.
  • Varience (Overfitting): Overfitting in Naive Bayes classifiers are controlled by introducing priors. MAP estimation is necessary to avoid overfitting and extreme probability values.
  • Bias: Naive Bayes, on the other hand, doesn’t care how erroneous the result might be, its weights are dictated by the empirical conditional probabilities of the features in the training set. Minimizing the error function gives other algorithms i.e logistic regression lower bias than naive Bayes because the meaning of “bias” is how much error there is in a model.

Bayes Theorem

Bayes theorem is a famous equation that allows us to make predictions based on data. Here is the classic version of the Bayes theorem:

\[P(A|B) = \frac{P(A)\cdot P(B|A)}{P(B)}\]

This might be too abstract, so let us replace some of the variables to make it more concrete. In a bayes classifier, we are interested in finding out the class (e.g. male or female, spam or harm) of an observation given the data:

\begin{equation} P(class|data) = \frac{P(class)\cdot P(data|class)}{P(data)}
\end{equation}

where:

  • $class$ is a particular class (admitted or not admiited)
  • $data$ is an observation’s data,
  • $P(class)$ is called the prior; it quantify uncertainity about class
  • $P(data)$ is called the marginal probability,
  • $P(data \mid class)$ is called the likelihood; the probability of the data under the given class (parameter) and
  • $P(class \mid data)$ is called the posterior our update belief,

Naives Bayes

Consider the data with $X$ features such that $X = {x_1\ldots x_n}$ and $y$ label with $k$ classes such that $k \in {1 \ldots k }$. Given a class variable $y = k$ and a dependent feature vector $x_1$ through $x_n$, we can rewrite the above expression as follows:

\[P(y=k | x_1, \ldots, x_n) = \frac{P(y=k)\cdot P(x_1, \ldots,x_n|y=k)}{P(x_1, \ldots, x_n)}\]

Using the naive independence assumption that: \(P(x_i|y=k, x_1, \ldots x_{i-1}, x_{i+1}, \ldots, x_n) = P(x_i|y=k)\)

It clear that: \(P(x_1, \ldots, x_n|y)= P(x_1|y)\cdot P(x_2|y),\ldots P(x_n|y) = \prod_{i=1}^n P(x_i|y)\)

Thus the previous equation simplify to: \(P(y=k | x_1, \ldots, x_n) = \frac{P(y=k)\cdot \prod_{i=1}^n P(x_i|y=k)}{P(x_1, \ldots, x_n)}\)

Since $P(x_1, \ldots, x_n)$ is constant given the input, we can use the following classification rule:

\[P(y=k | x_1, \ldots, x_n) \propto P(y=k)\cdot \prod_{i=1}^n P(x_i|y=k)\]

Training (learning parameter from data)

To estimate $P(y=k)$ and $P(x_i|y=k)$ Maximum A Posteriori (MAP) can be used where:

  1. $P(y=k)$ is simply the fraction of records with $y = k$ in the training set.
  2. $P(x_i \mid y=k)$ is the fraction of $y=k$ records that also have $x_i$ in the training set.

Prediction

To predict the new label $\hat{y}$ value given observations of all the $x_i$ values, the following prediction rule is employed:

\[\hat{y} = \arg\max_{k}\left[P(y=k)\cdot \prod_{i=1}^n P(x_i|y=k)\right]\]

Types of Naives Bayes

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $P(x_i\mid y=k)$

  • Gaussian: It used for real-values (continous) features.It assumes features follow a normal distribution.
  • Multinomial: It is used for discrete counts. For example, let’s say, we have a text classification problem.
  • Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.

Let consider Gaussian Naives Bayes

In the case of real-valued features, we can use the Gaussian distribution.

\[P(x_i |y=k) = \frac{1}{\sqrt{2 \pi \sigma _{ik}^2}} \cdot \exp{\left[-\frac{(x_i - \mu _{ik})^2}{2 \sigma _{ik}}\right]}\]

And this is called Gaussian Naive Bayes. To train gaussian naive bayes:

  • For each value of k:
    • Estimate $\pi _k = P(y=k)$
    • For each attribute $x_i$ estimate class conditional mean $\mu _{ik}$ and variance $\sigma _{ik}$

Example

Let us consider an exmple application of Gaussian naives bayes where given a model trained on a portion of admission data, we want to predict wether a student with given grades will be admitted or not.

import numpy as np
import pandas as pd
data = pd.read_csv('data/admission.csv', names = ["grade1", "grade2", "remark"])
data.tail()
grade1 grade2 remark
95 83.489163 48.380286 1
96 42.261701 87.103851 1
97 99.315009 68.775409 1
98 55.340018 64.931938 1
99 74.775893 89.529813 1

Calculate Class Priors

# Number of students
n_admitted = data['remark'][data['remark'] == 1].count()
n_not_admitted = data['remark'][data['remark'] == 0].count()
total_class = data['remark'].count()
# Class probability
p_admitted = n_admitted/total_class
p_not_admitted = n_not_admitted/total_class
p_admitted

Calculate Likelihood

# Group the data by grade and calculate the means of each feature
data_means = data.groupby('remark').mean()

# View the values
data_means
grade1 grade2
remark
0 52.032301 54.620392
1 74.718923 73.956402
# calculate the variance of each feature
data_variance = data.groupby('remark').var()

# View the values
data_variance
grade1 grade2
remark
0 307.969146 258.617568
1 222.380256 256.397065

Now we can create all the variables we need. The code below might look complex but all we are doing is creating a variable out of each cell in both of the tables above.

# Mean for admitted class
grade1_admit_mean = data_means['grade1'][data_means.index == 1].values[0]
grade2_admit_mean = data_means['grade2'][data_means.index == 1].values[0]

# variance for admitted class
grade1_admit_var = data_variance['grade1'][data_variance.index == 1].values[0]
grade2_admit_var = data_variance['grade2'][data_variance.index == 1].values[0]


# Mean for not admitted class
grade1_not_admit_mean = data_means['grade1'][data_means.index == 0].values[0]
grade2_not_admit_mean = data_means['grade2'][data_means.index == 0].values[0]

# variance for not admitted class
grade1_not_admit_var = data_variance['grade1'][data_variance.index == 0].values[0]
grade2_not_admit_var = data_variance['grade2'][data_variance.index == 0].values[0]


Finally, we need to create a function to calculate the probability density of each of the terms of the likelihood

def pdf_gausian(x, mean, var):

    # Input the arguments into a probability density function
    pdf = 1/(np.sqrt(2*np.pi*var**2)) * np.exp((-(x-mean)**2)/(2*var))

    # return p
    return pdf

Prediction

Now suppose we have a student with $grade1 = 50$ and $grade2 = 45$. Let check wether this student be admitted or not

grade1 = 50
grade2 = 45

#consider first class admitted
posterior_admitted = p_admitted * pdf_gausian(grade1,grade1_admit_mean,grade1_admit_var)*\
                     pdf_gausian(grade2,grade2_admit_mean,grade2_admit_var)
posterior_not_admitted = p_not_admitted * pdf_gausian(grade1,grade1_not_admit_mean,grade1_not_admit_var)*\
                         pdf_gausian(grade2,grade2_not_admit_mean,grade2_not_admit_var)
                        
print("Posterior probility for admitted class: {}".format(posterior_admitted))
print("Posterior probility for not-admitted class: {}".format(posterior_not_admitted))
Posterior probility for admitted class: 8.264138816397577e-08
Posterior probility for not-admitted class: 6.638833793017984e-07

Because the posterior for not-admitted class is greater than admitted class, then we predict that the student is not-admitted.

References

  • H. Zhang (2004). The optimality of Naive Bayes. Proc. FLAIRS.
  • C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 234-265.
  • V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering with Naive Bayes – Which Naive Bayes? 3rd Conf. on Email and Anti-Spam (CEAS).