What is Naive Bayes?

In my blog post Bayes and Binomial Theorem i talk about Bayes' theorem and how it is used to determine, or estimate rather, a conditional probability by turning the conditions around.

...

In other words, you can use P(B|A) and prior probabilities P(A) and P(B) to calculate P(A|B). This is very powerful because often we have information on the former three probabilities but not on the latter.

Naive Bayes is a classification algorithm that does this with features of a dataset. In non-math words: We calculate the probability of belonging to class A given feature vector B by multiplying the proportion of feature vector B in the population of class A with the proportion of class A and then divide the whole thing by the proportion of vector B in the population. This is in principle a very straight forward calculation, but you can probably tell when it will be hard or impossible to do: If we have many features, it becomes more and more unlikely that a specific feature vector has been seen before. And if all or most feature vectors only occur once in a population, we can not be confident of our probability estimations based on them.

Naive Bayes solves this problem by making an assumption that is almost certainly incorrect, namely that features are independent of each other. This is where it gets the 'naive' from. If we do this we can take each feature separately instead of taking the whole feature vector as one. The probability of having a particular value for a feature in a large data set is much larger than having a specific combination of all features. In addition to this naive assumption, the denominator P(B) does not usually need to be calculated, because it will be the same for all classes and for classification purposes it does not matter if we have absolute probabilities, we just need to know which of the classes gets the largest number out of our calculation.

It turns out, perhaps surprisingly given the assumptions it makes, that Naive Bayes is often a very good classification algorithm, often used in problems like email Spam detection. However, do not use it for estimating probabilities. the feature independence assumption makes it very inaccurate for those.

Comments

Post a Comment

Popular posts from this blog

Kaggle Tensorflow Speech Recognition Challenge

Post-Script to the Purpose of this Blog

What is Machine Learning?