FINAL PROJECT

TITLE

Classification of documents by using Naive Bayes

ABSTRACT

This project is to identify the class of the documents, based on the documents used for training. To be clear, few documents and the corresponding class of document will be given.  Now, by using Multinomial Naive Bayes algorithm given documents are analyzed. Navie Bayes uses the conditional probability of Bayes theorem.  When test document is given multinomial algorithm analyzes every word in the text document in perspective to all classes and assigns a score. Finally, the class with more score is the class of the testing document.

RELATED DOMAIN OF STUDY

Bayes Theorem helps us to find the probability of an event to occur, based on conditions related to the event in advance. For example, if a disease is related to geographical location, then using Bayes theorem, a person’s geographical location can be used to more accurately assess the probability that they have the disease, compared to the assessment of the probability of disease made without knowledge of the person’s location.

Multinomial Naive Bayes uses Bayes Theorem to find the probability of the document to be in particular class or category. And this is based on analysis of the previous document set used for training the algorithm.

DATA SOURCES

Data Sources for my project would be, documents with some text in it. I need to have two sets of documents one with categorized documents for training the algorithm and other with uncategorized documents for testing.

For example, have a look at my sample dataset

Sample_DataSet

REFERENCES

NLP Stanford

Wikipedia

Pillar Global

Sebastian Raschka

SCIKIT-Learn

SciKit Library

Basics of Naive Bayes Algorithm

Leave a comment