Text Mining Project

Text Analysis is becoming a fundamental tool in Data Science, because of the importance of parsing texts in order to extract machine-readable facts from them.

The goal of this text mining project is to accomplish three main tasks:

First Task - Data Cleaning and Pre-processing on Facebook comments:

Removing punctuation and stop words;
Tokenization of the text;
Bi-grams;
Split corpus in sentences;
Bag of words;
TF-IDF and document term matrix;
Implementation with pipelines of the previous tasks.

Second Task - Classification, Clustering and Topic Model of SMS (Spam Detection):

Classification with Logistic Regression;
K-means Clustering;
Topic Model using LDA (Latent Dirichlet Allocation);

Third Task - Summarization of a text:

Application of TextRank algorithm to summarize a text from a WW2 TextBook.