Text Analysis is becoming a fundamental tool in Data Science, because of the importance of parsing texts in order to extract machine-readable facts from them.
The goal of this text mining project is to accomplish three main tasks:
- First Task - Data Cleaning and Pre-processing on Facebook comments:
- Removing punctuation and stop words;
- Tokenization of the text;
- Bi-grams;
- Split corpus in sentences;
- Bag of words;
- TF-IDF and document term matrix;
- Implementation with pipelines of the previous tasks.
- Second Task - Classification, Clustering and Topic Model of SMS (Spam Detection):
- Classification with Logistic Regression;
- K-means Clustering;
- Topic Model using LDA (Latent Dirichlet Allocation);
- Third Task - Summarization of a text:
- Application of TextRank algorithm to summarize a text from a WW2 TextBook.