Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 889 Bytes

README.md

File metadata and controls

24 lines (17 loc) · 889 Bytes

Text Mining Project

Text Analysis is becoming a fundamental tool in Data Science, because of the importance of parsing texts in order to extract machine-readable facts from them.

The goal of this text mining project is to accomplish three main tasks:

  • First Task - Data Cleaning and Pre-processing on Facebook comments:
  1. Removing punctuation and stop words;
  2. Tokenization of the text;
  3. Bi-grams;
  4. Split corpus in sentences;
  5. Bag of words;
  6. TF-IDF and document term matrix;
  7. Implementation with pipelines of the previous tasks.
  • Second Task - Classification, Clustering and Topic Model of SMS (Spam Detection):
  1. Classification with Logistic Regression;
  2. K-means Clustering;
  3. Topic Model using LDA (Latent Dirichlet Allocation);
  • Third Task - Summarization of a text:
  1. Application of TextRank algorithm to summarize a text from a WW2 TextBook.