Skip to content

LeyenQian/4502_project

Repository files navigation

News Comparison with Real-time Data Crawling

Description: This is the final project for data mining course [CSCI 4502], it is consists of 2 parts:

  1. data crawling (retrieve articles from specified websites)
  2. data analysis (use several algorithms to analysis the relation between data)
  3. link to github: https://github.com/LeyenQian/4502_project/tree/master

Run:

  1. Automation.py (at least Windows 7 SP1 x64 with Visual C++ 2015 patch)
    a. crawl articles through Google Chrome
    b. store articles under "result" directory as the form of json file
    c. file name is unique; the sha256 value of the "article identity," which is the combination of the article name and link

  2. json_to_csv.py
    a. combine all articles under each News category into a single csv file for further data analysis
    b. csvs are stored under "result_csv" directory

  3. Analysis_frequent_itemset.ipynb & Analysis_k_means.ipynb (require Jupyter environment)
    a. read the csv files from "result_csv" directory
    b. analysis articles and generate graphs under "plot" directory

  4. k_shingle.ipynb
    a. this is the test version of data analysis before start online News Retrieving

Dependencies: (may required to install through pip command)

  1. Python Library for Analysis_frequent_itemset.ipynb and Analysis_k_means.ipynb
    a. pandas
    b. mlxtend
    c. matplotlib
    d. numpy
    e. sklearn

  2. Python Library for Automation.py
    a. selenium
    b. pytest-shutil
    c. csv
    d. pywin32
    e. typing

  3. Browser
    a. The entire binary executable Google Chrome under "C:\chrome\chrome.exe"
    b. ChromeDriver, which is included under "Tools\Chrome_Driver" directory
    c. Chrome version "76.0.3809.100" and ChromeDriver version "76.0.3809.126" are required for stable running

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published