Description: This is the final project for data mining course [CSCI 4502], it is consists of 2 parts:
- data crawling (retrieve articles from specified websites)
- data analysis (use several algorithms to analysis the relation between data)
- link to github: https://github.com/LeyenQian/4502_project/tree/master
Run:
-
Automation.py (at least Windows 7 SP1 x64 with Visual C++ 2015 patch)
a. crawl articles through Google Chrome
b. store articles under "result" directory as the form of json file
c. file name is unique; the sha256 value of the "article identity," which is the combination of the article name and link -
json_to_csv.py
a. combine all articles under each News category into a single csv file for further data analysis
b. csvs are stored under "result_csv" directory -
Analysis_frequent_itemset.ipynb & Analysis_k_means.ipynb (require Jupyter environment)
a. read the csv files from "result_csv" directory
b. analysis articles and generate graphs under "plot" directory -
k_shingle.ipynb
a. this is the test version of data analysis before start online News Retrieving
Dependencies: (may required to install through pip command)
-
Python Library for Analysis_frequent_itemset.ipynb and Analysis_k_means.ipynb
a. pandas
b. mlxtend
c. matplotlib
d. numpy
e. sklearn -
Python Library for Automation.py
a. selenium
b. pytest-shutil
c. csv
d. pywin32
e. typing -
Browser
a. The entire binary executable Google Chrome under "C:\chrome\chrome.exe"
b. ChromeDriver, which is included under "Tools\Chrome_Driver" directory
c. Chrome version "76.0.3809.100" and ChromeDriver version "76.0.3809.126" are required for stable running