News Comparison with Real-time Data Crawling

Description: This is the final project for data mining course [CSCI 4502], it is consists of 2 parts:

data crawling (retrieve articles from specified websites)
data analysis (use several algorithms to analysis the relation between data)
link to github: https://github.com/LeyenQian/4502_project/tree/master

Run:

Automation.py (at least Windows 7 SP1 x64 with Visual C++ 2015 patch)
a. crawl articles through Google Chrome
b. store articles under "result" directory as the form of json file
c. file name is unique; the sha256 value of the "article identity," which is the combination of the article name and link
json_to_csv.py
a. combine all articles under each News category into a single csv file for further data analysis
b. csvs are stored under "result_csv" directory
Analysis_frequent_itemset.ipynb & Analysis_k_means.ipynb (require Jupyter environment)
a. read the csv files from "result_csv" directory
b. analysis articles and generate graphs under "plot" directory
k_shingle.ipynb
a. this is the test version of data analysis before start online News Retrieving

Dependencies: (may required to install through pip command)

Python Library for Analysis_frequent_itemset.ipynb and Analysis_k_means.ipynb
a. pandas
b. mlxtend
c. matplotlib
d. numpy
e. sklearn
Python Library for Automation.py
a. selenium
b. pytest-shutil
c. csv
d. pywin32
e. typing
Browser
a. The entire binary executable Google Chrome under "C:\chrome\chrome.exe"
b. ChromeDriver, which is included under "Tools\Chrome_Driver" directory
c. Chrome version "76.0.3809.100" and ChromeDriver version "76.0.3809.126" are required for stable running

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.vscode		.vscode
Analysis_code_before_article_crawling		Analysis_code_before_article_crawling
Snapshots		Snapshots
Tools		Tools
UA_Set		UA_Set
plot		plot
result		result
result_csv		result_csv
Analysis_frequent_itemset.ipynb		Analysis_frequent_itemset.ipynb
Analysis_k_means.ipynb		Analysis_k_means.ipynb
Automation.py		Automation.py
NewsURLs.ini		NewsURLs.ini
README.md		README.md
base_page.py		base_page.py
config_manager.py		config_manager.py
json_to_csv.py		json_to_csv.py
page_operation.py		page_operation.py

Provide feedback