Skip to content

[ACL 2024] This is the code repo for our ACL’24 paper "Cleaner Pretraining Corpus Curation with Neural Web Scraping".

License

Notifications You must be signed in to change notification settings

MiraclePlus/NeuScraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeuScraper

Source code for our ACL'24 paper : Cleaner Pretraining Corpus Curation with Neural Web Scraping

If you find this work useful, please cite our paper and give us a shining star.

Quick Start

1️⃣ Download checkpoint for NeuScraper

git lfs install
git clone https://huggingface.co/OpenMatch/neuscraper-v1-clueweb

2️⃣ Clone from git

git clone https://github.com/MiraclePlus/NeuScraper
cd NeuScraper

3️⃣ Environment

Install the torch first :

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

Install other packages :

pip install -r requirements.txt

4️⃣ Install As Package

Install the neu_scraper package :

pip install -e .

You also can install from whl :

python setup.py bdist_wheel
pip install dist/neu_scraper-0.1-py3-none-any.whl

5️⃣ Use it like

from neu_scraper import predict
import requests

url = 'https://blog.christianperone.com/2023/06/appreciating-llms-data-pipelines/'
model_path = '../neuscraper-v1-clueweb/training_state_checkpoint.tar'

response = requests.get(url)
html = response.content.decode('utf-8')

result = predict(html, url, model_path)
print(result)

Citation

@inproceedings{xu2024cleaner,
  title={Cleaner Pretraining Corpus Curation with Neural Web Scraping},
  author={Xu, Zhipeng and Liu, Zhenghao and Yan, Yukun and Liu, Zhiyuan and Xiong, Chenyan and Yu, Ge},
  booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics},
  year={2024}
}

Contact Us

If you have questions, suggestions, and bug reports, please send a email to us, we will try our best to help you.

xuzhipeng@stumail.neu.edu.cn

About

[ACL 2024] This is the code repo for our ACL’24 paper "Cleaner Pretraining Corpus Curation with Neural Web Scraping".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%