Skip to content

Commit

Permalink
Merge branch 'IBM:dev' into dev-pankaj
Browse files Browse the repository at this point in the history
  • Loading branch information
pankajskku authored Oct 8, 2024
2 parents ba2d09d + e581989 commit bd08600
Showing 1 changed file with 23 additions and 2 deletions.
25 changes: 23 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ The goal is to offer high-level APIs for developers to quickly get started in wo
- [Scaling transforms from laptop to cluster](#laptop_cluster)
- [Repository Use and Navigation](doc/repo.md)
- [How to Contribute](CONTRIBUTING.md)
- [Papers and Talks](#talks_papers)
- [Talks and Papers](#talks_papers)
- [Citations](#citations)

## &#x1F4D6; About <a name = "about"></a>

Expand Down Expand Up @@ -131,7 +132,7 @@ The matrix below shows the the combination of modules and supported runtimes. Al
| **Data Ingestion** | | | | |
| [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [HTML to Parquet](transforms/universal/html2parquet/python/README.md) | :white_check_mark: | | | |
| [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | |
| **Universal (Code & Language)** | | | | |
| [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | | :white_check_mark: | | :white_check_mark: |
Expand Down Expand Up @@ -220,3 +221,23 @@ You can run transforms via docker image or using virtual environments. This [doc
5. Talk on "Hands on session for fine tuning LLMs" [Video](https://www.youtube.com/watch?v=VEHIA3E64DM)
6. Talk on "Build your own data preparation module using data-prep-kit" [Video](https://www.youtube.com/watch?v=0WUMG6HIgMg)

## Citations <a name = "citations"></a>

If you use Data Prep Kit in your research, please cite our paper:

```bash
@misc{wood2024dataprepkitgettingdataready,
title={Data-Prep-Kit: getting your data ready for LLM application development},
author={David Wood and Boris Lublinsky and Alexy Roytman and Shivdeep Singh
and Abdulhamid Adebayo and Revital Eres and Mohammad Nassar and Hima Patel
and Yousaf Shah and Constantin Adam and Petros Zerfos and Nirmit Desai
and Daiki Tsuzuku and Takuya Goto and Michele Dolfi and Saptha Surendran
and Paramesvaran Selvam and Sungeun An and Yuan Chi Chang and Dhiraj Joshi
and Hajar Emami-Gohari and Xuan-Hong Dang and Yan Koyfman and Shahrokh Daijavad},
year={2024},
eprint={2409.18164},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2409.18164},
}
```

0 comments on commit bd08600

Please sign in to comment.