bheu24-hf-cm4ds

BioHackathon Europe 2024 repository to explore Croissant ML metadata extraction from HuggingFace.

BioHackathon Proposal

Abstract To advance the use of Machine Learning for the understanding of diseases and conservation of biodiversity is important to promote FAIR AI-ready datasets since data scientists and bioinformaticians spend 80% of their time in data finding and preparation. Metadata descriptors for datasets are pivotal for the creation of Machine Learning Models as they facilitate the definition of strategies for data discovery, feature selection, data cleaning and data pre-processing.

Once a dataset is AI-ready, such metadata descriptors change wrt the initial version of the raw data. What can we learn from the metadata of raw vs AI-ready datasets? What transformations from raw to AI-ready could be (semi)automated based on metadata descriptors? In this project, we will manually analyze and curate metadata descriptors before and after AI-readiness. Based on our analysis, we will identify dataset transformations that could be (semi)automated by software pipelines with the aim of alleviating the effort and time invested in data pre-processing for Machine Learning.

ML-ready datasets, whether by design or after pre-processing, can be enriched with metadata so they become FAIRer, with all benefits that come from FAIR. Croissant ML is an extension of schema.org to better describe ML-ready datasets, released early 2024 and already adopted by some ML-model platforms such as Hugging Face (see Croissant ML viewer documentation)and OpenML. However, as it commonly happens with metadata, there are some limitations to the amount of metadata that can be automatically extracted. How much Croissant metadata can be programmatically extracted from ML-ready datasets? How this automation could be improved? In this project, we will explore answers to these two questions.

The results will be later integrated into a metadata-based reproducibility assessment cycle, part of the NFDI4DataScience project in Germany. To facilitate the work during the BioHackathon, we will focus on datasets in HuggingFace and/or datasets from the DOME registry as this would indicate already some level of availability for the metadata. The AI-ready metadata descriptors will use the Croissant schema proposed by the ML Commons. This project will also take into account previous work done at the BioHackathon 2022 on metadata for synthetic data.

Note: we have modified the intial idea as metadata created by metadata experts is time-consuming and does not necessarily refelct the metadata level that is commonly available in ML-models platforms. We think that a programmatic approach will be of more benefit for the ultimate goal of improving automatically extraction and enrichment metadata for AI/ML-ready datasets.

Co-leads Leyla Jael Castro, Nuria Queralt Rosinach

BioHackathon Plan

Depending on the participants skills and time (some might be collaborating to another project), the plan bellow will be modified and fine-tuned.

Get familiar with Croissant ML specification, examples and Python library
Retrieve Croissant metadata for Hugging Face datasets
Create a Knowledge Graph from the harvested metadata
Analyze how many of the Croissant schema is covered
Discuss what sort of corpus would be necessary to increase coverage with ML approaches

Participants

Dhwani Solanki
Rohitha Ravinder
Nelson Quiñones

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

bheu24-hf-cm4ds

BioHackathon Proposal

BioHackathon Plan

Participants

Files

README.md

Latest commit

History

README.md

File metadata and controls

bheu24-hf-cm4ds

BioHackathon Proposal

BioHackathon Plan

Participants