Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasets: add support for BIOfid dataset #1589

Merged
merged 2 commits into from
May 7, 2020
Merged

Conversation

stefan-it
Copy link
Member

@stefan-it stefan-it commented May 7, 2020

Hi,

this PR adds support for the BIOfid dataset:

The Specialized Information Service Biodiversity Research (BIOfid) has been launched to mobilize valuable biological data from printed literature hidden in German libraries for over the past 250 years. In this project, we annotate German texts converted by OCR from historical scientific literature on the biodiversity of plants, birds, moths and butterflies. Our work enables the automatic extraction of biological information previously buried in the mass of papers and volumes. For this purpose, we generated training data for the tasks of Named Entity Recognition (NER) and Taxa Recognition (TR) in biological documents. We use this data to train a number of leading machine learning tools and create a gold standard for TR in biodiversity literature. More specifically, we perform a practical analysis of our newly generated BIOfid dataset through various downstream-task evaluations and establish a new state of the art for TR with 80.23% F-score. In this sense, our paper lays the foundations for future work in the field of information extraction in biology texts.

More information can be found in the paper "BIOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature" from Ahmed et al.

Data is taken from their repository.

Usage

Here's an example of how to use the dataset in Flair:

import flair.datasets

biofid = flair.datasets.BIOFID()

print(biofid)

Some sanity checks:

tag_type = "ner"
tag_dictionary = biofid.make_tag_dictionary(tag_type=tag_type)

labels = set([item[2:] for item in tag_dictionary.get_items() if "-" in item])

assert len(labels) == 6 # As mentioned in the paper

However, the number of sentences differ a bit:

len_train = len(biofid.train)
len_dev = len(biofid.dev)
len_test = len(biofid.test)

print(len_train + len_dev + len_test)

Outputs: 15,836, whereas the paper reports a number of 15,833.

@alanakbik
Copy link
Collaborator

Cool dataset, thanks for integrating this! Even some tags like TIME and TAXON that the other German NER datasets don't have. I hope we can get multi-task learning into Flair at some point so that we could train NER models using all German datasets for instance.

@alanakbik
Copy link
Collaborator

@stefan-it the line from .sequence_labeling import BIOFID should be added to datasets/__init__.py to make the example code run.

@stefan-it
Copy link
Member Author

@alanakbik Oh, forgot to push that import 😅 thanks for that hint, should be added now :)

@alanakbik
Copy link
Collaborator

👍

@alanakbik alanakbik merged commit 821de2e into master May 7, 2020
@alanakbik alanakbik deleted the add-biofid-dataset branch May 18, 2020 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants