datasets: add support for BIOfid dataset #1589

stefan-it · 2020-05-07T13:48:52Z

Hi,

this PR adds support for the BIOfid dataset:

The Specialized Information Service Biodiversity Research (BIOfid) has been launched to mobilize valuable biological data from printed literature hidden in German libraries for over the past 250 years. In this project, we annotate German texts converted by OCR from historical scientific literature on the biodiversity of plants, birds, moths and butterflies. Our work enables the automatic extraction of biological information previously buried in the mass of papers and volumes. For this purpose, we generated training data for the tasks of Named Entity Recognition (NER) and Taxa Recognition (TR) in biological documents. We use this data to train a number of leading machine learning tools and create a gold standard for TR in biodiversity literature. More specifically, we perform a practical analysis of our newly generated BIOfid dataset through various downstream-task evaluations and establish a new state of the art for TR with 80.23% F-score. In this sense, our paper lays the foundations for future work in the field of information extraction in biology texts.

More information can be found in the paper "BIOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature" from Ahmed et al.

Data is taken from their repository.

Usage

Here's an example of how to use the dataset in Flair:

import flair.datasets

biofid = flair.datasets.BIOFID()

print(biofid)

Some sanity checks:

tag_type = "ner"
tag_dictionary = biofid.make_tag_dictionary(tag_type=tag_type)

labels = set([item[2:] for item in tag_dictionary.get_items() if "-" in item])

assert len(labels) == 6 # As mentioned in the paper

However, the number of sentences differ a bit:

len_train = len(biofid.train)
len_dev = len(biofid.dev)
len_test = len(biofid.test)

print(len_train + len_dev + len_test)

Outputs: 15,836, whereas the paper reports a number of 15,833.

alanakbik · 2020-05-07T14:28:16Z

Cool dataset, thanks for integrating this! Even some tags like TIME and TAXON that the other German NER datasets don't have. I hope we can get multi-task learning into Flair at some point so that we could train NER models using all German datasets for instance.

alanakbik · 2020-05-07T14:31:17Z

@stefan-it the line from .sequence_labeling import BIOFID should be added to datasets/__init__.py to make the example code run.

stefan-it · 2020-05-07T14:39:14Z

@alanakbik Oh, forgot to push that import 😅 thanks for that hint, should be added now :)

alanakbik · 2020-05-07T15:17:23Z

👍

datasets: add support for BIOfid dataset

5751c49

datasets: import BIOFID dataset

0d8c12a

alanakbik merged commit 821de2e into master May 7, 2020

alanakbik deleted the add-biofid-dataset branch May 18, 2020 11:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets: add support for BIOfid dataset #1589

datasets: add support for BIOfid dataset #1589

stefan-it commented May 7, 2020 •

edited

Loading

alanakbik commented May 7, 2020

alanakbik commented May 7, 2020

stefan-it commented May 7, 2020

alanakbik commented May 7, 2020

datasets: add support for BIOfid dataset #1589

datasets: add support for BIOfid dataset #1589

Conversation

stefan-it commented May 7, 2020 • edited Loading

Usage

alanakbik commented May 7, 2020

alanakbik commented May 7, 2020

stefan-it commented May 7, 2020

alanakbik commented May 7, 2020

stefan-it commented May 7, 2020 •

edited

Loading