BIOMEDICAL DATASETS: UnicodeDecodeError: Error loading several biomedical datasets #1874

datasri · 2020-09-21T17:02:14Z

Describe the bug

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5851: character maps to

This is a Unicode error that occurs when loading biomedical datasets. I have the following error when trying to load HUNER_CHEMICAL_CEMP().

To Reproduce
from flair.datasets import HUNER_CHEMICAL_CEMP
corpus = HUNER_CHEMICAL_CEMP()

Expected behavior
Corpus should load properly.

Screenshots

UnicodeDecodeError Traceback (most recent call last)
in
2 # # # 1. get all corpora for a specific entity type
3 # # from flair.models import SequenceTagger
----> 4 corpus = HUNER_CHEMICAL_CEMP()

~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in init(self, *args, **kwargs)
3999
4000 def init(self, *args, **kwargs):
-> 4001 super().init(*args, **kwargs)
4002
4003 @staticmethod

~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in init(self, base_path, in_memory, sentence_splitter)
522
523 writer = CoNLLWriter(sentence_splitter=self.sentence_splitter)
--> 524 internal_dataset = self.to_internal(data_folder)
525
526 train_data = self.get_subset(internal_dataset, "train", splits_dir)

~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in to_internal(self, data_dir)
4009 train_text_file = train_folder / "chemdner_patents_train_text.txt"
4010 train_ann_file = train_folder / "chemdner_cemp_gold_standard_train.tsv"
-> 4011 train_data = CEMP.parse_input_file(train_text_file, train_ann_file)
4012
4013 dev_folder = CEMP.download_dev_corpus(data_dir)

~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in parse_input_file(text_file, ann_file)
3956
3957 with open(str(text_file), "r") as text_reader:
-> 3958 for line in text_reader:
3959 if not line:
3960 continue

~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5851: character maps to

Environment (please complete the following information):

OS [Windows]:
Version [e.g. flair-0.6]:

Additional context

Infact, except for HUNER_CHEMICAL_CDR(), HUNER_DISEASE_CDR(), HUNER_DISEASE_NCBI(), HUNER_DISEASE_SCAI(),
I am getting similar Unicode Error when loading all other chemical (HUNER_CHEMICAL_CEMP(), HUNER_CHEMICAL_CHEBI(), HUNER_CHEMICAL_CHEMDNER(), HUNER_CHEMICAL_SCAI()) and disease (HUNER_DISEASE_MIRNA(), HUNER_DISEASE_VARIOME()) datasets.

** Similar error occurs when loading datasets as shown below.
from flair.datasets import CEMP
corpus = CEMP()

dataset name can be replaced with any other datasets mentioned above and similar Unicode error occurs.

leonweber · 2020-09-22T12:41:50Z

Thanks for reporting this @datasri . I will look into this in the next couple of days.

datasri · 2020-09-22T12:53:56Z

Thanks @leonweber

leonweber · 2020-09-28T12:28:17Z

Hey @datasri, I have pushed changes on a new branch which fix the issues on my local windows machine. Could you please try installing the fixes via pip install git+https://github.com/flairNLP/flair@GH-1874-Biomedical-Fix-Unicode-Windows and see whether this resolves your issue? It could be that you have to clear the cached datasets in your Flair folder (usually /Users//.flair).

datasri · 2020-09-29T19:22:18Z

Thanks @leonweber . Now I am able to load all Biomedical datasets for disease and chemicals. Thank you very much for quickly resolving the issue.

…-Windows Biomedical: Explicit encodings for Windows Support

datasri added the bug Something isn't working label Sep 21, 2020

datasri changed the title ~~BIOMEDICAL DATASETS: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5851: character maps to <undefined>~~ BIOMEDICAL DATASETS: UnicodeDecodeError: Error loading several biomedical datasets Sep 21, 2020

leonweber self-assigned this Sep 22, 2020

leonweber mentioned this issue Sep 30, 2020

Biomedical: Explicit encodings for Windows Support #1893

Merged

alanakbik closed this as completed in #1893 Oct 7, 2020

alanakbik added a commit that referenced this issue Oct 7, 2020

Merge pull request #1893 from flairNLP/GH-1874-Biomedical-Fix-Unicode…

41dbe1f

…-Windows Biomedical: Explicit encodings for Windows Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BIOMEDICAL DATASETS: UnicodeDecodeError: Error loading several biomedical datasets #1874

BIOMEDICAL DATASETS: UnicodeDecodeError: Error loading several biomedical datasets #1874

datasri commented Sep 21, 2020 •

edited

Loading

leonweber commented Sep 22, 2020

datasri commented Sep 22, 2020

leonweber commented Sep 28, 2020

datasri commented Sep 29, 2020

BIOMEDICAL DATASETS: UnicodeDecodeError: Error loading several biomedical datasets #1874

BIOMEDICAL DATASETS: UnicodeDecodeError: Error loading several biomedical datasets #1874

Comments

datasri commented Sep 21, 2020 • edited Loading

Describe the bug

Screenshots

Environment (please complete the following information):

Additional context

dataset name can be replaced with any other datasets mentioned above and similar Unicode error occurs.

leonweber commented Sep 22, 2020

datasri commented Sep 22, 2020

leonweber commented Sep 28, 2020

datasri commented Sep 29, 2020

datasri commented Sep 21, 2020 •

edited

Loading