Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIOMEDICAL DATASETS: UnicodeDecodeError: Error loading several biomedical datasets #1874

Closed
datasri opened this issue Sep 21, 2020 · 4 comments · Fixed by #1893
Closed

BIOMEDICAL DATASETS: UnicodeDecodeError: Error loading several biomedical datasets #1874

datasri opened this issue Sep 21, 2020 · 4 comments · Fixed by #1893
Assignees
Labels
bug Something isn't working

Comments

@datasri
Copy link

datasri commented Sep 21, 2020

Describe the bug

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5851: character maps to

This is a Unicode error that occurs when loading biomedical datasets. I have the following error when trying to load HUNER_CHEMICAL_CEMP().

To Reproduce
from flair.datasets import HUNER_CHEMICAL_CEMP
corpus = HUNER_CHEMICAL_CEMP()

Expected behavior
Corpus should load properly.

Screenshots

UnicodeDecodeError Traceback (most recent call last)
in
2 # # # 1. get all corpora for a specific entity type
3 # # from flair.models import SequenceTagger
----> 4 corpus = HUNER_CHEMICAL_CEMP()

~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in init(self, *args, **kwargs)
3999
4000 def init(self, *args, **kwargs):
-> 4001 super().init(*args, **kwargs)
4002
4003 @staticmethod

~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in init(self, base_path, in_memory, sentence_splitter)
522
523 writer = CoNLLWriter(sentence_splitter=self.sentence_splitter)
--> 524 internal_dataset = self.to_internal(data_folder)
525
526 train_data = self.get_subset(internal_dataset, "train", splits_dir)

~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in to_internal(self, data_dir)
4009 train_text_file = train_folder / "chemdner_patents_train_text.txt"
4010 train_ann_file = train_folder / "chemdner_cemp_gold_standard_train.tsv"
-> 4011 train_data = CEMP.parse_input_file(train_text_file, train_ann_file)
4012
4013 dev_folder = CEMP.download_dev_corpus(data_dir)

~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in parse_input_file(text_file, ann_file)
3956
3957 with open(str(text_file), "r") as text_reader:
-> 3958 for line in text_reader:
3959 if not line:
3960 continue

~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5851: character maps to

Environment (please complete the following information):

  • OS [Windows]:
  • Version [e.g. flair-0.6]:

Additional context

Infact, except for HUNER_CHEMICAL_CDR(), HUNER_DISEASE_CDR(), HUNER_DISEASE_NCBI(), HUNER_DISEASE_SCAI(),
I am getting similar Unicode Error when loading all other chemical (HUNER_CHEMICAL_CEMP(), HUNER_CHEMICAL_CHEBI(), HUNER_CHEMICAL_CHEMDNER(), HUNER_CHEMICAL_SCAI()) and disease (HUNER_DISEASE_MIRNA(), HUNER_DISEASE_VARIOME()) datasets.

** Similar error occurs when loading datasets as shown below.
from flair.datasets import CEMP
corpus = CEMP()

dataset name can be replaced with any other datasets mentioned above and similar Unicode error occurs.

@datasri datasri added the bug Something isn't working label Sep 21, 2020
@datasri datasri changed the title BIOMEDICAL DATASETS: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5851: character maps to <undefined> BIOMEDICAL DATASETS: UnicodeDecodeError: Error loading several biomedical datasets Sep 21, 2020
@leonweber
Copy link
Collaborator

Thanks for reporting this @datasri . I will look into this in the next couple of days.

@leonweber leonweber self-assigned this Sep 22, 2020
@datasri
Copy link
Author

datasri commented Sep 22, 2020

Thanks @leonweber

@leonweber
Copy link
Collaborator

Hey @datasri, I have pushed changes on a new branch which fix the issues on my local windows machine. Could you please try installing the fixes via pip install git+https://github.com/flairNLP/flair@GH-1874-Biomedical-Fix-Unicode-Windows and see whether this resolves your issue? It could be that you have to clear the cached datasets in your Flair folder (usually /Users//.flair).

@datasri
Copy link
Author

datasri commented Sep 29, 2020

Thanks @leonweber . Now I am able to load all Biomedical datasets for disease and chemicals. Thank you very much for quickly resolving the issue.

alanakbik added a commit that referenced this issue Oct 7, 2020
…-Windows

Biomedical: Explicit encodings for Windows Support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants