You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5851: character maps to
This is a Unicode error that occurs when loading biomedical datasets. I have the following error when trying to load HUNER_CHEMICAL_CEMP().
To Reproduce
from flair.datasets import HUNER_CHEMICAL_CEMP
corpus = HUNER_CHEMICAL_CEMP()
Expected behavior
Corpus should load properly.
Screenshots
UnicodeDecodeError Traceback (most recent call last)
in
2 # # # 1. get all corpora for a specific entity type
3 # # from flair.models import SequenceTagger
----> 4 corpus = HUNER_CHEMICAL_CEMP()
~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in parse_input_file(text_file, ann_file)
3956
3957 with open(str(text_file), "r") as text_reader:
-> 3958 for line in text_reader:
3959 if not line:
3960 continue
~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5851: character maps to
Environment (please complete the following information):
OS [Windows]:
Version [e.g. flair-0.6]:
Additional context
Infact, except for HUNER_CHEMICAL_CDR(), HUNER_DISEASE_CDR(), HUNER_DISEASE_NCBI(), HUNER_DISEASE_SCAI(),
I am getting similar Unicode Error when loading all other chemical (HUNER_CHEMICAL_CEMP(), HUNER_CHEMICAL_CHEBI(), HUNER_CHEMICAL_CHEMDNER(), HUNER_CHEMICAL_SCAI()) and disease (HUNER_DISEASE_MIRNA(), HUNER_DISEASE_VARIOME()) datasets.
** Similar error occurs when loading datasets as shown below.
from flair.datasets import CEMP
corpus = CEMP()
dataset name can be replaced with any other datasets mentioned above and similar Unicode error occurs.
The text was updated successfully, but these errors were encountered:
datasri
changed the title
BIOMEDICAL DATASETS: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5851: character maps to <undefined>
BIOMEDICAL DATASETS: UnicodeDecodeError: Error loading several biomedical datasets
Sep 21, 2020
Hey @datasri, I have pushed changes on a new branch which fix the issues on my local windows machine. Could you please try installing the fixes via pip install git+https://github.com/flairNLP/flair@GH-1874-Biomedical-Fix-Unicode-Windows and see whether this resolves your issue? It could be that you have to clear the cached datasets in your Flair folder (usually /Users//.flair).
Describe the bug
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5851: character maps to
This is a Unicode error that occurs when loading biomedical datasets. I have the following error when trying to load HUNER_CHEMICAL_CEMP().
To Reproduce
from flair.datasets import HUNER_CHEMICAL_CEMP
corpus = HUNER_CHEMICAL_CEMP()
Expected behavior
Corpus should load properly.
Screenshots
UnicodeDecodeError Traceback (most recent call last)
in
2 # # # 1. get all corpora for a specific entity type
3 # # from flair.models import SequenceTagger
----> 4 corpus = HUNER_CHEMICAL_CEMP()
~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in init(self, *args, **kwargs)
3999
4000 def init(self, *args, **kwargs):
-> 4001 super().init(*args, **kwargs)
4002
4003 @staticmethod
~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in init(self, base_path, in_memory, sentence_splitter)
522
523 writer = CoNLLWriter(sentence_splitter=self.sentence_splitter)
--> 524 internal_dataset = self.to_internal(data_folder)
525
526 train_data = self.get_subset(internal_dataset, "train", splits_dir)
~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in to_internal(self, data_dir)
4009 train_text_file = train_folder / "chemdner_patents_train_text.txt"
4010 train_ann_file = train_folder / "chemdner_cemp_gold_standard_train.tsv"
-> 4011 train_data = CEMP.parse_input_file(train_text_file, train_ann_file)
4012
4013 dev_folder = CEMP.download_dev_corpus(data_dir)
~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\site-packages\flair\datasets\biomedical.py in parse_input_file(text_file, ann_file)
3956
3957 with open(str(text_file), "r") as text_reader:
-> 3958 for line in text_reader:
3959 if not line:
3960 continue
~\AppData\Local\Continuum\anaconda3\envs\fgpu\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5851: character maps to
Environment (please complete the following information):
Additional context
Infact, except for HUNER_CHEMICAL_CDR(), HUNER_DISEASE_CDR(), HUNER_DISEASE_NCBI(), HUNER_DISEASE_SCAI(),
I am getting similar Unicode Error when loading all other chemical (HUNER_CHEMICAL_CEMP(), HUNER_CHEMICAL_CHEBI(), HUNER_CHEMICAL_CHEMDNER(), HUNER_CHEMICAL_SCAI()) and disease (HUNER_DISEASE_MIRNA(), HUNER_DISEASE_VARIOME()) datasets.
** Similar error occurs when loading datasets as shown below.
from flair.datasets import CEMP
corpus = CEMP()
dataset name can be replaced with any other datasets mentioned above and similar Unicode error occurs.
The text was updated successfully, but these errors were encountered: