Fix tokenizer insert empty token to sentence object #1226

eurekaqq · 2019-10-20T02:24:03Z

Fix issue#1188
This issue is about tokenizer of segtok library.
In segtok, someword+ n't, e.g. don't, didn't...
It will be tokenized as someword and n't.
However, if user has split it, it will insert empty between someword and n't.
Example:
"do n't" -> "do", "", "n't"
"did n't" -> "did", "", "n't"

To Reproduce

from flair.data import segtok_tokenizer

text = r'do n’t'
tokens = segtok_tokenizer(text)
print(tokens) #[Token: do, Token: , Token: n’t]

This commit fix this problem.

alanakbik · 2019-10-20T14:32:03Z

Thanks! The travis error is from a different commit, now fixed, so will merge!

Fix tokenizer insert empty token to sentence

a9e661f

alanakbik merged commit acf8133 into flairNLP:master Oct 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenizer insert empty token to sentence object #1226

Fix tokenizer insert empty token to sentence object #1226

eurekaqq commented Oct 20, 2019

alanakbik commented Oct 20, 2019

Fix tokenizer insert empty token to sentence object #1226

Fix tokenizer insert empty token to sentence object #1226

Conversation

eurekaqq commented Oct 20, 2019

alanakbik commented Oct 20, 2019