Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tokenizer insert empty token to sentence object #1226

Merged
merged 1 commit into from
Oct 20, 2019

Conversation

eurekaqq
Copy link
Contributor

Fix issue#1188
This issue is about tokenizer of segtok library.
In segtok, someword+ n't, e.g. don't, didn't...
It will be tokenized as someword and n't.
However, if user has split it, it will insert empty between someword and n't.
Example:
"do n't" -> "do", "", "n't"
"did n't" -> "did", "", "n't"

To Reproduce

from flair.data import segtok_tokenizer

text = r'do n’t'
tokens = segtok_tokenizer(text)
print(tokens) #[Token: do, Token: , Token: n’t]

This commit fix this problem.

@alanakbik
Copy link
Collaborator

Thanks! The travis error is from a different commit, now fixed, so will merge!

@alanakbik alanakbik merged commit acf8133 into flairNLP:master Oct 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error while loading BertEmbeddings with pooling_operation = "mean"
2 participants