-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use pretokenized sequences? #1963
Comments
Proposalhttps://github.com/flairNLP/flair/blob/master/flair/data.py#L570 if text is not None:
text = self._restore_windows_1252_characters(text)
[self.add_token(token) for token in tokenizer.tokenize(text)]
elif is_pretokenized and isinstance(text, collections.abc.Iterable):
[self.add_token(token) for token in text]
# or [self.add_token(self._restore_windows_1252_characters(token)) for token in text] I can make a PR if you want. |
Hello @ulf1 thanks for adding this! There is, however, already a way to do this using whitespace tokenization: # use whitespace tokenized string and no tokenizer
sentence = Sentence('The grass is green .', use_tokenizer=False)
print(sentence) You can create a whitepace tokenized string from a list of strings like this: pretokenized_sequence = ['The', 'grass', 'is', 'green', '.']
whitespace_tokenized_sentence = ' '.join(pretokenized_sequence)
print(whitespace_tokenized_sentence) I would prefer this approach over the suggested PR since it will introduce new constructor elements to the |
@alanakbik Your suggestion will likely produce errors in my use case, i.e. my preprocessing could produce with two different sequences length for the same sentence. |
Could you specify, perhaps with an example? |
When using two different tokenizer, it's very likely that we end up with different results in some edge cases. In one of my use cases I will get pretokenized sentences (and some other annotations), and I have no idea how these tokens were generated. |
@alanakbik "... Also the PR would change the signature of the text parameter to Union[str, List[str]] where the second only is valid if is_pretokenized is set. This could potentially confuse people. ..." That's correct. |
…-class fix #1963 pretokenized sequences as input for flair.data.Sentence
Hello, I want to use pretokenized sequences (I don't want to use a different tokenizer, e.g. docs).
Made up example:
Is it possible to achieve something like this?
The text was updated successfully, but these errors were encountered: