Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use pretokenized sequences? #1963

Closed
ulf1 opened this issue Nov 12, 2020 · 6 comments
Closed

How to use pretokenized sequences? #1963

ulf1 opened this issue Nov 12, 2020 · 6 comments
Labels
question Further information is requested

Comments

@ulf1
Copy link
Contributor

ulf1 commented Nov 12, 2020

Hello, I want to use pretokenized sequences (I don't want to use a different tokenizer, e.g. docs).

Made up example:

pretokenized_sequence = ['The', 'grass', 'is', 'green', '.']
my_sentence_object = Sentence(pretokenized_sequence, is_pretokenized=True)

Is it possible to achieve something like this?

@ulf1 ulf1 added the question Further information is requested label Nov 12, 2020
@ulf1
Copy link
Contributor Author

ulf1 commented Nov 12, 2020

Proposal

https://github.com/flairNLP/flair/blob/master/flair/data.py#L570

if text is not None:
    text = self._restore_windows_1252_characters(text)
    [self.add_token(token) for token in tokenizer.tokenize(text)]
elif is_pretokenized  and isinstance(text, collections.abc.Iterable):
    [self.add_token(token) for token in text]
    # or [self.add_token(self._restore_windows_1252_characters(token)) for token in text]

I can make a PR if you want.

@alanakbik
Copy link
Collaborator

Hello @ulf1 thanks for adding this! There is, however, already a way to do this using whitespace tokenization:

# use whitespace tokenized string and no tokenizer
sentence = Sentence('The grass is green .', use_tokenizer=False)

print(sentence)

You can create a whitepace tokenized string from a list of strings like this:

pretokenized_sequence = ['The', 'grass', 'is', 'green', '.']

whitespace_tokenized_sentence = ' '.join(pretokenized_sequence)

print(whitespace_tokenized_sentence)

I would prefer this approach over the suggested PR since it will introduce new constructor elements to the Sentence object. This is something we want to keep as few as possible to make the objects easy to understand. Also the PR would change the signature of the text parameter to Union[str, List[str]] where the second only is valid if is_pretokenized is set. This could potentially confuse people. So hopefully the existing way works for you!

@ulf1
Copy link
Contributor Author

ulf1 commented Nov 12, 2020

@alanakbik Your suggestion will likely produce errors in my use case, i.e. my preprocessing could produce with two different sequences length for the same sentence.

@alanakbik
Copy link
Collaborator

Could you specify, perhaps with an example?

@ulf1
Copy link
Contributor Author

ulf1 commented Nov 12, 2020

raw = 'here is a subtle typo'

  • Tokenizer A does raw.split(" ") and produces tokenized = ['here', ' ', 'is', 'a', 'subtle', 'typo']
  • Tokenizer B uses a clever regex rule and produces tokenized = ['here', 'is', 'a', 'subtle', 'typo']

When using two different tokenizer, it's very likely that we end up with different results in some edge cases. In one of my use cases I will get pretokenized sentences (and some other annotations), and I have no idea how these tokens were generated.

@ulf1
Copy link
Contributor Author

ulf1 commented Nov 12, 2020

@alanakbik "... Also the PR would change the signature of the text parameter to Union[str, List[str]] where the second only is valid if is_pretokenized is set. This could potentially confuse people. ..."

That's correct. is_pretokenized is not needed at all. I removed it.

alanakbik added a commit that referenced this issue Nov 23, 2020
…-class

fix #1963 pretokenized sequences as input for flair.data.Sentence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants