How to use pretokenized sequences? #1963

ulf1 · 2020-11-12T08:25:29Z

Hello, I want to use pretokenized sequences (I don't want to use a different tokenizer, e.g. docs).

Made up example:

pretokenized_sequence = ['The', 'grass', 'is', 'green', '.']
my_sentence_object = Sentence(pretokenized_sequence, is_pretokenized=True)

Is it possible to achieve something like this?

ulf1 · 2020-11-12T08:52:56Z

Proposal

https://github.com/flairNLP/flair/blob/master/flair/data.py#L570

if text is not None:
    text = self._restore_windows_1252_characters(text)
    [self.add_token(token) for token in tokenizer.tokenize(text)]
elif is_pretokenized  and isinstance(text, collections.abc.Iterable):
    [self.add_token(token) for token in text]
    # or [self.add_token(self._restore_windows_1252_characters(token)) for token in text]

I can make a PR if you want.

alanakbik · 2020-11-12T15:07:33Z

Hello @ulf1 thanks for adding this! There is, however, already a way to do this using whitespace tokenization:

# use whitespace tokenized string and no tokenizer
sentence = Sentence('The grass is green .', use_tokenizer=False)

print(sentence)

You can create a whitepace tokenized string from a list of strings like this:

pretokenized_sequence = ['The', 'grass', 'is', 'green', '.']

whitespace_tokenized_sentence = ' '.join(pretokenized_sequence)

print(whitespace_tokenized_sentence)

I would prefer this approach over the suggested PR since it will introduce new constructor elements to the Sentence object. This is something we want to keep as few as possible to make the objects easy to understand. Also the PR would change the signature of the text parameter to Union[str, List[str]] where the second only is valid if is_pretokenized is set. This could potentially confuse people. So hopefully the existing way works for you!

ulf1 · 2020-11-12T15:52:40Z

@alanakbik Your suggestion will likely produce errors in my use case, i.e. my preprocessing could produce with two different sequences length for the same sentence.

alanakbik · 2020-11-12T15:55:55Z

Could you specify, perhaps with an example?

ulf1 · 2020-11-12T16:24:35Z

raw = 'here is a subtle typo'

Tokenizer A does raw.split(" ") and produces tokenized = ['here', ' ', 'is', 'a', 'subtle', 'typo']
Tokenizer B uses a clever regex rule and produces tokenized = ['here', 'is', 'a', 'subtle', 'typo']

When using two different tokenizer, it's very likely that we end up with different results in some edge cases. In one of my use cases I will get pretokenized sentences (and some other annotations), and I have no idea how these tokens were generated.

ulf1 · 2020-11-12T17:28:35Z

@alanakbik "... Also the PR would change the signature of the text parameter to Union[str, List[str]] where the second only is valid if is_pretokenized is set. This could potentially confuse people. ..."

That's correct. is_pretokenized is not needed at all. I removed it.

…-class fix #1963 pretokenized sequences as input for flair.data.Sentence

ulf1 added the question Further information is requested label Nov 12, 2020

alanakbik closed this as completed in edcd88c Nov 23, 2020

alanakbik added a commit that referenced this issue Nov 23, 2020

Merge pull request #1965 from ulf1/pretokenized-sequence-for-sentence…

79014b0

…-class fix #1963 pretokenized sequences as input for flair.data.Sentence

whoisjones pushed a commit that referenced this issue Dec 2, 2020

fix #1963 pretokenized sequences as input for flair.data.Sentence

54530ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use pretokenized sequences? #1963

How to use pretokenized sequences? #1963

ulf1 commented Nov 12, 2020

ulf1 commented Nov 12, 2020 •

edited

Loading

alanakbik commented Nov 12, 2020

ulf1 commented Nov 12, 2020

alanakbik commented Nov 12, 2020

ulf1 commented Nov 12, 2020

ulf1 commented Nov 12, 2020 •

edited

Loading

How to use pretokenized sequences? #1963

How to use pretokenized sequences? #1963

Comments

ulf1 commented Nov 12, 2020

ulf1 commented Nov 12, 2020 • edited Loading

Proposal

alanakbik commented Nov 12, 2020

ulf1 commented Nov 12, 2020

alanakbik commented Nov 12, 2020

ulf1 commented Nov 12, 2020

ulf1 commented Nov 12, 2020 • edited Loading

ulf1 commented Nov 12, 2020 •

edited

Loading

ulf1 commented Nov 12, 2020 •

edited

Loading