Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #1963 pretokenized sequences as input for flair.data.Sentence #1965

Merged
merged 7 commits into from
Nov 23, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 11 additions & 4 deletions flair/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

from collections import Counter
from collections import defaultdict
from collections.abc import Iterable

from deprecated import deprecated
from flair.file_utils import Tqdm
Expand Down Expand Up @@ -525,19 +526,21 @@ class Sentence(DataPoint):

def __init__(
self,
text: str = None,
text: Union[str, List[str]] = None,
use_tokenizer: Union[bool, Tokenizer] = True,
is_pretokenized: bool = False,
language_code: str = None,
start_position: int = None
):
"""
Class to hold all meta related to a text (tokens, predictions, language code, ...)
:param text: original string
:param text: original string (sentence), or a list of string tokens (words)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to make it more clear?

:param text: original string (Alternatively, `text` can be a token list, i.e. a list of strings. In this case there is no tokenization by flair and the parameter `use_tokenizer` is ignored.)

:param use_tokenizer: a custom tokenizer (default is :class:`SpaceTokenizer`)
more advanced options are :class:`SegTokTokenizer` to use segtok or :class:`SpacyTokenizer`
to use Spacy library if available). Check the implementations of abstract class Tokenizer or
implement your own subclass (if you need it). If instead of providing a Tokenizer, this parameter
is just set to True (deprecated), :class:`SegtokTokenizer` will be used.
:param is_pretokenized: Flag if "text" is expected to be a list of tokens
:param language_code: Language of the sentence
:param start_position: Start char offset of the sentence in the superordinate document
"""
Expand Down Expand Up @@ -568,8 +571,12 @@ def __init__(

# if text is passed, instantiate sentence with tokens (words)
if text is not None:
ulf1 marked this conversation as resolved.
Show resolved Hide resolved
text = self._restore_windows_1252_characters(text)
[self.add_token(token) for token in tokenizer.tokenize(text)]
if is_pretokenized and isinstance(text, Iterable):
[self.add_token(self._restore_windows_1252_characters(token))
for token in text]
else:
text = self._restore_windows_1252_characters(text)
[self.add_token(token) for token in tokenizer.tokenize(text)]

# log a warning if the dataset is empty
if text == "":
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ tabulate
langdetect
lxml
ftfy
sentencepiece!=0.1.92
ulf1 marked this conversation as resolved.
Show resolved Hide resolved
sentencepiece==0.1.91
konoha<5.0.0,>=4.0.0
janome
gdown
16 changes: 16 additions & 0 deletions resources/docs/TUTORIAL_1_BASICS.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,22 @@ You can write your own tokenization routine. Check the code of `flair.data.Token
(e.g. `flair.tokenization.SegtokTokenizer` or `flair.tokenization.SpacyTokenizer`) to get an idea of how to add
your own tokenization method.

### Using pretokenized sequences
You can pass pass a pretokenized sequence as list of words, e.g.

```python
from flair.data import Sentence
my_sent = Sentence(['The', 'grass', 'is', 'green', '.'], is_pretokenized=True)
print(my_sent)
```

This should print:

```console
Sentence: "The grass is green." [− Tokens: 5]
ulf1 marked this conversation as resolved.
Show resolved Hide resolved
```


## Adding Labels

In Flair, any data point can be labeled. For instance, you can label a word or label a sentence:
Expand Down
7 changes: 7 additions & 0 deletions tests/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -866,3 +866,10 @@ def test_sentence_to_dict():
assert "Facebook, Inc." == dict["entities"][0]["text"]
assert "Google" == dict["entities"][1]["text"]
assert 0 == len(dict["labels"])


def test_pretokenized():
pretoks = ['The', 'grass', 'is', 'green', '.']
sent = Sentence(pretoks, is_pretokenized=True)
for i, token in enumerate(sent):
assert token.text == pretoks[i]