Refactor evaluation #1671

alanakbik · 2020-06-06T19:41:45Z

This PR makes a number of refactorings to the evaluation routines in Flair. In short: whenever possible, we now use the evaluation methods of sklearn (instead of our own implementations which kept getting issues). This applies to text classification and (most) sequence tagging.

A notable exception is "span-F1" which is used to evaluate NER because there is no good way of counting true negatives. After this PR, our implementation should now exactly mirror the original conlleval script of the CoNLL-02 challenge. In addition to using our reimplementation, an output file is now automatically generated that can be directly used with the conlleval script.

In more detail, this PR makes the following changes:

Span is now a list of Token and can now be iterated like a sentence
flair.DataLoader is now used throughout
The evaluate() interface in the Model base class is changed so that it no longer requires a data loader, but ran run either over list of Sentence or a Dataset
SequenceTagger.evaluate() now explicitly distinguishes between F1 and Span-F1. In the latter case, no TN are counted (closes Getting always same number of TN and TP for multilabel classification #1663) and a non-sklearn implementation is used.
An unrelated serialization error is fixed in DocumentPoolEmbeddings

In the evaluate() method of the SequenceTagger and TextClassifier, we now explicitly call the .predict() method. To enable this, we made some changes to the predict() interface, namely you can now optionally specify the "label name" of the predicted label:

sentence = Sentence('I love Berlin')

tagger = SequenceTagger.load('ner')

# specify label name to be 'conll03_ner'
tagger.predict(sentence, label_name='conll03_ner')

print(sentence)

This may be useful if you have multiple ner taggers and wish to tag the same sentence with them. Then you can distinguish between the tags by the taggers. It is also now no longer possible to give the predict method a string - you now must pass a sentence.

This PR also makes it possible to set seeds when loading and downsampling corpora, so that the sample is always the same:

# set a random seed 
import random
random.seed(4)

# load and downsample corpus
corpus = SENTEVAL_MR(filter_if_longer_than=50).downsample(0.1)

# print first sentence of dev and test 
print(corpus.dev[0])
print(corpus.test[0])

alanakbik added 21 commits June 5, 2020 12:36

Change model interface

7dafbe6

Make distinction between F1 and span-F1

4d5a0f2

Merge branch 'master' into refactor_evaluation

6733d2c

Slim down logging

1f2948a

fix serialization issue

3be395f

refactor evaluation

f1e38e9

refactor evaluation interface

8aa1cf6

Switch evaluation to sklearn

a2e2a9c

adapt to new evaluate() interface

dab4b40

Refactor sampling logic so that random seed can be passed

ce357a4

Refactor evaluation

11292f9

Enable deterministic downsampling of corpus

fd25658

Add fbeta support

4d4bd15

display false negatives

df09f3d

Add get_names() to embedding interface

a6e13e6

allow to specify which embeddings to use

1384b76

slim down predict() interface and allow to set label_name

73c5e11

Remove tokenization option

1bb1de8

Fix unit tests for new predict() interface

520b507

Call .predict() in evaluate()

23c7a71

Remove commented out code

c2eef90

alanakbik mentioned this pull request Jun 7, 2020

Getting always same number of TN and TP for multilabel classification #1663

Closed

alanakbik merged commit 6b605a2 into master Jun 8, 2020

alanakbik deleted the refactor_evaluation branch June 8, 2020 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor evaluation #1671

Refactor evaluation #1671

alanakbik commented Jun 6, 2020 •

edited

Loading

Refactor evaluation #1671

Refactor evaluation #1671

Conversation

alanakbik commented Jun 6, 2020 • edited Loading

alanakbik commented Jun 6, 2020 •

edited

Loading