Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor evaluation #1671

Merged
merged 21 commits into from
Jun 8, 2020
Merged

Refactor evaluation #1671

merged 21 commits into from
Jun 8, 2020

Conversation

alanakbik
Copy link
Collaborator

@alanakbik alanakbik commented Jun 6, 2020

This PR makes a number of refactorings to the evaluation routines in Flair. In short: whenever possible, we now use the evaluation methods of sklearn (instead of our own implementations which kept getting issues). This applies to text classification and (most) sequence tagging.

A notable exception is "span-F1" which is used to evaluate NER because there is no good way of counting true negatives. After this PR, our implementation should now exactly mirror the original conlleval script of the CoNLL-02 challenge. In addition to using our reimplementation, an output file is now automatically generated that can be directly used with the conlleval script.

In more detail, this PR makes the following changes:

  • Span is now a list of Token and can now be iterated like a sentence
  • flair.DataLoader is now used throughout
  • The evaluate() interface in the Model base class is changed so that it no longer requires a data loader, but ran run either over list of Sentence or a Dataset
  • SequenceTagger.evaluate() now explicitly distinguishes between F1 and Span-F1. In the latter case, no TN are counted (closes Getting always same number of TN and TP for multilabel classification #1663) and a non-sklearn implementation is used.
  • An unrelated serialization error is fixed in DocumentPoolEmbeddings

In the evaluate() method of the SequenceTagger and TextClassifier, we now explicitly call the .predict() method. To enable this, we made some changes to the predict() interface, namely you can now optionally specify the "label name" of the predicted label:

sentence = Sentence('I love Berlin')

tagger = SequenceTagger.load('ner')

# specify label name to be 'conll03_ner'
tagger.predict(sentence, label_name='conll03_ner')

print(sentence)

This may be useful if you have multiple ner taggers and wish to tag the same sentence with them. Then you can distinguish between the tags by the taggers. It is also now no longer possible to give the predict method a string - you now must pass a sentence.

This PR also makes it possible to set seeds when loading and downsampling corpora, so that the sample is always the same:

# set a random seed 
import random
random.seed(4)

# load and downsample corpus
corpus = SENTEVAL_MR(filter_if_longer_than=50).downsample(0.1)

# print first sentence of dev and test 
print(corpus.dev[0])
print(corpus.test[0])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Getting always same number of TN and TP for multilabel classification
1 participant