Train & Test Loss scales #670

amitbcp · 2019-04-17T06:37:06Z

Describe the bug
I am training a text classification model as described in the tutorial. On plotting the training process, I see the the accuracy & F1 score plots converge and stabilize over 50 epochs. Whereas The training and test loss have a wide gap though the stabilize.
On going over the loss.tsv it seemed that the test loss was not scaled as the training loss was.
On manually scaling the test loss, the new graph made sense.

On inspecting the flair code, the script https://github.com/zalandoresearch/flair/blob/master/flair/trainers/trainer.py Line 198, the issues seems to make sense as the training loss has been scaled.

Can you please validate or clarify the same ?

Expected behavior
The graphs from Flair -

The data used is a sample IMDB data set.

The code was used from the Tutorial of Text Classification-
https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md

alanakbik · 2019-04-18T09:17:59Z

Hello @amitbcp thank you for reporting this! You are right it is inconsistent, also between the SequenceTagger and the TextClassifier. I'll put in a PR that cleans this up!

yzhaoinuw · 2019-07-06T02:49:29Z

Hi I trained a SequenceTagger model for 10 epochs on a NER dataset with BIO tags. The hyper-parameter setup is the following:

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=ELMoEmbeddings('pubmed'),
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

#%%
# initialize trainer
from flair.trainers import ModelTrainer
from flair.training_utils import EvaluationMetric

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# start training
trainer.train('../model/vd_ner_hidden256_mini32.flair',
              EvaluationMetric.MICRO_F1_SCORE,
              learning_rate=0.05,
              mini_batch_size=32,
              max_epochs=20,
              checkpoint=True)

After each epoch the loss on training set is significantly higher than that on the dev set or test set. For example, in the log file:
2019-07-02 19:53:53,813 EPOCH 1 done: loss 3.1292 - lr 0.0500 - bad epochs 0
2019-07-02 19:56:14,378 DEV : loss 1.1580216884613037 - score 0.9409
2019-07-02 19:57:25,740 TEST : loss 1.1297632455825806 - score 0.9419

2019-07-02 20:04:34,588 EPOCH 2 done: loss 1.5655 - lr 0.0500 - bad epochs 0
2019-07-02 20:06:55,173 DEV : loss 0.9526862502098083 - score 0.9497
2019-07-02 20:08:05,667 TEST : loss 0.9231987595558167 - score 0.9506

2019-07-02 20:15:10,766 EPOCH 3 done: loss 1.2434 - lr 0.0500 - bad epochs 0
2019-07-02 20:17:30,138 DEV : loss 0.7518299221992493 - score 0.9603
2019-07-02 20:18:40,065 TEST : loss 0.7275686264038086 - score 0.9623

2019-07-02 20:25:44,550 EPOCH 4 done: loss 1.0927 - lr 0.0500 - bad epochs 0
2019-07-02 20:28:04,731 DEV : loss 0.6411412358283997 - score 0.9641
2019-07-02 20:29:13,444 TEST : loss 0.6374855041503906 - score 0.9648

.
.
.

2019-07-02 21:18:31,623 EPOCH 9 done: loss 0.7606 - lr 0.0500 - bad epochs 0
2019-07-02 21:20:50,077 DEV : loss 0.43434903025627136 - score 0.9761
2019-07-02 21:21:59,644 TEST : loss 0.42688044905662537 - score 0.9776

2019-07-02 21:29:00,119 EPOCH 10 done: loss 0.7143 - lr 0.0500 - bad epochs 0
2019-07-02 21:31:18,944 DEV : loss 0.4277861714363098 - score 0.9762
2019-07-02 21:32:28,363 TEST : loss 0.4218243956565857 - score 0.9784

I skipped lines shown in my log file but I saw that in each epoch the loss is significantly higher than that on the dev set and test set. I used a 7-2-1 train-dev-test set split and made sure that the data split was random. At first I suspected that the loss was not normalized but then I saw this issue @amitbcp started and it seemed that the loss has now been normalized. But then what could have caused the significant difference in the training loss and dev/test loss?
Could it have something to do with dropout? I saw a possible explanation in this post saying that maybe dropout was only activated in training but not in predicting. Is this the case?

Thank you!

alanakbik · 2019-07-08T06:45:52Z

I've looked a bit into this and unfortunately I don't think it has to do with dropout. Rather this seems to stem from the way we batch the data to compute the viterbi loss during testing - which seems to be somewhat inaccurate and does not correlate with F1 score (i.e. when dev loss starts climbing, F1 also gets better which intuitively seems wrong). We could switch to other implementations of viterbi loss but the ones I've tested were slower, so I'm not sure if we want to make this tradeoff. Any help from the community would be appreciated here :)

yzhaoinuw · 2019-07-08T14:50:10Z

@alanakbik Hi Alan, thank you for the reply. I saw that for the SequenceTagger class, there are two types of dropout default to non-zero values, namely, word_dropout=0.05 and locked dropout=0.5. I tested my hypothesis that one of the dropouts caused the gap in the training loss and dev/test loss in each epoch I saw previously. So I trained a SequenceTagger model that sets both dropouts to 0 while keeping everything else the same with the following code:
tagger: SequenceTagger = SequenceTagger(hidden_size=256, embeddings=ELMoEmbeddings('pubmed'), tag_dictionary=tag_dictionary, tag_type=tag_type, use_crf=True, locked_dropout=0.0,word_dropout=0.0)
Then from the log file I did notice that the gap between the training loss and dev/test loss disappeared after epoch 1, and the training loss in fact becomes significantly less than dev/test loss later. The lines from the log file are:

2019-07-06 18:53:12,364 EPOCH 1 done: loss 1.7662 - lr 0.0500 - bad epochs 0
2019-07-06 18:55:28,026 DEV : loss 0.7476624846458435 - score 0.961
2019-07-06 18:56:36,483 TEST : loss 0.7422851324081421 - score 0.9606

2019-07-06 19:04:24,118 EPOCH 2 done: loss 0.6406 - lr 0.0500 - bad epochs 0
2019-07-06 19:06:39,654 DEV : loss 0.5652707815170288 - score 0.9709
2019-07-06 19:07:46,186 TEST : loss 0.5547394156455994 - score 0.9713

2019-07-06 19:15:30,800 EPOCH 3 done: loss 0.3925 - lr 0.0500 - bad epochs 0
2019-07-06 19:17:45,920 DEV : loss 0.39891719818115234 - score 0.979
2019-07-06 19:18:52,329 TEST : loss 0.3932212293148041 - score 0.9797

2019-07-06 19:26:36,812 EPOCH 4 done: loss 0.2437 - lr 0.0500 - bad epochs 0
2019-07-06 19:28:52,055 DEV : loss 0.3676077723503113 - score 0.9823
2019-07-06 19:29:59,331 TEST : loss 0.3704891502857208 - score 0.9818

.
.
.

2019-07-06 20:22:07,215 EPOCH 9 done: loss 0.0404 - lr 0.0500 - bad epochs 0
2019-07-06 20:24:22,235 DEV : loss 0.31244173645973206 - score 0.9884
2019-07-06 20:25:29,638 TEST : loss 0.32593005895614624 - score 0.9881

2019-07-06 20:33:11,261 EPOCH 10 done: loss 0.0300 - lr 0.0500 - bad epochs 0
2019-07-06 20:35:26,659 DEV : loss 0.2994307279586792 - score 0.9894
2019-07-06 20:36:34,007 TEST : loss 0.32296326756477356 - score 0.9888

You can see that since epoch 2 the training loss is getting very close to dev/test loss, and then becomes less than dev/test loss. At the end of epoch 10, it is almost ten times less than dev/test loss.
I added more lines in my previous post from the log file for my previous model to better show the contrast. The contrast is huge between the results from the previous set up (with default word_dropout and locked_dropout) and the current setup (all dropouts set to 0). Given this result, do you think the dropouts also play a role in the gap between the training and dev/test loss, in addition to what you suggested, the way you "batch the data to compute the viterbi loss during testing" ?

alanakbik · 2019-07-08T15:48:55Z

Hello @yzhaoinuw thanks for sharing these results! Yes I think you're right: it may be because of the dropout that is activated during training but deactivated during testing. Perhaps expecially the word dropout is causing these divergences (have to test to check).
To explain epoch 1, we currently compute training loss as the average loss of all mini-batches during an epoch of training. Since in an untrained model, early mini-batches have really high loss, the average in epochs 1 and 2 seems much higher than dev.

amitbcp added the bug Something isn't working label Apr 17, 2019

alanakbik self-assigned this Apr 18, 2019

alanakbik mentioned this issue May 23, 2019

GH-678: Classification improvements #747

Merged

alanakbik closed this as completed in #747 May 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train & Test Loss scales #670

Train & Test Loss scales #670

amitbcp commented Apr 17, 2019 •

edited

Loading

alanakbik commented Apr 18, 2019

yzhaoinuw commented Jul 6, 2019 •

edited

Loading

alanakbik commented Jul 8, 2019

yzhaoinuw commented Jul 8, 2019 •

edited

Loading

alanakbik commented Jul 8, 2019

Train & Test Loss scales #670

Train & Test Loss scales #670

Comments

amitbcp commented Apr 17, 2019 • edited Loading

alanakbik commented Apr 18, 2019

yzhaoinuw commented Jul 6, 2019 • edited Loading

alanakbik commented Jul 8, 2019

yzhaoinuw commented Jul 8, 2019 • edited Loading

alanakbik commented Jul 8, 2019

amitbcp commented Apr 17, 2019 •

edited

Loading

yzhaoinuw commented Jul 6, 2019 •

edited

Loading

yzhaoinuw commented Jul 8, 2019 •

edited

Loading