-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train & Test Loss scales #670
Comments
Hello @amitbcp thank you for reporting this! You are right it is inconsistent, also between the |
Hi I trained a SequenceTagger model for 10 epochs on a NER dataset with BIO tags. The hyper-parameter setup is the following:
After each epoch the loss on training set is significantly higher than that on the dev set or test set. For example, in the log file: 2019-07-02 20:04:34,588 EPOCH 2 done: loss 1.5655 - lr 0.0500 - bad epochs 0 2019-07-02 20:15:10,766 EPOCH 3 done: loss 1.2434 - lr 0.0500 - bad epochs 0 2019-07-02 20:25:44,550 EPOCH 4 done: loss 1.0927 - lr 0.0500 - bad epochs 0 . 2019-07-02 21:18:31,623 EPOCH 9 done: loss 0.7606 - lr 0.0500 - bad epochs 0 2019-07-02 21:29:00,119 EPOCH 10 done: loss 0.7143 - lr 0.0500 - bad epochs 0 I skipped lines shown in my log file but I saw that in each epoch the loss is significantly higher than that on the dev set and test set. I used a 7-2-1 train-dev-test set split and made sure that the data split was random. At first I suspected that the loss was not normalized but then I saw this issue @amitbcp started and it seemed that the loss has now been normalized. But then what could have caused the significant difference in the training loss and dev/test loss? Thank you! |
I've looked a bit into this and unfortunately I don't think it has to do with dropout. Rather this seems to stem from the way we batch the data to compute the viterbi loss during testing - which seems to be somewhat inaccurate and does not correlate with F1 score (i.e. when dev loss starts climbing, F1 also gets better which intuitively seems wrong). We could switch to other implementations of viterbi loss but the ones I've tested were slower, so I'm not sure if we want to make this tradeoff. Any help from the community would be appreciated here :) |
@alanakbik Hi Alan, thank you for the reply. I saw that for the SequenceTagger class, there are two types of dropout default to non-zero values, namely, word_dropout=0.05 and locked dropout=0.5. I tested my hypothesis that one of the dropouts caused the gap in the training loss and dev/test loss in each epoch I saw previously. So I trained a SequenceTagger model that sets both dropouts to 0 while keeping everything else the same with the following code: 2019-07-06 18:53:12,364 EPOCH 1 done: loss 1.7662 - lr 0.0500 - bad epochs 0 2019-07-06 19:04:24,118 EPOCH 2 done: loss 0.6406 - lr 0.0500 - bad epochs 0 2019-07-06 19:15:30,800 EPOCH 3 done: loss 0.3925 - lr 0.0500 - bad epochs 0 2019-07-06 19:26:36,812 EPOCH 4 done: loss 0.2437 - lr 0.0500 - bad epochs 0 . 2019-07-06 20:22:07,215 EPOCH 9 done: loss 0.0404 - lr 0.0500 - bad epochs 0 2019-07-06 20:33:11,261 EPOCH 10 done: loss 0.0300 - lr 0.0500 - bad epochs 0 You can see that since epoch 2 the training loss is getting very close to dev/test loss, and then becomes less than dev/test loss. At the end of epoch 10, it is almost ten times less than dev/test loss. |
Hello @yzhaoinuw thanks for sharing these results! Yes I think you're right: it may be because of the dropout that is activated during training but deactivated during testing. Perhaps expecially the word dropout is causing these divergences (have to test to check). |
Describe the bug
I am training a text classification model as described in the tutorial. On plotting the training process, I see the the accuracy & F1 score plots converge and stabilize over 50 epochs. Whereas The training and test loss have a wide gap though the stabilize.
On going over the loss.tsv it seemed that the test loss was not scaled as the training loss was.
On manually scaling the test loss, the new graph made sense.
On inspecting the flair code, the script https://github.com/zalandoresearch/flair/blob/master/flair/trainers/trainer.py Line 198, the issues seems to make sense as the training loss has been scaled.
Can you please validate or clarify the same ?
Expected behavior
The graphs from Flair -
The data used is a sample IMDB data set.
The code was used from the Tutorial of Text Classification-
https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md
The text was updated successfully, but these errors were encountered: