-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is corpus object reusable across ModelTrainer instances ? #1604
Comments
Could you create a minimal example script to reproduce? |
Please run attached script. Three log files will get generated - initial training+evaluation, reload of checkpoint + evaluation repeated two times. This happens in other datasets also but only with PooledFlairEmbedding. Now, i understand some variability is expected but this is huge. Interestingly if i restart the process i get deterministic results. I will appreciate either an explanation or preferably a fix as its causing lot of problems for me in reproducing my results. |
Hi @sankaran45 thanks for reporting this and preparing the script to reproduce. This is a bug that is connected to serialization of the pooled flair embeddings. The "memory" of stored embeddings that Strangely, this error happens only if you serialize Anyway, I've pushed a branch that fixes this by always de-serializing to CPU and then moving to GPU. This fixes your script on my setup. |
I have three checkpoint files generated from a training run that uses PooledFlair embedding. Say chk10.pt, chk20.pt, chk30.pt.
I finalize using the following code in a for loop to get the F1 predictions out:
trainer: ModelTrainer = ModelTrainer.load_checkpoint(chkfile, corpus)
trainer.train('.', checkpoint = False, train_with_dev=True, max_epochs=epochs)
I set the epochs to the value at which this checkpoint got generated. So 10, 20, 30 etc. So typically it goes straight to creating the final model and emitting the predictions.
This works perfectly fine for the first time in the loop, after which the predictions are quite wrong. Now instead of doing it in the loop, if i simply do just once by restarting the process i get the values i expect. This behavior happens only with PooledFlairEmbedding. Same program runs just fine with ElmoEmbedding, BertEmbedding.
So my question is why is this the case ? Is it because i create the corpus object outside the for loop and keep reusing it across different ModelTrainer instances ?
It happens quite regularly for me. If needed i can make a small program and share.
The text was updated successfully, but these errors were encountered: