Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is corpus object reusable across ModelTrainer instances ? #1604

Closed
sankaran45 opened this issue May 12, 2020 · 3 comments · Fixed by #1679
Closed

is corpus object reusable across ModelTrainer instances ? #1604

sankaran45 opened this issue May 12, 2020 · 3 comments · Fixed by #1679
Labels
question Further information is requested

Comments

@sankaran45
Copy link

sankaran45 commented May 12, 2020

I have three checkpoint files generated from a training run that uses PooledFlair embedding. Say chk10.pt, chk20.pt, chk30.pt.

I finalize using the following code in a for loop to get the F1 predictions out:

trainer: ModelTrainer = ModelTrainer.load_checkpoint(chkfile, corpus)
trainer.train('.', checkpoint = False, train_with_dev=True, max_epochs=epochs)

I set the epochs to the value at which this checkpoint got generated. So 10, 20, 30 etc. So typically it goes straight to creating the final model and emitting the predictions.

This works perfectly fine for the first time in the loop, after which the predictions are quite wrong. Now instead of doing it in the loop, if i simply do just once by restarting the process i get the values i expect. This behavior happens only with PooledFlairEmbedding. Same program runs just fine with ElmoEmbedding, BertEmbedding.

So my question is why is this the case ? Is it because i create the corpus object outside the for loop and keep reusing it across different ModelTrainer instances ?

It happens quite regularly for me. If needed i can make a small program and share.

@sankaran45 sankaran45 added the question Further information is requested label May 12, 2020
@alanakbik
Copy link
Collaborator

Could you create a minimal example script to reproduce?

@sankaran45
Copy link
Author

sankaran45 commented May 26, 2020

Please run attached script. Three log files will get generated - initial training+evaluation, reload of checkpoint + evaluation repeated two times.

This happens in other datasets also but only with PooledFlairEmbedding. Now, i understand some variability is expected but this is huge. Interestingly if i restart the process i get deterministic results.

I will appreciate either an explanation or preferably a fix as its causing lot of problems for me in reproducing my results.

upload.zip

@alanakbik
Copy link
Collaborator

Hi @sankaran45 thanks for reporting this and preparing the script to reproduce. This is a bug that is connected to serialization of the pooled flair embeddings. The "memory" of stored embeddings that PooledFlairEmbeddings have was de-serialized differently than when it is serialized. So when it was loading the final model during training, the memory was scrambled - causing the big difference in prediction accuracy.

Strangely, this error happens only if you serialize PooledFlairEmbeddings as part of a model but not if you serialize them by themselves. I think this may have something to do with the memory being kept on CPU memory, while the model (if you train on GPU) is on CUDA, and de-serialization mapping everything to CUDA. Also this did not use to happen so I wonder if torch serialization logic has changed.

Anyway, I've pushed a branch that fixes this by always de-serializing to CPU and then moving to GPU. This fixes your script on my setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants