is corpus object reusable across ModelTrainer instances ? #1604

sankaran45 · 2020-05-12T04:35:41Z

I have three checkpoint files generated from a training run that uses PooledFlair embedding. Say chk10.pt, chk20.pt, chk30.pt.

I finalize using the following code in a for loop to get the F1 predictions out:

trainer: ModelTrainer = ModelTrainer.load_checkpoint(chkfile, corpus)
trainer.train('.', checkpoint = False, train_with_dev=True, max_epochs=epochs)

I set the epochs to the value at which this checkpoint got generated. So 10, 20, 30 etc. So typically it goes straight to creating the final model and emitting the predictions.

This works perfectly fine for the first time in the loop, after which the predictions are quite wrong. Now instead of doing it in the loop, if i simply do just once by restarting the process i get the values i expect. This behavior happens only with PooledFlairEmbedding. Same program runs just fine with ElmoEmbedding, BertEmbedding.

So my question is why is this the case ? Is it because i create the corpus object outside the for loop and keep reusing it across different ModelTrainer instances ?

It happens quite regularly for me. If needed i can make a small program and share.

alanakbik · 2020-05-24T12:45:35Z

Could you create a minimal example script to reproduce?

sankaran45 · 2020-05-26T03:53:56Z

Please run attached script. Three log files will get generated - initial training+evaluation, reload of checkpoint + evaluation repeated two times.

This happens in other datasets also but only with PooledFlairEmbedding. Now, i understand some variability is expected but this is huge. Interestingly if i restart the process i get deterministic results.

I will appreciate either an explanation or preferably a fix as its causing lot of problems for me in reproducing my results.

upload.zip

alanakbik · 2020-06-08T19:43:55Z

Hi @sankaran45 thanks for reporting this and preparing the script to reproduce. This is a bug that is connected to serialization of the pooled flair embeddings. The "memory" of stored embeddings that PooledFlairEmbeddings have was de-serialized differently than when it is serialized. So when it was loading the final model during training, the memory was scrambled - causing the big difference in prediction accuracy.

Strangely, this error happens only if you serialize PooledFlairEmbeddings as part of a model but not if you serialize them by themselves. I think this may have something to do with the memory being kept on CPU memory, while the model (if you train on GPU) is on CUDA, and de-serialization mapping everything to CUDA. Also this did not use to happen so I wonder if torch serialization logic has changed.

Anyway, I've pushed a branch that fixes this by always de-serializing to CPU and then moving to GPU. This fixes your script on my setup.

…zation GH-1604: deserialize to CPU and move to GPU to avoid error #1604

sankaran45 added the question Further information is requested label May 12, 2020

alanakbik added a commit that referenced this issue Jun 8, 2020

GH-1604: deserialize to CPU and move to GPU to avoid error #1604

a09e94a

alanakbik mentioned this issue Jun 8, 2020

GH-1604: deserialize to CPU and move to GPU to avoid error #1604 #1679

Merged

alanakbik closed this as completed in #1679 Jun 10, 2020

alanakbik added a commit that referenced this issue Jun 10, 2020

Merge pull request #1679 from flairNLP/GH-1604-pooled-flair-deseriali…

afb07b5

…zation GH-1604: deserialize to CPU and move to GPU to avoid error #1604

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is corpus object reusable across ModelTrainer instances ? #1604

is corpus object reusable across ModelTrainer instances ? #1604

sankaran45 commented May 12, 2020 •

edited

Loading

alanakbik commented May 24, 2020

sankaran45 commented May 26, 2020 •

edited

Loading

alanakbik commented Jun 8, 2020

is corpus object reusable across ModelTrainer instances ? #1604

is corpus object reusable across ModelTrainer instances ? #1604

Comments

sankaran45 commented May 12, 2020 • edited Loading

alanakbik commented May 24, 2020

sankaran45 commented May 26, 2020 • edited Loading

alanakbik commented Jun 8, 2020

sankaran45 commented May 12, 2020 •

edited

Loading

sankaran45 commented May 26, 2020 •

edited

Loading