-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lazy loading(generators) functionality #595
Conversation
Thank you very much for adding this - this addresses a major painpoint for many users. Started testing and noted one thing: The following line throws an error if you train with Also, the intermediate logging is now turned off. Perhaps we could find a solution in which we count mini-batches during the first epoch and then do our modulo logging at each 10% step from epoch 2? |
Oh I am so excited to see this!!!! Thank you so much @mfojtak! |
@Hellisotherpeople - thank you! |
I'm going to try testing this tonight - any quick tutorial for how to utilize this for sequence labeling with OOM datasets vs the given documentation? I'll try to figure it out but even a trivial code example may help me. |
Yes, please tetst it for performance and memory usage. You can use any corpus included in the library. They all now load data lazily. |
if not test_mode: | ||
random.shuffle(train_data) | ||
#if not test_mode: | ||
# random.shuffle(train_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shuffle the training data at each epoch so that the model does not overfit to a specific order of data points - this often improves model quality by quite a bit. Is there a way to shuffle the data at each epoch with the iterators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might implement a "lazy" sentence which would point into the text file. But list of such sentences might still be an OOM structure. Another option is to prefetch a batch of samples and random shuffle withing that batch {this is how it works in AllenNLP)
I pulled this and it's now causing a cuDNN error:
|
Generators are a great improvement. But there are a few limitations (shuffling), so I think PyTorch data loaders would be more powerful (and would also support multi gpu in future) :) |
@Hellisotherpeople - could you plese provide the code which caused the error? |
I took a closer look at torchtext and it looks like it loads the entire dataset when shuffling. |
I need this freture , if i currently compile and install your version on https://github.com/mfojtak/flair is that feature enable ? |
Hello @312shan as of version 0.4.2 we are now using PyTorch DataLoaders to do this, so the feature is already part of Flair. The |
This pull request implements lazy loading for Corpus to fix memory issues.
The main idea is that Corpus properties train, dev and test are now methods returning Iterable[Sentence]
The TaggedCorpus constructor now works with Lists (as before) but also with generator functions and lambdas.