Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazy loading(generators) functionality #595

Closed
wants to merge 9 commits into from

Conversation

mfojtak
Copy link

@mfojtak mfojtak commented Mar 6, 2019

This pull request implements lazy loading for Corpus to fix memory issues.
The main idea is that Corpus properties train, dev and test are now methods returning Iterable[Sentence]

def train(self) -> Iterable[Sentence]:

The TaggedCorpus constructor now works with Lists (as before) but also with generator functions and lambdas.

@mfojtak mfojtak mentioned this pull request Mar 6, 2019
5 tasks
@alanakbik
Copy link
Collaborator

Thank you very much for adding this - this addresses a major painpoint for many users.

Started testing and noted one thing: The following line throws an error if you train with train_with_dev=True:

https://github.com/zalandoresearch/flair/blob/0a7ac3e5931163a96b16fa047d7abcb9aed161c6/flair/trainers/trainer.py#L114

Also, the intermediate logging is now turned off. Perhaps we could find a solution in which we count mini-batches during the first epoch and then do our modulo logging at each 10% step from epoch 2?

@Hellisotherpeople
Copy link

Oh I am so excited to see this!!!! Thank you so much @mfojtak!

@mfojtak
Copy link
Author

mfojtak commented Mar 7, 2019

@Hellisotherpeople - thank you!
@alanakbik - you are right. There are some pieces of code which need the size of data information. There are already functions like make_tag_dictionary which iterate through the whole dataset prior training and could also be used for it. However, that requires deeper refactoring. The Corpus iteslf should not care if it will be used for tags or labels, it should be trainers's concern. In the meantime, I will probably do what you suggest - enable those pieces of code in the second run of the training. Also, the data_fetcher.py is becoming a bit of a nightmare. It should be split it in a separate class per corpus.

@Hellisotherpeople
Copy link

I'm going to try testing this tonight - any quick tutorial for how to utilize this for sequence labeling with OOM datasets vs the given documentation? I'll try to figure it out but even a trivial code example may help me.

@mfojtak
Copy link
Author

mfojtak commented Mar 12, 2019

I'm going to try testing this tonight - any quick tutorial for how to utilize this for sequence labeling with OOM datasets vs the given documentation? I'll try to figure it out but even a trivial code example may help me.

Yes, please tetst it for performance and memory usage. You can use any corpus included in the library. They all now load data lazily.

if not test_mode:
random.shuffle(train_data)
#if not test_mode:
# random.shuffle(train_data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shuffle the training data at each epoch so that the model does not overfit to a specific order of data points - this often improves model quality by quite a bit. Is there a way to shuffle the data at each epoch with the iterators?

Copy link
Author

@mfojtak mfojtak Mar 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might implement a "lazy" sentence which would point into the text file. But list of such sentences might still be an OOM structure. Another option is to prefetch a batch of samples and random shuffle withing that batch {this is how it works in AllenNLP)

@Hellisotherpeople
Copy link

Hellisotherpeople commented Mar 13, 2019

I pulled this and it's now causing a cuDNN error:
I did some work to verify that my GPU is detectable and the older version of flair seems to work - so I think that something in this change is causing my issues


  File "load_model_flair.py", line 23, in <module>
    tagger = SequenceTagger(hidden_size=256, embeddings = stacked_embeddings, tag_dictionary=tag_dictionary, tag_type=tag_type, use_crf=True)
  File "/home/lain/flair/flair/models/sequence_tagger_model.py", line 153, in __init__
    self.to(flair.device)
  File "/home/lain/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/home/lain/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/lain/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 117, in _apply
    self.flatten_parameters()
  File "/home/lain/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 113, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

@stefan-it
Copy link
Member

Generators are a great improvement. But there are a few limitations (shuffling), so I think PyTorch data loaders would be more powerful (and would also support multi gpu in future) :)

@mfojtak
Copy link
Author

mfojtak commented Mar 14, 2019

@Hellisotherpeople - could you plese provide the code which caused the error?

@mfojtak
Copy link
Author

mfojtak commented Mar 14, 2019

Generators are a great improvement. But there are a few limitations (shuffling), so I think PyTorch data loaders would be more powerful (and would also support multi gpu in future) :)

I took a closer look at torchtext and it looks like it loads the entire dataset when shuffling.
Besides, I think that shuffling should be a trainer's concern. We can have a shuffler class that accepts iterator and does the job. And the class could have multiple implementations like brute force - load the entire dataset and shuffle or load smaller "batches" and shuffle within them. Also if the generator returns "lazy sentences" then the brut force shuffler would still nicely shuffle even extremely large datasets. Lazy sentence is simply just a pointer to the sentence within the dataset.

@312shan
Copy link

312shan commented Jul 31, 2019

This pull request implements lazy loading for Corpus to fix memory issues.
The main idea is that Corpus properties train, dev and test are now methods returning Iterable[Sentence]

def train(self) -> Iterable[Sentence]:

The TaggedCorpus constructor now works with Lists (as before) but also with generator functions and lambdas.

I need this freture , if i currently compile and install your version on https://github.com/mfojtak/flair is that feature enable ?

@alanakbik
Copy link
Collaborator

Hello @312shan as of version 0.4.2 we are now using PyTorch DataLoaders to do this, so the feature is already part of Flair. The .train of Corpus is now a DataSet that you can iterate through using a DataLoader.

@mfojtak mfojtak closed this Aug 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants