lazy loading(generators) functionality #595

mfojtak · 2019-03-06T08:12:26Z

This pull request implements lazy loading for Corpus to fix memory issues.
The main idea is that Corpus properties train, dev and test are now methods returning Iterable[Sentence]

def train(self) -> Iterable[Sentence]:

The TaggedCorpus constructor now works with Lists (as before) but also with generator functions and lambdas.

alanakbik · 2019-03-06T13:26:58Z

Thank you very much for adding this - this addresses a major painpoint for many users.

Started testing and noted one thing: The following line throws an error if you train with train_with_dev=True:

https://github.com/zalandoresearch/flair/blob/0a7ac3e5931163a96b16fa047d7abcb9aed161c6/flair/trainers/trainer.py#L114

Also, the intermediate logging is now turned off. Perhaps we could find a solution in which we count mini-batches during the first epoch and then do our modulo logging at each 10% step from epoch 2?

Hellisotherpeople · 2019-03-06T19:17:47Z

Oh I am so excited to see this!!!! Thank you so much @mfojtak!

mfojtak · 2019-03-07T09:36:03Z

@Hellisotherpeople - thank you!
@alanakbik - you are right. There are some pieces of code which need the size of data information. There are already functions like make_tag_dictionary which iterate through the whole dataset prior training and could also be used for it. However, that requires deeper refactoring. The Corpus iteslf should not care if it will be used for tags or labels, it should be trainers's concern. In the meantime, I will probably do what you suggest - enable those pieces of code in the second run of the training. Also, the data_fetcher.py is becoming a bit of a nightmare. It should be split it in a separate class per corpus.

Hellisotherpeople · 2019-03-11T16:22:52Z

I'm going to try testing this tonight - any quick tutorial for how to utilize this for sequence labeling with OOM datasets vs the given documentation? I'll try to figure it out but even a trivial code example may help me.

mfojtak · 2019-03-12T16:56:31Z

I'm going to try testing this tonight - any quick tutorial for how to utilize this for sequence labeling with OOM datasets vs the given documentation? I'll try to figure it out but even a trivial code example may help me.

Yes, please tetst it for performance and memory usage. You can use any corpus included in the library. They all now load data lazily.

alanakbik · 2019-03-12T22:44:11Z

flair/trainers/trainer.py

-                if not test_mode:
-                    random.shuffle(train_data)
+                #if not test_mode:
+                #    random.shuffle(train_data)


We shuffle the training data at each epoch so that the model does not overfit to a specific order of data points - this often improves model quality by quite a bit. Is there a way to shuffle the data at each epoch with the iterators?

We might implement a "lazy" sentence which would point into the text file. But list of such sentences might still be an OOM structure. Another option is to prefetch a batch of samples and random shuffle withing that batch {this is how it works in AllenNLP)

Hellisotherpeople · 2019-03-13T10:12:36Z

I pulled this and it's now causing a cuDNN error:
I did some work to verify that my GPU is detectable and the older version of flair seems to work - so I think that something in this change is causing my issues


  File "load_model_flair.py", line 23, in <module>
    tagger = SequenceTagger(hidden_size=256, embeddings = stacked_embeddings, tag_dictionary=tag_dictionary, tag_type=tag_type, use_crf=True)
  File "/home/lain/flair/flair/models/sequence_tagger_model.py", line 153, in __init__
    self.to(flair.device)
  File "/home/lain/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/home/lain/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/lain/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 117, in _apply
    self.flatten_parameters()
  File "/home/lain/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 113, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

stefan-it · 2019-03-13T12:02:37Z

Generators are a great improvement. But there are a few limitations (shuffling), so I think PyTorch data loaders would be more powerful (and would also support multi gpu in future) :)

mfojtak · 2019-03-14T05:21:01Z

@Hellisotherpeople - could you plese provide the code which caused the error?

mfojtak · 2019-03-14T05:44:47Z

Generators are a great improvement. But there are a few limitations (shuffling), so I think PyTorch data loaders would be more powerful (and would also support multi gpu in future) :)

I took a closer look at torchtext and it looks like it loads the entire dataset when shuffling.
Besides, I think that shuffling should be a trainer's concern. We can have a shuffler class that accepts iterator and does the job. And the class could have multiple implementations like brute force - load the entire dataset and shuffle or load smaller "batches" and shuffle within them. Also if the generator returns "lazy sentences" then the brut force shuffler would still nicely shuffle even extremely large datasets. Lazy sentence is simply just a pointer to the sentence within the dataset.

312shan · 2019-07-31T03:43:43Z

This pull request implements lazy loading for Corpus to fix memory issues.
The main idea is that Corpus properties train, dev and test are now methods returning Iterable[Sentence]
def train(self) -> Iterable[Sentence]:
The TaggedCorpus constructor now works with Lists (as before) but also with generator functions and lambdas.

I need this freture , if i currently compile and install your version on https://github.com/mfojtak/flair is that feature enable ?

alanakbik · 2019-07-31T10:45:08Z

Hello @312shan as of version 0.4.2 we are now using PyTorch DataLoaders to do this, so the feature is already part of Flair. The .train of Corpus is now a DataSet that you can iterate through using a DataLoader.

adding lazy loading(generators) functionality

6e07461

mfojtak mentioned this pull request Mar 6, 2019

Flair 0.5 features #563

Closed

5 tasks

mfojtak added 3 commits March 6, 2019 11:04

fixes

f7965d8

fixes

edd9b74

fixed find_learning_rate

0a7ac3e

mfojtak added 3 commits March 8, 2019 16:06

modulo added back

d21b532

lazy loading for column data

4374a6d

refactoring

ae9c875

Update TUTORIAL_6_CORPUS.md

5efec94

alanakbik reviewed Mar 12, 2019

View reviewed changes

Merge branch 'master' of https://github.com/mfojtak/flair

a4d9113

alanakbik mentioned this pull request Mar 21, 2019

Unable to load corpus #457

Closed

alanakbik mentioned this pull request Apr 17, 2019

How to deal with huge datasets for training NER models #671

Closed

mfojtak closed this Aug 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lazy loading(generators) functionality #595

lazy loading(generators) functionality #595

mfojtak commented Mar 6, 2019

alanakbik commented Mar 6, 2019

Hellisotherpeople commented Mar 6, 2019

mfojtak commented Mar 7, 2019

Hellisotherpeople commented Mar 11, 2019

mfojtak commented Mar 12, 2019

alanakbik Mar 12, 2019

mfojtak Mar 13, 2019 •

edited

Loading

Hellisotherpeople commented Mar 13, 2019 •

edited

Loading

stefan-it commented Mar 13, 2019

mfojtak commented Mar 14, 2019

mfojtak commented Mar 14, 2019 •

edited

Loading

312shan commented Jul 31, 2019

alanakbik commented Jul 31, 2019

lazy loading(generators) functionality #595

lazy loading(generators) functionality #595

Conversation

mfojtak commented Mar 6, 2019

alanakbik commented Mar 6, 2019

Hellisotherpeople commented Mar 6, 2019

mfojtak commented Mar 7, 2019

Hellisotherpeople commented Mar 11, 2019

mfojtak commented Mar 12, 2019

alanakbik Mar 12, 2019

Choose a reason for hiding this comment

mfojtak Mar 13, 2019 • edited Loading

Choose a reason for hiding this comment

Hellisotherpeople commented Mar 13, 2019 • edited Loading

stefan-it commented Mar 13, 2019

mfojtak commented Mar 14, 2019

mfojtak commented Mar 14, 2019 • edited Loading

312shan commented Jul 31, 2019

alanakbik commented Jul 31, 2019

mfojtak Mar 13, 2019 •

edited

Loading

Hellisotherpeople commented Mar 13, 2019 •

edited

Loading

mfojtak commented Mar 14, 2019 •

edited

Loading