Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Optimized data loading for PyTorch #362

Merged
merged 4 commits into from
Jul 7, 2021
Merged

feat: Optimized data loading for PyTorch #362

merged 4 commits into from
Jul 7, 2021

Conversation

fg-mindee
Copy link
Contributor

@fg-mindee fg-mindee commented Jul 6, 2021

Following up on #190, this PR introduces the following modifications:

  • added automatic workers number selection for PyTorch trainings
  • switched image reading backend from torchvision to PIL + tensor conversion
  • reflected changes on training scripts
  • had to update PIL requirements due to Pillow 8.3 and NumPy python-pillow/Pillow#5571

Iterating with the dataloader on FUNSD gets a 25%+ speedup with the same number of workers (3X compared to the original default number of workers)

Closes #190

Any feedback is welcome!

@fg-mindee fg-mindee added type: enhancement Improvement critical High priority module: datasets Related to doctr.datasets ext: references Related to references folder framework: pytorch Related to PyTorch backend labels Jul 6, 2021
@fg-mindee fg-mindee added this to the 0.3.1 milestone Jul 6, 2021
@fg-mindee fg-mindee self-assigned this Jul 6, 2021
@codecov
Copy link

codecov bot commented Jul 6, 2021

Codecov Report

Merging #362 (cde22e5) into main (db1e034) will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #362      +/-   ##
==========================================
+ Coverage   96.11%   96.15%   +0.03%     
==========================================
  Files          83       83              
  Lines        3426     3460      +34     
==========================================
+ Hits         3293     3327      +34     
  Misses        133      133              
Flag Coverage Δ
unittests 96.15% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/datasets/pytorch.py 100.00% <100.00%> (ø)
doctr/models/_utils.py 93.10% <0.00%> (-0.84%) ⬇️
doctr/utils/geometry.py 100.00% <0.00%> (ø)
doctr/transforms/functional/pytorch.py 100.00% <0.00%> (ø)
doctr/transforms/functional/tensorflow.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db1e034...cde22e5. Read the comment docs.

torch.backends.cudnn.benchmark = True

st = time.time()
val_set = DetectionDataset(
img_folder=os.path.join(args.data_path, 'val'),
label_folder=os.path.join(args.data_path, 'val_labels'),
sample_transforms=Compose([
Lambda(lambda x: x / 255),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you remove the normalization here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The to_tensor in the dataset __getitem__ already does that :)

@@ -137,7 +138,6 @@ def main(args):
img_folder=os.path.join(args.data_path, 'train'),
label_folder=os.path.join(args.data_path, 'train_labels'),
sample_transforms=Compose([
Lambda(lambda x: x / 255),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf. previous comment

@fg-mindee fg-mindee merged commit e429ec0 into main Jul 7, 2021
@fg-mindee fg-mindee deleted the loader-optim branch July 7, 2021 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
critical High priority ext: references Related to references folder framework: pytorch Related to PyTorch backend module: datasets Related to doctr.datasets type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[datasets] Optimize dataloaders for faster iterations
2 participants