Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fairseq-TPU #1282

Closed
Eric-Wallace opened this issue Oct 22, 2019 · 9 comments
Closed

Fairseq-TPU #1282

Eric-Wallace opened this issue Oct 22, 2019 · 9 comments

Comments

@Eric-Wallace
Copy link

Eric-Wallace commented Oct 22, 2019

(More of a discussion than an issue)

Hi, I am investigating training Fairseq models using TPUs. I have followed the tutorial here https://cloud.google.com/tpu/docs/tutorials/transformer-pytorch, which worked nicely for training an NMT model. The tutorial uses a developmental branch of fairseq with TPU support developed by @taylanbil
https://github.com/pytorch-tpu/fairseq.

Currently, I am wondering what the plan is for merging this developmental branch into fairseq master? In particular, my use case is training models based on RoBERTa on cloud TPUs. The TPU branch is decently out of date with fairseq master, so the recent features are not present. It is also unclear which parts of fairseq (e.g., models, losses, tasks, etc.) are supported under pytorch-tpu/fairseq.

@taylanbil
Copy link
Contributor

Thanks @Eric-Wallace ! Roberta is definitely in our plans. Unfortunately, when we created the fork, Roberta was not added to Fairseq just yet.

It is in our plans to be up to date with Fairseq's latest and greatest functionalities, however, it's also a balancing act between making new models work and making the existing models optimized in terms of performance. That said, we would definitely appreciate any help from the community since our bandwidth is limited :)

@Eric-Wallace
Copy link
Author

Great, thanks! How difficult do you estimate it to be for me to just try to merge with upstream Fairseq and run RoBERTa on TPU? Would it require significant changes to the input padding and data loading?

@taylanbil
Copy link
Contributor

Those changes (input padding etc) already exist in our tpu branch currently. There might be merge conflicts that need resolution. That can get pretty ugly, or maybe easy, it depends.

The real issue is, there was a time when I was developing against the master branch and there was a commit which caused pretty big performance regression for us. In order to continue with our experiments, we had to cut a branch before that commit and keep going. A lot has happened since then, so it may be the case that the underlying cause is fixed, I'm not sure what the issue was exactly. It may require some debugging.

@Eric-Wallace
Copy link
Author

Cool, I will try to merge with upstream and see if it reasonable. What are the different branches, xla, tpu and tpu-r0.5? Which should I use?

@taylanbil
Copy link
Contributor

taylanbil commented Oct 22, 2019

I would use tpu. xla is deprecated, and tpu-r0.5 is not an active branch, it's the stable branch for the relase 0.5.

Big thanks for taking this on! Feel free to be in touch if you have questions / comments.

@Eric-Wallace
Copy link
Author

Update, we got NMT running in the latest Fairseq master. Was basically resolving some simple merge conflicts. It is indeed about 25-50% slower than using the tpu-r0.5 branch, so the problem persists. I will continue looking into this.

@taylanbil
Copy link
Contributor

Thanks Eric!

@Eric-Wallace
Copy link
Author

@taylanbil do you have any idea about what date the commit happened that caused the slowdown?

@taylanbil
Copy link
Contributor

fyi @Eric-Wallace : pytorch-tpu#19 enables roberta on the tpu branch.

facebook-github-bot pushed a commit that referenced this issue Sep 22, 2020
Summary:
Gate psutil import to make tests pass

Pull Request resolved: fairinternal/fairseq-py#1282

Reviewed By: tangyuq

Differential Revision: D23822037

Pulled By: myleott

fbshipit-source-id: c652c7931147ecd377d78322840e343c55cb85a2
sshleifer pushed a commit that referenced this issue Apr 7, 2021
Summary:
Gate psutil import to make tests pass

Pull Request resolved: fairinternal/fairseq-py#1282

Reviewed By: tangyuq

Differential Revision: D23822037

Pulled By: myleott

fbshipit-source-id: c652c7931147ecd377d78322840e343c55cb85a2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants