Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the latest version of PyTorch supported? #52

Open
SimonZsx opened this issue Jun 16, 2020 · 10 comments
Open

What's the latest version of PyTorch supported? #52

SimonZsx opened this issue Jun 16, 2020 · 10 comments

Comments

@SimonZsx
Copy link

Hi, what's the latest version of stable PyTorch release supported? Which version is pre_hook_pytorch_latest.patch for? Thanks for your reply in advance.

@deepakn94
Copy link
Collaborator

I have been using NVIDIA container image for PyTorch, release 19.09. Looks like that corresponds to PyTorch version 1.2.0. I have not tried this with later versions.

@SimonZsx
Copy link
Author

Just an update on this issue, I checked all PyTorch releases, and until NVIDIA PyTorch release 20.01, corresponding to PyTorch release 1.4.0, pipedream works, but since NVIDIA release 20.02, a runtime error occurs, #31 , the same error log as this issue. I tried to locate the issue, it seems from an in-place version checking feature added by release 1.5.0. And the problem comes from when the second last stage tries to start its backward pass, and if load_old_params() before backward(), the error will show.

@deepakn94
Copy link
Collaborator

Thanks for doing this! This is helpful! I will look into this in the next couple of days!

@SimonZsx
Copy link
Author

SimonZsx commented Jun 26, 2020

I temporarily make pipedream run on latest PyTorch by eliminating the version check in unpack() in torch/csrc/autograd/saved_variable.cpp, it seems runtime errors come from this version checking (really dirty solution). I have not really understood pipedream's manipulation on the backward propagated gradients, but I guess this comes from one more in-place operation on the tensors passing between stages. I think this may help you solve this problem.


Variable SavedVariable::unpack(std::shared_ptr<Node> saved_for) const {
  if (!data_.defined()) {
    if (!was_default_constructed_) {
      throw std::runtime_error(ERR_BACKWARD_TWICE);
    }
    return Variable();
  }

  auto grad_fn = is_inplace_view_ ? weak_grad_fn_.lock() : grad_fn_;
  if (has_grad_fn_ && !grad_fn) {
    if (!saved_for) {
      // If saving the grad_fn would create a circular reference, then it must
      // be passed in to the unpack function.
      throw std::runtime_error("No grad_fn for non-leaf saved variable");
    }
    grad_fn = std::move(saved_for);
  }
  if (saved_version_ != version_counter_.current_version()) {
    std::stringstream message;
    message << "one of the variables needed for gradient computation has been "
        "modified by an inplace operation: [" << data_.toString() << " "
        << data_.sizes() << "]";
    if (grad_fn) {
        message << ", which is output " << output_nr_
            << " of " << grad_fn->name() << ",";
    }
    message << " is at version " << version_counter_.current_version()
        << "; expected version " << saved_version_ << " instead.";
    if (!AnomalyMode::is_enabled()) {
        message << " Hint: enable anomaly detection to find the operation "
            "that failed to compute its gradient, with torch.autograd."
            "set_detect_anomaly(True).";
    }
    else {
        message << " Hint: the backtrace further above shows the operation "
            "that failed to compute its gradient. The variable in question "
            "was changed in there or anywhere later. Good luck!";
    }
    throw std::runtime_error(message.str());
  }

@BestSonny
Copy link

BestSonny commented Jun 27, 2020

@SimonZsx Have you tried to comment out the version checking code in Pytorch and see whether it is working?

@fkh12345
Copy link

@deepakn94 @SimonZsx Sorry to bother you, I'm reproducing the training process of pipedream, and hope to deploy it in torch >= 1.5.0. May I ask if there are any solutions currently?

@jglicat
Copy link

jglicat commented Dec 20, 2021

@deepakn94 @SimonZsx Sorry to bother you, I'm reproducing the training process of pipedream, and hope to deploy it in torch >= 1.5.0. May I ask if there are any solutions currently?

I also have this confusion. Do you have any progress? Maybe we can talk about it.

@SimonZsx
Copy link
Author

Hi, the commenting and recompiling solution work, but it’s kind of dirty. The problem can be avoided by not using the weight stashing, because this feature seems to be used for gradient version checking, and the weight stashing breaks the checking.

One of my folks says the version can be manually set to avoid this error, but I have not checked it yet; just a tiny hint for you.

@fkh12345
Copy link

Hi, the commenting and recompiling solution work, but it’s kind of dirty. The problem can be avoided by not using the weight stashing, because this feature seems to be used for gradient version checking, and the weight stashing breaks the checking.

One of my folks says the version can be manually set to avoid this error, but I have not checked it yet; just a tiny hint for you.

Thanks!

@leiguan1210
Copy link

@deepakn94 @SimonZsx Sorry to bother you, I'm reproducing the training process of pipedream, and hope to deploy it in torch >= 1.5.0. May I ask if there are any solutions currently?

I also have this confusion. Do you have any progress? Maybe we can talk about it.

Same to you. Do you have any progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants