Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

bugfix for parallel rand generator on multi-gpu #9300

Merged
merged 1 commit into from
Jan 4, 2018

Conversation

yzhliu
Copy link
Member

@yzhliu yzhliu commented Jan 4, 2018

Description

GPU rand state was not allocated on the right device, which can cause 'illegal memory access' when running with multi-GPUs.

It's line 281 in src/resource.cc causes the problem.

Can we do test on multi-gpu? I mean, can I assume our CI instances have multi-gpu enabled?

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Fix gpu rand state allocation problem
  • readability: NativeRandom -> ParallelRandom

@yzhliu yzhliu added the Bug label Jan 4, 2018
@yzhliu yzhliu requested review from piiswrong and szha January 4, 2018 02:07
: ctx(ctx), sampler(ncopy), resource(ncopy), curr_ptr(0) {
for (size_t i = 0; i < sampler.size(); ++i) {
const uint32_t seed = ctx.dev_id + i * kMaxNumGPUs + global_seed * kRandMagic;
resource[i].var = Engine::Get()->NewVariable();
common::random::RandGenerator<xpu> *r = new common::random::RandGenerator<xpu>();
common::random::RandGenerator<xpu>::AllocState(r);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the line that caused the bug.

@cjolivier01
Copy link
Member

Fix verified

@marcoabreu
Copy link
Contributor

Feel free to create a test case as CI has two GPUs available per slave

@cjolivier01
Copy link
Member

Current test machines in CI have at least two GPUs. However, you will need to detect GPU count in the unit test because when an individual runs unit test on his local machine (Mac, for instance), it likely will only have one GPU.

@cjolivier01 cjolivier01 merged commit 661bcef into apache:master Jan 4, 2018
@cjolivier01
Copy link
Member

For example, run nvidia-smi -L and count number of lines returned which start with "GPU [0-9]+" or something like that

@marcoabreu
Copy link
Contributor

Relying on nvidia-smi could cause issues as behaviour on windows and unix tend to differ. It should be possible to get the number of available GPUs within MXNet, right?

@yzhliu yzhliu deleted the fix-rand-multi-gpu branch January 4, 2018 18:06
@cjolivier01
Copy link
Member

I am not aware of an API call for that from python. One would need to be created, which would also force the CUDA library to init.
Do you know the outputs for the same command to differ or you're theorizing? I don't care how it's done, really -- just trying to suggest the way with the least lifting.

@marcoabreu
Copy link
Contributor

The CUDA library is going to be needed anyways, right? So that shouldn't be an issue. Couldn't this be achieved by trying to address the GPU on slot 0, 1, 2 etc and then just count how often it fails?

The problem is that nvidia-smi on windows may be present or not. While working on CI for windows, I've experienced that this tool may be at different locations which can not be retrieved programmatically. The workaround was to use a static path for CI.

yzhliu added a commit to yzhliu/mxnet that referenced this pull request Jan 4, 2018
yuxiangw pushed a commit to yuxiangw/incubator-mxnet that referenced this pull request Jan 25, 2018
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants