Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][RFC] initial support for PTX generation #34195

Closed
wants to merge 2 commits into from

Conversation

japaric
Copy link
Member

@japaric japaric commented Jun 10, 2016

Do not merge. RFC pending

this PR adds two targets:

  • nvptx-unknown-unknown (32-bit machine model)
  • nvptx64-unknown-unknown (64-bit machine model)

that can be used to generate PTX code from Rust source code:

$ rustc --target nvptx64-unknown-unknown --emit=asm foo.rs
$ head foo.s
//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

(..)

this PR also adds new intrinsics that are equivalent to the following
CUDA variables/functions:

  • threadIdx.{x,y,z}
  • blockIdx.{x,y,z}
  • blockDim.{x,y,z}
  • gridDim.{x,y,z}
  • __syncthreads

this PR has been tested by writing a kernel that memcpys a chunk of
memory to other:

#![no_core]

#[no_mangle]
pub fn memcpy_(src: *const f32, dst: *mut f32, n: isize) {
    unsafe {
        let i = overflowing_add(overflowing_mul(block_idx_x(), block_dim_x()), thread_idx_x()) as isize;

        if i < n {
            *(offset(dst, i) as *mut f32) = *offset(src, i)
        }
    }
}

// undeclared functions are intrinsics
// omitted: lang items

which translates to:

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

    // .globl   memcpy_

.visible .func memcpy_(
    .param .b64 memcpy__param_0,
    .param .b64 memcpy__param_1,
    .param .b64 memcpy__param_2
)
{
    .reg .pred  %p<2>;
    .reg .s32   %r<6>;
    .reg .s64   %rd<8>;

    mov.u32     %r1, %ctaid.x;
    ld.param.u64    %rd5, [memcpy__param_2];
    mov.u32     %r2, %ntid.x;
    mov.u32     %r3, %tid.x;
    mad.lo.s32  %r4, %r2, %r1, %r3;
    cvt.s64.s32 %rd6, %r4;
    setp.ge.s64 %p1, %rd6, %rd5;
    @%p1 bra    LBB0_2;
    ld.param.u64    %rd3, [memcpy__param_0];
    ld.param.u64    %rd4, [memcpy__param_1];
    mul.wide.s32    %rd7, %r4, 4;
    add.s64     %rd1, %rd3, %rd7;
    add.s64     %rd2, %rd4, %rd7;
    ld.u32  %r5, [%rd1];
    st.u32  [%rd2], %r5;
LBB0_2:
    ret;
}

however, this PTX code can't be directly used in a CUDA program because
the memcpy_ function is marked as a "device function" (.func memcpy_). Device functions can only be called from other GPU code. To
be usable from a CUDA program memcpy_ should be marked as a "kernel
function" (.entry memcpy_):

        // .globl       memcpy_

-.visible .entry memcpy_(
+.visible .func memcpy_(
        .param .b64 memcpy__param_0,
        .param .b64 memcpy__param_1,
        .param .b64 memcpy__param_2

After patching the generated PTX code the kernel became callable from a
CUDA program.

unresolved questions

  • we need to provide a way to differentiate functions that will be
    translated to "kernel functions" from the ones that will be translated
    to "device functions". CUDA uses the __global__ and
    __device__ attributes for this.
  • we need to provide a way to let the user choose on which memory region
    2 variables should be placed. CUDA exposes the __shared__ and
    __constant__ attributes for this.

FIXMEs

  • pointer arguments in kernel and device functions should be marked with
    the addrspace(1) attribute in LLVM IR.

cc @brson @alexcrichton @eddyb

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @arielb1 (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

japaric pushed a commit to japaric-archived/cuda that referenced this pull request Jun 10, 2016
@japaric
Copy link
Member Author

japaric commented Jun 10, 2016

I have published my test code (CUDA program) here.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Jun 10, 2016

It seems to be that the obvious solution for the unresolved questions (marking kernel functions, specifying memory regions) are attributes:

#[cuda_constant]
static X: i32 = 0; // not const since rustc would inline it then

#[cuda_kernel]
fn foo() {
    #[cuda_shared]
    let x: [i32; BLOCK_SIZE] = mem::uninitialized();
}

One wrinkles with the memory region attribute is that you can't apply attributes to function arguments (but I you can't specify storage for those anyway?).

However, that doesn't solve the issue of describing address spaces of pointers, e.g. a pointer into shared memory. That's a thing that exists, right?

@eddyb
Copy link
Member

eddyb commented Jun 10, 2016

For functions, maybe using extern "kernel" or extern "cuda-kernel" would be better?
Especially if it affects function pointers, too.

@japaric For rlibs, can't you/aren't you forced to use LTO, i.e. produce PTX from LLVM bitcode for the final executable?
OTOH, there is precedent for storing non-binary object files (emscripten with JS), so keeping the PTX in .o files and concatenating it to "link" could work.

@japaric
Copy link
Member Author

japaric commented Jun 11, 2016

Update:

  • Actually rlibs work fine. I was sort of expecting them to contain PTX but they don't. They probably contain llvm bitcode and/or rust metadata. I'm not sure... but they work 😄.
  • I can't compile core to PTX, I get an LLVM error:
LLVM ERROR: Cannot select: t4: f32 = fcopysign ConstantFP:f32<1.000000e+00>, t2
  t3: f32 = ConstantFP<1.000000e+00>
  t2: f32,ch = CopyFromReg t0, Register:f32 %vreg3
    t1: f32 = Register %vreg3
In function: _ZN4core3f3250_$LT$impl$u20$core..num..Float$u20$for$u20$f32$GT$6signum17h5c5199388a8660d1E

@eddyb

I'm going to look next into making these targets produce .ptx files when compiling executables (like the asmjs target that produces .js files) but there won't be a link step for this target. Is there any precedent for that? I hope I won't have to do too much plumbing to omit the link step.

BTW, is there any way to "link" in an external "bitcode library". The cuda SDK provides this libdevice bitcode library that ships with implementations of several math functions. It would be nice to provide a device crate that provides exposes those routines as Rust functions and takes care of linking in the bitcode before the PTX translation.

@rkruppe

I think even the obvious solutions would need to go through an RFC to land as a experimental feature whereas just landing the targets as this PR does can be done without an RFC.

but I you can't specify storage for those anyway?

I think so, yes.

However, that doesn't solve the issue of describing address spaces of pointers, e.g. a pointer into shared memory. That's a thing that exists, right?

Yes, but AFAIK cuda doesn't distinguish them at type level, they are all *T not * __shared__ T. The generated PTX probably does track their address spaces.

@bors
Copy link
Contributor

bors commented Jun 11, 2016

☔ The latest upstream changes (presumably #34208) made this pull request unmergeable. Please resolve the merge conflicts.

@@ -63,7 +63,7 @@ pub fn llvm(build: &Build, target: &str) {
.out_dir(&dst)
.profile(if build.config.llvm_optimize {"Release"} else {"Debug"})
.define("LLVM_ENABLE_ASSERTIONS", assertions)
.define("LLVM_TARGETS_TO_BUILD", "X86;ARM;AArch64;Mips;PowerPC")
.define("LLVM_TARGETS_TO_BUILD", "X86;ARM;AArch64;Mips;PowerPC;NVPTX")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, and just to confirm, this doesn't increase our binary sizes by like 100M or compile times by like 10 minutes, right? That is, this should in theory be a relatively small backend?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any precise measurement but:

# Nightly of 2016-06-08
$ du -h $(rustc --print sysroot)/lib
28K     /home/japaric/.multirust/toolchains/nightly-2016-06-08-x86_64-unknown-linux-gnu/lib/rustlib/etc
147M    /home/japaric/.multirust/toolchains/nightly-2016-06-08-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib
147M    /home/japaric/.multirust/toolchains/nightly-2016-06-08-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu
147M    /home/japaric/.multirust/toolchains/nightly-2016-06-08-x86_64-unknown-linux-gnu/lib/rustlib
254M    /home/japaric/.multirust/toolchains/nightly-2016-06-08-x86_64-unknown-linux-gnu/lib

# This PR on top of that nightly (Note: stage1)
$ du -h build/x86_64-unknown-linux-gnu/stage1/lib
155M    build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/x86_64-unknown-linux-gnu/lib
155M    build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/x86_64-unknown-linux-gnu
155M    build/x86_64-unknown-linux-gnu/stage1/lib/rustlib
253M    build/x86_64-unknown-linux-gnu/stage1/lib

Also the LLVM static libraries are rather small:

$ ls -hs build/x86_64-unknown-linux-gnu/llvm/lib/*PTX*
160K build/x86_64-unknown-linux-gnu/llvm/lib/libLLVMNVPTXAsmPrinter.a
2.3M build/x86_64-unknown-linux-gnu/llvm/lib/libLLVMNVPTXCodeGen.a
384K build/x86_64-unknown-linux-gnu/llvm/lib/libLLVMNVPTXDesc.a
8.0K build/x86_64-unknown-linux-gnu/llvm/lib/libLLVMNVPTXInfo.a

No data on compile times.

Jorge Aparicio added 2 commits June 16, 2016 19:17
this PR adds two targets:

- `nvptx-unknown-unknown` (32-bit machine model)
- `nvptx64-unknown-unknown` (64-bit machine model)

that can be used to generate PTX code from Rust source code:

```
$ rustc --target nvptx64-unknown-unknown --emit=asm foo.rs
$ head foo.s
//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

(..)
```

this PR also adds new intrinsics that are equivalent to the following
CUDA variables/functions:

- `threadIdx.{x,y,z}`
- `blockIdx.{x,y,z}`
- `blockDim.{x,y,z}`
- `gridDim.{x,y,z}`
- `__syncthreads`

this PR has been tested by writing a kernel that `memcpy`s a chunk of
memory to other:

``` rust
#![no_core]

#[no_mangle]
pub fn memcpy_(src: *const f32, dst: *mut f32, n: isize) {
    unsafe {
        let i = overflowing_add(overflowing_mul(block_idx_x(), block_dim_x()), thread_idx_x()) as isize;

        if i < n {
            *(offset(dst, i) as *mut f32) = *offset(src, i)
        }
    }
}

// undeclared functions are intrinsics
// omitted: lang items
```

which translates to:

``` ptx
//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

	// .globl	memcpy_

.visible .func memcpy_(
	.param .b64 memcpy__param_0,
	.param .b64 memcpy__param_1,
	.param .b64 memcpy__param_2
)
{
	.reg .pred 	%p<2>;
	.reg .s32 	%r<6>;
	.reg .s64 	%rd<8>;

	mov.u32 	%r1, %ctaid.x;
	ld.param.u64 	%rd5, [memcpy__param_2];
	mov.u32 	%r2, %ntid.x;
	mov.u32 	%r3, %tid.x;
	mad.lo.s32 	%r4, %r2, %r1, %r3;
	cvt.s64.s32	%rd6, %r4;
	setp.ge.s64	%p1, %rd6, %rd5;
	@%p1 bra 	LBB0_2;
	ld.param.u64 	%rd3, [memcpy__param_0];
	ld.param.u64 	%rd4, [memcpy__param_1];
	mul.wide.s32 	%rd7, %r4, 4;
	add.s64 	%rd1, %rd3, %rd7;
	add.s64 	%rd2, %rd4, %rd7;
	ld.u32 	%r5, [%rd1];
	st.u32 	[%rd2], %r5;
LBB0_2:
	ret;
}
```

however, this PTX code can't be directly used in a CUDA program because
the `memcpy_` function is marked as a "device function" (`.func
memcpy_`). Device functions can only be called from other GPU code. To
be usable from a CUDA program `memcpy_` should be marked as a "kernel
function" (`.entry memcpy_`):

``` diff
        // .globl       memcpy_

-.visible .entry memcpy_(
+.visible .func memcpy_(
        .param .b64 memcpy__param_0,
        .param .b64 memcpy__param_1,
        .param .b64 memcpy__param_2
```

After patching the generated PTX code the kernel became callable from a
CUDA program.

### unresolved questions

- we need to provide a way to differentiate functions that will be
  translated to "kernel functions" from the ones that will be translated
  to "device functions". CUDA uses the `__global__` and
  `__device__` attributes for this.
- we need to provide a way to let the user choose on which memory region
  [2] variables should be placed. CUDA exposes the  `__shared__` and
  `__constant__` attributes for this.

### FIXMEs

- pointer arguments in kernel and device functions should be marked with
  the `addrspace(1)` attribute in LLVM IR.
- compiling a rlib produces an empty archive (no PTX in it)

[1]: http://llvm.org/docs/NVPTXUsage.html#kernel-metadata
[2]: http://llvm.org/docs/NVPTXUsage.html#id10
these can't be used with other targets
@japaric
Copy link
Member Author

japaric commented Jun 17, 2016

We'll be iterating this feature on the rust-on-gpu repo. We'll send a RFC when we get a better idea of what language changes are required to provide a sensible Rust on GPU experience. I'm going to close this until then.

@japaric japaric closed this Jun 17, 2016
@japaric japaric mentioned this pull request Dec 22, 2016
alexcrichton added a commit to alexcrichton/rust that referenced this pull request Dec 30, 2016
PTX support, take 2

- You can generate PTX using `--emit=asm` and the right (custom) target. Which
  then you can run on a NVIDIA GPU.

- You can compile `core` to PTX. [Xargo] also works and it can compile some
  other crates like `collections` (but I doubt all of those make sense on a GPU)

[Xargo]: https://github.com/japaric/xargo

- You can create "global" functions, which can be "called" by the host, using
  the `"ptx-kernel"` ABI, e.g. `extern "ptx-kernel" fn kernel() { .. }`. Every
  other function is a "device" function and can only be called by the GPU.

- Intrinsics like `__syncthreads()` and `blockIdx.x` are available as
  `"platform-intrinsics"`. These intrinsics are *not* in the `core` crate but
  any Rust user can create "bindings" to them using an `extern
  "platform-intrinsics"` block. See example at the end.

- Trying to emit PTX with `-g` (debuginfo); you get an LLVM error. But I don't
  think PTX can contain debuginfo anyway so `-g` should be ignored and a warning
  should be printed ("`-g` doesn't work with this target" or something).

- "Single source" support. You *can't* write a single source file that contains
  both host and device code. I think that should be possible to implement that
  outside the compiler using compiler plugins / build scripts.

- The equivalent to CUDA `__shared__` which it's used to declare memory that's
  shared between the threads of the same block. This could be implemented using
  attributes: `#[shared] static mut SCRATCH_MEMORY: [f32; 64]` but hasn't been
  implemented yet.

- Built-in targets. This PR doesn't add targets to the compiler just yet but one
  can create custom targets to be able to emit PTX code (see the example at the
  end). The idea is to have people experiment with this feature before
  committing to it (built-in targets are "insta-stable")

- All functions must be "inlined". IOW, the `.rlib` must always contain the LLVM
  bitcode of all the functions of the crate it was produced from. Otherwise, you
  end with "undefined references" in the final PTX code but you won't get *any*
  linker error because no linker is involved. IOW, you'll hit a runtime error
  when loading the PTX into the GPU. The workaround is to use `#[inline]` on
  non-generic functions and to never use `#[inline(never)]` but this may not
  always be possible because e.g. you could be relying on third party code.

- Should `--emit=asm` generate a `.ptx` file instead of a `.s` file?

TL;DR Use Xargo to turn a crate into a PTX module (a `.s` file). Then pass that
PTX module, as a string, to the GPU and run it.

The full code is in [this repository]. This section gives an overview of how to
run Rust code on a NVIDIA GPU.

[this repository]: https://github.com/japaric/cuda

- Create a custom target. Here's the 64-bit NVPTX target (NOTE: the comments
  are not valid because this is supposed to be a JSON file; remove them before
  you use this file):

``` js
// nvptx64-nvidia-cuda.json
{
  "arch": "nvptx64",  // matches LLVM
  "cpu": "sm_20",  // "oldest" compute capability supported by LLVM
  "data-layout": "e-i64:64-v16:16-v32:32-n16:32:64",
  "llvm-target": "nvptx64-nvidia-cuda",
  "max-atomic-width": 0,  // LLVM errors with any other value :-(
  "os": "cuda",  // matches LLVM
  "panic-strategy": "abort",
  "target-endian": "little",
  "target-pointer-width": "64",
  "target-vendor": "nvidia",  // matches LLVM -- not required
}
```

(There's a 32-bit target specification in the linked repository)

- Write a kernel

``` rust

extern "platform-intrinsic" {
    fn nvptx_block_dim_x() -> i32;
    fn nvptx_block_idx_x() -> i32;
    fn nvptx_thread_idx_x() -> i32;
}

/// Copies an array of `n` floating point numbers from `src` to `dst`
pub unsafe extern "ptx-kernel" fn memcpy(dst: *mut f32,
                                         src: *const f32,
                                         n: usize) {
    let i = (nvptx_block_dim_x() as isize)
        .wrapping_mul(nvptx_block_idx_x() as isize)
        .wrapping_add(nvptx_thread_idx_x() as isize);

    if (i as usize) < n {
        *dst.offset(i) = *src.offset(i);
    }
}
```

- Emit PTX code

```
$ xargo rustc --target nvptx64-nvidia-cuda --release -- --emit=asm
   Compiling core v0.0.0 (file://..)
   (..)
   Compiling nvptx-builtins v0.1.0 (https://github.com/japaric/nvptx-builtins)
   Compiling kernel v0.1.0

$ cat target/nvptx64-nvidia-cuda/release/deps/kernel-*.s
//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

        // .globl       memcpy

.visible .entry memcpy(
        .param .u64 memcpy_param_0,
        .param .u64 memcpy_param_1,
        .param .u64 memcpy_param_2
)
{
        .reg .pred      %p<2>;
        .reg .s32       %r<5>;
        .reg .s64       %rd<12>;

        ld.param.u64    %rd7, [memcpy_param_2];
        mov.u32 %r1, %ntid.x;
        mov.u32 %r2, %ctaid.x;
        mul.wide.s32    %rd8, %r2, %r1;
        mov.u32 %r3, %tid.x;
        cvt.s64.s32     %rd9, %r3;
        add.s64         %rd10, %rd9, %rd8;
        setp.ge.u64     %p1, %rd10, %rd7;
        @%p1 bra        LBB0_2;
        ld.param.u64    %rd3, [memcpy_param_0];
        ld.param.u64    %rd4, [memcpy_param_1];
        cvta.to.global.u64      %rd5, %rd4;
        cvta.to.global.u64      %rd6, %rd3;
        shl.b64         %rd11, %rd10, 2;
        add.s64         %rd1, %rd6, %rd11;
        add.s64         %rd2, %rd5, %rd11;
        ld.global.u32   %r4, [%rd2];
        st.global.u32   [%rd1], %r4;
LBB0_2:
        ret;
}
```

- Run it on the GPU

``` rust
// `kernel.ptx` is the `*.s` file we got in the previous step
const KERNEL: &'static str = include_str!("kernel.ptx");

driver::initialize()?;

let device = Device(0)?;
let ctx = device.create_context()?;
let module = ctx.load_module(KERNEL)?;
let kernel = module.function("memcpy")?;

let h_a: Vec<f32> = /* create some random data */;
let h_b = vec![0.; N];

let d_a = driver::allocate(bytes)?;
let d_b = driver::allocate(bytes)?;

// Copy from host to GPU
driver::copy(h_a, d_a)?;

// Run `memcpy` on the GPU
kernel.launch(d_b, d_a, N)?;

// Copy from GPU to host
driver::copy(d_b, h_b)?;

// Verify
assert_eq!(h_a, h_b);

// `d_a`, `d_b`, `h_a`, `h_b` are dropped/freed here
```

---

cc @alexcrichton @brson @rkruppe

> What has changed since rust-lang#34195?

- `core` now can be compiled into PTX. Which makes it very easy to turn `no_std`
  crates into "kernels" with the help of Xargo.

- There's now a way, the `"ptx-kernel"` ABI, to generate "global" functions. The
  old PR required a manual step (it was hack) to "convert" "device" functions
  into "global" functions. (Only "global" functions can be launched by the host)

- Everything is unstable. There are not "insta stable" built-in targets this
  time (\*). The users have to use a custom target to experiment with this
  feature. Also, PTX instrinsics, like `__syncthreads` and `blockIdx.x`, are now
  implemented as `"platform-intrinsics"` so they no longer live in the `core`
  crate.

(\*) I'd actually like to have in-tree targets because that makes this target
more discoverable, removes the need to lug around .json files, etc.

However, bundling a target with the compiler immediately puts it in the path
towards stabilization. Which gives us just two cycles to find and fix any
problem with the target specification. Afterwards, it becomes hard to tweak
the specification because that could be a breaking change.

A possible solution could be "unstable built-in targets". Basically, to use an
unstable target, you'll have to also pass `-Z unstable-options` to the compiler.
And unstable targets, being unstable, wouldn't be available on stable.

> Why should this be merged?

- To let people experiment with the feature out of tree. Having easy access to
  the feature (in every nightly) allows this. I also think that, as it is, it
  should be possible to start prototyping type-safe single source support using
  build scripts, macros and/or plugins.

- It's a straightforward implementation. No different that adding support for
  any other architecture.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants