Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't profile benchmark with ncu, nsys #183

Closed
WyldeCat opened this issue Oct 29, 2023 · 5 comments
Closed

Can't profile benchmark with ncu, nsys #183

WyldeCat opened this issue Oct 29, 2023 · 5 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@WyldeCat
Copy link

WyldeCat commented Oct 29, 2023

Tried to profile llama 7b benchmark but failed to obtain reports.

root@nf5688m7-release:/code/tensorrt_llm/benchmarks/python# ncu --target-processes all python benchmark.py  -m llama_7b  --mode plugin --batch_size "64" --input_output_len "128,128" --enable_fp8 --fp8_kv_cache
==PROF== Connected to process 36055 (/usr/bin/nvidia-smi)
==PROF== Disconnected from process 36055
==PROF== Target process 36054 terminated before first instrumented API call.
==PROF== Connected to process 35989 (/usr/bin/python3.10)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py:658: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  torch.nested.nested_tensor(split_ids_list,
==PROF== Target process 41832 terminated before first instrumented API call.
[BENCHMARK] model_name llama_7b world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 64 input_length 128 output_length 128 gpu_peak_mem(gb) 21.35 build_time(s) 367.44 tokens_per_sec 260.23 percentile95(ms) 31915.803 percentile99(ms) 31915.803 lat
ency(ms) 31479.389 compute_cap sm90
==PROF== Target process 41835 terminated before first instrumented API call.
==PROF== Disconnected from process 35989
==WARNING== No kernels were profiled.

When using nsys, the following error occurs

root@nf5688m7-release:/code/tensorrt_llm/benchmarks/python# /opt/nvidia/nsight-systems/2023.3.1/bin/nsys profile python benchmark.py  -m llama_7b  --mode plugin  --batch_size "64" --input_output_len "128,128" --enable_fp8 --fp8_kv_cache
[10/29/2023-07:07:28] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::62] Error Code 1: Cuda Runtime (unspecified launch failure)
[10/29/2023-07:07:28] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:279] Error 719 destroying stream '0x560e6297eb80'.)
[10/29/2023-07:07:28] [TRT] [E] 9: Skipping tactic0x0000000000000000 due to exception unspecified launch failure
[10/29/2023-07:07:31] [TRT] [E] 1: [defaultAllocator.cpp::allocate::20] Error Code 1: Cuda Runtime (unspecified launch failure)
[10/29/2023-07:07:31] [TRT] [E] 9: Skipping tactic0x0000000000000000 due to exception [tunable_graph.cpp:create:114] autotuning: User allocator error allocating 54002688-byte buffer
[10/29/2023-07:07:31] [TRT] [E] 9: Skipping tactic0x0000000000000000 due to exception Assertion engine failed.
[10/29/2023-07:07:31] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/layers/0/attention/dense/CONSTANT_1...LLaMAForCausalLM/layers/1/attention/qkv/MATRIX_MULTIPLY_0]}.
[10/29/2023-07:07:31] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (unspecified launch failure)
[10/29/2023-07:07:31] [TRT] [E] 10: [optimizer.cpp::computeCosts::4051] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/layers/0/attention/dense/CONSTANT_1...LLaMAForCausalLM/layers/1/attention/qkv/MATRIX_MULTIPLY_0]}.)
[10/29/2023-07:07:31] [TRT-LLM] [E] Engine building failed, please check the error log.
Traceback (most recent call last):
  File "/code/tensorrt_llm/benchmarks/python/benchmark.py", line 322, in <module>
    main(args)
  File "/code/tensorrt_llm/benchmarks/python/benchmark.py", line 219, in main
    benchmarker = GPTBenchmark(
  File "/code/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 144, in __init__
    assert engine_buffer is not None
AssertionError
Generating '/tmp/nsys-report-17ae.qdstrm'
[1/1] [========================100%] report2.nsys-rep
Generated:
    /code/tensorrt_llm/benchmarks/python/report2.nsys-rep

What do I need to do to get reports?

@WyldeCat WyldeCat changed the title No kernels profiled with ncu Can't profile benchmark with ncu, nsys Oct 29, 2023
@juney-nvidia
Copy link
Collaborator

juney-nvidia commented Oct 29, 2023

@WyldeCat you can follow the guide mentioned in the documentation link posted by you, by passing --cap-add=SYS_ADMIN when you start the docker container, something like:

docker run --cap-add=SYS_ADMIN ...

@juney-nvidia juney-nvidia self-assigned this Oct 29, 2023
@juney-nvidia juney-nvidia added the triaged Issue has been triaged by maintainers label Oct 29, 2023
@WyldeCat
Copy link
Author

WyldeCat commented Oct 29, 2023

@WyldeCat you can follow the guide mentioned in the documentation link posted by you, by passing --cap-add=SYS_ADMIN when you start the docker container, something like:

@juney-nvidia Thanks, it seems to have become possible to use the ncu profiler, but nsys still doesn't work. Is there any way to use nsys profiler?

@juney-nvidia
Copy link
Collaborator

juney-nvidia commented Oct 29, 2023

@WyldeCat you can follow the guide mentioned in the documentation link posted by you, by passing --cap-add=SYS_ADMIN when you start the docker container, something like:

@juney-nvidia Thanks, it seems to have become possible to use the ncu profiler, but nsys still doesn't work. Is there any way to use nsys profiler?

Have you tried to run with smaller batch size, smaller in/out length to see whether the issue still exist? And what is hardware you are using?

June

@WyldeCat
Copy link
Author

@WyldeCat you can follow the guide mentioned in the documentation link posted by you, by passing --cap-add=SYS_ADMIN when you start the docker container, something like:

@juney-nvidia Thanks, it seems to have become possible to use the ncu profiler, but nsys still doesn't work. Is there any way to use nsys profiler?

Have you tried to run with smaller batch size, smaller in/out length to see whether the issue still exist? And what is hardware you are using?

@juney-nvidia
I've tried batch size 1 and problem still exists. I'm using H100 80GB.
I found that running benchmarks with explicit engine directory makes nsys work.
(by giving --engine_dir my_engine_dir argument to command)
So I think, building engines on-air before benchmarking makes nsys error.

Because running benchmarks with on-air engine build is much more comfortable, it would be great if there's a way to use it with nsys.

@jdemouth-nvidia
Copy link
Collaborator

Thanks for the feedback @WyldeCat . It would also mean that you would "pollute" your NSYS trace with a lot of extra kernel launches that are not relevant for your application (all the auto-tuning done by TensorRT) and you will end up with a much bigger NSYS output file. I'm pretty sure it would make the analysis of the NSYS report a lot harder. For now, I'm going to close the issue as "closed" but feel free to open a "feature request" if you think it's really a needed feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants
@jdemouth-nvidia @WyldeCat @juney-nvidia and others