Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on bad perf when concurrent copy on single GPU #237

Open
Zhaojp-Frank opened this issue Oct 19, 2022 · 9 comments
Open

Question on bad perf when concurrent copy on single GPU #237

Zhaojp-Frank opened this issue Oct 19, 2022 · 9 comments
Labels

Comments

@Zhaojp-Frank
Copy link

Zhaojp-Frank commented Oct 19, 2022

we observe bad latency if concurrent copy_to_mapping on the single GPU, and want to understand the cause (known limit?) before we dive into.

  • env: X86 8163 2 socket (AVX2 supported), Nvidia T4 *2, PCIe3.0 * 16, CUDA driver 450.82, latest gdrcpy(2022.10)
  • Tests: 2 processes (bind to different core) concurrently running test/copylat on GPU0, each process alloc different host and dev memory addr of course.
  • result: i.e. 32KB, each process gdr_copy_to_mapping gets avg 6.2usec, vs. 3.2usec if run with single process. similar problem with other block size (such as 2KB ~ 256KB where I only focuse on small blocks)
    btw, if 2 processes run torwards different GPU, the perf behaves ok.

Question1: what's major cause for such big contention or perf degrade when concurrent gdr_copy_to_mapping? considering 32KB is not large enough I don't think PCIe bandwith is saturated.

Question2: any plan or possible to optimize concurrent gdr_copy_to_mapping?

Thanks for any feedback.

@pakmarkthub
Copy link
Collaborator

Hi @Zhaojp-Frank ,

I would like to know more about your setup before we drive deeper. Some questions are just to make sure that we have already eliminated external factors.

  1. You said that you bound 2 processes to different cores. Were they on the same or different CPU sockets?
  2. Did you also bind the host memory to the core -- e.g., by using numactl -l?
  3. How did you make sure that gdr_copy_to_mapping of both processes ran concurrently? Starting both processes at the same time does not always mean they will reach the test section at the same time.

each process gdr_copy_to_mapping gets avg 6.2usec

  1. Does this number come from averaging the latency of both processes? Or Process A showed 6.2 us and Process B also showed 6.2 us?
  2. How many iterations did you run?
  3. Did you put the GPU clocks to max? Did you also lock the CPU clock?
  4. Can you provide the PCIe topology?

@Zhaojp-Frank
Copy link
Author

in general, do you think it's abnormal (not by design)?
I think you may quickly reproduce it by simpliy modifying tests with fixed size (i.e., 32KB/64KB), and disable gdr_copy_from_mapping test. see attatched file for reference. then launch two process in background
CUDA_VISIBLE_DEVICES=0 numactl -l ./copylat -w 10000 -r 0 -s 65536 &
CUDA_VISIBLE_DEVICES=1 numactl -l ./copylat -w 10000 -r 0 -s 65536 &

more info input:

  1. I tried with different core (e.g., core 0, 1) from same socket and different socket (i.e., core0, core 48), looks no obvious differences. there're 48core *2
  2. tried with numactl -l, no big difference
  3. yeah, I just running the processes in background with large enough iterations like 10000
    what's more, the same setting get pretty good perf if towards diff GPU 0, 1. so accruate concurrent or not sounds not a big deal
  4. the latter case, each process report similar perf at ~6.2usec (which avg the iterations within the process)
  5. set as 10000 iterations
  6. no, i have not set/reset any GPU/cpu clock
  7. nvidia-smi topo -m output
    $nvidia-smi topo -m
    GPU0 GPU1 CPU Affinity NUMA Affinity
    GPU0 X SYS 0-95 N/A
    GPU1 SYS X 0-95 N/A

$lspci -tv|grep -i nvidia
-+-[0000:d7]-+-00.0-[d8]----00.0 NVIDIA Corporation Device 1eb8
+-[0000:5d]-+-00.0-[5e]----00.0 NVIDIA Corporation Device 1eb8

diff
--- copylat-orig.cpp 2022-10-17 22:37:29.944080142 +0800
+++ copylat-simple.cpp 2022-10-20 15:07:44.764855950 +0800
@@ -253,11 +253,10 @@ int main(int argc, char *argv[])
// gdr_copy_to_mapping benchmark
cout << endl;
cout << "gdr_copy_to_mapping num iters for each size: " << num_write_iters << endl;

  •    cout << "WARNING: Measuring the API invocation overhead as observed by the CPU. Data might not be ordered all the way to the GPU internal visibility." << endl;
       // For more information, see
       // https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#sync-behavior
       printf("Test \t\t\t Size(B) \t Avg.Time(us)\n");
    
  •    copy_size = 1;
    
  •    copy_size = size;
       while (copy_size <= size) {
           int iter = 0;
           clock_gettime(MYCLOCK, &beg);
    

@@ -276,6 +275,7 @@ int main(int argc, char *argv[])
MB();

     // gdr_copy_from_mapping benchmark
  •    /*
       cout << endl;
       cout << "gdr_copy_from_mapping num iters for each size: " << num_read_iters << endl;
       printf("Test \t\t\t Size(B) \t Avg.Time(us)\n");
    

@@ -290,6 +290,7 @@ int main(int argc, char *argv[])
printf("gdr_copy_from_mapping \t %8zu \t %11.4f\n", copy_size, lat_us);
copy_size <<= 1;
}

  •    */
    

@Zhaojp-Frank
Copy link
Author

for size=64KB,

if targets two gpu, each process reports ~6.4usec,
if both process with CUDA_VISIBLE_DEVICES=0, each process report ~12.6usec

CUDA_VISIBLE_DEVICES=0 numactl -l ./copylat -w 10000 -r 0 -s 65536 &
CUDA_VISIBLE_DEVICES=1 numactl -l ./copylat -w 10000 -r 0 -s 65536 &

@pakmarkthub
Copy link
Collaborator

GDRCopy, by design, is for low latency CPU-GPU communication at small message sizes. It uses CPU to drive the communication -- as opposed to cudaMemcpy which uses the GPU copy engine. In many systems, GDRCopy cannot reach the peak BW while cudaMemcpy can. To understand what GDRCopy can deliver on your system, I suggest that you run copybw at various message sizes and plot the BW graph. You might find out that you have already reached the bw limit at that message size.

On your system, write combining (WC) is likely enabled. WC uses the CPU WC buffer to absorb small messages and flushes out one large PCIe packet. This helps with the performance. However, the WC buffer size and how the buffer is shared across cores depend on the CPU.

Putting a process on a far socket can increase the latency. This is because the transactions need to be forwarded through the CPU-CPU link. And that can also cause interference with transactions that originate from the near socket.

I recommend setting the GPU clocks (SM and memory) to max. Otherwise, the GPU internal subsystem may operate at a lower frequency, which delays the response time. Setting the CPU clock to max is also recommended because CPU is driving the communication. I don't think this is the root cause, however. Using the default clocks should not cause the latency to double.

@Zhaojp-Frank
Copy link
Author

Zhaojp-Frank commented Oct 20, 2022

thanks for you insight sharing, indeed I actually do care latency rather than BW in this test case.

Your comment on WC makes great sense. indeed it's enabled (shown in the map info output).

I just want to validate WC impact on latency, so do u know how to disable WC effect such as on specific dev range? using another avx instructions (rather that stream**?)

@pakmarkthub
Copy link
Collaborator

WC mapping is enabled in the gdrdrv driver. You can comment out these lines to disable it (https://github.com/NVIDIA/gdrcopy/blob/master/src/gdrdrv/gdrdrv.c#L1190-L1197). The default on x86 should be uncached (UC) mapping. You probably see higher latency with UC with the sizes that you mentioned.

@Zhaojp-Frank
Copy link
Author

well, If I comment out WC enabling, the perf no mather single or two processes, the latency is terrable, 220+ usec
it doesn't resolve contention problem but make things worse.

wondering other clue to improve concurrent gdr_copy_to_mapping

@pakmarkthub
Copy link
Collaborator

Have you already measured the BW? If you are limited by the BW, there is nothing much we can do. As mentioned, the peak BW GDRCopy can achieved can be lower than the PCIe BW on your system.

You may be able to get a bit more performance when playing with the copy algorithm. Depending on the system (CPU, topology, and other factors), changing the algorithm from AVX to something else might help. But I don't expect it to completely solve your problem about experiencing double latency when using two processes.

@Zhaojp-Frank
Copy link
Author

Ok, I'll measure BW as well and post it later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants