-
What is your question?
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Our results in the README include IO and network bottlenecks, it is the wall-clock time. These are 2 DGX-2 machines, and so they have very fast NVLink, each GPU is sharing a 100Gb/sec connection using RoCE with 1 other GPU. Additionally the files are stored locally as Parquet on the DGX-2 RAID. The shuffle was done using UCX in this scenario, and so we used both NVLink and RoCE to send the shuffle data, so to some degree we are minimizing the IO/Network bottleneck and trying to feed the GPUs as much as we can. We ran at 10TB and had generously sized partitions (we had 32 shuffle partitions most of the time). We were not using the 200 partition default in Spark.
We chose the queries due to shuffle size and support on the GPU (at the time). We could show a more accurate picture of where we are heading by choosing queries that did do a lot of shuffle, to stress the IO components we are developing, and also with most of the query covered on the GPU we can show what the plugin can do as we add more coverage. Our coverage has improved significantly since then, and so we are continuously looking at other benchmarks (like TPC-DS), so I don't think these 4 queries are the end game in any way, just queries we could use to showcase our plugin.
I am a bit surprised by this. More cores should mean more scheduling opportunities, so your tasks should see CPU time quicker, and not be queued in waves for a restrictive environment. That said, it could be you are over-committing the CPU, and now you are seeing a negative effects. It seems some information on what got slower would help. Did you see the shuffle reads taking longer, for example?
Does this say: with less executor-cores, IO stopped being as much of an overhead? One potential thing that could be happening here is more pipelining of IO and compute, since there are less shuffle iterators trying to fetch (this goes along the previous comment on cores going above 96 having negative effects). The Spark stage has a "Timeline" view that may help show what tasks were doing in time. But I have the same question as before. What part of the query got faster? |
Beta Was this translation helpful? Give feedback.
-
Closing, please reopen if you still have questions. |
Beta Was this translation helpful? Give feedback.
Our results in the README include IO and network bottlenecks, it is the wall-clock time. These are 2 DGX-2 machines, and so they have very fast NVLink, each GPU is sharing a 100Gb/sec connection using RoCE with 1 other GPU. Additionally the files are stored locally as Parquet on the DGX-2 RAID. The shuffle was done using UCX in this scenario, and so we used both NVLink and RoCE to send the shuffle data, so to some degree we are minimizing the IO/Network bottleneck and trying to feed the GPUs as much as we can. We ran at 10TB and had generously sized partitions (we …