[QST]There are two questions about TPCxBB Like query results in README.md #5374

YeahNew · 2020-11-27T08:22:14Z

YeahNew
Nov 27, 2020

What is your question?
I used the spark-rapids plugin to successfully use the GPU through spark3.0. Re-do TPCxBB Like query results as shown in README.in. We use a 200GB Dataset (scale factor 200), stored in csv. The processing happened on a three node cluster. Each node has 64 CPU cores, 365GB host memory, 6 Tesla T4 GPUs, and 16 GB GPU memory.
I set total-executors=6 unchanged, and changed total-executor-cores(6,12,24,48,72,96,128). In this way, 6 GPUs are used (because each executor can only be bound to one GPU), and I found that these querys were also accelerated in GPU mode. But I found some problems with iostat -x 1> io.log and nload eno4.
For Query#5, in GPU mode, the IO and network bandwidth will reach the maximum (the so-called bottleneck) and last a long time (Maximum speed of IO read or write = 500MB/s. Maximum network bandwidth speed = 1000MB/s). The more cpu cores, the longer the duration of the bottleneck. When cpu ocres>48, the duration remains almost unchanged. However, in pure CPU mode, when cpu cores>96, IO and network bandwidth will become bottlenecks.
Then I reduced the executors to 3, total-executor-cores=6. I want to eliminate the impact of network and IO bottlenecks, and simply compare CPU and GPU computing power. In pure CPU mode, all Query(5,16,21,22) have almost no bottlenecks. In GPU mode, Query#16 and Query#21 still have short-term bottlenecks.
When we experimented, the memory is sufficient, so I have two questions:

Do your results exclude IO and network bottlenecks, but simply compare the computational efficiency of CPU and GPU modes?
Tpcx-BB has a total of 30 queries, why choose Query [FEA] Support Adaptive Execution #5,[FEA] Support for arrays and array operations #16,[FEA] NormalizeNansAndZeros #21,[FEA] Add support for bucketed writes #22. Is there anything special about these four Querys? Is there any difference between them and tpc-ds?

Answered by abellina

Nov 30, 2020

Do your results exclude IO and network bottlenecks, but simply compare the computational efficiency of CPU and GPU modes?

Our results in the README include IO and network bottlenecks, it is the wall-clock time. These are 2 DGX-2 machines, and so they have very fast NVLink, each GPU is sharing a 100Gb/sec connection using RoCE with 1 other GPU. Additionally the files are stored locally as Parquet on the DGX-2 RAID. The shuffle was done using UCX in this scenario, and so we used both NVLink and RoCE to send the shuffle data, so to some degree we are minimizing the IO/Network bottleneck and trying to feed the GPUs as much as we can. We ran at 10TB and had generously sized partitions (we …

View full answer

abellina · 2020-11-30T15:52:34Z

abellina
Nov 30, 2020
Collaborator

Do your results exclude IO and network bottlenecks, but simply compare the computational efficiency of CPU and GPU modes?

Our results in the README include IO and network bottlenecks, it is the wall-clock time. These are 2 DGX-2 machines, and so they have very fast NVLink, each GPU is sharing a 100Gb/sec connection using RoCE with 1 other GPU. Additionally the files are stored locally as Parquet on the DGX-2 RAID. The shuffle was done using UCX in this scenario, and so we used both NVLink and RoCE to send the shuffle data, so to some degree we are minimizing the IO/Network bottleneck and trying to feed the GPUs as much as we can. We ran at 10TB and had generously sized partitions (we had 32 shuffle partitions most of the time). We were not using the 200 partition default in Spark.

Tpcx-BB has a total of 30 queries, why choose Query [FEA] Support Adaptive Execution #5,[FEA] Support for arrays and array operations #16,[FEA] NormalizeNansAndZeros #21,[FEA] Add support for bucketed writes #22. Is there anything special about these four Querys? Is there any difference between them and tpc-ds?

We chose the queries due to shuffle size and support on the GPU (at the time). We could show a more accurate picture of where we are heading by choosing queries that did do a lot of shuffle, to stress the IO components we are developing, and also with most of the query covered on the GPU we can show what the plugin can do as we add more coverage. Our coverage has improved significantly since then, and so we are continuously looking at other benchmarks (like TPC-DS), so I don't think these 4 queries are the end game in any way, just queries we could use to showcase our plugin.

The more cpu cores, the longer the duration of the bottleneck. When cpu ocres>48, the duration remains almost unchanged. However, in pure CPU mode, when cpu cores>96, IO and network bandwidth will become bottlenecks.

I am a bit surprised by this. More cores should mean more scheduling opportunities, so your tasks should see CPU time quicker, and not be queued in waves for a restrictive environment. That said, it could be you are over-committing the CPU, and now you are seeing a negative effects. It seems some information on what got slower would help. Did you see the shuffle reads taking longer, for example?

Then I reduced the executors to 3, total-executor-cores=6. I want to eliminate the impact of network and IO bottlenecks, and simply compare CPU and GPU computing power. In pure CPU mode, all Query(5,16,21,22) have almost no bottlenecks. In GPU mode, Query#16 and Query#21 still have short-term bottlenecks.

Does this say: with less executor-cores, IO stopped being as much of an overhead? One potential thing that could be happening here is more pipelining of IO and compute, since there are less shuffle iterators trying to fetch (this goes along the previous comment on cores going above 96 having negative effects). The Spark stage has a "Timeline" view that may help show what tasks were doing in time. But I have the same question as before. What part of the query got faster?

0 replies

sameerz · 2021-02-09T23:22:37Z

sameerz
Feb 9, 2021
Maintainer

Closing, please reopen if you still have questions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST]There are two questions about TPCxBB Like query results in README.md #5374

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

[QST]There are two questions about TPCxBB Like query results in README.md #5374

YeahNew Nov 27, 2020

Replies: 2 comments

abellina Nov 30, 2020 Collaborator

sameerz Feb 9, 2021 Maintainer

YeahNew
Nov 27, 2020

abellina
Nov 30, 2020
Collaborator

sameerz
Feb 9, 2021
Maintainer