[QST] Setting Exclusive_Mode on GPUs making it unavailable for ML workloads #5338

Niharikadutta · 2021-09-29T02:28:23Z

Niharikadutta
Sep 29, 2021

What is your question?
I am running Spark on YARN without isolation and hence following the instructions in this doc by setting each GPU to EXCLUSIVE_MODE.
While this works for ETL workloads, this setting makes other ML (tensorflow, pytorch) workloads fail by giving the following error:

all CUDA-capable devices are busy or unavailable

Interestingly, this happens even if there is no active Spark job running on the cluster, which should ideally mean the GPUs are available to be used by other processes?
What is the recommended way to run ETL + ML workloads on the same YARN cluster such that GPUs can be shared?

Answered by viadea

Oct 11, 2021

@Niharikadutta @tgravescs As discussed today, if the ML job(Horovod on Spark) needs to run after the ETL portion has finished, we may need to split the GPU memory to 2 parts -- one for ETL job(Spark executor), one for Horovod job(which is a separate python process using GPU).

Say for a T4 with 16G GPU memory, we can allocate 6GB to Spark executor, and 10GB to other stuff(including ML: Horovod job in this case).
The 10GB can be reserved by using parameter spark.rapids.memory.gpu.reserve.

To troubleshoot if the Horovod issue is caused by inefficient GPU memory, we can try to do one test to troubleshoot:

Reserve most of the GPU memory for Horovod. (Say spark.rapids.memory.gpu.reserve set to…

View full answer

tgravescs · 2021-09-29T12:48:48Z

tgravescs
Sep 29, 2021
Maintainer

are you using GPU scheduling on the YARN setup?
How many ML jobs are you running on a box and are they trying to share GPU's? The easiest way to check this is to run the job, go to the host where it got the error and use nvidia-smi to see if the GPUs are already allocated to other processes.
For instance if you have 4 GPUs on a box and try to have 5 processes on that box use the GPUs, then one of them will fail.

0 replies

Niharikadutta · 2021-10-06T00:43:43Z

Niharikadutta
Oct 6, 2021
Author

@tgravescs yes I'm using GPU scheduling on the YARN setup by setting following yarn configs:

<property>
      <name>yarn.resource-types</name>
      <value>yarn.io/gpu</value>
    </property>
    <property>
      <name>yarn.scheduler.capacity.resource-calculator</name>
      <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
    </property>
    <property>
      <name>yarn.nodemanager.resource-plugins</name>
      <value>yarn.io/gpu</value>
    </property>

And setting all GPUs with exclusive_mode property.

Here is the output of nvidia-smi after spinning up a cluster but before running any job on it:

It looks like without running any active job, all GPUs are being used by java processes. Is this expected?

0 replies

tgravescs · 2021-10-06T13:44:06Z

tgravescs
Oct 6, 2021
Maintainer

No, something else is using your GPUs, if its not scheduling them through yarn then they will obviously conflict and hence why you get an error about the cuda devices not available. Suggest you look to see what those processes are and if they are using GPU scheduling in yarn as well.

0 replies

viadea · 2021-10-07T22:26:33Z

viadea
Oct 7, 2021
Collaborator

@tgravescs I checked with @Niharikadutta today and we suspect those 4 JAVA processes are Spark Executors.
@Niharikadutta will double check on this assumption.

Since currently they are using YARN without Isolation mode, so all 4 GPUs are in exclusive mode. If all Spark executors will keep running even after Spark job finishes, is there any way to run other ML jobs on GPU?
My understanding is that other ML jobs can not since all GPUs are occupied by Spark executors.

Now if @Niharikadutta switched from "YARN without Isolation" to "YARN 3.1.3 with Isolation and GPU Scheduling Enabled " mode, of course, there is no requirement to enable GPU in exclusive mode.
But do you think if other MLs jobs can share the same GPUs with long-running Spark executors?

0 replies

tgravescs · 2021-10-08T13:33:21Z

tgravescs
Oct 8, 2021
Maintainer

I don't understand the situation:

Spark job finishes

What exactly do you mean by this? Is the spark application finishing or is this trying to run ML jobs within the same application? ie one application uses the sql plugin to run etc jobs and then afterwards the same applications runs ML job. Are these ML jobs python?

0 replies

tgravescs · 2021-10-08T13:44:56Z

tgravescs
Oct 8, 2021
Maintainer

If the answer to the above is yes, its all in the same application and you have separate java and python process, then you couldn't do it in process exclusive mode, but if you are using yarn with isolation and gpu scheduling it would allow both to use the GPU. You would need to split the memory though across them, otherwise the rapids plugin will use all the GPU memory

0 replies

viadea · 2021-10-11T20:59:13Z

viadea
Oct 11, 2021
Collaborator

@Niharikadutta @tgravescs As discussed today, if the ML job(Horovod on Spark) needs to run after the ETL portion has finished, we may need to split the GPU memory to 2 parts -- one for ETL job(Spark executor), one for Horovod job(which is a separate python process using GPU).

Say for a T4 with 16G GPU memory, we can allocate 6GB to Spark executor, and 10GB to other stuff(including ML: Horovod job in this case).
The 10GB can be reserved by using parameter spark.rapids.memory.gpu.reserve.

To troubleshoot if the Horovod issue is caused by inefficient GPU memory, we can try to do one test to troubleshoot:

Reserve most of the GPU memory for Horovod. (Say spark.rapids.memory.gpu.reserve set to 16G or 15G)
Disable rapids pluginn for ETL portion to let it run on CPU without using any GPU memory.
Run the Horovod job after ETL is done to see if this issue still exists.

If the issue still exists, then we can know this issue is not related to GPU memory allocation.

0 replies

viadea · 2021-10-11T21:03:36Z

viadea
Oct 11, 2021
Collaborator

Another option I can think of is to:

Either use different pools for ETL or ML(Horovod);
Or restart the Spark executors(with different GPU memory settings spark.rapids.memory.gpu.reserve) between ETL and ML in the same pool.

0 replies

tgravescs · 2021-10-11T21:11:08Z

tgravescs
Oct 11, 2021
Maintainer

you can't really restart the executors in Spark. You could potentially use stage level scheduling feature (http://spark.apache.org/docs/latest/configuration.html#stage-level-scheduling-overview) but it requires dynamic allocation. Essentially with that, you would get one set of executors to do the ETL with spark rapids plugin and then you would specify a different profile for the ML side which would tell spark to get different executors to run ML on. Note that with dynamic allocation, the ETL executors would just idle timeout when not in use.

0 replies

viadea · 2021-10-14T17:31:37Z

viadea
Oct 14, 2021
Collaborator

Yes @tgravescs . Here "restart the Spark executors" actually means restarting the spark jobs in the same pool.

0 replies

Niharikadutta · 2021-12-03T00:03:10Z

Niharikadutta
Dec 3, 2021
Author

Closing this issue since this is by design,

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Setting Exclusive_Mode on GPUs making it unavailable for ML workloads #5338

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[QST] Setting Exclusive_Mode on GPUs making it unavailable for ML workloads #5338

Niharikadutta Sep 29, 2021

Replies: 11 comments

tgravescs Sep 29, 2021 Maintainer

Niharikadutta Oct 6, 2021 Author

tgravescs Oct 6, 2021 Maintainer

viadea Oct 7, 2021 Collaborator

tgravescs Oct 8, 2021 Maintainer

tgravescs Oct 8, 2021 Maintainer

viadea Oct 11, 2021 Collaborator

viadea Oct 11, 2021 Collaborator

tgravescs Oct 11, 2021 Maintainer

viadea Oct 14, 2021 Collaborator

Niharikadutta Dec 3, 2021 Author

Niharikadutta
Sep 29, 2021

tgravescs
Sep 29, 2021
Maintainer

Niharikadutta
Oct 6, 2021
Author

tgravescs
Oct 6, 2021
Maintainer

viadea
Oct 7, 2021
Collaborator

tgravescs
Oct 8, 2021
Maintainer

tgravescs
Oct 8, 2021
Maintainer

viadea
Oct 11, 2021
Collaborator

viadea
Oct 11, 2021
Collaborator

tgravescs
Oct 11, 2021
Maintainer

viadea
Oct 14, 2021
Collaborator

Niharikadutta
Dec 3, 2021
Author