[QST] Setting Exclusive_Mode on GPUs making it unavailable for ML workloads #5338
-
What is your question?
Interestingly, this happens even if there is no active Spark job running on the cluster, which should ideally mean the GPUs are available to be used by other processes? |
Beta Was this translation helpful? Give feedback.
Replies: 11 comments
-
are you using GPU scheduling on the YARN setup? |
Beta Was this translation helpful? Give feedback.
-
@tgravescs yes I'm using GPU scheduling on the YARN setup by setting following yarn configs:
And setting all GPUs with Here is the output of
|
Beta Was this translation helpful? Give feedback.
-
No, something else is using your GPUs, if its not scheduling them through yarn then they will obviously conflict and hence why you get an error about the cuda devices not available. Suggest you look to see what those processes are and if they are using GPU scheduling in yarn as well. |
Beta Was this translation helpful? Give feedback.
-
@tgravescs I checked with @Niharikadutta today and we suspect those 4 JAVA processes are Spark Executors. Since currently they are using YARN without Isolation mode, so all 4 GPUs are in exclusive mode. If all Spark executors will keep running even after Spark job finishes, is there any way to run other ML jobs on GPU? Now if @Niharikadutta switched from "YARN without Isolation" to "YARN 3.1.3 with Isolation and GPU Scheduling Enabled " mode, of course, there is no requirement to enable GPU in exclusive mode. |
Beta Was this translation helpful? Give feedback.
-
I don't understand the situation:
What exactly do you mean by this? Is the spark application finishing or is this trying to run ML jobs within the same application? ie one application uses the sql plugin to run etc jobs and then afterwards the same applications runs ML job. Are these ML jobs python? |
Beta Was this translation helpful? Give feedback.
-
If the answer to the above is yes, its all in the same application and you have separate java and python process, then you couldn't do it in process exclusive mode, but if you are using yarn with isolation and gpu scheduling it would allow both to use the GPU. You would need to split the memory though across them, otherwise the rapids plugin will use all the GPU memory |
Beta Was this translation helpful? Give feedback.
-
@Niharikadutta @tgravescs As discussed today, if the ML job(Horovod on Spark) needs to run after the ETL portion has finished, we may need to split the GPU memory to 2 parts -- one for ETL job(Spark executor), one for Horovod job(which is a separate python process using GPU). Say for a T4 with 16G GPU memory, we can allocate 6GB to Spark executor, and 10GB to other stuff(including ML: Horovod job in this case). To troubleshoot if the Horovod issue is caused by inefficient GPU memory, we can try to do one test to troubleshoot:
If the issue still exists, then we can know this issue is not related to GPU memory allocation. |
Beta Was this translation helpful? Give feedback.
-
Another option I can think of is to:
|
Beta Was this translation helpful? Give feedback.
-
you can't really restart the executors in Spark. You could potentially use stage level scheduling feature (http://spark.apache.org/docs/latest/configuration.html#stage-level-scheduling-overview) but it requires dynamic allocation. Essentially with that, you would get one set of executors to do the ETL with spark rapids plugin and then you would specify a different profile for the ML side which would tell spark to get different executors to run ML on. Note that with dynamic allocation, the ETL executors would just idle timeout when not in use. |
Beta Was this translation helpful? Give feedback.
-
Yes @tgravescs . Here "restart the Spark executors" actually means restarting the spark jobs in the same pool. |
Beta Was this translation helpful? Give feedback.
-
Closing this issue since this is by design, |
Beta Was this translation helpful? Give feedback.
@Niharikadutta @tgravescs As discussed today, if the ML job(Horovod on Spark) needs to run after the ETL portion has finished, we may need to split the GPU memory to 2 parts -- one for ETL job(Spark executor), one for Horovod job(which is a separate python process using GPU).
Say for a T4 with 16G GPU memory, we can allocate 6GB to Spark executor, and 10GB to other stuff(including ML: Horovod job in this case).
The 10GB can be reserved by using parameter
spark.rapids.memory.gpu.reserve
.To troubleshoot if the Horovod issue is caused by inefficient GPU memory, we can try to do one test to troubleshoot:
spark.rapids.memory.gpu.reserve
set to…