Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry and Suggestions Regarding OpenBLAS Code Flow with OpenMP #4418

Open
shivammonaka opened this issue Jan 5, 2024 · 8 comments
Open

Comments

@shivammonaka
Copy link
Contributor

Hello,

I've been delving into the OpenBLAS codebase, specifically focusing on the gemm_driver function in level3_thread.c with the #USE_OPENMP=1 flag enabled. I've come across a section in the code where a lock is used in the Pthreads and Win32 backend before initializing and assigning the blasqueue. The lock is released only after the exec_blas call is completed.

My current understanding is :

  1. In the case of Pthreads The thread pool is initialized (POSIX), and when a BLAS call is made, the thread pool is utilized to execute it, after which the threads go back to sleep. The use of locks ensures that only one exec_blas call can be executed at a time.
  1. If I use a thread pool with No_of_threads < nthreads (number of queue partitions) defined in level3_thread.c, I encounter a deadlock within the inner_threads function.
  2. There seems to be significant busy-waiting (NOP) inside the inner_thread function call for thread synchronization.

I have a few questions and want to seek suggestions from the community:

  1. I noticed that OpenMP locking mechanisms like omp_set_lock are not used, and instead, busy-waiting is implemented inside the exec_blas function in blas_server_omp.c using max_parallel_number. Could you please provide insights into this choice?

  2. The inner_thread function appears to encounter deadlock when used with fewer threads. I tested this in OpenMP by creating a parallel region inside exec_blas in line 437 - blas_server_omp.c with a fixed number of threads less than nthreads. Could you shed some light on the reasons behind this behavior?

  3. Considering the busy-waiting in the code, have there been considerations for putting threads to sleep or employing other synchronization methods to enhance efficiency?

I would appreciate clarification on these points.

@martin-frbg
Copy link
Collaborator

regarding 1. The code involving max_parallel_number is only a stopgap "solution" (from #1536) for the case when an OpenMP program makes two or more concurrent calls to BLAS functions and the calls to exec_blas would end up using the same queue buffers. For calls from within an OpenMP parallel region, OpenBLAS will currently use a single thread only (see omp_in_parallel() conditional in common_thread.h). Generally, the OpenMP code in OpenBLAS for the most part dates back to the original GotoBLAS2 of 15 years ago, so tends not to use more recent OpenMP features. I am currently trying to unravel all this, but it is turning into a bigger task than I anticipated...

2., Unfortunately it is not quite clear to me what you are trying to do with using fewer threads inside the level3 code, possibly you are accidentally creating a situation where part of the code still expects all threads to be in use ? At least I am not aware of deadlocks in "normal" operation.

3., the pthreads code already utilizes a configurable THREAD_TIMEOUT for putting idle threads to sleep, the omp server code appears to rely on the OMP_WAIT_POLICY of the OpenMP implementation to do the right thing. The busy-waiting inside the inner_thread is a feature inherited from GotoBLAS2 (for which almost no documentation of implementation details exists)

@shivammonaka
Copy link
Contributor Author

Hi @martin-frbg

  1. I see the issue #1536 and that openMP goes to single thread execution when it sees that it is in OMP_Parallel region but instead of looping in a while(true) loop to acquire a memory region isn't it better to use omp_locks there or conditional wait policy. Suppose we have 32core machine with 32 parallel instances with NUM_PARALLEL=1 then only one of them gets the access and other 31 instances are busy waiting in a while loop and wasting other cores.

  2. I tried limiting the number of OpenMP threads at line 438 in blas_server_omp.c and it gets stucks. The value of num was supposed to be 32. So here, If I use any values < 32, it gets stuck.
    image
    It is getting stuck inside inner_threads function in level3_thread.c.
    image
    I suspect it is waiting for all threads to come before moving forward but since the number of threads are lower than num variable in blas_server_omp.c it is getting stuck.

  3. Similar to (1), I can see we have used YIELDING inside inner_threads function which internally is just a bunch of NOP instructions. It is similar to busy waiting. Shouldn't we use some conditional waiting instead? I think OpenMP has taskyield and task priority which can be used instead of YIELDING with some architectural changes.

@martin-frbg
Copy link
Collaborator

It is entirely possible that you know more about the topic than me, and as mentioned the core parts here stem from a time when large, fast multiprocessor machines barely existed. (The pthreads build was not thread safe at all back then, I suspect OpenMP was mostly used as an opaque way of getting around that)
There have been a few discussions about doing away with the ugly YIELDING in the past, but nothing conclusive came out of them.

@martin-frbg
Copy link
Collaborator

FYI it appears that taskyield is an empty stub in GCC's libgomp - the situation appears to be better with LLVM's libomp at least from about 2020 onwards but I have not checked in depth if it is implemented as anything fundamentally better than the current YIELDING.

@martin-frbg
Copy link
Collaborator

While I'm still looking into this, unfortunately I have not seen any gains from your suggestions - though perhaps I am doing something wrong. Did you try with omp_set_lock() yourself ?

The busy waiting is a complicated topic, we've discussed and experimented with various alternatives since years (e.g. #1051) without conclusive results and I notice the BLIS project is stuck in the same predicament (e.g. their issue #604).

Lastly, if you want to restrict the number of threads for a given BLAS call, you'd need to do that at the interface stage (or by invoking openblas_set_num_threads() in the calling program) to keep the thread count information consistent across
the program and avoid deadlocks from waiting on phantom threads.

@stevengj
Copy link
Contributor

stevengj commented May 7, 2024

I'm told that this busy-wait/spin-lock is also causing an issue with #4577 where they are trying to plug in alternative threading backends like TBB.

Fundamentally, if you are using a backend like TBB or Cilk or partr, you give it tasks that could be executed in parallel, but you have no control over whether they actually are executed in parallel. So you cannot use a spin-lock in the tasks that assumes they are running in parallel, and you cannot use openblas_set_num_threads to control the number of threads that will actually be used — instead, openblas_set_num_threads in this context will just set the number of potentially parallel tasks that a dgemm call will be broken into.

That means that, in that context, they will need to replace the spin-lock with something else, either:

  • expose (e.g. via another callback) the backend (e.g. TBB) task-group-wait synchronization (or whatever the appropriate API is), and call it instead of a spin-lock
  • or refactor the inner-thread function into multiple calls to the threading-backend callback, which already waits for all tasks to complete, corresponding to the multiple stages that you currently spin-lock between.

For performance, it might be possible to pass additional hints to the runtime scheduler of TBB (or whatever) hinting that certain tasks should be scheduled onto certain threads. But fundamentally, you might lose some parallel performance as a tradeoff for attaining better composability.

@martin-frbg
Copy link
Collaborator

I guess we could create a separate "spin-lock" function that would have to be overridden in the callback from #4577. Not sure if it is possible to determine at runtime what the threading backend is (or even just if it is TBB), but there is probably no way to come up with a backend-agnostic synchronization mechanism(?).

@stevengj
Copy link
Contributor

stevengj commented May 7, 2024

You could have a function waitgroup(groupinfo) that looks something (crudely) like

if (waitgroup_callback != NULL) {
    waitgroup_callback(groupinfo);
}
else {
    # busy-wait
}

where there would be some API to set waitgroup_callback, and groupinfo would be some data structure with information about the group of tasks you are waiting on (which could be set by the threads callback and passed into the work function via an extra argument).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants