OpenMP locks instead of busy-waiting with NUM_PARALLEL #4503

shivammonaka · 2024-02-12T07:43:55Z

Brief description: This pull request proposes replacing MAX_PARALLEL_NUMBER/NUM_PARALLEL feature with OpenMP locking mechanism to achieve a more refined implementation in line with OpenMP specifications and consistent with Pthreads and Win32 designs.

In the current codebase, the exec_blas function in OpenMP employs busy-waiting, utilizing the max_parallel_number feature to enable the execution of multiple instances of OpenBLAS in parallel. With our changes:

We introduce OpenMP locks during the initialization of the queue. We have removed existing lock mechanism and have used omp_lock instead of in-house spin lock. This design is consistent with the existing Pthreads and Win32 threading backend designs.
This adjustment ensures that concurrent calls to BLAS functions do not utilize the same BLAS thread buffers.
OpenMP locks(currently using spin locks) have an expected scope for improvement in the future. Any performance gain will contribute to the new implementation without requiring manual changes.

Kindly note:

We have eliminated the NUM_PARALLEL flag and its associated code. NUM_PARALLEL was an add-on to the existing OPENMP_NUM_THREADS and requires recompilation of OPENBLAS for usage.
There is no impact to performance with default setting of NUM_PARALLEL=1. We are using the same logic but now in a more refined manner aligned with OpenMP specification with scope for future improvement.

For further details and discussions on this issue, please refer to: #4418

shivammonaka · 2024-02-14T08:08:55Z

Hi @martin-frbg. Can you review the PR and give your insights.

martin-frbg · 2024-02-15T09:13:38Z

Thank you, see what you mean now (and why my own experiments did not look promising). Two quick comments - I would prefer not to have an assert in the code,if it is at all possible to fall back to some safe value instead - by its very nature, OpenBLAS may be somewhere deep in the library stack of a program, and blowing up a calculation on a "bad" BLAS call when the end user may not even be aware that they use BLAS is not a good idea. And I think I would like to preserve the NUM_PARALLEL at least for now - unless you are sure that it won't be needed to address the (fairly unusual) case of multiple independent OpenMP runtimes calling into OpenBLAS (as supported by legate/legion, issue #2146)

shivammonaka · 2024-02-15T10:33:30Z

Thank you for your comments @martin-frbg

I agree with you on assert removal. I will remove it on next commit.
There are 2 ways to handle multiple independent OpenMP runtimes calling into OpenBLAS:

Multiple buffers (Existing) : We can have multiple buffers and no lock on gemm_driver, So whenever we get multiple independent OpenBLAS calls, they will each have their own buffers (Limited by the NUM_PARALLEL) and rest of the calls will busy wait in the while loop inside exec_blas.

Single Buffer with locks (Proposed in PR and currently in use by Pthread and Win32 Backends) : The idea is to have a lock in the initial phase of OpenBLAS call inside the gemm_driver function. Now, when multiple independent OpenBLAS calls comes only one of them is granted a lock and others will sleep or busy wait (According to the inner implementation of locking mechanism).

How the test was performed :

I had a test.cpp file which called OpenBLAS inside a omp parallel region.
I had enabled nested threading in the test file.
I disabled omp_in_parallel flag in common_thread.h to not force execute single threading path if omp parallism is detected.

So, Having a single buffer with lock is fine as only one OpenBLAS call is being served at a time. It was showing consistent results during my testing on multiple OpenBLAS execution in parallel.

Edit: I have attached a screenshot of the test.cpp.

martin-frbg · 2024-02-15T16:18:41Z

Thank you - I agree that having a single set of buffers and locking out any competing caller will work, I am only wondering if better performance can be achieved (in the future) by keeping NUM_PARALLEL as well and having as many buffers for parallel callers. The only drawback is that there probably needs to be more duplication of queue control structures so that a caller can not end up reusing idle threads configured for some other parallel process. (This is a problem for both pthreads and OpenMP I think, likely win32 threads as well - at this point I only suggest to avoid throwing out the NUM_PARALLEL option and its associated code, when we may need it again somewhat later)

shivammonaka · 2024-02-19T11:01:01Z

Hi @martin-frbg

I want your views on some points before moving forward with the modification

(Future performance gain with NUM_PARALLEL) : I want to know your thoughts on how to utilize NUM_PARALLEL to get better OpenBLAS efficiency and your about your future plans on NUM_PARALLEL. I also didn't wanted to remove a existing feature but I failed to find any reason of future improvement using it.

Currently, Pthreads and Win32 works by allowing only a single BLAS call at a time. Is there any reason for it ?(Hardware related or Cache or any other).
Are you going to have NUM_PARALLEL type feature in Pthreads and Win32 backend as well?

(How to Proceed forward) : There are 2 ways we can have NUM_PARALLEL with OpenMP locks

OpenMP locks existing with NUM_PARALLEL : We can have different lock for each NUM_PARALLEL region.
OpenMP locks as a specific case : We can have OpenMP locks flow when NUM_PARALLEL = 1 since it behaves exactly the same as existing design and when NUM_PARALLEL != 1, We can have existing flow and disable the locks.

Plesae let me know which design is inligned with your future plans on NUM_PARALLEL

martin-frbg · 2024-02-20T14:24:01Z

Sorry for the delay - some other things going on in real life - what I'm thinking about is OpenMP locks coexisting with NUM_PARALLEL (for some small number of NUM_PARALLEL, like 2 or 3) if that works out to be practical.

I think the limitation to one caller is purely historical - most of the "threading infrastructure" part of OpenBLAS is still the same as the original GotoBLAS from close to twenty years ago, written from scratch by a single developer and appropriate for the hardware of the time. (It wasn't until about six years ago that it became apparent that the pthreads build was not thread-safe at all even for a single caller.)

shivammonaka · 2024-02-20T15:14:11Z

Hi @martin-frbg

My understanding is that you want to test out locks with some small number of parallel regions like 2 or 3 but what should the flow be like if the value of NUM_PARALLEL is suppose 16. One thing we can do is have same number of static locks as NUM_PARALLEL and each thread using the buffer associated with its acquired lock.
I think we can have multiple instances of OpenBLAS calls to execute at a time in Pthreads and Win32 as well but executing it with current inner_threads might have some issues (1) I am not sure whether current inner_threads is better at utilizing cache efficiently only when one blas call is executing at a time or not. I am not sure whether allowing more blas calls to execute concurrently will lead to more cache misses or not (2) We might need to do something about the YIELDING (Need to change it from spinning to actual thread yield) in inner_threads.
Last thing, I want to point out that I think some performance can be unlocked if we can have actual thread yield and sleeping mechanism inside OpenMP. Like currently OpenMP locks are spinning. Do you any workaround for it?

martin-frbg · 2024-02-20T16:12:59Z

Yes, same number of locks (and probably just forbid NUM_PARALLEL greater than 2 or 3) - but I'm not yet sure how much duplication is needed elsewhere, it could be that it gets too expensive in terms of memory quickly.
Probably depends on the size of the hardware, some big NUMA server may be better suited for this but I'm just exploring... YIELDING is asm("nop;nop)" on some platforms already which should make it behave better. The only reason it is still sched_yield or equivalent on others is that it seemed good enough to stay whenever someone looked into it.
No idea yet for replacing spin locks, looked into it but ran out of time, and maybe I am simply not good enough at OpenMP coding. On the other hand, I get the impression that the default OMP_WAIT_POLICY is implementation dependent and some platforms could gain from expressly setting it to "PASSIVE"

shivammonaka · 2024-02-29T16:56:26Z

Hi @martin-frbg

As discussed,I have made some changes.
OpenBLAS is now controlling parallel execution inside OpenMP with a single lock. I have tried to explain the locks logic in comments.

I have tested it and it seems to be working fine. Please review the changes and let me know if any modification is required.

martin-frbg · 2024-02-29T19:18:53Z

Thank you - unfortunately it looks as if I am not getting much of anything done today, so I've only had a quick look so far

shivammonaka · 2024-03-05T06:12:33Z

Hi @martin-frbg
Any comments/suggestions on the PR?

martin-frbg · 2024-03-05T09:14:31Z

sorry I'm currently sick, hope to get to this later in the week

shivammonaka · 2024-03-05T11:45:06Z

Oh, Take your time. I hope you get well soon

shivammonaka · 2024-03-13T18:45:42Z

Hi @martin-frbg
Any comments/suggestions on the PR?

martin-frbg · 2024-03-13T20:11:26Z

Looks good (as far as I can tell today), thank you very much. I intend to merge it tomorrow.

shivammonaka force-pushed the OpenMP-Locks branch from 7b302ac to 0f9ce40 Compare February 14, 2024 05:58

shivammonaka force-pushed the OpenMP-Locks branch from 0f9ce40 to 4f7cf07 Compare February 23, 2024 13:10

shivammonaka closed this Feb 23, 2024

shivammonaka force-pushed the OpenMP-Locks branch from 80ce081 to 0ff8549 Compare February 23, 2024 13:36

shivammonaka reopened this Feb 27, 2024

shivammonaka force-pushed the OpenMP-Locks branch from 2f2a98e to 4c017a7 Compare February 28, 2024 05:51

shivammonaka added 2 commits February 29, 2024 11:47

Using OpenMP locks with NUM_PARALLEL

bc19101

Merge branch 'shivam-develop' into shivam-Locks

d49ebc5

shivammonaka force-pushed the OpenMP-Locks branch from 386283b to d49ebc5 Compare February 29, 2024 06:29

martin-frbg added this to the 0.3.27 milestone Mar 14, 2024

martin-frbg merged commit 66bde62 into OpenMathLib:develop Mar 14, 2024
65 of 69 checks passed

martin-frbg mentioned this pull request Mar 24, 2024

Restore outer loop of blas_buffer_inuse setup for parallel OpenMP #4575

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMP locks instead of busy-waiting with NUM_PARALLEL #4503

OpenMP locks instead of busy-waiting with NUM_PARALLEL #4503

shivammonaka commented Feb 12, 2024

shivammonaka commented Feb 14, 2024

martin-frbg commented Feb 15, 2024

shivammonaka commented Feb 15, 2024 •

edited

Loading

martin-frbg commented Feb 15, 2024

shivammonaka commented Feb 19, 2024 •

edited

Loading

martin-frbg commented Feb 20, 2024

shivammonaka commented Feb 20, 2024

martin-frbg commented Feb 20, 2024

shivammonaka commented Feb 29, 2024

martin-frbg commented Feb 29, 2024

shivammonaka commented Mar 5, 2024

martin-frbg commented Mar 5, 2024

shivammonaka commented Mar 5, 2024

shivammonaka commented Mar 13, 2024

martin-frbg commented Mar 13, 2024

OpenMP locks instead of busy-waiting with NUM_PARALLEL #4503

OpenMP locks instead of busy-waiting with NUM_PARALLEL #4503

Conversation

shivammonaka commented Feb 12, 2024

shivammonaka commented Feb 14, 2024

martin-frbg commented Feb 15, 2024

shivammonaka commented Feb 15, 2024 • edited Loading

martin-frbg commented Feb 15, 2024

shivammonaka commented Feb 19, 2024 • edited Loading

martin-frbg commented Feb 20, 2024

shivammonaka commented Feb 20, 2024

martin-frbg commented Feb 20, 2024

shivammonaka commented Feb 29, 2024

martin-frbg commented Feb 29, 2024

shivammonaka commented Mar 5, 2024

martin-frbg commented Mar 5, 2024

shivammonaka commented Mar 5, 2024

shivammonaka commented Mar 13, 2024

martin-frbg commented Mar 13, 2024

shivammonaka commented Feb 15, 2024 •

edited

Loading

shivammonaka commented Feb 19, 2024 •

edited

Loading