Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMP locks instead of busy-waiting with NUM_PARALLEL #4503

Merged
merged 2 commits into from
Mar 14, 2024

Conversation

shivammonaka
Copy link
Contributor

Brief description: This pull request proposes replacing MAX_PARALLEL_NUMBER/NUM_PARALLEL feature with OpenMP locking mechanism to achieve a more refined implementation in line with OpenMP specifications and consistent with Pthreads and Win32 designs.

In the current codebase, the exec_blas function in OpenMP employs busy-waiting, utilizing the max_parallel_number feature to enable the execution of multiple instances of OpenBLAS in parallel. With our changes:

  1. We introduce OpenMP locks during the initialization of the queue. We have removed existing lock mechanism and have used omp_lock instead of in-house spin lock. This design is consistent with the existing Pthreads and Win32 threading backend designs.

  2. This adjustment ensures that concurrent calls to BLAS functions do not utilize the same BLAS thread buffers.

  3. OpenMP locks(currently using spin locks) have an expected scope for improvement in the future. Any performance gain will contribute to the new implementation without requiring manual changes.

Kindly note:

  1. We have eliminated the NUM_PARALLEL flag and its associated code. NUM_PARALLEL was an add-on to the existing OPENMP_NUM_THREADS and requires recompilation of OPENBLAS for usage.

  2. There is no impact to performance with default setting of NUM_PARALLEL=1. We are using the same logic but now in a more refined manner aligned with OpenMP specification with scope for future improvement.

For further details and discussions on this issue, please refer to: #4418

@shivammonaka
Copy link
Contributor Author

Hi @martin-frbg. Can you review the PR and give your insights.

@martin-frbg
Copy link
Collaborator

Thank you, see what you mean now (and why my own experiments did not look promising). Two quick comments - I would prefer not to have an assert in the code,if it is at all possible to fall back to some safe value instead - by its very nature, OpenBLAS may be somewhere deep in the library stack of a program, and blowing up a calculation on a "bad" BLAS call when the end user may not even be aware that they use BLAS is not a good idea. And I think I would like to preserve the NUM_PARALLEL at least for now - unless you are sure that it won't be needed to address the (fairly unusual) case of multiple independent OpenMP runtimes calling into OpenBLAS (as supported by legate/legion, issue #2146)

@shivammonaka
Copy link
Contributor Author

shivammonaka commented Feb 15, 2024

Thank you for your comments @martin-frbg

  1. I agree with you on assert removal. I will remove it on next commit.

  2. There are 2 ways to handle multiple independent OpenMP runtimes calling into OpenBLAS:

Multiple buffers (Existing) : We can have multiple buffers and no lock on gemm_driver, So whenever we get multiple independent OpenBLAS calls, they will each have their own buffers (Limited by the NUM_PARALLEL) and rest of the calls will busy wait in the while loop inside exec_blas.

Single Buffer with locks (Proposed in PR and currently in use by Pthread and Win32 Backends) : The idea is to have a lock in the initial phase of OpenBLAS call inside the gemm_driver function. Now, when multiple independent OpenBLAS calls comes only one of them is granted a lock and others will sleep or busy wait (According to the inner implementation of locking mechanism).

How the test was performed :

  1. I had a test.cpp file which called OpenBLAS inside a omp parallel region.
  2. I had enabled nested threading in the test file.
  3. I disabled omp_in_parallel flag in common_thread.h to not force execute single threading path if omp parallism is detected.

So, Having a single buffer with lock is fine as only one OpenBLAS call is being served at a time. It was showing consistent results during my testing on multiple OpenBLAS execution in parallel.

Edit: I have attached a screenshot of the test.cpp.

image

@martin-frbg
Copy link
Collaborator

Thank you - I agree that having a single set of buffers and locking out any competing caller will work, I am only wondering if better performance can be achieved (in the future) by keeping NUM_PARALLEL as well and having as many buffers for parallel callers. The only drawback is that there probably needs to be more duplication of queue control structures so that a caller can not end up reusing idle threads configured for some other parallel process. (This is a problem for both pthreads and OpenMP I think, likely win32 threads as well - at this point I only suggest to avoid throwing out the NUM_PARALLEL option and its associated code, when we may need it again somewhat later)

@shivammonaka
Copy link
Contributor Author

shivammonaka commented Feb 19, 2024

Hi @martin-frbg

I want your views on some points before moving forward with the modification

(Future performance gain with NUM_PARALLEL) : I want to know your thoughts on how to utilize NUM_PARALLEL to get better OpenBLAS efficiency and your about your future plans on NUM_PARALLEL. I also didn't wanted to remove a existing feature but I failed to find any reason of future improvement using it.

  1. Currently, Pthreads and Win32 works by allowing only a single BLAS call at a time. Is there any reason for it ?(Hardware related or Cache or any other).
  2. Are you going to have NUM_PARALLEL type feature in Pthreads and Win32 backend as well?

(How to Proceed forward) : There are 2 ways we can have NUM_PARALLEL with OpenMP locks

  1. OpenMP locks existing with NUM_PARALLEL : We can have different lock for each NUM_PARALLEL region.
  2. OpenMP locks as a specific case : We can have OpenMP locks flow when NUM_PARALLEL = 1 since it behaves exactly the same as existing design and when NUM_PARALLEL != 1, We can have existing flow and disable the locks.

Plesae let me know which design is inligned with your future plans on NUM_PARALLEL

@martin-frbg
Copy link
Collaborator

Sorry for the delay - some other things going on in real life - what I'm thinking about is OpenMP locks coexisting with NUM_PARALLEL (for some small number of NUM_PARALLEL, like 2 or 3) if that works out to be practical.

I think the limitation to one caller is purely historical - most of the "threading infrastructure" part of OpenBLAS is still the same as the original GotoBLAS from close to twenty years ago, written from scratch by a single developer and appropriate for the hardware of the time. (It wasn't until about six years ago that it became apparent that the pthreads build was not thread-safe at all even for a single caller.)

@shivammonaka
Copy link
Contributor Author

Hi @martin-frbg

  1. My understanding is that you want to test out locks with some small number of parallel regions like 2 or 3 but what should the flow be like if the value of NUM_PARALLEL is suppose 16. One thing we can do is have same number of static locks as NUM_PARALLEL and each thread using the buffer associated with its acquired lock.

  2. I think we can have multiple instances of OpenBLAS calls to execute at a time in Pthreads and Win32 as well but executing it with current inner_threads might have some issues (1) I am not sure whether current inner_threads is better at utilizing cache efficiently only when one blas call is executing at a time or not. I am not sure whether allowing more blas calls to execute concurrently will lead to more cache misses or not (2) We might need to do something about the YIELDING (Need to change it from spinning to actual thread yield) in inner_threads.

  3. Last thing, I want to point out that I think some performance can be unlocked if we can have actual thread yield and sleeping mechanism inside OpenMP. Like currently OpenMP locks are spinning. Do you any workaround for it?

@martin-frbg
Copy link
Collaborator

  1. Yes, same number of locks (and probably just forbid NUM_PARALLEL greater than 2 or 3) - but I'm not yet sure how much duplication is needed elsewhere, it could be that it gets too expensive in terms of memory quickly.
  2. Probably depends on the size of the hardware, some big NUMA server may be better suited for this but I'm just exploring... YIELDING is asm("nop;nop)" on some platforms already which should make it behave better. The only reason it is still sched_yield or equivalent on others is that it seemed good enough to stay whenever someone looked into it.
  3. No idea yet for replacing spin locks, looked into it but ran out of time, and maybe I am simply not good enough at OpenMP coding. On the other hand, I get the impression that the default OMP_WAIT_POLICY is implementation dependent and some platforms could gain from expressly setting it to "PASSIVE"

@shivammonaka
Copy link
Contributor Author

Hi @martin-frbg

As discussed,I have made some changes.
OpenBLAS is now controlling parallel execution inside OpenMP with a single lock. I have tried to explain the locks logic in comments.

I have tested it and it seems to be working fine. Please review the changes and let me know if any modification is required.

@martin-frbg
Copy link
Collaborator

Thank you - unfortunately it looks as if I am not getting much of anything done today, so I've only had a quick look so far

@shivammonaka
Copy link
Contributor Author

Hi @martin-frbg
Any comments/suggestions on the PR?

@martin-frbg
Copy link
Collaborator

sorry I'm currently sick, hope to get to this later in the week

@shivammonaka
Copy link
Contributor Author

Oh, Take your time. I hope you get well soon

@shivammonaka
Copy link
Contributor Author

Hi @martin-frbg
Any comments/suggestions on the PR?

@martin-frbg
Copy link
Collaborator

Looks good (as far as I can tell today), thank you very much. I intend to merge it tomorrow.

@martin-frbg martin-frbg added this to the 0.3.27 milestone Mar 14, 2024
@martin-frbg martin-frbg merged commit 66bde62 into OpenMathLib:develop Mar 14, 2024
65 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants