Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TOOLS/PERF: Changes to enable the rocm perf modules #4349

Closed
wants to merge 9 commits into from
Closed

TOOLS/PERF: Changes to enable the rocm perf modules #4349

wants to merge 9 commits into from

Conversation

paklui
Copy link
Contributor

@paklui paklui commented Oct 29, 2019

What

I am including the changes to enable the perf modules for rocm to get built, so we can run ucx_perftest on rocm device.

Why ?

The rocm support on ucx_perftest is not enabled, so just to enable the ucx_perftest. This change does not change the existing functionality of rocm in ucx.

@swx-jenkins3
Copy link
Collaborator

Can one of the admins verify this patch?

@dmitrygx
Copy link
Member

ok to test

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/12895/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/12896/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 8 of 29 workers (click for details)

Note: the logs will be deleted after 05-Nov-2019

Agent/Stage Status
_main ❓ ABORTED
hpc-arm-cavium-jenkins_W0 ❓ ABORTED
hpc-arm-cavium-jenkins_W1 ❓ ABORTED
hpc-arm-cavium-jenkins_W2 ❓ ABORTED
hpc-arm-cavium-jenkins_W3 ❓ ABORTED
hpc-test-node-legacy_W0 ❓ ABORTED
hpc-test-node-legacy_W1 ❓ ABORTED
hpc-test-node-legacy_W2 ❓ ABORTED
hpc-test-node-legacy_W3 ❓ ABORTED
hpc-test-node-new_W0 ❓ ABORTED
hpc-test-node-new_W1 ❓ ABORTED
hpc-test-node-new_W2 ❓ ABORTED
hpc-test-node-new_W3 ❓ ABORTED
hpc-arm-hwi-jenkins_W0 ❌ FAILURE
hpc-arm-hwi-jenkins_W1 ❌ FAILURE
hpc-arm-hwi-jenkins_W2 ❌ FAILURE
hpc-arm-hwi-jenkins_W3 ❌ FAILURE
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W1 ❌ FAILURE
hpc-test-node-gpu_W2 ❌ FAILURE
hpc-test-node-gpu_W3 ❌ FAILURE
hpc-test-althca_W0 ✔️ SUCCESS
hpc-test-althca_W1 ✔️ SUCCESS
hpc-test-althca_W2 ✔️ SUCCESS
hpc-test-althca_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 4 of 25 workers (click for details)

Note: the logs will be deleted after 05-Nov-2019

Agent/Stage Status
_main ❓ ABORTED
hpc-arm-cavium-jenkins_W0 ❓ ABORTED
hpc-arm-cavium-jenkins_W1 ❓ ABORTED
hpc-arm-cavium-jenkins_W2 ❓ ABORTED
hpc-arm-cavium-jenkins_W3 ❓ ABORTED
hpc-test-node-legacy_W0 ❓ ABORTED
hpc-test-node-legacy_W1 ❓ ABORTED
hpc-test-node-legacy_W2 ❓ ABORTED
hpc-test-node-legacy_W3 ❓ ABORTED
hpc-test-node-new_W0 ❓ ABORTED
hpc-test-node-new_W1 ❓ ABORTED
hpc-test-node-new_W2 ❓ ABORTED
hpc-test-node-new_W3 ❓ ABORTED
r-vmb-ppc-jenkins_W1 ❓ ABORTED
r-vmb-ppc-jenkins_W2 ❓ ABORTED
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W1 ❌ FAILURE
hpc-test-node-gpu_W2 ❌ FAILURE
hpc-test-node-gpu_W3 ❌ FAILURE
hpc-test-althca_W0 ✔️ SUCCESS
hpc-test-althca_W1 ✔️ SUCCESS
hpc-test-althca_W2 ✔️ SUCCESS
hpc-test-althca_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@paklui paklui changed the title Changes to enable the rocm perf modules TOOLS/PERF: Changes to enable the rocm perf modules Nov 11, 2019
@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 5 of 25 workers (click for details)

Note: the logs will be deleted after 18-Nov-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W1 ❌ FAILURE
hpc-test-node-gpu_W2 ❌ FAILURE
hpc-test-node-gpu_W3 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 5 of 25 workers (click for details)

Note: the logs will be deleted after 18-Nov-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W1 ❌ FAILURE
hpc-test-node-gpu_W2 ❌ FAILURE
hpc-test-node-gpu_W3 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

# This is the 1st commit message:

Changes to enable the rocm perf modules

# This is the commit message #2:

UCT: rdmacm on iface: null check for the UNREACHABLE event's priv_data

# This is the commit message #3:

UCT: rdmacm on iface - remove redundant static assert.

# This is the commit message #4:

GTEST/UCP: Fix clang 9 warnings for OpenMP loop variable initialization

# This is the commit message #5:

UCT/IB/MLX5: Modify QP via DEVX

# This is the commit message #6:

MEM/TYPE-DETECT: use const pointer

- use const void* buffers for memory type detection

# This is the commit message #7:

UCT.H: fixed typo in description

# This is the commit message #8:

UCM/UTIL: Add error prints to mmap() failures

# This is the commit message #9:

TOOLS/PROFILE: Support custom pager

# This is the commit message #10:

UCT/RCX: MP XRQ support

# This is the commit message #11:

TOOLS/TESTS/BUILD-PGI: fixed compilation flags

- fixed compilation flags detection for PGI compiler
- optimized ICC flags settings

# This is the commit message #12:

UCP/CORE, GTEST/UCP: Add tests for zero-thresholds + fixes for RNDV thresholds

# This is the commit message #13:

GTEST/UCP: Remove duplicate test

# This is the commit message #14:

GTEST/UCP: Initialize m_env inside ctor

# This is the commit message #15:

UCS/MEMTYPE_CACHE, GTEST/UCS: Align regions by Page Table alignment

# This is the commit message #16:

TOOLS/PERF: Print an error if memory type is unsupported

# This is the commit message #17:

UCT/IB: Add multi-threaded MR handling

performance on Intel E5-2697A 32 cores
without patch
100G shmem_init:      25.32 secs
     shmem_finalize:   3.26 secs
200G shmem_init:     106.29 secs
     shmem_finalize    9.55 secs

with patch
100G shmem_init:       1.63 secs
     shmem_finalize:   0.44 secs
200G shmem_init:       3.01 secs
     shmem_finalize    0.85 secs

# This is the commit message #18:

UCS/ASYNC: Support sync-removing a handler from its callback

Allow removing the handler from its callback in a synchronous way. This
means the function will return when the only remaining reference to the
handler is of the current callback.

# This is the commit message #19:

EXAMPLES/JAVA: Sleep for 3 seconds after endpoint destroy

Give grace period for other side to disconnect before worker is
destroyed. Need to remove this when UCP close protocol is implemented.

Fixes test failures with "Retry count exceeded".

# This is the commit message #20:

UCS/ASYNC: Code review fixes and reword comment

# This is the commit message #21:

UCS/STRING: Add ucs_snprintf_safe()

# This is the commit message #22:

UCT/TCP, UCS/SYS: Repeat connection establishment whn detected dropped conenctions

# This is the commit message #23:

UCT/TCP: Fix review comments

# This is the commit message #24:

TEST/GTEST: Print parameter name for every running test

# This is the commit message #25:

TEST/GTETS: reduce execution time of stress sockaddr testing under valgrind

# This is the commit message #26:

UCP/WIREUP: remove ucp_ep_params from ucp_wireup_ep_t

And store all required info in ucp_ep_init_flags

# This is the commit message #27:

TEST/GTEST: Small enhancements

# This is the commit message #28:

TEST/UCT: Extend MM tests

# This is the commit message #29:

UCT/MM: Fix md_attr::cap.max_reg for xpmem

# This is the commit message #30:

GTEST/UCS/UCT: Enable sockaddr testing with an IPv6 address.

Don't use functions that support only IPv4.
ioctl is legacy and won't return IPv6.
https://stackoverflow.com/questions/20743709/get-ipv6-addresses-in-linux-using-ioctl

# This is the commit message #31:

GTEST: sockaddr: fix testing for ipv6

# This is the commit message #32:

UCS/ASYNC/TEST: Fix sync handler remove

In order to support removing async handler while its callback is called,
we must make sure the call happens from the same thread. Otherwise, the
sync-remove operation does not really guarantee sync, and subsequent
file operations (such as accept() in sockcm) will fail because fd is
already closed.

Refactor polling on missed events to avoid deadlock - we can assume the
polled handlers come from same async context we are polling on (because
of the way the are added to miss queue).

Add a test to make sure the async handler is not called after it's
sync-removed.

# This is the commit message #33:

UCS/DEBUG: Do not use undefined signal names, and terminate the array.

On FreeBSD SIGSYS is 12, and SIGPIPE is 13.  On the other hand, there
is no SIGSTKFLT and SIGPWR.

Signed-off-by: Konstantin Belousov <konstantinb@mellanox.com>

# This is the commit message #34:

UCP/WIREUP/TEST: Fix asymmetric endpoints connection with p2p transport

# This is the commit message #35:

UCP/ADDRESS: Code review fixes

# This is the commit message #36:

TEST/MPI: Remove accidentaly added file

# This is the commit message #37:

UCS/BACKTRACE: re-implemented backtrace output

- added backtrace output to invalid AM ID handler
- optimization for ucs_mmap_free: now ucs_mmap_free gets
  original object size used to allocate

# This is the commit message #38:

UCM/CUDA: Skip ucm_cuda_set_ptr_attr when pointer is null

# This is the commit message #39:

AZP: edit existing release if exist

# This is the commit message #40:

TEST/ASYNC: Add retries to test_async.ctx_event_block

# This is the commit message #41:

GTEST/UCT/ROCM: Enable ROCm unit tests in the gtest framework

# This is the commit message #42:

UCM/CUDA: fix ucm_cudafree_dispatch_events assertion

Only call assert in ucm_cudafree_dispatch_events if cuMemGetAddressRange() call
succeeds and downgrade warning to debug message when it fails.

# This is the commit message #43:

ARCH/ARM: Barrier code update.

The change is driven by recent clarifications in linux kernel code:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f

Performance testing shows 8% uplift for short messages using InfiniBand.

Signed-off-by: Pavel Shamis (Pasha) <pasharesearch@gmail.com>

# This is the commit message #44:

UCT/CUDA_IPC: Fix peer_accessible check

# This is the commit message #45:

RC/IFACE: fixed potential leak on failed rc iface init

- added jump label for more accurate error handling
…me system and explicitly define the include files, and changes to enable the rocm perf modules
@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 5 of 25 workers (click for details)

Note: the logs will be deleted after 19-Nov-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W1 ❌ FAILURE
hpc-test-node-gpu_W2 ❌ FAILURE
hpc-test-node-gpu_W3 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 5 of 25 workers (click for details)

Note: the logs will be deleted after 19-Nov-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W1 ❌ FAILURE
hpc-test-node-gpu_W2 ❌ FAILURE
hpc-test-node-gpu_W3 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 5 of 25 workers (click for details)

Note: the logs will be deleted after 19-Nov-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W1 ❌ FAILURE
hpc-test-node-gpu_W2 ❌ FAILURE
hpc-test-node-gpu_W3 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@paklui
Copy link
Contributor Author

paklui commented Nov 12, 2019

closing this PR to fix the commit title

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants