Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTEST: fixes to the uct sockaddr tests #4520

Merged
merged 1 commit into from
Dec 4, 2019

Conversation

alinask
Copy link
Contributor

@alinask alinask commented Nov 28, 2019

No description provided.

@alinask
Copy link
Contributor Author

alinask commented Nov 28, 2019

fixes for #4331

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 2 of 25 workers (click for details)

Note: the logs will be deleted after 05-Dec-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-legacy_W2 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@alinask
Copy link
Contributor Author

alinask commented Nov 28, 2019

the failures are #4512 and OOM

bot:pipe:retest
bot:mlx:retest

@yosefe
Copy link
Contributor

yosefe commented Nov 30, 2019

bot:retest

@yosefe
Copy link
Contributor

yosefe commented Nov 30, 2019

are all the counters incremented only by one thread?

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 4 of 25 workers (click for details)

Note: the logs will be deleted after 07-Dec-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-legacy_W0 ❌ FAILURE
hpc-test-node-legacy_W2 ❌ FAILURE
hpc-test-node-legacy_W3 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@alinask
Copy link
Contributor Author

alinask commented Dec 1, 2019

are all the counters incremented only by one thread?

in rdmacm - all but the err_count counter (in the iface tests) are incremented by the async thread.
error handling is done in the main thread.

@@ -428,6 +433,7 @@ class test_uct_cm_sockaddr : public uct_test {

self->m_cm_state |= TEST_CM_STATE_CONNECT_REQUESTED;
self->m_server_recv_req_cnt++;
ucs_memory_cpu_store_fence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queue push should be before counter increment

self->server_recv_req++;
}

static ucs_status_t err_handler(void *arg, uct_ep_h ep, ucs_status_t status)
{
test_uct_sockaddr *self = reinterpret_cast<test_uct_sockaddr*>(arg);

ucs_memory_cpu_store_fence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove, make err_count atomic instead

@@ -445,6 +451,7 @@ class test_uct_cm_sockaddr : public uct_test {
server_connect_cb(uct_ep_h ep, void *arg, ucs_status_t status) {
test_uct_cm_sockaddr *self = reinterpret_cast<test_uct_cm_sockaddr *>(arg);
self->m_cm_state |= TEST_CM_STATE_SERVER_CONNECTED;
ucs_memory_cpu_store_fence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably could be removed

@@ -456,6 +463,7 @@ class test_uct_cm_sockaddr : public uct_test {
self->m_server->disconnect(ep);
}
self->m_cm_state |= TEST_CM_STATE_SERVER_DISCONNECTED;
ucs_memory_cpu_store_fence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -477,6 +485,7 @@ class test_uct_cm_sockaddr : public uct_test {
EXPECT_EQ(entity::server_priv_data,
std::string(static_cast<const char *>(remote_data->conn_priv_data)));
self->m_cm_state |= TEST_CM_STATE_CLIENT_CONNECTED;
ucs_memory_cpu_store_fence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -495,6 +504,7 @@ class test_uct_cm_sockaddr : public uct_test {
}

self->m_cm_state |= TEST_CM_STATE_CLIENT_DISCONNECTED;
ucs_memory_cpu_store_fence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

ucs::test_time_multiplier();

ucs_memory_cpu_load_fence();
while ((m_delayed_conn_reqs.size() == 0) && (ucs_get_time() < deadline)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while should wait for counter (conn requests) and then load_fence() and access the queue

@yosefe
Copy link
Contributor

yosefe commented Dec 1, 2019

bot:retest

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 08-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 4 of 25 workers (click for details)

Note: the logs will be deleted after 08-Dec-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-gpu_W1 ❌ FAILURE
hpc-test-node-legacy_W1 ❌ FAILURE
hpc-test-node-new_W1 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@amaslenn
Copy link
Contributor

amaslenn commented Dec 2, 2019

From AZP testing (Tests althca on worker 1):

  CXX      uct/ib/gtest-test_sockaddr.o
In file included from /opt/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/test_helpers.h:11,
                 from /opt/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/test.h:10,
                 from /opt/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/uct/ib/test_sockaddr.cc:7:
/opt/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/gtest.h: In instantiation of ‘testing::AssertionResult testing::internal::CmpHelperEQ(const char*, const char*, const T1&, const T2&) [with T1 = int; T2 = volatile unsigned int]’:
/opt/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/gtest.h:18938:23:   required from ‘static testing::AssertionResult testing::internal::EqHelper<true>::Compare(const char*, const char*, const T1&, const T2&, typename testing::internal::EnableIf<(! testing::internal::is_pointer<T2>::value)>::type*) [with T1 = int; T2 = volatile unsigned int; typename testing::internal::EnableIf<(! testing::internal::is_pointer<T2>::value)>::type = void]’
/opt/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/uct/ib/test_sockaddr.cc:170:5:   required from here
/opt/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/gtest.h:18862:16: error: comparison of integer expressions of different signedness: ‘const int’ and ‘const volatile unsigned int’ [-Werror=sign-compare]
   if (expected == actual) {
       ~~~~~~~~~^~~~~~~~~
/opt/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/gtest.h: In instantiation of ‘testing::AssertionResult testing::internal::CmpHelperEQ(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int; T2 = volatile int]’:
/opt/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/gtest.h:18898:23:   required from ‘static testing::AssertionResult testing::internal::EqHelper<lhs_is_null_literal>::Compare(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int; T2 = volatile int; bool lhs_is_null_literal = false]’
/opt/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/uct/ib/test_sockaddr.cc:555:9:   required from here
/opt/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/gtest.h:18862:16: error: comparison of integer expressions of different signedness: ‘const long unsigned int’ and ‘const volatile int’ [-Werror=sign-compare]
cc1plus: all warnings being treated as errors
make[3]: *** [uct/ib/gtest-test_sockaddr.o] Error 1

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 09-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@alinask
Copy link
Contributor Author

alinask commented Dec 3, 2019

@yosefe is this good to go?

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

squash

+ sockaddr over cm - fix the delayed server response test -
  wait for the server's recv_req_cnt increment.
  checking the queue's size right after connect() returns isn't
  correct since connect() may return before the request was added to the
  queue.
+ add fences in the tests to make it work correctly on hosts with a weak
  memory model.
@alinask alinask force-pushed the topic/gtest-sockaddr-add-fence branch from c68c5a3 to 337b26c Compare December 3, 2019 12:12
@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 10-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@yosefe
Copy link
Contributor

yosefe commented Dec 4, 2019

bot:pipe:retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants