Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT: fix hang when using polling fd #1492 #1561

Closed
wants to merge 3 commits into from

Conversation

evgeny-leksikov
Copy link
Contributor

Fixes #1492

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1787/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1788/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3724/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3725/ for details (Mellanox internal link).

@evgeny-leksikov
Copy link
Contributor Author

@yosefe @brminich pls review

@yosefe
Copy link
Contributor

yosefe commented Jun 2, 2017

@evgeny-leksikov why need to rearm inside progress, if the user has to explicitly call arm after progress anyway?
e.g use flow is:

  1. progress
  2. arm
  3. epoll_wait/...

so how is adding "arm" after (1) helps, given that we already have arm in (2)?

@evgeny-leksikov
Copy link
Contributor Author

@yosefe because arm did not re-arm if there are any events. Pls, take a look now.

@yosefe
Copy link
Contributor

yosefe commented Jun 4, 2017

@evgeny-leksikov user should call arm again when getting BUSY, until he gets UCS_OK, which would mean the cq is really re-armed.

@yosefe
Copy link
Contributor

yosefe commented Jun 4, 2017

@MattBBaker can you please post ORNL failure?

@evgeny-leksikov
Copy link
Contributor Author

@yosefe yes, but some events can be missed between BUSY and next re-arm but not polled from the CQ by progress. So, we have to have re-armed transport all the time.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1794/ for details.

@yosefe
Copy link
Contributor

yosefe commented Jun 4, 2017

@evgeny-leksikov so you mean that if there are unpolled CQEs, and then arm is done, no event is generated?

@evgeny-leksikov
Copy link
Contributor Author

yes

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3733/ for details (Mellanox internal link).

@evgeny-leksikov
Copy link
Contributor Author

bot:mlx:retest

@yosefe yosefe added the Bugfix label Jun 4, 2017
@yosefe
Copy link
Contributor

yosefe commented Jun 4, 2017

@evgeny-leksikov what if the event arrives just before arming is done, and the cq is not armed yet? for example, on the first time?

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3735/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3738/ for details (Mellanox internal link).

@MattBBaker
Copy link
Contributor

@yosefe Looks like a build environment failure.

@MattBBaker
Copy link
Contributor

bot:ornl:retest

1 similar comment
@MattBBaker
Copy link
Contributor

bot:ornl:retest

@shamisp
Copy link
Contributor

shamisp commented Jun 6, 2017

probably should go to v1.2 as well

@yosefe yosefe added the WIP-DNM Work in progress / Do not review label Jun 6, 2017
@evgeny-leksikov
Copy link
Contributor Author

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3788/ for details (Mellanox internal link).

@yosefe
Copy link
Contributor

yosefe commented Jun 19, 2017

the reported issue is not a bug

@yosefe yosefe closed this Jun 19, 2017
@evgeny-leksikov evgeny-leksikov deleted the hang_poll_fd branch June 25, 2017 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bugfix WIP-DNM Work in progress / Do not review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants