Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SWDEV-300279][MSRCHA-140] Batchnorm backward convergence fix #1116

Merged
merged 1 commit into from
Aug 27, 2021

Conversation

muralinr
Copy link
Contributor

This fix will address batchnorm backward convergence issue for batch sizes greater 768.

@codecov

This comment has been minimized.

@@ -865,7 +865,7 @@ void BatchNormBackward(Handle& handle,
// N*H*W < 32M and H*W > 1024, use batchnorm variant#1 implementation which parallelize
// work groups over channels and loop through NHW.
//*************************************************************************************************
if((in_nhw < (32 * 1024 * 1024) && in_cstride > 1024) || (n > 768))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why batch sizes greater 768 will have problem parallelize work groups over channels and loop through NHW? Just curious how this has fixed the converge issue. Thanks1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always check HW (in_cstride) which is missing in this case. We were using batchnorm variant#1 for batch sizes > 768 which is having issues for these batch sizes when HW is small. I have corrected it to use correct variant. This bug might have been introduced inadvertently. I fixed it.

@junliume junliume changed the title Batchnorm backward convergence fix [SWDEV-300279] Batchnorm backward convergence fix Aug 26, 2021
@junliume
Copy link
Collaborator

Ping @ce1adon for review. One line change only. Thanks!

@junliume junliume changed the title [SWDEV-300279] Batchnorm backward convergence fix [SWDEV-300279][MSRCHA-140] Batchnorm backward convergence fix Aug 27, 2021
Copy link
Collaborator

@junliume junliume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI has passed

@junliume junliume merged commit 25b5736 into develop Aug 27, 2021
@junliume junliume deleted the batchnorm_bwd_spatial_converge branch August 27, 2021 01:09
Copy link
Contributor

@atamazov atamazov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a regression test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants