Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failure notification for orchagent #1665

Merged
merged 9 commits into from
Mar 31, 2021
Merged

Conversation

shi-su
Copy link
Contributor

@shi-su shi-su commented Mar 8, 2021

What I did
Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch.

Why I did it
Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.

How I verified it

Details if related

@lgtm-com
Copy link

lgtm-com bot commented Mar 8, 2021

This pull request fixes 2 alerts when merging 5de44a6 into 03a0e21 - view on LGTM.com

fixed alerts:

  • 2 for FIXME comment

@lgtm-com
Copy link

lgtm-com bot commented Mar 10, 2021

This pull request fixes 2 alerts when merging eaed258 into 9e30abb - view on LGTM.com

fixed alerts:

  • 2 for FIXME comment

@shi-su shi-su changed the base branch from master to bmtor March 12, 2021 18:54
@shi-su shi-su changed the base branch from bmtor to master March 12, 2021 18:54
@lgtm-com
Copy link

lgtm-com bot commented Mar 12, 2021

This pull request fixes 2 alerts when merging 625dfbb into 1951365 - view on LGTM.com

fixed alerts:

  • 2 for FIXME comment

@shi-su shi-su changed the title Add failure notification for multiple orch Add failure notification for orchagent Mar 15, 2021
@shi-su shi-su requested a review from qiluo-msft March 15, 2021 16:53
@@ -3596,6 +3596,7 @@ sai_status_t AclOrch::createDTelWatchListTables()
if (status != SAI_STATUS_SUCCESS)
{
SWSS_LOG_ERROR("Failed to create table %s", flowWLTable.description.c_str());
handleSaiCreateStatus(SAI_API_ACL, status);
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSaiCreateStatus [](start = 8, length = 21)

handleSaiCreateStatus [](start = 8, length = 21)

Check the return value? If successfully handled, you treat it as success? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few SAI calls that do not take any action in the case of SAI failures. For these ones, since our goal is to make it on-par with async mode at this stage, I am not checking the return value for now so that it has the same behavior as before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume this function magically retry and fix the issue, so the return value is task_success. You should return SAI_STATUS_SUCCESS after this function.


In reply to: 595609272 [](ancestors = 595609272)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, updated the logic to check if the return value is task_success, and proceed when it is.

@@ -125,7 +125,7 @@ void CoppOrch::initDefaultHostIntfTable()
if (status != SAI_STATUS_SUCCESS)
{
SWSS_LOG_ERROR("Failed to create default host interface table, rv:%d", status);
throw "CoppOrch initialization failure";
handleSaiCreateStatus(SAI_API_HOSTIF, status);
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSaiCreateStatus [](start = 8, length = 21)

handleSaiCreateStatus [](start = 8, length = 21)

Do you want to keep old behavior even with your new handle function? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the old behavior (async mode), this throw should not even be called since syncd should already trigger an swss restart before it reaches this throw. To mimic the old behavior close enough, I think a method could be running this handling function before this throw and keep this throw regardless of the return value of the handling function. I updated the code as such.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to check return value and conditional throw.


In reply to: 595610233 [](ancestors = 595610233)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the logic to check return value before throwing.

@@ -489,6 +489,7 @@ task_process_status CoppOrch::processCoppRule(Consumer& consumer)
if (sai_status != SAI_STATUS_SUCCESS)
{
SWSS_LOG_ERROR("Failed to apply attribute:%d to trap group:%" PRIx64 ", name:%s, error:%d\n", trap_gr_attr.id, m_trap_group_map[trap_group_name], trap_group_name.c_str(), sai_status);
handleSaiSetStatus(SAI_API_HOSTIF, sai_status);
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSaiSetStatus [](start = 20, length = 18)

handleSaiSetStatus [](start = 20, length = 18)

Check the return value? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few SAI calls that do not take any action in the case of SAI failures. For these ones, since our goal is to make it on-par with async mode at this stage, I am not checking the return value for now so that it has the same behavior as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check for the return value of the handling function and return only when it is not task_success

}
else
{
it++;
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it++; [](start = 24, length = 5)

I think you change the behavior. Old code will always erase(it) no matter success or not. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the handling function always exits orchagent to achieve on-par handling with async mode. So this change will not be activated at the moment and the behavior should be the same as async mode. Once we have a better logic for the failure handling functions, we can fully control whether it should retry or not.

@@ -1097,7 +1098,7 @@ bool FdbOrch::addFdbEntry(const FdbEntry& entry, const string& port_name,
SWSS_LOG_ERROR("Failed to create %s FDB %s in %s on %s, rv:%d",
fdbData.type.c_str(), entry.mac.to_string().c_str(),
vlan.m_alias.c_str(), port_name.c_str(), status);
return false; //FIXME: it should be based on status. Some could be retried, some not
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME [](start = 28, length = 5)

Did you truly fixed the FIXME? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not, added the FIXME back.

@@ -1212,7 +1213,7 @@ bool FdbOrch::removeFdbEntry(const FdbEntry& entry, FdbOrigin origin)
{
SWSS_LOG_ERROR("FdbOrch RemoveFDBEntry: Failed to remove FDB entry. mac=%s, bv_id=0x%" PRIx64,
entry.mac.to_string().c_str(), entry.bv_id);
return true; //FIXME: it should be based on status. Some could be retried. some not
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME [](start = 23, length = 5)

The same. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the FIXME back.

@@ -1056,7 +1061,7 @@ bool IntfsOrch::addRouterIntfs(sai_object_id_t vrf_id, Port &port)
{
SWSS_LOG_ERROR("Failed to create router interface %s, rv:%d",
port.m_alias.c_str(), status);
throw runtime_error("Failed to create router interface.");
return handleSaiCreateStatus(SAI_API_ROUTER_INTERFACE, status);
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSaiCreateStatus [](start = 15, length = 21)

handleSaiCreateStatus [](start = 15, length = 21)

To exactly mimic old behavior, you can derive the handleSaiCreateStatus function in this class, and throw inside the function. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the old behavior (async mode), this throw should not even be called since syncd should already trigger an swss restart before it reaches this throw. To mimic the old behavior close enough, I think a method could be running this handling function before this throw and keep this throw regardless of the return value of the handling function. I updated the code as such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check for the return value of the handling function and throw exception only when it did not successfully handle the failure.

@shi-su shi-su requested a review from qiluo-msft March 24, 2021 17:25
shi-su added a commit to shi-su/sonic-swss that referenced this pull request Aug 17, 2021
What I did
Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch.

Why I did it
Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
shi-su added a commit to shi-su/sonic-swss that referenced this pull request Aug 17, 2021
What I did
Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch.

Why I did it
Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
shi-su added a commit that referenced this pull request Aug 18, 2021
What I did
Backport SAI failure handling related commits into the 202012 branch. The following is a list of backported commits:

941875a Deactivate mirror session only when session status is true in updateLagMember (#1666)
be12482 Ignore ALREADY_EXIST error in FDB creation (#1815)
c9c1aa2 Add failure handling for SAI get operations (#1768)
47b4276 [BufferOrch] Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL (#1786)
db9238f Add failure notification for orchagent (#1665)
fc8e43f [synchronous mode] Add failure notification for SAI failures in synchronous mode (#1596)

Why I did it
202012 image needs to include failure handling mechanism for enough notification in the presence of SAI failures.
raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-swss that referenced this pull request Oct 5, 2021
What I did
Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch.

Why I did it
Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants