-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix port up/bfd sessions bringup notification delay issue. #3269
base: master
Are you sure you want to change the base?
Fix port up/bfd sessions bringup notification delay issue. #3269
Conversation
@liuh-80 I built an image with this change and tested. For the first time boot up after installation on a single linecard, all ports come up in 8 minutes and all 34k routes are also installed. For subsequent reboot a single linecard, it takes about 7 minutes for all linkup and 34k routes installed. It seems this change addresses the issue. We need to do more testing to verify that, includes the OC testing. |
@bocon13 , we found a performance issue which cause by your PR, can you review this fix? Performance issue: sonic-net/sonic-buildimage#19569 |
@liuh-80 , we must need some UT to prevent such regression. |
thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0? |
Will add sonic-swss test case to prevent this issue happen again, |
This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked. |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
@liuh-80 , we ran OC with this fix and noticed orchagent crashes in all test beds. Analyzing the syslog files. I will update my findings soon. |
The following tests in pc suite seems to be triggering the crashes. orchagent tries to remove the neighbor and nexthop, either meta layer or SAI is not in sync with orchagent and returns error which causes orchagent to exit. |
@mint570 for viz |
With this PR, each Consumer will pop less than 128 notifications every time. which means orchagent will check and handle high priority notification more frequently. |
does this mean, bulk-api's (SAI) that orchagent invoke for routes etc will now have a limit of 128 entries |
Yes, it will have a 128 entries limit. |
std::deque<KeyOpFieldsValuesTuple> entries; | ||
table->pops(entries); | ||
update_size = addToSync(entries); | ||
} while (update_size != 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple test case failed. because not get expected data within 20 seconds.
Possible reason:
- some types of notification depends on each other, so need pops all entries in same loop.
- this PR make orchagent process some data slowly than before.
Will try different solution to only limit route process in another POC PR
[POC] Improve routeorch to stop process routes when high priority notification coming. #3278
I think it's a timing issue, for example the validation of this PR has lots of test case failed, but after I increase the wait_for_n_keys timeout, many test case passed. However this change do impact performance, because after this change every doTask() call can only handle 128 entries, so some scenario take longer time. I'm trying to only change RouteOrch to improve performance. |
orchagent/orch.cpp
Outdated
@@ -789,7 +787,7 @@ void Orch::addConsumer(DBConnector *db, string tableName, int pri) | |||
} | |||
else | |||
{ | |||
addExecutor(new Consumer(new ConsumerStateTable(db, tableName, gBatchSize, pri), this, tableName)); | |||
addExecutor(new Consumer(new ConsumerStateTable(db, tableName, gBatchSize * CONSUMER_POP_MAX_BATCH_COUNT, pri), this, tableName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will cause an issue:
sonic-net/sonic-buildimage@286ec3e
Background running lua script may cause redis-server quite busy if batch size is 8192.
If handling time exceeded default 5s, the redis-server will not response to other process and will cause syncd crash.
So that's also part of reason why add a loop in execute()
auto table = static_cast<swss::ConsumerTableBase *>(getSelectable()); | ||
int batch_count = 0; | ||
size_t update_size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will move this loop to swss-common ConsumerStateTable::pops() method.
Fix port up/bfd sessions bringup notification delay issue.
Why I did it
Fix following issue:
sonic-net/sonic-buildimage#19569
Work item tracking
How I did it
Revert change in Consumer::execute(), which introduced by this commit:
9258978#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08
The change in this commit add a while loop:
do
{
std::deque entries;
table->pops(entries);
update_size = addToSync(entries);
} while (update_size != 0);
The addToSync sync method will return the size of entries
Which means, if there are massive routes notification, other high priority notification for example port up notification will blocked until all routes notification been handled.
How to verify it
Pass all UT.
Manually verify issue fixed.
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Fix port up/bfd sessions bringup notification delay issue.
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)