Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix port up/bfd sessions bringup notification delay issue. #3269

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Aug 28, 2024

Fix port up/bfd sessions bringup notification delay issue.

Why I did it

Fix following issue:
sonic-net/sonic-buildimage#19569

Work item tracking
  • Microsoft ADO: 29192284

How I did it

Revert change in Consumer::execute(), which introduced by this commit:
9258978#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08

The change in this commit add a while loop:
do
{
std::deque entries;
table->pops(entries);
update_size = addToSync(entries);
} while (update_size != 0);

The addToSync sync method will return the size of entries
Which means, if there are massive routes notification, other high priority notification for example port up notification will blocked until all routes notification been handled.

How to verify it

Pass all UT.
Manually verify issue fixed.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

  • SONiC.master-20030.629638-f370e2fa8

Description for the changelog

Fix port up/bfd sessions bringup notification delay issue.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mlok-nokia
Copy link

@liuh-80 I built an image with this change and tested. For the first time boot up after installation on a single linecard, all ports come up in 8 minutes and all 34k routes are also installed. For subsequent reboot a single linecard, it takes about 7 minutes for all linkup and 34k routes installed. It seems this change addresses the issue. We need to do more testing to verify that, includes the OC testing.

@liuh-80
Copy link
Contributor Author

liuh-80 commented Aug 29, 2024

@bocon13 , we found a performance issue which cause by your PR, can you review this fix?

Performance issue: sonic-net/sonic-buildimage#19569
PR cause performance issue: #1992

@liuh-80 liuh-80 changed the title [POC] verify route performance issue Fix port up/bfd sessions bringup notification delay issue. Aug 30, 2024
@liuh-80 liuh-80 marked this pull request as ready for review August 30, 2024 06:20
@liuh-80 liuh-80 requested a review from prsunny as a code owner August 30, 2024 06:20
@lguohan
Copy link
Contributor

lguohan commented Aug 30, 2024

@liuh-80 , we must need some UT to prevent such regression.

@wenyiz2021
Copy link

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

qiluo-msft
qiluo-msft previously approved these changes Aug 31, 2024
@liuh-80
Copy link
Contributor Author

liuh-80 commented Sep 2, 2024

Comsumer

Will add sonic-swss test case to prevent this issue happen again,

@liuh-80
Copy link
Contributor Author

liuh-80 commented Sep 2, 2024

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked.

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@siqbal1986 siqbal1986 self-requested a review September 3, 2024 03:28
@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@saksarav-nokia
Copy link
Contributor

@liuh-80 , we ran OC with this fix and noticed orchagent crashes in all test beds. Analyzing the syslog files. I will update my findings soon.

@saksarav-nokia
Copy link
Contributor

The following tests in pc suite seems to be triggering the crashes. orchagent tries to remove the neighbor and nexthop, either meta layer or SAI is not in sync with orchagent and returns error which causes orchagent to exit.
test_po_update_io_no_loss
test_po_update

@prsunny
Copy link
Collaborator

prsunny commented Sep 3, 2024

@mint570 for viz

@liuh-80
Copy link
Contributor Author

liuh-80 commented Sep 3, 2024

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked.

thanks! then does current design in this PR ensure the priority task notify, or it only notify the first come notification?

With this PR, each Consumer will pop less than 128 notifications every time. which means orchagent will check and handle high priority notification more frequently.

@prsunny
Copy link
Collaborator

prsunny commented Sep 4, 2024

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked.

thanks! then does current design in this PR ensure the priority task notify, or it only notify the first come notification?

With this PR, each Consumer will pop less than 128 notifications every time. which means orchagent will check and handle high priority notification more frequently.

does this mean, bulk-api's (SAI) that orchagent invoke for routes etc will now have a limit of 128 entries

@liuh-80
Copy link
Contributor Author

liuh-80 commented Sep 4, 2024

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked.

thanks! then does current design in this PR ensure the priority task notify, or it only notify the first come notification?

With this PR, each Consumer will pop less than 128 notifications every time. which means orchagent will check and handle high priority notification more frequently.

does this mean, bulk-api's (SAI) that orchagent invoke for routes etc will now have a limit of 128 entries

Yes, it will have a 128 entries limit.
This is an orchagent design issue.

std::deque<KeyOpFieldsValuesTuple> entries;
table->pops(entries);
update_size = addToSync(entries);
} while (update_size != 0);
Copy link
Contributor Author

@liuh-80 liuh-80 Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple test case failed. because not get expected data within 20 seconds.
Possible reason:

  1. some types of notification depends on each other, so need pops all entries in same loop.
  2. this PR make orchagent process some data slowly than before.
    Will try different solution to only limit route process in another POC PR
    [POC] Improve routeorch to stop process routes when high priority notification coming. #3278

@liuh-80
Copy link
Contributor Author

liuh-80 commented Sep 4, 2024

The following tests in pc suite seems to be triggering the crashes. orchagent tries to remove the neighbor and nexthop, either meta layer or SAI is not in sync with orchagent and returns error which causes orchagent to exit. test_po_update_io_no_loss test_po_update

I think it's a timing issue, for example the validation of this PR has lots of test case failed, but after I increase the wait_for_n_keys timeout, many test case passed.

However this change do impact performance, because after this change every doTask() call can only handle 128 entries, so some scenario take longer time.

I'm trying to only change RouteOrch to improve performance.

@@ -789,7 +787,7 @@ void Orch::addConsumer(DBConnector *db, string tableName, int pri)
}
else
{
addExecutor(new Consumer(new ConsumerStateTable(db, tableName, gBatchSize, pri), this, tableName));
addExecutor(new Consumer(new ConsumerStateTable(db, tableName, gBatchSize * CONSUMER_POP_MAX_BATCH_COUNT, pri), this, tableName));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will cause an issue:

sonic-net/sonic-buildimage@286ec3e

Background running lua script may cause redis-server quite busy if batch size is 8192.
If handling time exceeded default 5s, the redis-server will not response to other process and will cause syncd crash.

So that's also part of reason why add a loop in execute()

auto table = static_cast<swss::ConsumerTableBase *>(getSelectable());
int batch_count = 0;
size_t update_size = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will move this loop to swss-common ConsumerStateTable::pops() method.

sonic-net/sonic-swss-common#916

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

10 participants