Fix port up/bfd sessions bringup notification delay issue. #3269

liuh-80 · 2024-08-28T02:49:28Z

Fix port up/bfd sessions bringup notification delay issue.

Why I did it

Fix following issue:
sonic-net/sonic-buildimage#19569

Work item tracking

Microsoft ADO: 29192284

How I did it

Revert change in Consumer::execute(), which introduced by this commit:
9258978#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08

The change in this commit add a while loop:
do
{
std::deque entries;
table->pops(entries);
update_size = addToSync(entries);
} while (update_size != 0);

The addToSync sync method will return the size of entries
Which means, if there are massive routes notification, other high priority notification for example port up notification will blocked until all routes notification been handled.

How to verify it

Pass all UT.
Manually verify issue fixed.

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

SONiC.master-20030.629638-f370e2fa8

Description for the changelog

Fix port up/bfd sessions bringup notification delay issue.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

mlok-nokia · 2024-08-29T01:44:26Z

@liuh-80 I built an image with this change and tested. For the first time boot up after installation on a single linecard, all ports come up in 8 minutes and all 34k routes are also installed. For subsequent reboot a single linecard, it takes about 7 minutes for all linkup and 34k routes installed. It seems this change addresses the issue. We need to do more testing to verify that, includes the OC testing.

liuh-80 · 2024-08-29T02:29:07Z

@bocon13 , we found a performance issue which cause by your PR, can you review this fix?

Performance issue: sonic-net/sonic-buildimage#19569
PR cause performance issue: #1992

lguohan · 2024-08-30T14:12:03Z

@liuh-80 , we must need some UT to prevent such regression.

wenyiz2021 · 2024-08-30T16:48:56Z

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

liuh-80 · 2024-09-02T01:16:02Z

Comsumer

Will add sonic-swss test case to prevent this issue happen again,

liuh-80 · 2024-09-02T01:18:00Z

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked.

mssonicbld · 2024-09-02T01:59:43Z

/AzurePipelines run Azure.sonic-swss

azure-pipelines · 2024-09-02T01:59:53Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2024-09-03T06:12:55Z

/AzurePipelines run Azure.sonic-swss

azure-pipelines · 2024-09-03T06:13:05Z

Azure Pipelines successfully started running 1 pipeline(s).

saksarav-nokia · 2024-09-03T18:51:11Z

@liuh-80 , we ran OC with this fix and noticed orchagent crashes in all test beds. Analyzing the syslog files. I will update my findings soon.

saksarav-nokia · 2024-09-03T21:13:37Z

The following tests in pc suite seems to be triggering the crashes. orchagent tries to remove the neighbor and nexthop, either meta layer or SAI is not in sync with orchagent and returns error which causes orchagent to exit.
test_po_update_io_no_loss
test_po_update

prsunny · 2024-09-03T22:27:47Z

@mint570 for viz

liuh-80 · 2024-09-03T23:32:36Z

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked.

thanks! then does current design in this PR ensure the priority task notify, or it only notify the first come notification?

With this PR, each Consumer will pop less than 128 notifications every time. which means orchagent will check and handle high priority notification more frequently.

prsunny · 2024-09-04T00:03:11Z

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked.

thanks! then does current design in this PR ensure the priority task notify, or it only notify the first come notification?

With this PR, each Consumer will pop less than 128 notifications every time. which means orchagent will check and handle high priority notification more frequently.

does this mean, bulk-api's (SAI) that orchagent invoke for routes etc will now have a limit of 128 entries

liuh-80 · 2024-09-04T01:40:54Z

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked.

thanks! then does current design in this PR ensure the priority task notify, or it only notify the first come notification?

With this PR, each Consumer will pop less than 128 notifications every time. which means orchagent will check and handle high priority notification more frequently.

does this mean, bulk-api's (SAI) that orchagent invoke for routes etc will now have a limit of 128 entries

Yes, it will have a 128 entries limit.
This is an orchagent design issue.

liuh-80 · 2024-09-04T02:56:32Z

orchagent/orch.cpp

-        std::deque<KeyOpFieldsValuesTuple> entries;
-        table->pops(entries);
-        update_size = addToSync(entries);
-    } while (update_size != 0);


Multiple test case failed. because not get expected data within 20 seconds.
Possible reason:

some types of notification depends on each other, so need pops all entries in same loop.

this PR make orchagent process some data slowly than before.
Will try different solution to only limit route process in another POC PR
[POC] Improve routeorch to stop process routes when high priority notification coming. #3278

liuh-80 · 2024-09-04T09:37:10Z

The following tests in pc suite seems to be triggering the crashes. orchagent tries to remove the neighbor and nexthop, either meta layer or SAI is not in sync with orchagent and returns error which causes orchagent to exit. test_po_update_io_no_loss test_po_update

I think it's a timing issue, for example the validation of this PR has lots of test case failed, but after I increase the wait_for_n_keys timeout, many test case passed.

However this change do impact performance, because after this change every doTask() call can only handle 128 entries, so some scenario take longer time.

I'm trying to only change RouteOrch to improve performance.

liuh-80 · 2024-09-12T07:42:23Z

orchagent/orch.cpp

@@ -789,7 +787,7 @@ void Orch::addConsumer(DBConnector *db, string tableName, int pri)
    }
    else
    {
-        addExecutor(new Consumer(new ConsumerStateTable(db, tableName, gBatchSize, pri), this, tableName));
+        addExecutor(new Consumer(new ConsumerStateTable(db, tableName, gBatchSize * CONSUMER_POP_MAX_BATCH_COUNT, pri), this, tableName));


This change will cause an issue:

sonic-net/sonic-buildimage@286ec3e

Background running lua script may cause redis-server quite busy if batch size is 8192.
If handling time exceeded default 5s, the redis-server will not response to other process and will cause syncd crash.

So that's also part of reason why add a loop in execute()

liuh-80 · 2024-09-13T07:52:06Z

orchagent/orch.cpp

    auto table = static_cast<swss::ConsumerTableBase *>(getSelectable());
+    int batch_count = 0;
+    size_t update_size = 0;


Will move this loop to swss-common ConsumerStateTable::pops() method.

sonic-net/sonic-swss-common#916

liuh-80 added 2 commits August 27, 2024 08:01

Add debug info

9376a4f

Revert change

96967cf

liuh-80 mentioned this pull request Aug 28, 2024

[202405] [Chassis]: Ports take too long to come up due to delayed port up notification processing by orchagent sonic-net/sonic-buildimage#19569

Open

Update orch.cpp

229f883

liuh-80 changed the title ~~[POC] verify route performance issue~~ Fix port up/bfd sessions bringup notification delay issue. Aug 30, 2024

liuh-80 requested review from lguohan, qiluo-msft, abdosi and mlok-nokia August 30, 2024 06:18

liuh-80 marked this pull request as ready for review August 30, 2024 06:20

liuh-80 requested a review from prsunny as a code owner August 30, 2024 06:20

liuh-80 mentioned this pull request Aug 30, 2024

[chassis-packet]: internal bfd sessions bringup delays during config reload/reboot sonic-net/sonic-buildimage#17180

Open

mlok-nokia approved these changes Aug 30, 2024

View reviewed changes

anamehra approved these changes Aug 30, 2024

View reviewed changes

qiluo-msft previously approved these changes Aug 31, 2024

View reviewed changes

Add test case

a59fc76

liuh-80 dismissed qiluo-msft’s stale review via a59fc76 September 2, 2024 06:42

liuh-80 added 2 commits September 2, 2024 17:00

Update consumer_ut.cpp

087ae0e

Update mock_consumerstatetable.cpp

68c65ee

siqbal1986 self-requested a review September 3, 2024 03:28

Merge branch 'master' into dev/liuh/test-route-performance

ff0b951

Update dvs_common.py

8c206a9

Update dvs_common.py

e30c429

liuh-80 commented Sep 4, 2024

View reviewed changes

liuh-80 added 2 commits September 4, 2024 14:48

Update dvs_common.py

ad1fc72

Update dvs_common.py

1cf59b6

rlhui assigned liuh-80 Sep 4, 2024

liuh-80 added 3 commits September 10, 2024 17:43

Update orch.cpp

21b9d39

Update orch.cpp

e262b08

Update consumer_ut.cpp

b8396bf

siqbal1986 approved these changes Sep 11, 2024

View reviewed changes

liuh-80 added 6 commits September 11, 2024 09:16

Update orch.cpp

d0b4886

Update dvs_common.py

d5acaae

Update orch.cpp

d99ebf6

Update consumer_ut.cpp

317191d

Update orch.cpp

574cf11

Update orch.cpp

133f057

liuh-80 commented Sep 12, 2024

View reviewed changes

liuh-80 added 4 commits September 12, 2024 15:46

Update orch.cpp

3620acb

Update consumer_ut.cpp

7c68aef

Update orch.cpp

e3ae1bf

Update consumer_ut.cpp

ba0cd6d

liuh-80 commented Sep 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix port up/bfd sessions bringup notification delay issue. #3269

Fix port up/bfd sessions bringup notification delay issue. #3269

liuh-80 commented Aug 28, 2024 •

edited

Loading

mlok-nokia commented Aug 29, 2024

liuh-80 commented Aug 29, 2024

lguohan commented Aug 30, 2024

wenyiz2021 commented Aug 30, 2024

liuh-80 commented Sep 2, 2024

liuh-80 commented Sep 2, 2024

mssonicbld commented Sep 2, 2024

azure-pipelines bot commented Sep 2, 2024

mssonicbld commented Sep 3, 2024

azure-pipelines bot commented Sep 3, 2024

saksarav-nokia commented Sep 3, 2024

saksarav-nokia commented Sep 3, 2024

prsunny commented Sep 3, 2024

liuh-80 commented Sep 3, 2024

prsunny commented Sep 4, 2024

liuh-80 commented Sep 4, 2024

liuh-80 Sep 4, 2024 •

edited

Loading

liuh-80 commented Sep 4, 2024

liuh-80 Sep 12, 2024

liuh-80 Sep 13, 2024

Fix port up/bfd sessions bringup notification delay issue. #3269

Are you sure you want to change the base?

Fix port up/bfd sessions bringup notification delay issue. #3269

Conversation

liuh-80 commented Aug 28, 2024 • edited Loading

Why I did it

Work item tracking

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

mlok-nokia commented Aug 29, 2024

liuh-80 commented Aug 29, 2024

lguohan commented Aug 30, 2024

wenyiz2021 commented Aug 30, 2024

liuh-80 commented Sep 2, 2024

liuh-80 commented Sep 2, 2024

mssonicbld commented Sep 2, 2024

azure-pipelines bot commented Sep 2, 2024

mssonicbld commented Sep 3, 2024

azure-pipelines bot commented Sep 3, 2024

saksarav-nokia commented Sep 3, 2024

saksarav-nokia commented Sep 3, 2024

prsunny commented Sep 3, 2024

liuh-80 commented Sep 3, 2024

prsunny commented Sep 4, 2024

liuh-80 commented Sep 4, 2024

liuh-80 Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

liuh-80 commented Sep 4, 2024

liuh-80 Sep 12, 2024

Choose a reason for hiding this comment

liuh-80 Sep 13, 2024

Choose a reason for hiding this comment

liuh-80 commented Aug 28, 2024 •

edited

Loading

liuh-80 Sep 4, 2024 •

edited

Loading