Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto Techsupport] Event driven Techsupport Changes #8670

Merged
merged 36 commits into from
Nov 16, 2021

Conversation

vivekrnv
Copy link
Contributor

@vivekrnv vivekrnv commented Sep 3, 2021

Why I did it

Changes required for feature "Event Driven TechSupport Invocation & CoreDump Mgmt". HLD

Requires: sonic-net/sonic-utilities#1796.
Merging in any order would be fine.

Summary of the changes:

  • Added the YANG Models for the new tables introduces as a part of this feature.
  • Enhanced init_cfg.json with the default config required
  • Added a compile Time flag which enables/disables the config required for this feature inside the init_cfg.json
  • Enhanced the supervisor-proc-exit-listener script to populate <feature>:<critical_proc> = <comm>:<pid> info in the STATE_DB when it observes an proc exit notification for the critical processes running inside the docker.

How I did it

How to verify it

##### No Core files and ts dumps initially
admin@sonic:~$ ls /var/core/
admin@sonic:~$ ls /var/dump

#### verify auto-techsupport status

admin@sonic:~$ show auto-techsupport global
STATE      RATE LIMIT INTERVAL    MAX TECHSUPPORT SIZE    MAX CORE SIZE  SINCE
-------  ---------------------  ----------------------  ---------------  ----------
enabled                    180                      10                5  2 days ago

admin@sonic:~$ show auto-techsupport-feature 
FEATURE NAME    STATE       RATE LIMIT INTERVAL
--------------  --------  ---------------------
bgp             enabled                     600
database        enabled                     600
dhcp_relay      enabled                     600
lldp            enabled                     600
macsec          enabled                     600
mgmt-framework  enabled                     600
nat             enabled                     600
pmon            enabled                     600
radv            enabled                     600
restapi         disabled                    800
sflow           enabled                     600
snmp            enabled                     600
swss            disabled                    800
syncd           enabled                     600
teamd           enabled                     600
telemetry       enabled                     600

#### Kill a critical Process t and trigger a coredump
admin@sonic:~$ docker exec -it snmp ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          23  1.6  0.5 138316 42296 pts/0    Sl   19:09   0:15 python3 -m sonic_ax_impl
...............

admin@sonic:~$ docker exec -it snmp kill -11 23

#### Coredump is created
admin@sonic:~$ ls /var/core/
python3.1629401152.23.core.gz 

#### Techsupport Dump creation in progress
admin@sonic:~$ ls /var/dump/
sonic_dump_sonic_20210819_192558  sonic_dump_sonic_20210819_192558.tar

admin@sonic:~$ ps -aux | grep coredump_gen
root       17823  0.1  0.2  30960 16736 ?        S    19:25   0:00 python3 /usr/local/bin/coredump_gen_handler.py python3.1629401152.23.core.gz

#### Wait until the techsupport dump execution has finished
admin@sonic:~$ ls /var/dump/
sonic_dump_sonic_20210819_192558.tar.gz


admin@sonic:~$ show auto-techsupport history

TECHSUPPORT DUMP                          TRIGGERED BY    CORE DUMP
----------------------------------------  --------------  -----------------------------
sonic_dump_sonic_device_20210819_192558  snmp-subagent   python3.1629401152.23.core.gz 

This feature required changes to supervisor-proc-exit-listener script:
Changes made are backward compatible with python2. Tested this on docker running python3 & python2 (restapi)

admin@sonic:~$ docker exec -it snmp ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.5  0.3  35676 24316 pts/0    Ss+  21:43   0:00 /usr/bin/pyth
root           9  0.1  0.3  43608 24888 pts/0    S    21:43   0:00 python3 /usr/
root          17  0.0  0.0 225856  5616 pts/0    Sl   21:43   0:00 /usr/sbin/rsy
Debian-+      21  0.9  0.1  32932 12524 pts/0    S    21:43   0:00 /usr/sbin/snm
root          23  8.4  0.4 133116 36764 pts/0    Sl   21:43   0:06 python3 -m so
root          26  0.0  0.0  11248  3036 pts/1    Rs+  21:44   0:00 ps -aux
admin@r-lionfish-16:~$ docker exec -it restapi ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.5  0.2  59112 21552 pts/0    Ss+  21:43   0:00 /usr/bin/pyth
root           9  0.1  0.2  84932 22932 pts/0    S    21:43   0:00 python2 /usr/
root          14  0.0  0.0 262988  3596 pts/0    Sl   21:43   0:00 /usr/sbin/rsy
root          20  0.0  0.0  17964  2948 pts/0    S    21:43   0:00 bash /usr/bin
root          31  0.0  0.0   4188   660 pts/0    S    21:44   0:00 sleep 60
root          32  0.0  0.0  36636  2840 pts/1    Rs+  21:44   0:00 ps -aux
admin@sonic:~$ docker exec -it restapi kill -11 20
admin@sonic:~$ docker exec -it snmp kill -11 23
admin@sonic:~$ redis-cli -n 6 hgetall "AUTO_TECHSUPPORT|FEATURE_PROC_INFO"
1) "restapi;restapi"
2) "20;bash"
3) "snmp;snmp-subagent"
4) "23;python3"

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

vivekrnv and others added 29 commits August 10, 2021 22:05
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
@vivekrnv
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vivekrnv
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

ganglyu
ganglyu previously approved these changes Oct 17, 2021
@vivekrnv
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vivekrnv
Copy link
Contributor Author

vivekrnv commented Nov 5, 2021

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

When the limit is crossed, the older core files are incrementally deleted
*/
description "Max Limit in percentage for the cummulative size of ts dumps. No cleanup is performed if the value isn't configured or is 0.0";
type decimal-repr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value is between (0, 100) here. But decimal-repr range is from 0.0 to 99.99, can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intended to be b/w 0.0 to 99.99. As for the comment (A value between (0,100) should be specified, Upto two decimal places will be used in the calculation) , I've used open brackets to signify the value is between 0 & 100. and the next comment suggests that upto two decimal places are used. So, i think this is clear. Although, let me know, if i have to re-write comments to make them clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open brackets does not include 0 and 100? Please update comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment

When the limit is crossed, the older core files are deleted
*/
description "Max Limit in percentage for the cummulative size of core dumps. No cleanup is performed if the value isn't congiured or is 0.0";
type decimal-repr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

range 0.0..99.99?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented above

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
@vivekrnv
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list.

@qiluo-msft qiluo-msft merged commit ff32ac3 into sonic-net:master Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants