Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement NTP source info messages to the controller #3968

Merged
merged 9 commits into from
Jun 25, 2024

Conversation

rouming
Copy link
Contributor

@rouming rouming commented Jun 10, 2024

This implements NTP sources eve-api/ptoto/info/ntpsources.proto info messages, which periodically (once per 10 min) are sent to the controller.

NTP sources data is fetched from the chronyd running on EVE over the unix domain socket with the help of vendor API from the github.com/facebook/time/ntp/chrony package. Currently this API supports only monitoring (RO), not control packages. In the future this can be extended, so nim can have full control over the chronyd and update its servers and pools.

CC: @rene

@rouming rouming changed the title [WIP] Implement NTP source info messages to the controller Implement NTP source info messages to the controller Jun 11, 2024
Copy link
Contributor

@milan-zededa milan-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please mention in EVE docs:

  • that we use chrony
  • how we pick NTP server (first from the list obtained via DHCP or static config - unless we improve this to apply all)
  • what is the default NTP if none configured
  • NTP status information that EVE publishes (what info is available, how often it is published and how to change the interval)

I have started something here, but separate md file for ntp could also make sense.

pkg/pillar/cmd/zedagent/handlentp.go Outdated Show resolved Hide resolved
pkg/pillar/scripts/device-steps.sh Outdated Show resolved Hide resolved
@rouming rouming force-pushed the chrony-ntp-sources branch 2 times, most recently from 4582cf4 to b8578cd Compare June 12, 2024 14:55
@rouming
Copy link
Contributor Author

rouming commented Jun 12, 2024

Could you please mention in EVE docs:

  • that we use chrony

Mentioned

  • how we pick NTP server (first from the list obtained via DHCP or static config - unless we improve this to apply all)
  • what is the default NTP if none configured

Those two are already covered by you.

  • NTP status information that EVE publishes (what info is available, how often it is published and how to change the interval)

Done.

I have started something here, but separate md file for ntp could also make sense.

Meh, separate file is too much, I do not know what to write there except those 4 lines I added.

@rouming rouming force-pushed the chrony-ntp-sources branch 2 times, most recently from e16425e to 0119979 Compare June 13, 2024 08:16
Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A periodic message makes sense to a LOC/LPS (and I don't understand why that isn't included - we shouldn't have that provide less info that the controller).

But for the info sent to the controller I don't think periodic updates makes sense. That should be done when there is a significant change. 10 minutes is too slow when there are changes, and too frequent when there are no significant changes.

pkg/pillar/cmd/zedagent/handlentp.go Outdated Show resolved Hide resolved
pkg/pillar/cmd/zedagent/handlentp.go Show resolved Hide resolved
pkg/pillar/rootfs/etc/chrony/chrony.conf Show resolved Hide resolved
@rouming
Copy link
Contributor Author

rouming commented Jun 13, 2024

A periodic message makes sense to a LOC/LPS (and I don't understand why that isn't included - we shouldn't have that provide less info that the controller).

Not sure what you mean by that isn't included, LOC is updated periodically by the triggerPublishAllInfo() call. Is that what you mean?

But for the info sent to the controller I don't think periodic updates makes sense. That should be done when there is a significant change.

My main question is how many bytes we need to spare here? e.g. we do periodic updates for the location (interval is bigger though). Is that bad sending ~100 bytes in 10min for updating NTP sources?

The other observation is that NTP sources do update very frequently. (server set is changing, reachability, other fields and metrics). So simple bytes comparison on previous buffer and the one we retrieve from chrony will likely be always different, so having this simple algorithm we likely update controller withing 1 minute. Having smarter algorithm (like you said "significant change") makes me confused: what is exactly "significant change" for the final user? set of servers? status/mode change?

@eriknordmark
Copy link
Contributor

My main question is how many bytes we need to spare here? e.g. we do periodic updates for the location (interval is bigger though). Is that bad sending ~100 bytes in 10min for updating NTP sources?

The point is that when there is an issue with connectivity (Ethernet not connected, firewall not letting outbound NTP through) and the user wants to fix that and checked that it worked, then a 10 minute wait to see whether it worked is not acceptable. And I would argue that even an 1 minute wait is painful.

But 10,000 devices reporting periodically every 1 minute or less would be a huge load on the controller.

Hence the suggestion that the reporting should be done when there is a signficant change and not periodically.

The other observation is that NTP sources do update very frequently. (server set is changing, reachability, other fields and metrics). So simple bytes comparison on previous buffer and the one we retrieve from chrony will likely be always different, so having this simple algorithm we likely update controller withing 1 minute. Having smarter algorithm (like you said "significant change") makes me confused: what is exactly "significant change" for the final user? set of servers? status/mode change?

That is why I said "significant change" - when the set of synchronized servers change. Most important is when the set of sychronoized servers goes between zero and non-zero, but even if the count of synch servers is the same (but IP changes), or the count changes in general, it would make sense to send an update.
(Plus send one on boot to make sure there is an initial set of servers to compare with.)

I don't know if there is some high-level status/mode changes which the user would care about reporting. Perhaps we can ask for feedback on that later?

@rouming
Copy link
Contributor Author

rouming commented Jun 18, 2024

Difference to the previous version:

  • Change NTP source update by two events:
    • Do force send every 10 minutes. This is needed just to make sure no updates are lost due controller downtime. We still send NTP updates recurrently in order not to stall the send queue.
    • NTP sources updates are also sent if the list of NTP peers or NTP peer fields, such as mode, state, have changed. This guarantees us we see updates more frequent.
  • Introduce ForceSend flag for a destination bitset to control operation from the sender.

@rouming
Copy link
Contributor Author

rouming commented Jun 21, 2024

@eriknordmark could you please take a look

@milan-zededa
Copy link
Contributor

milan-zededa commented Jun 21, 2024

@rouming Will this also trigger publishing of NTP info: https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/zedagent/zedagent.go#L2182 ? (this code executes when network connectivity that previously didn't work is restored)
I guess it does because you are also using the deferred queue, right?

@rouming
Copy link
Contributor Author

rouming commented Jun 21, 2024

@rouming Will this also trigger publishing of NTP info: https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/zedagent/zedagent.go#L2182 ? (this code executes when network connectivity that previously didn't work is restored) I guess it does because you are also using the deferred queue, right?

If I understand correctly, this will eventually lead to this timer kick:
zedcloudCtx.DeferredEventCtx.KickTimer()

Which kicks the stalled queue (stalled because of missing connectivity for example) to repeat the send. This does not kick any of the higher level calls, like ntp sources send, only to kick the queue which is already populated with stalled requests.

@milan-zededa
Copy link
Contributor

which is already populated with stalled requests

Which will include NTP info messages if there was any NTP change during broken controller connectivity, correct?

@eriknordmark
Copy link
Contributor

Difference to the previous version:

  • Change NTP source update by two events:

    • Do force send every 10 minutes. This is needed just to make sure no updates are lost due controller downtime. We still send NTP updates recurrently in order not to stall the send queue.

"recurrently"? I can't quite understand what you mean by stalling the queue. I assume the same key is used with the deferred logic meaning that there is at most one NTP info message queued (and subsequent messages replace a message which has not yet been sent.)

And note that if the controller is down the the deferred logic will retry (until it gets a 200 response) so strictly speaking you shouldn't need any more periodic message here than for e.g., device, app, or volume info messages. But no harm in sending occaionally so folks can see the skew/delta type NTP info.

  • NTP sources updates are also sent if the list of NTP peers or NTP peer fields, such as mode, state, have changed. This guarantees us we see updates more frequent.
  • Introduce ForceSend flag for a destination bitset to control operation from the sender.

@rouming
Copy link
Contributor Author

rouming commented Jun 21, 2024

Understanding is correct, the request will be the only one in the queue from ntp in case of lost connectivity. We need to update ntp sources recurrently because we need to reflect updates of other fields in the ntp peer structure (we can't guarantee other fields are up-to-date with only checking the change in set of peers or states). So we need to do that recurrently with low frequency, putting this request into the recurrent queue, not the event queue.

Comment on lines 156 to 157
// Even for the controller destination we can't stall the queue on error,
// set @forcePeriodic to true and let the timer repeat the send in a while
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean that if the network is not up (or that there is a timeout), you will wait for the full 10 minutes until you try again?

Why can't you have the queue retry on error for the triggered updates (and have the periodic updates not retry)?
How will the queue "stall" in a way which will be more upsetting than e.g., a device INFO message being queued and retried?
(You might want to think about the relative priority of the new NTP INFO message if you didn't already.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will wait for the full 10 minutes until you try again?

Yes, if network is not up then we have to wait another 10 minutes. Exactly as all other info messages function (with less interval though). So do not quite understand why NTP status (which has pure informative role on the controller) should be exceptional here.

Why can't you have the queue retry on error for the triggered updates (and have the periodic updates not retry)?

We don't have any callback mechanism and any error checks and retries. My impression is that this particular problem does not worth any over-complication, at least that what I understood discussing this with Daniel.

How will the queue "stall" in a way which will be more upsetting than e.g., a device INFO message being queued and retried?

Not upsetting at all, but abuses the semantics chosen long time ago to have two queues: one for events and another one for recurring calls. If you are satisfied with making NTP status exceptional and pushing it to event queue and not recurring queue (although we are forced to call it recurrently every 10 min), and eventually we can merge this PR, I can cover the code with fat comment explaining this and change the boolean flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if network is not up then we have to wait another 10 minutes. Exactly as all other info messages function (with less interval though). So do not quite understand why NTP status (which has pure informative role on the controller) should be exceptional here.

I see you changed this to pass forcePeriodic=false like for the other into messages, which is what I was asking for.
I see that zedagent calls zedcloudCtx.DeferredEventCtx.KickTimer() when it thinks it makes sense to try to send from the queue. In addition there binary exponential timer (from 1 to 15 minutes) in deferred.go. So this would normally be retried relatively quickly if the network was not connected or there is a short outage.

Use chronyd for time synchchronization instead of busybox ntpd
because of the rich set of features it has and implementation
of the monitor API in golang, which we can use for getting all
the useful information for ntp status sharing to the controller.

The only difference between how ntpd and chronyd starts is the
way how an NTP server/peer should be passed to the daemon.
In case of chronyd the chrony client (chronyc) should be used.

Also now the default pool.ntp.org is explicitly passed to the
chronyd as 'pool', other NTP servers which come from the nim
passed as 'server'.

In this change all common code was moved to the functions,
namely single_ntp_sync(), start_ntp_daemon() and
restart_ntp_daemon(). This reduces code duplication.

Also chrony requires configuration parsing and passing it as
a single line to the chronyd which is running in the `-q`
client mode (see the `single_ntp_sync()`).

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
The ZInfoNTPSources info message will be publish by EVE.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Nothing fancy in this PR, just vendor updates.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
chrony API will be used later in this series in order
to get NTP sources data from chronyd.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Nothing to look at, just vendor files.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
…rred

Since a while we support a few destinations: Controller, LOC, LPS.
There are a few calls which send messages to all destinations
simultaneiously. The semantics of the `SetDeferred` function is to
replace a deferred item if key is found. The `queueInfoToDest()`
can add two items to the same `DeferredPeriodicCtx` context with
different URLs. The original behavior of the `SetDeferred` call
is to replace first item with the second item by the key, which
will be same. This leads to a missing packet to one of the URLs
if Controller and LOC are both enabled.

This patch adds `URL` check along with the `key` check on deferred
item add, makes both items with different URLs co-exist in the
same queue.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
The flag acts as a hint for request handlers and tells that
request should be sent in any case. In the next patches the
NTP sources reports will be added and will be sent to the
controller if NTP peers have been changed, so a comparison
will take place. LOC destination is special and should
bypass any checks and request should be forcibly sent.

This scenario is the main user of the `ForceSend` flag.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
This implements NTP sources eve-api/ptoto/info/ntpsources.proto
info messages, which periodically (at least once per 10 min) are
sent to the controller.

NTP sources data is fetched from the chronyd running on EVE over
the unix domain socket with the help of vendor API from the
github.com/facebook/time/ntp/chrony package. Currently this API
supports only monitoring (RO), not control packages. In the future
this can be extended, so `nim` can have full control over the
chronyd and update its servers and pools.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Documentation bits. Never harms.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
@rouming
Copy link
Contributor Author

rouming commented Jun 24, 2024

Difference to the previous version:

  • Pass configuration from the config to chrony in a client -q mode as a cmd line to allow chrony initial large adjustment.
  • NTP source requests queued with the cleared forcePeriodic.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@milan-zededa
Copy link
Contributor

I also have these failures in my PR:

        FAIL: ../eclient/testdata/metadata.txt:26: command failure

and

        Bad port ''
        [exit status 1]
        FAIL: ../eclient/testdata/port_forward.txt:48: command failure

Something very recently merged to master must have created this. I will investigate.

@eriknordmark
Copy link
Contributor

I also have these failures in my PR:

        FAIL: ../eclient/testdata/metadata.txt:26: command failure

and

        Bad port ''
        [exit status 1]
        FAIL: ../eclient/testdata/port_forward.txt:48: command failure

Something very recently merged to master must have created this. I will investigate.

@milan-zededa I will proceed to merge this and your PR today, but can you please let us know whether you think these test failures indicate product issues (as opposed to test issues).

@eriknordmark eriknordmark merged commit 98701ed into lf-edge:master Jun 25, 2024
29 of 35 checks passed
@milan-zededa
Copy link
Contributor

I also have these failures in my PR:

        FAIL: ../eclient/testdata/metadata.txt:26: command failure

and

        Bad port ''
        [exit status 1]
        FAIL: ../eclient/testdata/port_forward.txt:48: command failure

Something very recently merged to master must have created this. I will investigate.

@milan-zededa I will proceed to merge this and your PR today, but can you please let us know whether you think these test failures indicate product issues (as opposed to test issues).

Yes, it is some regression in the product. I can reproduce these two test failures locally and was able to narrow it down to this commit: bd68e04
But at the moment I don't know what change from the commit is causing the failures.

@uncleDecart
Copy link
Contributor

@milan-zededa I'll check it out

@milan-zededa
Copy link
Contributor

@milan-zededa I'll check it out

It panics here: https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/msrv/pubsub.go#L46
I don't see srv.deviceNetworkStatus being set anywhere. It seems you are missing subscription for DeviceNetworkStatus...

@uncleDecart
Copy link
Contributor

@milan-zededa see #3998

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants