Keepalive bugfixes and unify timers strategies between client and server #2760

davidfiala · 2024-05-28T21:56:51Z

Improve server-side keepalives, possibly resolve bug where keepalive errors were not being treated as errors.

Bugfix: Ensure that if session.ping returns false we correctly identify fail the keepalive and connection
Bugfix: Ensure that if the interval between keepalives being sent occurs faster than the prior keepalive's timeout that we do not overwrite the reference to the prior timeout. Before this change: we could have in theory prevented a valid keepalive timeout from clearing itself because it's object reference was replaced with a newer timeout. This rewrite keeps every timeout as a local (vs a shared state per session). Even if the timeout outlives the lifetime of a session, we still guard against errors by checking that the parent interval is not falsy. I reckon this could result in a short-term memory leak per session which is bounded for a maximum of keepaliveTimeoutMs. On the other hand even with that potential for a short reference hold, this implementation proposed here is more correct I think. One alternative we could do is keep a list of pending timeouts.. which is complex for a rare situation that will self resolve anyhow when keepaliveTimeoutMs is reached. So I'm going with the cheaper, more common case.
Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue)
Rename variables for clarity, to prevent future bugs like swapping clearInterval vs clearTimeout.
Trace on GOAWAY frames and ensure we drop the connection
Implementation is repeated in two places, per warning from Serverside keepalive error detection and cleanups #2756 (comment)
This commit supercedes the prior PR on a master branch which was out of date. Serverside keepalive error detection and cleanups #2756

Aims to help resolve #2734

- Bugfix: Ensure that if session.ping returns false we correctly identify fail the keepalive and connection - Bugfix: Ensure that if the interval between keepalives being sent occurs faster than the prior keepalive's timeout that we do not overwrite the reference to the prior timeout. Prior implementation could have in theory prevented a valid keepalive timeout from clearing itself. This rewrite keeps every timeout as a local (vs a shared state per session). Even if the timeout outlives the lifetime of a session, we still guard against errors by checking that the parent interval is not false-y. I reckon this could result in a short-term memory leak per session which is bounded for a maximum of keepaliveTimeoutMs. On the other hand even with that potential for a short reference hold, this implementation proposed here is more correct I think. One alternative we could do is keep a list of pending timeouts.. which is complex for a rare situation that will self resolve anyhow when keepaliveTimeoutMs is reached. - Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue) - Rename variables for clarity, to prevent future bugs like swapping clearInterval vs clearTimeout. - Implementation is repeated in two places, per warning from grpc#2756 (comment) - This commit supercedes the prior PR on a master branch which was out of date. grpc#2756

davidfiala · 2024-05-28T22:09:05Z

I realize that a second solution to my reference conundrum is to use a WeakRef to hold a reference to session and channelz tracer in the timeout lambda, which might reduce the risk of the short-term memleak size. I'm not sure if the rest of the code base is overly cautious for these types of things. If you'd prefer I track all outstanding timeouts or use a weakref, please let me know. Happy to adjust.

murgatroid99

All per-line comments apply to both implementations.

channelzTrace.addTrace isn't a replacement for trace. channelzTrace.addTrace only traces to channelz, so that will only be seen by a channelz client. You still want to call trace to log to the console or whatever log file.

Also, keepalive logs may be verbose compared to other server logs, so I think it would be a good idea to have a separate tracer, as the client does. It's probably fine to use the same keepalive tracer name that the client uses.

packages/grpc-js/src/server.ts

… first round of review from grpc#2760

davidfiala · 2024-05-29T05:27:42Z

Thanks for your prompt help and code review! I appreciate all the pointers, especially given that this is my first pass through this part of your code base.

Addressed your comments as best I could.
Unified the tracer name and messages between client and server. The channelz version of the server does both tracing to channelz and keepalive tracer. I kept the original version of the text for channelz.
Unfied server and client to both start ping timer only after ping response. Never has PING in parallel now. Matches your request, I believe.
As part of unifying the code, I wanted to simplify the conditons to check, and thus went with a single timer reference which is used for both types of timeouts (waiting to send ping and waiting for ping to timeout). I think this makes the code easier to reason about.
Unified server to match client: if ping time is negative, then pinging is turned off.
Attempted to make the server ping code feel the same as the clients original code style: The server code attempts to mimic the style of the client by using functions for things like starting a ping, starting a timer, clearing a timer. I had to use lambdas (vs class methods in client) though for this given that my context scope was a function rather than class. I rather like it this way though now that I see the finished product.

Question:

would you like it for me to trace to channelz when a ping is outgoing? what about when a ping is received successfully? At present, only bad outcomes go to channelz. The original code IIRC didn't log good outcomes to channelz.

Thanks again for the help.

…trace behavior

…gain

murgatroid99

I think the channelz traces are fine the way they are.

packages/grpc-js/src/transport.ts

davidfiala · 2024-06-05T01:14:02Z

Thank you for the review.

I've rewritten to make the server code act more like the client code did for the timers.

I think the only dangling issue for us to align on is the private keepaliveDisabled = false; comment. I know it looks verbose, but unless sure otherwise, I'd prefer to lean against any risk of races. Let me know what you think.

…epend on whether the session is destroyed

davidfiala · 2024-06-06T02:10:12Z

Per request, keepaliveDisabled state management is removed.

…ger. this is a necessary change to fit with having removed keepaliveDisabled boolean. manually inspected test logs for both server.ts and transport.ts to verify both types of keepalives are operating correctly.

davidfiala · 2024-06-10T17:14:42Z

I've been successfully running the PR branch for many days now on a staging environment with bidi streaming RPCs that sit idle for long periods.

Tweaking the keepalive values on both client and server to be both under 10 minutes or above 10 minutes, I've been able to reproduce the keepalive both (a) working to keep alive without errors, and (b) catching connections that were dropped by intermediate firewalls after a period of idle timeout

...specifically including GCP's 10 minute idle limit firewall ;)

https://cloud.google.com/compute/docs/troubleshooting/general-tips#idle-connections

I've used GRPC_TRACE=transport,keepalive GRPC_VERBOSITY=DEBUG to monitor and verify activity under differing configurations to ensure that the keepalives were what was responsible for both (a) keeping alive and (b) detecting silently dropped TCP connections, per above.

davidfiala · 2024-06-17T23:46:29Z

Hi gRPC team: Gentle bump. Let me know if I can provide anything else.

murgatroid99 · 2024-06-18T18:24:50Z

Sorry about the delay, I had to focus on GHSA-7v5v-9h63-cj86. I can get this out soon.

murgatroid99 · 2024-06-24T22:24:04Z

This is now out in version 1.10.10.

davidfiala mentioned this pull request May 28, 2024

Serverside keepalive error detection and cleanups #2756

Closed

remove comment

334f0dc

davidfiala mentioned this pull request May 28, 2024

grpc-js servers not sending keepalives #2734

Closed

murgatroid99 requested changes May 28, 2024

View reviewed changes

unify server and client keepalive matching comments and discussion on…

d799a7a

… first round of review from grpc#2760

davidfiala added 5 commits May 28, 2024 22:32

add keepalive server trace back in to match channelz vs non-channelz …

577b4b4

…trace behavior

return imports back to original order

7883164

another missing trace message for parity

19cdc12

resolve hoisting

bed5e85

hoist in second location

d325b5f

davidfiala changed the title ~~Serverside keepalive error detection and cleanups and GOAWAY handling~~ Keepalive bugfixes and unify timers strategies between client and server May 29, 2024

Based on grpc#2139 I wrapped http2session.ping in a try-catch block a…

a77d94f

…gain

murgatroid99 requested changes May 29, 2024

View reviewed changes

packages/grpc-js/src/transport.ts Outdated Show resolved Hide resolved

packages/grpc-js/src/transport.ts Outdated Show resolved Hide resolved

remove keepaliveDisabled from server.ts. rename keepaliveTimer.

c2da436

per discussion, avoid tracking keepalive disabled state and instead d…

3c5ab22

…epend on whether the session is destroyed

ensure that client keepalive timers are always cleared when they trig…

98cd87f

…ger. this is a necessary change to fit with having removed keepaliveDisabled boolean. manually inspected test logs for both server.ts and transport.ts to verify both types of keepalives are operating correctly.

murgatroid99 added the kokoro:run label Jun 7, 2024

kokoro-team removed the kokoro:run label Jun 7, 2024

murgatroid99 merged commit 5c0226d into grpc:@grpc/grpc-js@1.10.x Jun 18, 2024
4 of 5 checks passed

thiswayman mentioned this pull request Jul 16, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.5.4 to 1.10.10 thiswayman/agones#248

Open

MrRaja23 mentioned this pull request Jul 17, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.8.8 to 1.10.10 MrRaja23/beam#86

Open

snyk-io bot mentioned this pull request Aug 10, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.7.3 to 1.11.1 WontonSam/Phiahplay-#45

Open

Isuf-Muca mentioned this pull request Aug 10, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.4.6 to 1.11.1 Isuf-Muca/beam#1

Open

WontonSam mentioned this pull request Aug 14, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.4.6 to 1.11.1 WontonSam/cachicloud#277

Open

thiswayman mentioned this pull request Aug 15, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.5.4 to 1.11.1 thiswayman/agones#265

Open

WontonSam mentioned this pull request Aug 15, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.7.3 to 1.11.1 WontonSam/Phiahplay-#47

Open

This was referenced Aug 16, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.3.7 to 1.11.1 Mhmonicox/dapi#243

Open

[Snyk] Upgrade @grpc/grpc-js from 1.3.7 to 1.11.1 Mhmonicox/dapi#246

Open

thiswayman mentioned this pull request Aug 17, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.5.4 to 1.11.1 thiswayman/agones#267

Open

Mhmonicox mentioned this pull request Aug 19, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.3.7 to 1.11.1 Mhmonicox/dapi#249

Open

thiswayman mentioned this pull request Aug 19, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.5.4 to 1.11.1 thiswayman/agones#268

Open

Mhmonicox mentioned this pull request Aug 20, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.3.7 to 1.11.1 Mhmonicox/dapi#253

Open

thiswayman mentioned this pull request Aug 20, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.5.4 to 1.11.1 thiswayman/agones#269

Open

vroldanbet mentioned this pull request Aug 20, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.7.3 to 1.11.1 authzed/authzed-node#145

Open

thiswayman mentioned this pull request Aug 21, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.5.4 to 1.11.1 thiswayman/agones#270

Open

vroldanbet mentioned this pull request Aug 21, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.7.3 to 1.11.1 authzed/authzed-node#147

Open

davehorton mentioned this pull request Aug 21, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.9.15 to 1.11.1 jambonz/speech-utils#87

Open

MrRaja23 mentioned this pull request Aug 21, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.8.8 to 1.11.1 MrRaja23/beam#117

Open

thiswayman mentioned this pull request Aug 21, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.5.4 to 1.11.1 thiswayman/agones#271

Open

vroldanbet mentioned this pull request Aug 22, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.7.3 to 1.11.1 authzed/authzed-node#150

Open

evgeniyarbatov mentioned this pull request Aug 22, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.10.7 to 1.11.1 evgeniyarbatov/ultra-gpx-tracking#1

Open

MrRaja23 mentioned this pull request Aug 22, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.8.8 to 1.11.1 MrRaja23/beam#118

Open

thiswayman mentioned this pull request Aug 23, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.5.4 to 1.11.1 thiswayman/agones#272

Open

Mhmonicox mentioned this pull request Aug 28, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.3.7 to 1.11.1 Mhmonicox/dapi#261

Open

thiswayman mentioned this pull request Aug 28, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.5.4 to 1.11.1 thiswayman/agones#273

Open

searce-nirav mentioned this pull request Aug 31, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.4.4 to 1.11.1 searce-nirav/online-boutique#1

Open

This was referenced Sep 1, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.9.11 to 1.11.1 Philight/email-form-sender#1

Open

[Snyk] Upgrade @grpc/grpc-js from 1.6.8 to 1.11.1 Philight/email-form-sender#5

Open

thiswayman mentioned this pull request Sep 2, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.5.4 to 1.11.1 thiswayman/agones#274

Open

WontonSam mentioned this pull request Sep 4, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.4.6 to 1.11.1 WontonSam/cachicloud#285

Open

thiswayman mentioned this pull request Sep 4, 2024

[Snyk] Upgrade @grpc/grpc-js from 1.5.4 to 1.11.1 thiswayman/agones#275

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keepalive bugfixes and unify timers strategies between client and server #2760

Keepalive bugfixes and unify timers strategies between client and server #2760

davidfiala commented May 28, 2024

davidfiala commented May 28, 2024

murgatroid99 left a comment

davidfiala commented May 29, 2024

murgatroid99 left a comment

davidfiala commented Jun 5, 2024

davidfiala commented Jun 6, 2024

davidfiala commented Jun 10, 2024

davidfiala commented Jun 17, 2024

murgatroid99 commented Jun 18, 2024

murgatroid99 commented Jun 24, 2024

Keepalive bugfixes and unify timers strategies between client and server #2760

Keepalive bugfixes and unify timers strategies between client and server #2760

Conversation

davidfiala commented May 28, 2024

davidfiala commented May 28, 2024

murgatroid99 left a comment

Choose a reason for hiding this comment

davidfiala commented May 29, 2024

murgatroid99 left a comment

Choose a reason for hiding this comment

davidfiala commented Jun 5, 2024

davidfiala commented Jun 6, 2024

davidfiala commented Jun 10, 2024

davidfiala commented Jun 17, 2024

murgatroid99 commented Jun 18, 2024

murgatroid99 commented Jun 24, 2024