Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add syslog batching implementation #491

Merged
merged 14 commits into from
Oct 3, 2024
Merged

Conversation

nicklas-dohrn
Copy link
Contributor

@nicklas-dohrn nicklas-dohrn commented Feb 13, 2024

Description

This is our proposal to implement syslog batching for sending logs via https.
it includes a switch between the normal syslog one log per request mode via a syslog query parameter.
This can be done with the https-batch:

https-batch://<your-drain-url>/syslog

If you enable the syslog batching behaviour, it will currently write syslogbatches, where single messages are newline delimited (\n).
Currently, the batch sizes are hardwired to be around 256kb, which is already sufficient for speeding up throughput by a factor of 10x at least.
making it configurable would be an option, but I did not see the need so far.
please let me know what you think of the current approach.

Copy link

linux-foundation-easycla bot commented Feb 13, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@nicklas-dohrn nicklas-dohrn marked this pull request as ready for review April 4, 2024 05:19
@nicklas-dohrn nicklas-dohrn requested a review from a team as a code owner April 4, 2024 05:19
Copy link
Member

@ctlong ctlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it looks fine to me. I don't seem where it adds the newline character to delimit between syslog lines though...

Would love to see a demo at the next ARP WG meeting.

(I sort of disregarded that this was a POC at points in time and some of my comments are more implementation-focused, sorry about that 😅 )

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved
@nicklas-dohrn
Copy link
Contributor Author

The newline is already part of the syslog messages, so these are added already by a method beforhand (linked for anyone curious):
msg := appendNewline(removeNulls(env.GetLog().Payload))

this is true for all possible syslog messages, so I do not even need to add this, which is really convenient.

@nicklas-dohrn
Copy link
Contributor Author

Adressed all the comments and additions by @ctlong above, if sufficient, please close the threads :)

Copy link
Member

@ctlong ctlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've still some specific concerns, which I've left as comments in this review.

In general, the implementation looks fine, though I'm not sure that I understand the necessity of the new TriggerTimer struct.

src/pkg/egress/syslog/triggerTimer.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved
@ctlong
Copy link
Member

ctlong commented Apr 17, 2024

@nicklas-dohrn can you please sign the CLA. We can't merge this unless you've done so.

@chombium
Copy link
Contributor

@ctlong I will take care about the CLA. @nicklas-dohrn has to be added to one of our GitHub orgs.

Copy link
Member

@ctlong ctlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I think this proof of concept is correct. Implementation-wise the timer still has some issues.

Once those are fixed, I would suggest rebasing this off #573 and testing the two changes together to see if it achieves the throughput you want. Then we're all ready for a real implementation (with tests).

🙏 Could you please also update the PR description, thanks.

src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Show resolved Hide resolved
acrmp
acrmp previously requested changes Jun 22, 2024
src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Show resolved Hide resolved
src/pkg/egress/syslog/https_batch_test.go Show resolved Hide resolved
src/pkg/egress/syslog/https_batch_test.go Show resolved Hide resolved
src/pkg/egress/syslog/https_batch_test.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/syslog_connector.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved
@nicklas-dohrn
Copy link
Contributor Author

I reimplemented the changes using a similar approach to what @ctlong proposed.
This will make the change way more concise, but also gets rid of all -race conflicts shown by go test -race.
There are some race complaints left, but these only refer to the implementation of the tests themselves, which inherently are data-races by design.
@ctlong and @acrmp, If your concerns above are adressed, please close everything that is done, so we can keep this organised.

@nicklas-dohrn nicklas-dohrn requested a review from acrmp June 26, 2024 04:57
@nicklas-dohrn nicklas-dohrn changed the title Add syslog batching poc implementation Add syslog batching implementation Aug 5, 2024
@nicklas-dohrn
Copy link
Contributor Author

I did some elaborate testing on the current and new approach for syslog-batching, sending from our dev cf landscape with 4 diego cells and 4 loggregator agents to a cls instance.
I also tested #573 (HTTPS drains reuse and release fasthttp),
there were no mayor improvements of throughput to be seen, cpu consumption of both the new and old version is minimal due to being network bounded.
There are some good news, concerning batching, that it will considerably speed up throughput, and also reduce drops.
(see attached table for information)
The concurrent refers to a version of the https drain, which I modified with a go routine to allow usage of more than one cpu.
The current approach would increase the throughput considerably, but this results in a new issue:
If multiple applications on one diego cell would bind to a cls instance, they would share the throughput constraints of one cpu (I tested that this is indeed the behaviour).
This leads to bottleneck issues on bigger cf landscapes, as these use way bigger diego cells, consequently using bigger loggregator instances with more cpu cores, which does not scale for this approach at all.
@ctlong and @acrmp,
I would like to hear your thoughts on this issue, and how we can proceed making this one work at scale.
image

@juergen-walter
Copy link

@ctlong and @acrmp have many customers suffering and complaining about log drops. We would highly appreciate if this PR could be finalized/merged in in a timely manner. Appreciate your efforts so far, best regards.

Copy link
Contributor

@chombium chombium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it looks fine, I found two little things.

I will wait on @ctlong for his review

src/pkg/egress/syslog/https_batch_test.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https_test.go Outdated Show resolved Hide resolved
src/pkg/egress/syslog/https.go Show resolved Hide resolved
src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved
This is a new approach to switch between http and http batching.
It only is different in this regard from the previous attempts,
and only contains refactorings besides this change.
@nicklas-dohrn
Copy link
Contributor Author

I rebased off the current main of loggregator agent, and adopted all changes to work with the https batching.

@ctlong
Copy link
Member

ctlong commented Sep 12, 2024

@nicklas-dohrn the tests and linting are still failing.

Also, I'm not sure what's going on with the history of this branch but the full PR does not reflect the latest commit you added to adopt the changes I'd suggested. Like if you go to Files Changed tab or pull the target branch they're not there 🤔

@nicklas-dohrn
Copy link
Contributor Author

Will try getting it to work now.

@nicklas-dohrn
Copy link
Contributor Author

I got all the tests to work again, was only due to changing settings for testing purposes, and simply forgot to change them back.

@nicklas-dohrn
Copy link
Contributor Author

nicklas-dohrn commented Sep 16, 2024

There is a real issue turned up by the linting errors:
The write function for the batching implementation cannot return an error because of two issues with the changed logic:

  • returning a send error for async sends would need a callback function.
  • even if the returning would work, the error would be for the batch and not the single message to which the error would be returned, breaking the retry logic.

This is unsolvable in the current architecture, as the inversion of batch creation and retry logic would be the way better approach. This is difficult due to the creation of batching in front of unmarshalling the envelopes.
This also turned up an issue with the current retry implementation, I raised it in a different issue to keep topics on track

@ctlong what is your opinion on this issue?

P.S.: I could disable it by appending //nolint:errcheck, but this seems not like a not needed error check, as it triggers retry attempts.

@ctlong
Copy link
Member

ctlong commented Sep 17, 2024

I see what you mean and I agree that finding a way to return an error does seems like something that should be added eventually. However, how badly do you want retries and error logs on write failures? If you want to disable this error with a TODO to come back and refactor it, I'm willing to approve that to get this change through since you've been waiting a long time for it.

I don't think the "right" fix is very straightforward unfortunately. It seems like this writer should either have its own retry and logging functionality, or else inline batch writes similar to how you had them before. The former approach would be more complex while the latter approach comes with the obvious downside of envelopes potentially never being sent if there isn't a constant stream of them. What do you think? Maybe you have a better idea?

@nicklas-dohrn
Copy link
Contributor Author

For now, I would go to implementing the retry directly as a temporary within the http batching code, omitting the error reporting to the retry writer for now, essentially making it inert.
That would lead to the wanted behaviour without making to many changes in other parts of the program.
I still need to try out the effects of retries, as there might be some issues with the current implementation on error cases for throughput.

@nicklas-dohrn
Copy link
Contributor Author

nicklas-dohrn commented Sep 26, 2024

Hey @ctlong,
I finally was able to create a final draft, so I would like to merge the current state, with any remarks you have to be implemented added for sure.

I opened an issue for discussion on the state of the retry stack, so this issue gets discoupled from that, as it is a different issue to solve:
#613

@ctlong
Copy link
Member

ctlong commented Sep 30, 2024

@nicklas-dohrn this PR is still failing linting and unit tests

@nicklas-dohrn
Copy link
Contributor Author

will fix that, was thinking that was fixed by the additions I made.

@ameowlia ameowlia removed the request for review from acrmp October 3, 2024 14:34
@ctlong ctlong dismissed stale reviews from chombium and acrmp October 3, 2024 16:42

Addressed

@ctlong ctlong assigned ctlong and unassigned nicklas-dohrn Oct 3, 2024
@ctlong ctlong merged commit 75f9a92 into cloudfoundry:main Oct 3, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

5 participants