feat: Add new Drain tokenizer that splits on most punctuation #13143

benclive · 2024-06-05T15:42:07Z

What this PR does / why we need it:

Implement new Tokenizer that splits log lines on most punctuation. - characters are treated as part of a single token.
Add new feature to the Tokenizer interface: an opaque state object can be utilised by the Tokenizer to successfully tokenize and join the results. Here I'm returning an array of token indexes that indicate where to put spaces when joining the string.
Implement a new deduplicatePlaceholders method which operates on a string instead of tokens. The token-based method stopped working when using the state objects since the space indexes no longer lined up with the tokens and I couldn't think of an efficient way to handle this at the token level.
Take a look at the tests to see the new output: Generally it generates 10% less patterns for a given stream & they tend to be higher quality (subjectively).

Perf wise, this PR is ~50% higher CPU usage compared to previous Drain & much less allocations (so hopefully less GC). I will continue to do some perf optimizations in a separate PR to try and improve this.

Data 1:
Benchmark for using the new "punctuation" tokenizer vs the old "splitting" tokenizer:

$ benchstat drain-splitting-tokenizer.txt drain-punctuation-tokenizer.txt                                                                                                                                                                                                                                                                                               ok | 2m 5s | 16:38:23 
goos: darwin
goarch: arm64
pkg: github.com/grafana/loki/v3/pkg/pattern/drain
                                                               │ drain-splitting-tokenizer.txt │   drain-punctuation-tokenizer.txt    │
                                                               │            sec/op             │    sec/op     vs base                │
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-14                           1.650m ± 2%    2.531m ± 3%  +53.45% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-14                        115.3µ ± 1%    178.1µ ± 1%  +54.48% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-14                             255.8µ ± 0%    415.4µ ± 1%  +62.40% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-14                     4.992m ± 1%    7.667m ± 1%  +53.60% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/journald.txt-14                               1.952m ± 4%    2.746m ± 1%  +40.67% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kafka.txt-14                                  911.3µ ± 1%   1559.7µ ± 2%  +71.15% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-14                             2.090m ± 0%    2.218m ± 3%   +6.14% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/vault.txt-14                                  798.0µ ± 1%   1468.7µ ± 2%  +84.05% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/calico.txt-14                                 1.633m ± 1%    2.172m ± 2%  +32.99% (p=0.000 n=10)
geomean                                                                            1.018m         1.521m       +49.36%

                                                               │ drain-splitting-tokenizer.txt │    drain-punctuation-tokenizer.txt     │
                                                               │             B/op              │     B/op       vs base                 │
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-14                          2.046Mi ± 0%    5.766Mi ± 0%  +181.79% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-14                       163.0Ki ± 0%    380.0Ki ± 0%  +133.16% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-14                            294.1Ki ± 0%    844.6Ki ± 0%  +187.15% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-14                    6.498Mi ± 0%   16.697Mi ± 0%  +156.96% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/journald.txt-14                              2.843Mi ± 0%    5.662Mi ± 0%   +99.14% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kafka.txt-14                                 1.106Mi ± 0%    3.180Mi ± 0%  +187.58% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-14                            2.967Mi ± 0%    4.750Mi ± 0%   +60.12% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/vault.txt-14                                 970.6Ki ± 0%   3503.3Ki ± 0%  +260.93% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/calico.txt-14                                2.380Mi ± 0%    4.449Mi ± 0%   +86.93% (p=0.000 n=10)
geomean                                                                           1.327Mi         3.231Mi       +143.41%

                                                               │ drain-splitting-tokenizer.txt │   drain-punctuation-tokenizer.txt   │
                                                               │           allocs/op           │  allocs/op   vs base                │
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-14                          16.447k ± 0%   6.180k ± 0%  -62.42% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-14                        1456.0 ± 0%    675.0 ± 0%  -53.64% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-14                             3.577k ± 0%   1.299k ± 0%  -63.68% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-14                     65.27k ± 0%   20.46k ± 0%  -68.66% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/journald.txt-14                               17.73k ± 0%   10.46k ± 0%  -41.02% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kafka.txt-14                                  9.925k ± 0%   5.119k ± 0%  -48.42% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-14                            13.601k ± 0%   6.744k ± 0%  -50.42% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/vault.txt-14                                 11.047k ± 0%   4.169k ± 0%  -62.26% (p=0.000 n=10)
Drain_TrainExtractsPatterns/testdata/calico.txt-14                                14.991k ± 0%   8.462k ± 0%  -43.55% (p=0.000 n=10)
geomean                                                                            10.92k        4.823k       -55.85%

Data 2:
Benchmark for my custom deduplicatePlaceholders vs a solution using regexp.MustCompile("<_>+").ReplaceAllLiteralString:

$ benchstat dedup-regex.txt dedup-loops.txt                                                                                                                                                                                                                                                                                                                                        ok | 15:58:42 
goos: darwin
goarch: arm64
                 │ dedup-regex.txt │           dedup-loops.txt           │
                 │     sec/op      │   sec/op     vs base                │
Dedup/Dedup_0-14       1.838n ± 2%   1.880n ± 0%   +2.29% (p=0.001 n=10)
Dedup/Dedup_1-14      142.60n ± 0%   15.63n ± 1%  -89.04% (p=0.000 n=10)
Dedup/Dedup_2-14      1716.0n ± 0%   142.4n ± 0%  -91.70% (p=0.000 n=10)
Dedup/Dedup_3-14       4.567n ± 0%   5.012n ± 0%   +9.75% (p=0.000 n=10)
Dedup/Dedup_4-14       197.3n ± 1%   193.1n ± 2%   -2.13% (p=0.000 n=10)
Dedup/Dedup_5-14       4.567n ± 0%   5.067n ± 0%  +10.95% (p=0.000 n=10)
Dedup/Dedup_6-14       3.490n ± 0%   3.759n ± 0%   +7.71% (p=0.000 n=10)
Dedup/Dedup_7-14      195.78µ ± 0%   10.40µ ± 6%  -94.69% (p=0.000 n=10)
Dedup/Dedup_8-14      176.15n ± 1%   23.87n ± 0%  -86.45% (p=0.000 n=10)
geomean                84.63n        29.91n       -64.66%

                 │ dedup-regex.txt │              dedup-loops.txt              │
                 │      B/op       │     B/op       vs base                    │
Dedup/Dedup_0-14      0.000 ± 0%        0.000 ± 0%          ~ (p=1.000 n=10) ¹
Dedup/Dedup_1-14      32.00 ± 0%        16.00 ± 0%    -50.00% (p=0.000 n=10)
Dedup/Dedup_2-14      32.00 ± 0%       320.00 ± 0%   +900.00% (p=0.000 n=10)
Dedup/Dedup_3-14      0.000 ± 0%        0.000 ± 0%          ~ (p=1.000 n=10) ¹
Dedup/Dedup_4-14      0.000 ± 0%        0.000 ± 0%          ~ (p=1.000 n=10) ¹
Dedup/Dedup_5-14      0.000 ± 0%        0.000 ± 0%          ~ (p=1.000 n=10) ¹
Dedup/Dedup_6-14      0.000 ± 0%        0.000 ± 0%          ~ (p=1.000 n=10) ¹
Dedup/Dedup_7-14    1.421Ki ± 0%     32.000Ki ± 0%  +2152.10% (p=0.000 n=10)
Dedup/Dedup_8-14      56.00 ± 0%        24.00 ± 0%    -57.14% (p=0.000 n=10)
geomean                          ²                    +53.84%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                 │ dedup-regex.txt │           dedup-loops.txt            │
                 │    allocs/op    │ allocs/op   vs base                  │
Dedup/Dedup_0-14      0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=10) ¹
Dedup/Dedup_1-14      3.000 ± 0%     1.000 ± 0%  -66.67% (p=0.000 n=10)
Dedup/Dedup_2-14      3.000 ± 0%     1.000 ± 0%  -66.67% (p=0.000 n=10)
Dedup/Dedup_3-14      0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=10) ¹
Dedup/Dedup_4-14      0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=10) ¹
Dedup/Dedup_5-14      0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=10) ¹
Dedup/Dedup_6-14      0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=10) ¹
Dedup/Dedup_7-14      9.000 ± 0%     1.000 ± 0%  -88.89% (p=0.000 n=10)
Dedup/Dedup_8-14      4.000 ± 0%     1.000 ± 0%  -75.00% (p=0.000 n=10)
geomean                          ²               -47.39%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

Optimize output drain

cyriltovena · 2024-06-06T11:29:37Z

pkg/logcli/output/loki.go

@@ -0,0 +1 @@
+package output


cyriltovena · 2024-06-06T11:30:00Z

pkg/pattern/drain/drain.go

@@ -139,7 +141,7 @@ func DefaultConfig() *Config {
 		// MaxClusterDepth and SimTh, the less the chance that there will be
 		// "similar" clusters, but the greater the footprint.
 		SimTh:       0.3,
-		MaxChildren: 100,
+		MaxChildren: 15,


is that better ?

cyriltovena · 2024-06-06T13:20:53Z

pkg/pattern/drain/line_tokenizer.go


 type LineTokenizer interface {
-	Tokenize(line string) []string
-	Join(tokens []string) string
+	Tokenize(line string) ([]string, interface{})


I wonder if generics would work here, just a thought. I know interface have a cost when casting for instance.

cyriltovena · 2024-06-06T13:21:55Z

pkg/pattern/drain/line_tokenizer.go

+
+func (p *punctuationTokenizer) Tokenize(line string) ([]string, interface{}) {
+	tokens := make([]string, len(line))                  // Maximum size is every character is punctuation
+	spacesAfter := make([]int, strings.Count(line, " ")) // Could be a bitmap, but it's not worth it for a few bytes.


You might want to use a pool for this one. Prometheus has a good sync.Pool that works in buckets.

cyriltovena

LGTM

Let's try it !

benclive added 4 commits June 5, 2024 15:27

Use similar drain params to compare

c05f6be

Optimize output drain

Clean up & add placeholder deduplication step

8279ac1

Remove unused regex

58ca640

Update tests

6c618bb

benclive requested a review from a team as a code owner June 5, 2024 15:42

pull-request-size bot added the size/XXL label Jun 5, 2024

benclive requested a review from cyriltovena June 5, 2024 15:42

benclive added 2 commits June 5, 2024 17:03

Clean up & tests

9314e8f

golangci-lint fixes

b15a41e

cyriltovena reviewed Jun 6, 2024

View reviewed changes

pkg/logcli/output/loki.go

@@ -0,0 +1 @@

package output

Copy link

Contributor

cyriltovena Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

cyriltovena reviewed Jun 6, 2024

View reviewed changes

cyriltovena approved these changes Jun 6, 2024

View reviewed changes

cyriltovena merged commit 6a0fdd0 into main Jun 7, 2024
59 checks passed

cyriltovena deleted the add-new-tokenizer-that-splits-aggressively branch June 7, 2024 10:59

This was referenced Jun 10, 2024

chore(k206): release 3.1.0 #13184

Closed

chore(k207): release 3.1.0 #13225

Merged

loki-gh-app bot mentioned this pull request Jun 24, 2024

chore(k208): release 3.1.0 #13291

Closed

loki-gh-app bot mentioned this pull request Jul 1, 2024

chore(k209): release 3.1.0 #13356

Closed

grafanabot mentioned this pull request Jul 2, 2024

chore: [main] chore(k207): release 3.1.0 #13391

Open

This was referenced Jul 3, 2024

chore(release-3.1.x): release 3.0.1 #13402

Closed

chore(k210): release 3.1.0 #13435

Closed

chore(k210): release 3.1.0 #13462

Closed

RodrigoCMoraes mentioned this pull request Jul 12, 2024

incognia inloco/loki#22

Closed

loki-gh-app bot mentioned this pull request Jul 15, 2024

chore(k211): release 3.1.0 #13521

Closed

loki-gh-app bot mentioned this pull request Jul 22, 2024

chore(k212): release 3.1.0 #13595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add new Drain tokenizer that splits on most punctuation #13143

feat: Add new Drain tokenizer that splits on most punctuation #13143

benclive commented Jun 5, 2024

cyriltovena Jun 6, 2024

cyriltovena Jun 6, 2024

cyriltovena Jun 6, 2024

cyriltovena Jun 6, 2024

cyriltovena left a comment

feat: Add new Drain tokenizer that splits on most punctuation #13143

feat: Add new Drain tokenizer that splits on most punctuation #13143

Conversation

benclive commented Jun 5, 2024

cyriltovena Jun 6, 2024

Choose a reason for hiding this comment

cyriltovena Jun 6, 2024

Choose a reason for hiding this comment

cyriltovena Jun 6, 2024

Choose a reason for hiding this comment

cyriltovena Jun 6, 2024

Choose a reason for hiding this comment

cyriltovena left a comment

Choose a reason for hiding this comment