Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support customizing the rebalance threshold #422

Open
cenkalti opened this issue Mar 14, 2023 · 17 comments
Open

Support customizing the rebalance threshold #422

cenkalti opened this issue Mar 14, 2023 · 17 comments

Comments

@cenkalti
Copy link
Member

bbolt/node.go

Lines 420 to 424 in 838586a

// Ignore if node is above threshold (25%) and has enough keys.
var threshold = n.bucket.tx.db.pageSize / 4
if n.size() > threshold && len(n.inodes) > n.minKeys() {
return
}

How this number was chosen?

What happens if we make it configurable and increase or decrease it?

@ahrtr
Copy link
Member

ahrtr commented Mar 15, 2023

The intention might be setting it as 50% * DefaultFillPercent (50% * 0.5 = 25%)? cc @benbjohnson

@ahrtr
Copy link
Member

ahrtr commented Apr 26, 2023

I think we can add a field FillPercent (defaults to 0.5) into both Options and DB. It's the default value for Bucket.FillPercent, users can also set a different value after opening a bucket if they want.

// FillPercent sets the threshold for filling nodes when they split. By default,
// the bucket will fill to 50%, but it can be useful to increase this
// amount if you know that your write workloads are mostly append-only.
FillPercent float64

The rebalance threshold should be 50% * FillPercent,

bbolt/node.go

Line 375 in bd7d6e9

var threshold = n.bucket.tx.db.pageSize / 4

@cenkalti
Copy link
Member Author

Can we make rebalance threshold configurable that defaults to 25% (to keep it backwards compatible)?

It can be set on the Bucket similar to FillPercent field.

type Bucket struct {

	// Sets the threshold for filling nodes when they split. By default,
	// the bucket will fill to 50% but it can be useful to increase this
	// amount if you know that your write workloads are mostly append-only.
	//
	// This is non-persisted across transactions so it must be set in every Tx.
	FillPercent float64

	// new field
	RebalancePercent float64
}

I think this is simpler change to make.

@ahrtr
Copy link
Member

ahrtr commented Apr 26, 2023

Can we make rebalance threshold configurable that defaults to 25% (to keep it backwards compatible)?

More flexibility doesn't mean better. I think we would better not to expose the RebalancePercent parameter. Usually a B-tree is kept balanced after deletion by merging or redistributing keys among siblings to maintain the d-key minimum for non-root nodes, and the d is half the maximum number of keys.

Similarly, I think it makes sense to set the rebalancePercent to 0.5 * FillPercent.

@cenkalti
Copy link
Member Author

cenkalti commented May 8, 2023

I am trying to maximize page utilization for etcd's use case.

Compaction deletes revisions from "keys" bucket but pages nodes do not get rebalanced until data amount in the page node goes below 25%.

Compaction deletes revisions up to a given revision number but it skips the revision if it is the only revision of a key.

Because the revision numbers are incremental, there is no point of having free space in pages other than the latest page at the end of b+ tree.

If a page is empty, it gets removed from the tree and added to the freelist. However, these half-filled pages could stay forever and waste disk and memory space.

If my understanding is correct and we can solve this issue we can eliminate the need to defrag completely.

@cenkalti cenkalti changed the title Question: Why rebalance page threshold is 25%? Why rebalance page threshold is 25%? May 9, 2023
@cenkalti
Copy link
Member Author

Write keys to etcd, fill roughly 3.5 pages. For the KV size below, one pages takes up to ~60 items.

import etcd3

client = etcd3.client()

key = "mykey"
val = "." * 10

# put few unique keys
for i in range(30):
    client.put(key=key+str(i), value=val)

# overwrite same key
for i in range(180):
    client.put(key=key, value=val)

After writing data, pages are:

% bbolt pages ~/projects/etcd/default.etcd/member/snap/db
ID       TYPE       ITEMS  OVRFLW
======== ========== ====== ======
0        meta       0
1        meta       0
2        leaf       62
3        leaf       61
4        leaf       63
5        leaf       24
6        branch     4
7        leaf       10
8        free
9        free
10       free

Pages of “key” bucket:

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=7>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=11>
Txn ID:     12
Checksum:   6e26f9e5b4162263

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=10>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=11>
Txn ID:     11
Checksum:   03270b449e68cf59

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 7
Page ID:    7
Page Type:  leaf
Total Size: 4096 bytes
Overflow pages: 0
Consumed Size: 368
Item Count: 10

"alarm": <pgid=0,seq=0>
"auth": <pgid=0,seq=0>
"authRoles": <pgid=0,seq=0>
"authUsers": <pgid=0,seq=0>
"cluster": <pgid=0,seq=0>
"key": <pgid=6,seq=0>
"lease": <pgid=0,seq=0>
"members": <pgid=0,seq=0>
"members_removed": <pgid=0,seq=0>
"meta": <pgid=0,seq=0>

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 6
Page ID:    6
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 4

00000000000000025f0000000000000000: <pgid=2>
00000000000000405f0000000000000000: <pgid=4>
000000000000007f5f0000000000000000: <pgid=3>
00000000000000bc5f0000000000000000: <pgid=5>

Pages 2, 4, 3, 5 has 62, 63, 61, 24 items. The first 30 keys are unique, the rest (180 keys) are revisions of the same key.

Get latest revision:

% bin/etcdctl get '/' -w json | jq .header.revision
211

Compact to the middle of the 3rd page:

% etcdctl compact 150
compacted revision 150

After compaction “key” bucket contains these pages:

% bbolt pages ~/projects/etcd/default.etcd/member/snap/db
ID       TYPE       ITEMS  OVRFLW
======== ========== ====== ======
0        meta       0
1        meta       0
2        free
3        free
4        free
5        leaf       24
6        leaf       10
7        leaf       30
8        free
9        leaf       38
10       branch     3
11       free
% bbolt page ~/projects/etcd/default.etcd/member/snap/db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=11>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=12>
Txn ID:     18
Checksum:   13f6eecf472396c6

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=6>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=12>
Txn ID:     19
Checksum:   1ef4a4bd5af5d6fa

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 6
Page ID:    6
Page Type:  leaf
Total Size: 4096 bytes
Overflow pages: 0
Consumed Size: 369
Item Count: 10

"alarm": <pgid=0,seq=0>
"auth": <pgid=0,seq=0>
"authRoles": <pgid=0,seq=0>
"authUsers": <pgid=0,seq=0>
"cluster": <pgid=0,seq=0>
"key": <pgid=10,seq=0>
"lease": <pgid=0,seq=0>
"members": <pgid=0,seq=0>
"members_removed": <pgid=0,seq=0>
"meta": <pgid=0,seq=0>

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 10
Page ID:    10
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 3

00000000000000025f0000000000000000: <pgid=7>
00000000000000965f0000000000000000: <pgid=9>
00000000000000bc5f0000000000000000: <pgid=5>

Pages 7, 9, 5 has 30, 38, 24 items.

1st page 2nd page 3rd page 4th page
62 63 61 24
30 deleted 38 24

The 30 unique keys in the 1st page will never get deleted if they don’t get a new revision because etcd never deletes the latest revision of the key.

Since the revision number are incremental, the items in the first page will take up full page space forever unless that page's utilization goes below 25% which is not a configurable value.

@cenkalti cenkalti self-assigned this May 19, 2023
@ahrtr
Copy link
Member

ahrtr commented May 20, 2023

The compacted/removed key (revisions in etcd) may be scattered everywhere in the db pages, but the key point is etcd always adds new k/v at the end of the last page. So for etcd, we should set a high value for both FillPercent (e.g 1.0) and RebalancePercent (e.g 0.9); In this way, we can eventually get rid of defragmentation in etcd 5.0+. Note we still need defragmentation in the transition stage, because some middle pages will never be rebalanced if there is no deletion in them.

The downside is there will be more rebalance operation and accordingly more page writing & syncing in each etcd compaction operation. So we need to evaluate the performance impact.

The other concern is that there are other buckets (e.g. lease, auth*) which are not append-only. Since K8s doesn't use auth, so it isn't a problem. I am not sure the exact usage of lease in K8s, e.g. how often it's created & deleted. Please also get this clarified.

Note: It only makes sense to set a high value for FillPercent and RebalancePercent for append (specifically, only add K/V at the tail) only use case, e.g. etcd.

@ahrtr ahrtr changed the title Why rebalance page threshold is 25%? Support customizing the rebalance threshold May 20, 2023
@ahrtr
Copy link
Member

ahrtr commented May 20, 2023

Note it still makes sense to set the rebalancePercent to 0.5 * FillPercent by default just as my previous comment #422 (comment) mentioned. We just need to support customizing it so that users (e.g etcd) can set a different value based on their scenarios.

@cenkalti
Copy link
Member Author

After #518 is merged, users are able to set rebalance percent to values between 0.0 and 0.5.

However, it's still not possible to set it to values larger than 0.5.

Can we consider again introducing a new field that is set to 0.25 by default?

type Bucket struct {
	FillPercent float64

	// new field
	RebalancePercent float64
}

@ahrtr
Copy link
Member

ahrtr commented May 31, 2023

If you read my comment above and the new title of this issue "Support customizing the rebalance threshold", you will get that I am not against the proposal. But please think about the concern in above comment.

@cenkalti
Copy link
Member Author

cenkalti commented Jun 1, 2023

Cool. I'll try to come up with a way to evaluate the performance impact.

@ahrtr
Copy link
Member

ahrtr commented May 9, 2024

@cenkalti are you still working on this? Please feel free to let me know if you don't have bandwidth, then I am happy to take over.

@cenkalti
Copy link
Member Author

cenkalti commented May 9, 2024

No, I am not working on this anymore.

@cenkalti cenkalti removed their assignment May 9, 2024
@ahrtr ahrtr self-assigned this May 9, 2024
@ahrtr
Copy link
Member

ahrtr commented May 20, 2024

It seems that it isn't that simple. Setting a bigger RebalancePercent can trigger rebalance more easily, and accordingly a node may be merged with next or prev node more easily. But the merged node may be split again if we don't set a reasonable RebalancePercent and FillPercent,

bbolt/node.go

Lines 230 to 246 in 7eb39a6

// Ignore the split if the page doesn't have at least enough nodes for
// two pages or if the nodes can fit in a single page.
if len(n.inodes) <= (common.MinKeysPerPage*2) || n.sizeLessThan(pageSize) {
return n, nil
}
// Determine the threshold before starting a new node.
var fillPercent = n.bucket.FillPercent
if fillPercent < minFillPercent {
fillPercent = minFillPercent
} else if fillPercent > maxFillPercent {
fillPercent = maxFillPercent
}
threshold := int(float64(pageSize) * fillPercent)
// Determine split position and sizes of the two pages.
splitIndex, _ := n.splitIndex(threshold)

The default values for FillPercent and RebalancePercent are is 50% and 25% respectively, so a merged node/page has at most 75% pageSize, so a merged node will never be split again in current transaction by default. But If we set a bigger RebalancePercent, it may cause a merged node to be split again.

@tjungblu
Copy link
Contributor

what do you guys want to optimize for? tighter packing of pages? less memory allocations?

@ahrtr
Copy link
Member

ahrtr commented May 21, 2024

eventually get rid of defragmentation

The goal was to eventually get rid of defragmentation, see #422 (comment).

@tjungblu
Copy link
Contributor

Ah I see, so we have different thresholds and then we wanted to configure them and now you've found a new parameter that needs tweaking alongside it :)

@github-actions github-actions bot added the stale label Aug 20, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 10, 2024
@ahrtr ahrtr reopened this Sep 10, 2024
@github-actions github-actions bot removed the stale label Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants