Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyDB segmentation faults on 6.3.4 #7238

Closed
2 tasks done
StefanB7 opened this issue Dec 8, 2023 · 1 comment · Fixed by #7236
Closed
2 tasks done

KeyDB segmentation faults on 6.3.4 #7238

StefanB7 opened this issue Dec 8, 2023 · 1 comment · Fixed by #7236
Labels
bug Something isn't working

Comments

@StefanB7
Copy link

StefanB7 commented Dec 8, 2023

Actions before raising this issue

  • I searched the existing issues and did not find anything similar.
  • I read/searched the docs

Steps to Reproduce

  1. Get the newest version of CVAT running (using KeyDB 6.3.4 as cvat_redis).
  2. Run the server for a while (usually saw this happening after 5-6 hours of the server being on).

We noticed abnormally high CPU usage by the KeyDB instance (process name was keydb-bgsave). It used 100% of 1 core, where it would normally use much less than that (before upgrading to the latest CVAT version 2.9.1). Upon further investigation we found that the KeyDB instance would segfault, in the dmesg output of my server I would get entries such as:

[240164.761699] keydb-server[783976]: segfault at 0 ip 00007fe5228ecb8a sp 00007fe5228ecb90 error 4
[240164.761706] Code: 60 cb 8e 22 e5 7f 00 00 68 cb 8e 22 e5 7f 00 00 70 cb 8e 22 e5 7f 00 00 78 cb 8e 22 e5 7f 00 00 80 cb 8e 22 e5 7f 00 00 88 cb <8e> 22 e5 7f 00 00 90 cb 8e 22 e5 7f 00 00 98 cb 8e 22 e5 7f 00 00

There would also be a lot of logs in the cvat_redis container, such as:

------ MODULES INFO OUTPUT ------

------ FAST MEMORY TEST ------
*** Preparing to test memory region 556d493b7000 (2334720 bytes)
*** Preparing to test memory region 7fe50a0df000 (26738688 bytes)
*** Preparing to test memory region 7fe50ba5f000 (22020096 bytes)
*** Preparing to test memory region 7fe50fc80000 (295174144 bytes)
*** Preparing to test memory region 7fe521774000 (4718592 bytes)
*** Preparing to test memory region 7fe521bf5000 (8388608 bytes)
*** Preparing to test memory region 7fe5223f6000 (8388608 bytes)
*** Preparing to test memory region 7fe522bf7000 (8388608 bytes)
*** Preparing to test memory region 7fe5233f8000 (8388608 bytes)
*** Preparing to test memory region 7fe523bf9000 (8388608 bytes)
*** Preparing to test memory region 7fe5243fa000 (8388608 bytes)
*** Preparing to test memory region 7fe524bfb000 (8388608 bytes)
*** Preparing to test memory region 7fe526d23000 (60817408 bytes)
*** Preparing to test memory region 7fe52d7fb000 (8388608 bytes)
*** Preparing to test memory region 7fe52dffc000 (8388608 bytes)
*** Preparing to test memory region 7fe52e7fd000 (8388608 bytes)
*** Preparing to test memory region 7fe52effe000 (8388608 bytes)
*** Preparing to test memory region 7fe52f800000 (20971520 bytes)
*** Preparing to test memory region 7fe530c00000 (29360128 bytes)
*** Preparing to test memory region 7fe532800000 (2097152 bytes)
*** Preparing to test memory region 7fe532bff000 (4194304 bytes)
*** Preparing to test memory region 7fe533000000 (8388608 bytes)
*** Preparing to test memory region 7fe533800000 (2097152 bytes)
*** Preparing to test memory region 7fe533bfe000 (8388608 bytes)
*** Preparing to test memory region 7fe5343fe000 (2097152 bytes)
*** Preparing to test memory region 7fe534600000 (26214400 bytes)
*** Preparing to test memory region 7fe535f00000 (30408704 bytes)
*** Preparing to test memory region 7fe537dfb000 (2097152 bytes)
*** Preparing to test memory region 7fe537ffc000 (8388608 bytes)
*** Preparing to test memory region 7fe5387fd000 (8388608 bytes)
*** Preparing to test memory region 7fe538ffe000 (8388608 bytes)
*** Preparing to test memory region 7fe5397ff000 (8388608 bytes)
*** Preparing to test memory region 7fe53a000000 (8388608 bytes)
*** Preparing to test memory region 7fe53a800000 (2097152 bytes)
*** Preparing to test memory region 7fe53ac00000 (8388608 bytes)
*** Preparing to test memory region 7fe53b400000 (8388608 bytes)
*** Preparing to test memory region 7fe53bc70000 (24576 bytes)
*** Preparing to test memory region 7fe53bd9f000 (45056 bytes)
*** Preparing to test memory region 7fe53bddd000 (32768 bytes)
*** Preparing to test memory region 7fe53bf5b000 (4096 bytes)
*** Preparing to test memory region 7fe53bf98000 (8192 bytes)
*** Preparing to test memory region 7fe53bff6000 (4096 bytes)
*** Preparing to test memory region 7fe53c12f000 (4096 bytes)
*** Preparing to test memory region 7fe53c13c000 (8192 bytes)
*** Preparing to test memory region 7fe53c316000 (8192 bytes)
*** Preparing to test memory region 7fe53c332000 (8192 bytes)
*** Preparing to test memory region 7fe53c37a000 (4096 bytes)
*** Preparing to test memory region 7fe53c458000 (8192 bytes)
*** Preparing to test memory region 7fe53c723000 (8192 bytes)
*** Preparing to test memory region 7fe53c8b5000 (8192 bytes)
*** Preparing to test memory region 7fe53c91c000 (8192 bytes)
*** Preparing to test memory region 7fe53ca0c000 (8192 bytes)
*** Preparing to test memory region 7fe53caff000 (8192 bytes)
*** Preparing to test memory region 7fe53cdcc000 (16384 bytes)
*** Preparing to test memory region 7fe53cdef000 (16384 bytes)
*** Preparing to test memory region 7fe53ce0e000 (8192 bytes)
*** Preparing to test memory region 7fe53d13e000 (12288 bytes)
*** Preparing to test memory region 7fe53d413000 (16384 bytes)
*** Preparing to test memory region 7fe53d4cc000 (8192 bytes)
*** Preparing to test memory region 7fe53d500000 (4096 bytes)
.O.O.O.O.O.O.1:556:M 07 Dec 2023 08:13:22.003 # Background saving cancelled
1:556:M 07 Dec 2023 08:13:22.104 * 1 changes in 900 seconds. Saving...
1:556:M 07 Dec 2023 08:13:22.110 * Background saving started by pid 12783
1:556:M 07 Dec 2023 08:13:22.110 * Background saving started


=== KEYDB BUG REPORT START: Cut & paste starting from here ===
12783:556:C 07 Dec 2023 08:13:23.029 # === ASSERTION FAILED ===
12783:556:C 07 Dec 2023 08:13:23.029 # ==> rdb.cpp:1372 'ckeysExpired == db->expireSize()' is not true

------ STACK TRACE ------

Backtrace:
keydb-rdb-bgsave *:6379(rdbSaveRio(_rio*, redisDbPersistentDataSnapshot const**, int*, int, rdbSaveInfo*)+0x54d) [0x556d48cab06d]
keydb-rdb-bgsave *:6379(rdbSaveFile(char*, redisDbPersistentDataSnapshot const**, rdbSaveInfo*)+0xc7) [0x556d48cab167]
keydb-rdb-bgsave *:6379(rdbSave(redisDbPersistentDataSnapshot const**, rdbSaveInfo*)+0x6e) [0x556d48cab49e]
keydb-rdb-bgsave *:6379(rdbSaveBackgroundFork(rdbSaveInfo*)+0x104) [0x556d48cabb64]
keydb-rdb-bgsave *:6379(launchRdbSaveThread(unsigned long&, rdbSaveInfo*)+0x45) [0x556d48cabc25]
keydb-rdb-bgsave *:6379(rdbSaveBackground(rdbSaveInfo*)+0x88) [0x556d48cac058]
keydb-rdb-bgsave *:6379(serverCron(aeEventLoop*, long long, void*)+0xcc0) [0x556d48d04e50]
keydb-rdb-bgsave *:6379(aeProcessEvents+0x224) [0x556d48d0b5a4]
keydb-rdb-bgsave *:6379(aeMain+0x3a) [0x556d48d0fcda]
keydb-rdb-bgsave *:6379(workerThreadMain(void*)+0x7e) [0x556d48cf7ade]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fe53cdd8609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fe53ccfd133]

------ INFO OUTPUT ------
# Server
redis_version:6.3.4
redis_git_sha1:7e7e5e57
redis_git_dirty:1
redis_build_id:bb8fc59400781b64
redis_mode:standalone
os:Linux 5.15.0-89-generic x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:9.4.0
process_id:12783
process_supervised:no
run_id:34a268b72ead250968fc5019c10fd16b119cf1b0
tcp_port:6379
server_time_usec:1701936802104528
uptime_in_seconds:23957
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:7437986
executable:/data/keydb-server
config_file:/etc/keydb/keydb.conf
availability_zone:
features:cluster_mget

It seems to happen each time when it was testing the memory. Another error we got:

*** Preparing to test memory region 7fe53d13e000 (12288 bytes)
*** Preparing to test memory region 7fe53d413000 (16384 bytes)
*** Preparing to test memory region 7fe53d4cc000 (8192 bytes)
*** Preparing to test memory region 7fe53d500000 (4096 bytes)
.O.O.O.O.O.12175:556:C 07 Dec 2023 07:39:30.364 # KeyDB 6.3.4 crashed by signal: 11, si_code: 1
12175:556:C 07 Dec 2023 07:39:30.364 # Accessing address: 0xffffffffe5229ee9
12175:556:C 07 Dec 2023 07:39:30.364 # Crashed running the instruction at: 0x7fe5229ee98f

------ STACK TRACE ------
EIP:
[0x7fe5229ee98f]

Backtrace:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7fe53cde4420]
[0x7fe5229ee98f]

This is difficult to reproduce, but with KeyDB 6.3.4 it happened within a few hours, each time, even when clearing the cvat_cache_db docker volume each time.

Expected Behavior

KeyDB should not crash, and should not consume excessive CPU cycles.
Note: The CVAT system as a whole continued to perform as expected, CVAT was usable.

Possible Solution

We migrated back to KeyDB v 6.3.2 and the issue seems to have gone away. Likely caused by some changes in KeyDB 6.3.2 -> 6.3.4.
Solution for now would be to change the version of KeyDB used in docker-compose.yml from 6.3.4 back to 6.3.2.
Excerpt of docker-compose.yml:

  cvat_redis:
    container_name: cvat_redis
    image: eqalpha/keydb:x86_64_v6.3.2 #changed from eqalpha/keydb:x86_64_v6.3.4
    restart: always
    command:
      [
        'keydb-server',
        '/etc/keydb/keydb.conf',
        '--storage-provider',
        'flash',
        '/data/flash',
        '--maxmemory',
        '5G',
        '--maxmemory-policy',
        'allkeys-lfu',
      ]
    volumes:
      - cvat_cache_db:/data
    networks:
      - cvat

KeyDB was upgraded to 6.3.4 in #7118. There seems to be a PR already out to fix this #7236.

Context

No response

Environment

No response

@StefanB7 StefanB7 added the bug Something isn't working label Dec 8, 2023
@bsekachev
Copy link
Member

Yep, there was a lot of pain with new KeyDB version :(

azhavoro pushed a commit that referenced this issue Dec 11, 2023
<!-- Raise an issue to propose your change
(https://github.com/opencv/cvat/issues).
It helps to avoid duplication of efforts from multiple independent
contributors.
Discuss your ideas with maintainers to be sure that changes will be
approved and merged.
Read the [Contribution
guide](https://opencv.github.io/cvat/docs/contributing/). -->

<!-- Provide a general summary of your changes in the Title above -->

### Motivation and context
<!-- Why is this change required? What problem does it solve? If it
fixes an open
issue, please link to the issue here. Describe your changes in detail,
add
screenshots. -->
That KeyDB version turned out rather unstable in practice, with multiple
crashes and freezes observed in production.

Resolved #7238

This reverts commit 118cc72.

### How has this been tested?
<!-- Please describe in detail how you tested your changes.
Include details of your testing environment, and the tests you ran to
see how your change affects other areas of the code, etc. -->

### Checklist
<!-- Go over all the following points, and put an `x` in all the boxes
that apply.
If an item isn't applicable for some reason, then ~~explicitly
strikethrough~~ the whole
line. If you don't do that, GitHub will show incorrect progress for the
pull request.
If you're unsure about any of these, don't hesitate to ask. We're here
to help! -->
- [x] I submit my changes into the `develop` branch
- [ ] I have created a changelog fragment <!-- see top comment in
CHANGELOG.md -->
- ~~[ ] I have updated the documentation accordingly~~
- ~~[ ] I have added tests to cover my changes~~
- ~~[ ] I have linked related issues (see [GitHub docs](

https://help.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword))~~
- ~~[ ] I have increased versions of npm packages if it is necessary

([cvat-canvas](https://github.com/opencv/cvat/tree/develop/cvat-canvas#versioning),

[cvat-core](https://github.com/opencv/cvat/tree/develop/cvat-core#versioning),

[cvat-data](https://github.com/opencv/cvat/tree/develop/cvat-data#versioning)
and

[cvat-ui](https://github.com/opencv/cvat/tree/develop/cvat-ui#versioning))~~

### License

- [x] I submit _my code changes_ under the same [MIT License](
https://github.com/opencv/cvat/blob/develop/LICENSE) that covers the
project.
  Feel free to contact the maintainers if that's a concern.
amjadsaadeh pushed a commit to amjadsaadeh/cvat that referenced this issue Dec 14, 2023
<!-- Raise an issue to propose your change
(https://github.com/opencv/cvat/issues).
It helps to avoid duplication of efforts from multiple independent
contributors.
Discuss your ideas with maintainers to be sure that changes will be
approved and merged.
Read the [Contribution
guide](https://opencv.github.io/cvat/docs/contributing/). -->

<!-- Provide a general summary of your changes in the Title above -->

### Motivation and context
<!-- Why is this change required? What problem does it solve? If it
fixes an open
issue, please link to the issue here. Describe your changes in detail,
add
screenshots. -->
That KeyDB version turned out rather unstable in practice, with multiple
crashes and freezes observed in production.

Resolved cvat-ai#7238

This reverts commit 118cc72.

### How has this been tested?
<!-- Please describe in detail how you tested your changes.
Include details of your testing environment, and the tests you ran to
see how your change affects other areas of the code, etc. -->

### Checklist
<!-- Go over all the following points, and put an `x` in all the boxes
that apply.
If an item isn't applicable for some reason, then ~~explicitly
strikethrough~~ the whole
line. If you don't do that, GitHub will show incorrect progress for the
pull request.
If you're unsure about any of these, don't hesitate to ask. We're here
to help! -->
- [x] I submit my changes into the `develop` branch
- [ ] I have created a changelog fragment <!-- see top comment in
CHANGELOG.md -->
- ~~[ ] I have updated the documentation accordingly~~
- ~~[ ] I have added tests to cover my changes~~
- ~~[ ] I have linked related issues (see [GitHub docs](

https://help.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword))~~
- ~~[ ] I have increased versions of npm packages if it is necessary

([cvat-canvas](https://github.com/opencv/cvat/tree/develop/cvat-canvas#versioning),

[cvat-core](https://github.com/opencv/cvat/tree/develop/cvat-core#versioning),

[cvat-data](https://github.com/opencv/cvat/tree/develop/cvat-data#versioning)
and

[cvat-ui](https://github.com/opencv/cvat/tree/develop/cvat-ui#versioning))~~

### License

- [x] I submit _my code changes_ under the same [MIT License](
https://github.com/opencv/cvat/blob/develop/LICENSE) that covers the
project.
  Feel free to contact the maintainers if that's a concern.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants