Move getEvent cache to externalCache #9379

Half-Shot · 2021-02-10T21:31:19Z

This PR moves the getEvent cache to the external cache class (backed by redis). ~~We could potentially have a two level cache where the process still has a smaller LRU, but this PR at least ensures there is one cache for events for all workers :)~~

This PR implements Redis as a level 2 event cache, where the existing event cache serves as a level 1 cache.

richvdh · 2021-02-10T22:16:04Z

synapse/storage/databases/main/events_worker.py

@@ -491,9 +492,12 @@ async def _get_events_from_cache_or_db(self, event_ids, allow_rejected=False):
        return event_entry_map

    def _invalidate_get_event_cache(self, event_id):
+        if self._external_cache.is_enabled():
+            self._external_cache.delete("getEvent", event_id)


you have to await a coroutine for it to actually do anything.

I've since made this a background process, as we depend on this function to be sync all over the place. I am a bit worried that doing this might break areas of Synapse that rely on this cache to be accurate though?

Following on from #9379 (comment), we should probably use two caches.

richvdh

This looks exciting!

I suspect a two-layer cache will be essential though: we end up fetching some events a lot of times, and you don't necessarily want to be deserialising them every time (not least because that could mean you end up with many copies of the same event in memory at the same time, which could backfire)

also, you might want to consider storing redactions separately to events, so that the entries are immutable and you don't have to worry about cache invalidation.

Half-Shot · 2021-02-11T11:35:09Z

I suspect a two-layer cache will be essential though: we end up fetching some events a lot of times, and you don't necessarily want to be deserialising them every time (not least because that could mean you end up with many copies of the same event in memory at the same time, which could backfire)

I've ended up changing the logic so we use the existing memory cache first, and fall back to the external cache. The idea being that if you want to save on worker memory, you tune down your event_cache_size which controls your L1 cache and Redis can be an L2 cache. If you don't jhave Redis support, you only get the L1 cache.

also, you might want to consider storing redactions separately to events, so that the entries are immutable and you don't have to worry about cache invalidation.

Ah, so the reason we invalidate the cache is to then populate it with another entry that includes the redacted_event? If so, then two caches definitely make more sense to me. Would also solve the racyness of an async delete.

Configurable cache size / limit

I did a bit of homework on this. It sounds like the way most people use Redis is that they set it's memory limit, and applications basically fill the cache while Redis handles evicting entries to keep it under the memory limit. With this in mind, rather than adding complicated logic to Synapse to keep the Redis cache a safe size, we should probably ask administrators to configure their Redis cache to evict based on lfu (least-frequently-used), and to set a sensible memory limit.

For reference, I left half-shot.uk running for a bit overnight and Redis has grown to 1.2GB bearing in mind that I've not setup any cache eviction settings on it yet.

synapse/storage/databases/main/events_worker.py

Half-Shot · 2021-02-11T12:06:47Z

Having spoken with @richvdh in #synapse-dev, we'd like to avoid cache invalidation so I've hunted down the areas where we invalidate:

Usages of EventsWorkerStore._invalidate_get_event_cache:

PersistEventSore._update_room_depths_txn (can we just mutate the event in the cache?)
Store._store_redaction (store redacted event in a seperate cache?)
CacheInvalidationWorkerStore._invalidate_caches_for_event
- Called for backfilling ("We fetched some old events and either we had never seen that event before or it went from being an outlier to not.")
  Do we need to invalidate?
- Called for new events ("We received a new event, or an event went from being an outlier to not"
  Do we need to invalidate?

Probably need to pick the brain of @erikjohnston here a bit to see what he thinks.

Co-authored-by: Christian Paul <christianp@matrix.org>

Half-Shot · 2021-02-12T11:27:08Z

synapse/storage/databases/main/events_worker.py

+        if self._external_cache.is_enabled():
+            # XXX: Is there danger in doing this?
+            # We could hold a set of recently evicted keys in memory if
+            # we need this to be synchronous?
+            run_as_background_process(
+                "getEvent_external_cache_delete",
+                self._external_cache.delete,
+                GET_EVENT_CACHE_NAME,
+                event_id,
+            )


Danger danger

You probably know this already, but this is absolutely not safe: if we modify the cached event (for example due to redaction), and then try to fetch the event again, there's a good chance of getting the unmodified version back from the cache.

Yup, this is horrible. Even if we maintained a separate redaction cache, the set operation is async. I'm trying to think of ways to make modifying the cache here synchronous.

I don't really follow what you mean by "the set operation". You mean, storing the redaction in the cache? I don't think that matters - persisting the redaction to the database is also async.

To invalidate the Redis cache I think we need track "in flight" invalidations, so that any query for an entry in the process of being invalidated just returns None. We'll also need to do this for any set operation (for mutable caches) to ensure that we don't get stale results.

When combining with database caches extra care needs to be taken to ensure that streaming the cache invalidation happens after we've deleted stuff from Redis. This is probably going to be a bit fiddly to get right, especially for event caches that get invalidated due to the events stream. I think it can be done by doing the await cache.delete in the context manager that assigns new stream IDs for events (as we won't stream invalidation until after).

We also then need to ensure that we handle the case where a worker re-inserts stale data between the Redis cache being cleared and the DB invalidation being streamed to the workers. Re-invalidating the Redis cache on the workers when it receives the invalidation doesn't solve this, as it still allows workers to (briefly) clobber the Redis cache with stale data, which will be visible to workers that already invalidated the cache (and may then get put back into the in-memory cache?). l don't know quite how to fix that, other than introducing sequence numbers so that "newer" versions of events don't get clobbered by older versions.

TBH I think that we should try hard to avoid this and just have immutable caches (or mutable caches only in Redis). That probably means just storing the unredacted event dict (and possibly internal metadata?), and have everything else be queried from the DB and cached solely in memory. Hopefully the mutable bits are small and so we can cache a lot more in memory, which should hopefully still give us some benefit.

I still agree with the general sentiment of "let's do everything we can to avoid cache invalidation in Redis", but something I don't quite follow here:

To invalidate the Redis cache I think we need track "in flight" invalidations

In principle, why can't we just treat the Redis cache invalidation as part of the async "persist event" operation, provided we do the invalidation of the in-memory-cache (and streaming that invalidation to workers) after the redis invalidation completes? Any operation which tries to read the event between the start of the "persist event" and the in-memory invalidation might get an old copy of the event, but isn't that true anyway?

Also:

(or mutable caches only in Redis).

is that really what you meant to type?

To invalidate the Redis cache I think we need track "in flight" invalidations

In principle, why can't we just treat the Redis cache invalidation as part of the async "persist event" operation, provided we do the invalidation of the in-memory-cache (and streaming that invalidation to workers) after the redis invalidation completes? Any operation which tries to read the event between the start of the "persist event" and the in-memory invalidation might get an old copy of the event, but isn't that true anyway?

I think that's fine, so long as we do wait for the Redis invalidation (rather than doing that as a background task). For the current database caches we try really hard to immediately invalidate caches after we commit to the DB, to avoid having inconsistencies due to having a mix of live and stale data. That may be less of a concern with things we persist in Redis, and for the event cache its probably not a problem. I'm not sure I'm comfortable with saying its true generally though.

(or mutable caches only in Redis).

is that really what you meant to type?

I mean I think its fine to cache things in Redis that are mutable if we don't also cache things in memory (and so avoid the need to coordinate the invalidation between the Redis and in-memory caches)

For the current database caches we try really hard to immediately invalidate caches after we commit to the DB

I don't really follow this. There's inherently a window between the commit and the cache invalidation: there is no "immediately". You might say we'll be widening that window, which is true, but if we have code that depends on that race not happening it might be best to find it than wait for the 1-in-a-million time where it bites...

None of this changes the fact that I think an architecture that avoids Redis cache invalidation will be much easier to reason about (and probably more performant)!

Half-Shot · 2021-02-12T11:28:24Z

synapse/storage/databases/main/events_worker.py

@@ -62,7 +62,8 @@
 EVENT_QUEUE_THREADS = 3  # Max number of threads that will fetch events


Seeing a lot of tests fail due to timeouts over replication. I am not having much luck understanding why, and I'm not able to reproduce the problem in a live environment.

synapse/storage/databases/main/events_worker.py

synapse/replication/tcp/external_cache.py

synapse/storage/databases/main/events_worker.py

erikjohnston · 2021-02-22T11:27:25Z

synapse/storage/databases/main/events_worker.py

+        if self._external_cache.is_enabled():
+            # XXX: Is there danger in doing this?
+            # We could hold a set of recently evicted keys in memory if
+            # we need this to be synchronous?
+            run_as_background_process(
+                "getEvent_external_cache_delete",
+                self._external_cache.delete,
+                GET_EVENT_CACHE_NAME,
+                event_id,
+            )


To invalidate the Redis cache I think we need track "in flight" invalidations, so that any query for an entry in the process of being invalidated just returns None. We'll also need to do this for any set operation (for mutable caches) to ensure that we don't get stale results.

When combining with database caches extra care needs to be taken to ensure that streaming the cache invalidation happens after we've deleted stuff from Redis. This is probably going to be a bit fiddly to get right, especially for event caches that get invalidated due to the events stream. I think it can be done by doing the await cache.delete in the context manager that assigns new stream IDs for events (as we won't stream invalidation until after).

We also then need to ensure that we handle the case where a worker re-inserts stale data between the Redis cache being cleared and the DB invalidation being streamed to the workers. Re-invalidating the Redis cache on the workers when it receives the invalidation doesn't solve this, as it still allows workers to (briefly) clobber the Redis cache with stale data, which will be visible to workers that already invalidated the cache (and may then get put back into the in-memory cache?). l don't know quite how to fix that, other than introducing sequence numbers so that "newer" versions of events don't get clobbered by older versions.

TBH I think that we should try hard to avoid this and just have immutable caches (or mutable caches only in Redis). That probably means just storing the unredacted event dict (and possibly internal metadata?), and have everything else be queried from the DB and cached solely in memory. Hopefully the mutable bits are small and so we can cache a lot more in memory, which should hopefully still give us some benefit.

erikjohnston · 2021-02-22T11:28:59Z

Looks like this PR has a bunch of spurious commits ? Looks like a rebase that included a merge commit.

…-event-cache

This change: - Adds dedicated functions to dehydrate and hydrate frozen events in the cache - Resets the expiry time on events

Half-Shot · 2021-02-22T18:17:36Z

Looks like this PR has a bunch of spurious commits ? Looks like a rebase that included a merge commit.

Yeah, I hadn't noticed that until it was too late. I've corrected the branch now.

…-event-cache

erikjohnston · 2021-05-20T10:03:39Z

@Half-Shot what's the state of this ooi?

…gether-event-cache

@Sorunome

Synapse 1.37.0rc1 (2021-06-24) ============================== This release deprecates the current spam checker interface. See the [upgrade notes](https://matrix-org.github.io/synapse/develop/upgrade#deprecation-of-the-current-spam-checker-interface) for more information on how to update to the new generic module interface. This release also removes support for fetching and renewing TLS certificates using the ACME v1 protocol, which has been fully decommissioned by Let's Encrypt on June 1st 2021. Admins previously using this feature should use a [reverse proxy](https://matrix-org.github.io/synapse/develop/reverse_proxy.html) to handle TLS termination, or use an external ACME client (such as [certbot](https://certbot.eff.org/)) to retrieve a certificate and key and provide them to Synapse using the `tls_certificate_path` and `tls_private_key_path` configuration settings. Features -------- - Implement "room knocking" as per [MSC2403](matrix-org/matrix-spec-proposals#2403). Contributed by @Sorunome and anoa. ([\#6739](#6739), [\#9359](#9359), [\#10167](#10167), [\#10212](#10212), [\#10227](#10227)) - Add experimental support for backfilling history into rooms ([MSC2716](matrix-org/matrix-spec-proposals#2716)). ([\#9247](#9247)) - Implement a generic interface for third-party plugin modules. ([\#10062](#10062), [\#10206](#10206)) - Implement config option `sso.update_profile_information` to sync SSO users' profile information with the identity provider each time they login. Currently only displayname is supported. ([\#10108](#10108)) - Ensure that errors during startup are written to the logs and the console. ([\#10191](#10191)) Bugfixes -------- - Fix a bug introduced in Synapse v1.25.0 that prevented the `ip_range_whitelist` configuration option from working for federation and identity servers. Contributed by @mikure. ([\#10115](#10115)) - Remove a broken import line in Synapse's `admin_cmd` worker. Broke in Synapse v1.33.0. ([\#10154](#10154)) - Fix a bug introduced in Synapse v1.21.0 which could cause `/sync` to return immediately with an empty response. ([\#10157](#10157), [\#10158](#10158)) - Fix a minor bug in the response to `/_matrix/client/r0/user/{user}/openid/request_token` causing `expires_in` to be a float instead of an integer. Contributed by @lukaslihotzki. ([\#10175](#10175)) - Always require users to re-authenticate for dangerous operations: deactivating an account, modifying an account password, and adding 3PIDs. ([\#10184](#10184)) - Fix a bug introduced in Synpase v1.7.2 where remote server count metrics collection would be incorrectly delayed on startup. Found by @heftig. ([\#10195](#10195)) - Fix a bug introduced in Synapse v1.35.1 where an `allow` key of a `m.room.join_rules` event could be applied for incorrect room versions and configurations. ([\#10208](#10208)) - Fix performance regression in responding to user key requests over federation. Introduced in Synapse v1.34.0rc1. ([\#10221](#10221)) Improved Documentation ---------------------- - Add a new guide to decoding request logs. ([\#8436](#8436)) - Mention in the sample homeserver config that you may need to configure max upload size in your reverse proxy. Contributed by @aaronraimist. ([\#10122](#10122)) - Fix broken links in documentation. ([\#10180](#10180)) - Deploy a snapshot of the documentation website upon each new Synapse release. ([\#10198](#10198)) Deprecations and Removals ------------------------- - The current spam checker interface is deprecated in favour of a new generic modules system. See the [upgrade notes](https://matrix-org.github.io/synapse/develop/upgrade#deprecation-of-the-current-spam-checker-interface) for more information on how to update to the new system. ([\#10062](#10062), [\#10210](#10210), [\#10238](#10238)) - Stop supporting the unstable spaces prefixes from MSC1772. ([\#10161](#10161)) - Remove Synapse's support for automatically fetching and renewing certificates using the ACME v1 protocol. This protocol has been fully turned off by Let's Encrypt for existing installations on June 1st 2021. Admins previously using this feature should use a [reverse proxy](https://matrix-org.github.io/synapse/develop/reverse_proxy.html) to handle TLS termination, or use an external ACME client (such as [certbot](https://certbot.eff.org/)) to retrieve a certificate and key and provide them to Synapse using the `tls_certificate_path` and `tls_private_key_path` configuration settings. ([\#10194](#10194)) Internal Changes ---------------- - Update the database schema versioning to support gradual migration away from legacy tables. ([\#9933](#9933)) - Add type hints to the federation servlets. ([\#10080](#10080)) - Improve OpenTracing for event persistence. ([\#10134](#10134), [\#10193](#10193)) - Clean up the interface for injecting OpenTracing over HTTP. ([\#10143](#10143)) - Limit the number of in-flight `/keys/query` requests from a single device. ([\#10144](#10144)) - Refactor EventPersistenceQueue. ([\#10145](#10145)) - Document `SYNAPSE_TEST_LOG_LEVEL` to see the logger output when running tests. ([\#10148](#10148)) - Update the Complement build tags in GitHub Actions to test currently experimental features. ([\#10155](#10155)) - Add a `synapse_federation_soft_failed_events_total` metric to track how often events are soft failed. ([\#10156](#10156)) - Fetch the corresponding complement branch when performing CI. ([\#10160](#10160)) - Add some developer documentation about boolean columns in database schemas. ([\#10164](#10164)) - Add extra logging fields to better debug where events are being soft failed. ([\#10168](#10168)) - Add debug logging for when we enter and exit `Measure` blocks. ([\#10183](#10183)) - Improve comments in structured logging code. ([\#10188](#10188)) - Update [MSC3083](matrix-org/matrix-spec-proposals#3083) support with modifications from the MSC. ([\#10189](#10189)) - Remove redundant DNS lookup limiter. ([\#10190](#10190)) - Upgrade `black` linting tool to 21.6b0. ([\#10197](#10197)) - Expose OpenTracing trace id in response headers. ([\#10199](#10199))

…-event-cache

richvdh · 2021-11-26T11:17:06Z

this seems to be bitrotting and unloved, so I'm going to close it.

@Fizzadar

Some experimental prep work to enable external event caching based on #9379 & #12955. Doesn't actually move the cache at all, just lays the groundwork for async implemented caches. Signed off by Nick @ Beeper (@Fizzadar)

@Fizzadar

Some experimental prep work to enable external event caching based on matrix-org#9379 & matrix-org#12955. Doesn't actually move the cache at all, just lays the groundwork for async implemented caches. Signed off by Nick @ Beeper (@Fizzadar)

Half-Shot force-pushed the hs/hacked-together-event-cache branch from 3aad82c to 53e424c Compare February 10, 2021 21:40

richvdh reviewed Feb 10, 2021

View reviewed changes

Half-Shot marked this pull request as ready for review February 11, 2021 11:46

jaller94 reviewed Feb 11, 2021

View reviewed changes

synapse/storage/databases/main/events_worker.py Outdated Show resolved Hide resolved

Half-Shot and others added 9 commits February 12, 2021 10:31

Add delete key to ExternalCache

264e9a6

Use external cache for events

0b11773

changelog+linting

963e1c6

await

fc38c18

Use both internal and external caches

5358283

Call external_cache.delete syncronously

c673e3e

linting

9ac17af

fix port db

e9fbbf1

Update synapse/storage/databases/main/events_worker.py

873da38

Co-authored-by: Christian Paul <christianp@matrix.org>

Half-Shot force-pushed the hs/hacked-together-event-cache branch from c98e36a to 873da38 Compare February 12, 2021 10:31

Half-Shot requested a review from erikjohnston February 12, 2021 11:27

Half-Shot commented Feb 12, 2021

View reviewed changes

Half-Shot commented Feb 14, 2021

View reviewed changes

synapse/storage/databases/main/events_worker.py Outdated Show resolved Hide resolved

Do not auto expire events

2de6060

Half-Shot force-pushed the hs/hacked-together-event-cache branch from 520d18f to 2de6060 Compare February 16, 2021 15:46

erikjohnston reviewed Feb 22, 2021

View reviewed changes

Half-Shot added 3 commits February 22, 2021 18:11

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

c2f5415

…-event-cache

Add delete and expire methods to external_cache

8b36dee

Squash changes to events_worker to fix commit issues

f55d926

This change: - Adds dedicated functions to dehydrate and hydrate frozen events in the cache - Resets the expiry time on events

Half-Shot force-pushed the hs/hacked-together-event-cache branch from 2964339 to f55d926 Compare February 22, 2021 18:16

Use ms for expiry time on get()

2be3a02

Half-Shot and others added 8 commits April 20, 2021 00:34

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

82fcd3c

…-event-cache

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

f98d098

…-event-cache

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

d77072f

…-event-cache

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

5e8c7b4

…-event-cache

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

1315b6a

…-event-cache

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

89ad2f6

…-event-cache

Fix merge header

76288b9

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

1f3e399

…-event-cache

Half-Shot force-pushed the hs/hacked-together-event-cache branch from 21811b9 to 1f3e399 Compare May 6, 2021 10:38

Half-Shot mentioned this pull request May 6, 2021

Failing to sync: attr._funcs.evolve() argument after ** must be a mapping, not RoomStreamToken #9936

Closed

Half-Shot added 3 commits May 6, 2021 12:20

Attempt to pin to fix

a5e4002

fix version

6a20b4f

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

dacc395

…-event-cache

Merge remote-tracking branch 'origin/release-v1.36' into hs/hacked-to…

7d33ba7

…gether-event-cache

erikjohnston marked this pull request as draft June 16, 2021 10:56

Half-Shot added 9 commits June 28, 2021 09:47

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

81dd28c

…-event-cache

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

f197953

…-event-cache

Fix merge fail

f6be804

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

6497040

…-event-cache

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

fad9189

…-event-cache

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

61293c8

…-event-cache

Merge remote-tracking branch 'origin/develop' into hs/hacked-together…

fb71966

…-event-cache

await

1eb989f

richvdh closed this Nov 26, 2021

Fizzadar mentioned this pull request Jul 11, 2022

Async get event cache prep #13242

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move getEvent cache to externalCache #9379

Move getEvent cache to externalCache #9379

Half-Shot commented Feb 10, 2021 •

edited

Loading

richvdh Feb 10, 2021

Half-Shot Feb 11, 2021

Half-Shot Feb 11, 2021

richvdh left a comment

Half-Shot commented Feb 11, 2021

Half-Shot commented Feb 11, 2021 •

edited

Loading

Half-Shot Feb 12, 2021

richvdh Feb 15, 2021

Half-Shot Feb 17, 2021

richvdh Feb 17, 2021

erikjohnston Feb 22, 2021

richvdh Feb 22, 2021

erikjohnston Feb 22, 2021

richvdh Feb 22, 2021

Half-Shot Feb 12, 2021

erikjohnston Feb 22, 2021

erikjohnston commented Feb 22, 2021

Half-Shot commented Feb 22, 2021

erikjohnston commented May 20, 2021

richvdh commented Nov 26, 2021

		@@ -62,7 +62,8 @@
		EVENT_QUEUE_THREADS = 3 # Max number of threads that will fetch events

Move getEvent cache to externalCache #9379

Move getEvent cache to externalCache #9379

Conversation

Half-Shot commented Feb 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richvdh left a comment

Choose a reason for hiding this comment

Half-Shot commented Feb 11, 2021

Half-Shot commented Feb 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikjohnston commented Feb 22, 2021

Half-Shot commented Feb 22, 2021

erikjohnston commented May 20, 2021

richvdh commented Nov 26, 2021

Half-Shot commented Feb 10, 2021 •

edited

Loading

Half-Shot commented Feb 11, 2021 •

edited

Loading