Sync stream is either missing echo of events or is sending them earlier than anticipated #7206

turt2live · 2020-04-02T15:54:11Z

On March 31st (roughly) there was an uptick of "message stuck as latest message" reports to the riot clients (riot-web's upstream issue is element-hq/element-web#10032 for this and other cases).

The cause looks to be because Synapse is not sending events down /sync sometimes for events we've sent. There are also some reports of events just not being received on the other end which could be related. Most reporters were on matrix.org, and the reports seemed to stop once matrix.org rolled back.

Reproduction steps appear to be time-dependent, and possibly only useful for matrix.org-sized homeservers:

Send an event
Start a /sync request
Send another event
Receive event ID for first event (/send/m.room.message completes)
Receive event down /sync
Send yet another event
Start a /sync request
Receive event ID for second and third events
Receive the third event down /sync, but not the second (sometimes the other way around)
Never receive the second event through /sync, making it stuck on Riot

Most easily reproduced during high traffic volumes on matrix.org, though one case was also reproduced at ~22:00 UTC yesterday (but never again that night).

Theory at least from my side is that the request hits a busy synchrotron which advances the stream token past the second event to the third event, invoking amnesia.

@erikjohnston's conclusions appear to be that synapse might be sending the event down /sync ahead of the /send/m.room.message request completing?

The text was updated successfully, but these errors were encountered:

turt2live · 2020-04-02T15:55:34Z

This rageshake is my ~22:00 UTC attempt: https://github.com/matrix-org/riot-web-rageshakes/issues/2501

(I think, I might have gotten the timestamp wrong).

richvdh · 2020-04-02T15:55:55Z

@erikjohnston's conclusions appear to be that synapse might be sending the event down /sync ahead of the /send/m.room.message request completing?

this is valid behaviour though.

turt2live · 2020-04-02T15:58:08Z

this is valid behaviour though.

Yup, and it seems like Riot is okay with this. My attempt revealed no syncs with the event ID contained within, aside from a sync which had the second user's read receipt for the event. The event ID did not appear early or late.

richvdh · 2020-04-02T17:45:18Z

well there are a boatload of errors in the logs which suggest that some replication updates aren't being processed correctly. They include things like:

Traceback (most recent call last):
  File "/home/synapse/src/synapse/metrics/background_process_metrics.py", line 213, in run
    return (yield result)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/synapse/src/synapse/replication/tcp/protocol.py", line 254, in handle_command
    await handler(cmd)
  File "/home/synapse/src/synapse/replication/tcp/protocol.py", line 650, in on_POSITION
    [stream.parse_row(update[1]) for update in updates],
  File "/home/synapse/src/synapse/replication/tcp/protocol.py", line 650, in <listcomp>
    [stream.parse_row(update[1]) for update in updates],
IndexError: string index out of range

AttributeError: 'GenericWorkerSlavedStore' object has no attribute 'get_all_push_rule_updates'

Traceback (most recent call last):
  File "/home/synapse/src/synapse/metrics/background_process_metrics.py", line 213, in run
    return (yield result)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/synapse/src/synapse/replication/tcp/protocol.py", line 254, in handle_command
    await handler(cmd)
  File "/home/synapse/src/synapse/replication/tcp/protocol.py", line 668, in on_POSITION
    await self.handler.on_rdata(cmd.stream_name, rows[-1].token, rows)
AttributeError: 'DeviceListsStreamRow' object has no attribute 'token'

(also PushersStreamRow, EventsStreamRow, ToDeviceStreamRow, probably others.)

My inclination would be to fix those, then redeploy and see if the problem is magically cured.

For matrix-org/synapse#7206 Intended to be removed (or possibly converted into a proper off-by-default setting) once the issue concludes. There's a bunch of null checking through defaults here because this is a very critical part of the whole stack. We default to writing errors to the console, but do not break the app for failing to log info about the sync. This may have adverse affects on performance for large accounts - use `localStorage.setItem("mx_skip_sync_logging", "true")` to disable the worst parts.

turt2live · 2020-04-03T04:03:51Z

Logging for riot-web added here: matrix-org/matrix-js-sdk#1300

Might help narrow it down, maybe.

richvdh · 2020-04-08T12:25:35Z

For the record, we rolled back matrix.org's synapse deployment on 2nd April while we investigated these problems.

As of 11:06 UTC this morning, we have redeployed matrix.org with current develop. We've fixed the obvious exceptions being spewed by the replication connections, so I am optimistic that this problem has gone away; however, we never really got to the bottom of what was going on, so it's entirely possible there are still problems.

In other words: feedback welcome as to whether this bug seems to be fixed or not.

richvdh · 2020-04-08T16:55:25Z

... and the bug is still biting, so rolled back again as of 16:34 UTC

richvdh · 2020-04-15T18:02:15Z

Right; I have reproduced this and tracked it down. It happens after the replication connection disconnects. After a disconnection, we perform a catchup, potentially processing multiple events as having the same stream id. StreamChangeCache assumes that there will only be one entity changing on each stream id, which is no longer true; updates are therefore getting lost.

richvdh · 2020-04-16T10:36:29Z

Steps to reproduce:

Stop the synchrotron process (eg, send it a SIGSTOP. Blocking the replication traffic with iptables or so might also work.)
While it is stopped, send a few messages.
Wait until the master has got bored, and dropped the replication connection
Send the synchrotron a SIGCONT
observe stuck messages

Other parts of the code (such as the StreamChangeCache) assume that there will not be multiple changes with the same stream id. This code was introduced in #7024, and I hope this fixes #7206.

richvdh · 2020-04-20T16:59:57Z

hopefully fixed by #7286

…7303) First some background: StreamChangeCache is used to keep track of what "entities" have changed since a given stream ID. So for example, we might use it to keep track of when the last to-device message for a given user was received [1], and hence whether we need to pull any to-device messages from the database on a sync [2]. Now, it turns out that StreamChangeCache didn't support more than one thing being changed at a given stream_id (this was part of the problem with #7206). However, it's entirely valid to send to-device messages to more than one user at a time. As it turns out, this did in fact work, because *some* methods of StreamChangeCache coped ok with having multiple things changing on the same stream ID, and it seems we never actually use the methods which don't work on the stream change caches where we allow multiple changes at the same stream ID. But that feels horribly fragile, hence: let's update StreamChangeCache to properly support this, and add some typing and some more tests while we're at it. [1]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L301 [2]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L47-L51

Other parts of the code (such as the StreamChangeCache) assume that there will not be multiple changes with the same stream id. This code was introduced in matrix-org#7024, and I hope this fixes matrix-org#7206.

…atrix-org#7303) First some background: StreamChangeCache is used to keep track of what "entities" have changed since a given stream ID. So for example, we might use it to keep track of when the last to-device message for a given user was received [1], and hence whether we need to pull any to-device messages from the database on a sync [2]. Now, it turns out that StreamChangeCache didn't support more than one thing being changed at a given stream_id (this was part of the problem with matrix-org#7206). However, it's entirely valid to send to-device messages to more than one user at a time. As it turns out, this did in fact work, because *some* methods of StreamChangeCache coped ok with having multiple things changing on the same stream ID, and it seems we never actually use the methods which don't work on the stream change caches where we allow multiple changes at the same stream ID. But that feels horribly fragile, hence: let's update StreamChangeCache to properly support this, and add some typing and some more tests while we're at it. [1]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L301 [2]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L47-L51

turt2live added the z-bug (Deprecated Label) label Apr 2, 2020

turt2live mentioned this issue Apr 2, 2020

Canonical issue for message stuck as latest message element-hq/element-web#10032

Closed

richvdh self-assigned this Apr 2, 2020

turt2live mentioned this issue Apr 3, 2020

Add some logging around sync for stuck message debugging matrix-org/matrix-js-sdk#1300

Closed

richvdh mentioned this issue Apr 3, 2020

Move client command handling out of TCP protocol #7185

Merged

erikjohnston mentioned this issue Apr 6, 2020

Fix race in replication #7226

Merged

BillCarsonFr mentioned this issue Apr 8, 2020

Sent message (but not synced) stays at bottom of timeline element-hq/element-android#1192

Closed

turt2live mentioned this issue Apr 8, 2020

Messages disappearing or missing on Riot Desktop element-hq/element-web#12677

Closed

t3chguy mentioned this issue Apr 11, 2020

Same message multiple time in history element-hq/element-web#11538

Closed

This was referenced Apr 16, 2020

On catchup, process each row with its own stream id #7286

Merged

Extend StreamChangeCache to support multiple entities per stream ID #7303

Merged

richvdh closed this as completed Apr 20, 2020

turt2live mentioned this issue May 4, 2020

Stuck messages again #7409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync stream is either missing echo of events or is sending them earlier than anticipated #7206

Sync stream is either missing echo of events or is sending them earlier than anticipated #7206

turt2live commented Apr 2, 2020

turt2live commented Apr 2, 2020

richvdh commented Apr 2, 2020

turt2live commented Apr 2, 2020

richvdh commented Apr 2, 2020

turt2live commented Apr 3, 2020

richvdh commented Apr 8, 2020 •

edited

Loading

richvdh commented Apr 8, 2020

richvdh commented Apr 15, 2020

richvdh commented Apr 16, 2020

richvdh commented Apr 20, 2020

Sync stream is either missing echo of events or is sending them earlier than anticipated #7206

Sync stream is either missing echo of events or is sending them earlier than anticipated #7206

Comments

turt2live commented Apr 2, 2020

turt2live commented Apr 2, 2020

richvdh commented Apr 2, 2020

turt2live commented Apr 2, 2020

richvdh commented Apr 2, 2020

turt2live commented Apr 3, 2020

richvdh commented Apr 8, 2020 • edited Loading

richvdh commented Apr 8, 2020

richvdh commented Apr 15, 2020

richvdh commented Apr 16, 2020

richvdh commented Apr 20, 2020

richvdh commented Apr 8, 2020 •

edited

Loading