Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Sync stream is either missing echo of events or is sending them earlier than anticipated #7206

Closed
turt2live opened this issue Apr 2, 2020 · 10 comments
Assignees
Labels
z-bug (Deprecated Label)

Comments

@turt2live
Copy link
Member

On March 31st (roughly) there was an uptick of "message stuck as latest message" reports to the riot clients (riot-web's upstream issue is element-hq/element-web#10032 for this and other cases).

The cause looks to be because Synapse is not sending events down /sync sometimes for events we've sent. There are also some reports of events just not being received on the other end which could be related. Most reporters were on matrix.org, and the reports seemed to stop once matrix.org rolled back.

Reproduction steps appear to be time-dependent, and possibly only useful for matrix.org-sized homeservers:

  1. Send an event
  2. Start a /sync request
  3. Send another event
  4. Receive event ID for first event (/send/m.room.message completes)
  5. Receive event down /sync
  6. Send yet another event
  7. Start a /sync request
  8. Receive event ID for second and third events
  9. Receive the third event down /sync, but not the second (sometimes the other way around)
  10. Never receive the second event through /sync, making it stuck on Riot

Most easily reproduced during high traffic volumes on matrix.org, though one case was also reproduced at ~22:00 UTC yesterday (but never again that night).

Theory at least from my side is that the request hits a busy synchrotron which advances the stream token past the second event to the third event, invoking amnesia.

@erikjohnston's conclusions appear to be that synapse might be sending the event down /sync ahead of the /send/m.room.message request completing?

@turt2live
Copy link
Member Author

This rageshake is my ~22:00 UTC attempt: https://github.com/matrix-org/riot-web-rageshakes/issues/2501

(I think, I might have gotten the timestamp wrong).

@richvdh
Copy link
Member

richvdh commented Apr 2, 2020

@erikjohnston's conclusions appear to be that synapse might be sending the event down /sync ahead of the /send/m.room.message request completing?

this is valid behaviour though.

@turt2live
Copy link
Member Author

this is valid behaviour though.

Yup, and it seems like Riot is okay with this. My attempt revealed no syncs with the event ID contained within, aside from a sync which had the second user's read receipt for the event. The event ID did not appear early or late.

@richvdh richvdh self-assigned this Apr 2, 2020
@richvdh
Copy link
Member

richvdh commented Apr 2, 2020

well there are a boatload of errors in the logs which suggest that some replication updates aren't being processed correctly. They include things like:

Traceback (most recent call last):
  File "/home/synapse/src/synapse/metrics/background_process_metrics.py", line 213, in run
    return (yield result)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/synapse/src/synapse/replication/tcp/protocol.py", line 254, in handle_command
    await handler(cmd)
  File "/home/synapse/src/synapse/replication/tcp/protocol.py", line 650, in on_POSITION
    [stream.parse_row(update[1]) for update in updates],
  File "/home/synapse/src/synapse/replication/tcp/protocol.py", line 650, in <listcomp>
    [stream.parse_row(update[1]) for update in updates],
IndexError: string index out of range
AttributeError: 'GenericWorkerSlavedStore' object has no attribute 'get_all_push_rule_updates'
Traceback (most recent call last):
  File "/home/synapse/src/synapse/metrics/background_process_metrics.py", line 213, in run
    return (yield result)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/synapse/src/synapse/replication/tcp/protocol.py", line 254, in handle_command
    await handler(cmd)
  File "/home/synapse/src/synapse/replication/tcp/protocol.py", line 668, in on_POSITION
    await self.handler.on_rdata(cmd.stream_name, rows[-1].token, rows)
AttributeError: 'DeviceListsStreamRow' object has no attribute 'token'

(also PushersStreamRow, EventsStreamRow, ToDeviceStreamRow, probably others.)

My inclination would be to fix those, then redeploy and see if the problem is magically cured.

turt2live added a commit to matrix-org/matrix-js-sdk that referenced this issue Apr 3, 2020
For matrix-org/synapse#7206

Intended to be removed (or possibly converted into a proper off-by-default setting) once the issue concludes.

There's a bunch of null checking through defaults here because this is a very critical part of the whole stack. We default to writing errors to the console, but do not break the app for failing to log info about the sync.

This may have adverse affects on performance for large accounts - use `localStorage.setItem("mx_skip_sync_logging", "true")` to disable the worst parts.
@turt2live
Copy link
Member Author

Logging for riot-web added here: matrix-org/matrix-js-sdk#1300

Might help narrow it down, maybe.

@richvdh
Copy link
Member

richvdh commented Apr 8, 2020

For the record, we rolled back matrix.org's synapse deployment on 2nd April while we investigated these problems.

As of 11:06 UTC this morning, we have redeployed matrix.org with current develop. We've fixed the obvious exceptions being spewed by the replication connections, so I am optimistic that this problem has gone away; however, we never really got to the bottom of what was going on, so it's entirely possible there are still problems.

In other words: feedback welcome as to whether this bug seems to be fixed or not.

@richvdh
Copy link
Member

richvdh commented Apr 8, 2020

... and the bug is still biting, so rolled back again as of 16:34 UTC

@richvdh
Copy link
Member

richvdh commented Apr 15, 2020

Right; I have reproduced this and tracked it down. It happens after the replication connection disconnects. After a disconnection, we perform a catchup, potentially processing multiple events as having the same stream id. StreamChangeCache assumes that there will only be one entity changing on each stream id, which is no longer true; updates are therefore getting lost.

@richvdh
Copy link
Member

richvdh commented Apr 16, 2020

Steps to reproduce:

  • Stop the synchrotron process (eg, send it a SIGSTOP. Blocking the replication traffic with iptables or so might also work.)
  • While it is stopped, send a few messages.
  • Wait until the master has got bored, and dropped the replication connection
  • Send the synchrotron a SIGCONT
  • observe stuck messages

richvdh added a commit that referenced this issue Apr 20, 2020
Other parts of the code (such as the StreamChangeCache) assume that there will
not be multiple changes with the same stream id.

This code was introduced in #7024, and I hope this fixes #7206.
@richvdh
Copy link
Member

richvdh commented Apr 20, 2020

hopefully fixed by #7286

@richvdh richvdh closed this as completed Apr 20, 2020
richvdh added a commit that referenced this issue Apr 22, 2020
…7303)

First some background: StreamChangeCache is used to keep track of what "entities" have 
changed since a given stream ID. So for example, we might use it to keep track of when the last
to-device message for a given user was received [1], and hence whether we need to pull any to-device messages from the database on a sync [2].

Now, it turns out that StreamChangeCache didn't support more than one thing being changed at
a given stream_id (this was part of the problem with #7206). However, it's entirely valid to send
to-device messages to more than one user at a time.

As it turns out, this did in fact work, because *some* methods of StreamChangeCache coped
ok with having multiple things changing on the same stream ID, and it seems we never actually
use the methods which don't work on the stream change caches where we allow multiple
changes at the same stream ID. But that feels horribly fragile, hence: let's update
StreamChangeCache to properly support this, and add some typing and some more tests while
we're at it.

[1]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L301
[2]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L47-L51
phil-flex pushed a commit to phil-flex/synapse that referenced this issue Jun 16, 2020
Other parts of the code (such as the StreamChangeCache) assume that there will
not be multiple changes with the same stream id.

This code was introduced in matrix-org#7024, and I hope this fixes matrix-org#7206.
phil-flex pushed a commit to phil-flex/synapse that referenced this issue Jun 16, 2020
…atrix-org#7303)

First some background: StreamChangeCache is used to keep track of what "entities" have 
changed since a given stream ID. So for example, we might use it to keep track of when the last
to-device message for a given user was received [1], and hence whether we need to pull any to-device messages from the database on a sync [2].

Now, it turns out that StreamChangeCache didn't support more than one thing being changed at
a given stream_id (this was part of the problem with matrix-org#7206). However, it's entirely valid to send
to-device messages to more than one user at a time.

As it turns out, this did in fact work, because *some* methods of StreamChangeCache coped
ok with having multiple things changing on the same stream ID, and it seems we never actually
use the methods which don't work on the stream change caches where we allow multiple
changes at the same stream ID. But that feels horribly fragile, hence: let's update
StreamChangeCache to properly support this, and add some typing and some more tests while
we're at it.

[1]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L301
[2]: https://github.com/matrix-org/synapse/blob/release-v1.12.3/synapse/storage/data_stores/main/deviceinbox.py#L47-L51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

2 participants