Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

GenericWorkerReplicationHandler has no attribute 'send_federation_ack' #7535

Closed
heftig opened this issue May 19, 2020 · 5 comments · Fixed by #7564
Closed

GenericWorkerReplicationHandler has no attribute 'send_federation_ack' #7535

heftig opened this issue May 19, 2020 · 5 comments · Fixed by #7564
Assignees
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication) z-bug (Deprecated Label)

Comments

@heftig
Copy link

heftig commented May 19, 2020

Description

I'm trying to use the federation_sender worker. I made these changes to the homeserver config:

# Worker config
worker_app: synapse.app.homeserver
worker_log_config: "/etc/synapse/log_config.yaml"
worker_replication_host: 127.0.0.1
worker_replication_port: 9092
worker_replication_http_port: 9093

# Delegate to federation_sender worker
send_federation: false

# Delegate to appservice worker
notify_appservices: false
 
listeners:
  - port: 9092
    type: replication
    bind_addresses: ['127.0.0.1']

  - port: 9093
    type: http
    bind_addresses: ['127.0.0.1']
    resources:
      - names: [replication]
        compress: false

The log shows this error, frequently:

synapse.app.generic_worker: [replication-RDATA-federation-3176] Error updating federation stream position
Traceback (most recent call last):
  File "/var/lib/synapse/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/synapse/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/synapse/venv/lib/python3.8/site-packages/synapse/app/generic_worker.py", line 851, in update_token
    self.replication_client.send_federation_ack(
AttributeError: 'GenericWorkerReplicationHandler' object has no attribute 'send_federation_ack'

Version information

  • Homeserver: matrix.archlinux.org
  • Version: 1.13.0
  • Install method: pip
  • Platform: Arch Linux
@heftig
Copy link
Author

heftig commented May 19, 2020

@erikjohnston I think that #7185 could have caused this.

@erikjohnston
Copy link
Member

Huh. This is on the full 1.13 release? I've seen it before, but didn't see it on matrix.org so assumed everything was fine, at least where the release was concerned.

The net effect of this is that master will slowly leak some memory btw.

@heftig
Copy link
Author

heftig commented May 19, 2020

Huh. This is on the full 1.13 release? I've seen it before, but didn't see it on matrix.org so assumed everything was fine, at least where the release was concerned.

The net effect of this is that master will slowly leak some memory btw.

Yes, when I deployed the config to set up workers the pip update also pulled the new 1.13.0.

@heftig
Copy link
Author

heftig commented May 19, 2020

This seems to solve it. Feels like a hack, though.

diff --git i/synapse/app/generic_worker.py w/synapse/app/generic_worker.py
index 667ad2042..d36699b00 100644
--- i/synapse/app/generic_worker.py
+++ w/synapse/app/generic_worker.py
@@ -765,6 +765,7 @@ class FederationSenderHandler(object):
         self._is_mine_id = hs.is_mine_id
         self.federation_sender = hs.get_federation_sender()
         self.replication_client = replication_client
+        self.hs = hs
 
         self.federation_position = self.store.federation_out_pos_startup
         self._fed_position_linearizer = Linearizer(name="_fed_position_linearizer")
@@ -848,7 +849,7 @@ class FederationSenderHandler(object):
 
                     # We ACK this token over replication so that the master can drop
                     # its in memory queues
-                    self.replication_client.send_federation_ack(
+                    self.hs.get_tcp_replication().send_federation_ack(
                         self.federation_position
                     )
                     self._last_ack = self.federation_position

@richvdh
Copy link
Member

richvdh commented May 21, 2020

I've investigated this a bit more.

@heftig's workaround is valid. It's a little bit hacky, but the problem is that doing it in the non-hacky way introduces a dependency cycle (hs.get_tcp_replication() builds a ReplicationCommandHandler calls hs.get_replication_data_handler() builds a GenericWorkerReplicationHandler builds a FederationSenderHandler ... which therefore can't call hs.get_tcp_replication() in its constructor). The other places that we do this sort of thing (example: sending INVALIDATE_CACHE or REMOVE_PUSHER commands) do it exactly the way @heftig suggests.

A cleaner fix might be to invert the dependency between ReplicationCommandHandler and replication_data_handler by having the latter register itself with ReplicationCommandHandler... however that can be done separately

As for why this isn't happening on matrix.org:

The federation_stream_position database table stores the most recent stream id for the federation stream which has been successfully processed by the federation_sender worker.

Let's say that we process up to stream id 10000; then the whole system is restarted. The intention with the federation stream is that the stream ids restart on each restart of the master, so now stream ids go back to zero. At this point, in the following code, _last_ack == 10000 and federation_position is small, so the code never runs, so doesn't raise any exceptions.

if self._last_ack < self.federation_position:
await self.store.update_federation_out_pos(
"federation", self.federation_position
)
# We ACK this token over replication so that the master can drop
# its in memory queues
self.replication_client.send_federation_ack(
self.federation_position
)
self._last_ack = self.federation_position

Other than the lack of exceptions, I think the effects are:

  • FEDERATION_ACKs aren't being sent to the master, which isn't too critical since the master implicitly clears the pending traffic after 5 minutes anyway.
  • when the federation_sender worker is restarted or its replication connection drops, it might drop traffic (because it thinks it has processed everything up to stream id 10000). Again this isn't too critical because the only things that the federation stream is used for is presence and typing.

It looks like the if condition on line 844 should simply be removed.

In summary: we should do what @heftig said, and remove the condition at line 844. But also: I'm annoyed that this didn't get picked up by any tests. It actually looks like federation acks have been broken on matrix.org for months: we should have been able to detect both problems if we had checks that the federation_sender worker was sending back FEDERATION_ACKs.

Given the (lack of) severity of the symptoms here, I don't feel like this is worth a patch release to fix.

@richvdh richvdh added the A-Workers Problems related to running Synapse in Worker Mode (or replication) label Feb 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication) z-bug (Deprecated Label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants