Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Improve type hints for data store usage #11165

Open
squahtx opened this issue Oct 22, 2021 · 9 comments
Open

Improve type hints for data store usage #11165

squahtx opened this issue Oct 22, 2021 · 9 comments
Labels
T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact

Comments

@squahtx
Copy link
Contributor

squahtx commented Oct 22, 2021

Everywhere in Synapse, HomeServer.get_datastore() is annotated as the DataStore god class, which is incorrect for workers and allowed #11163 to slip into matrix.org, causing an outage. It would be a good idea for data store consumers (servlets, handlers, the module API, etc) to declare which aspects of the data store they need, and have CI check that we don't pass them a data store that's missing the required interfaces.

There are three data store classes to consider:

  • DataStore, which is the data store class used on the main process
  • GenericWorkerSlavedStore, which is the data store class used on worker processes
  • AdminCmdSlavedStore, which is the data store class used when running admin commands(?)

DataStore and GenericWorkerSlavedStore overlap but aren't subtypes of each other. AdminCmdSlavedStore is a subset of GenericWorkerSlavedStore functionality-wise, but not through inheritance.

It's been suggested by @reivilibre that we define two or three types for use in type hints:

  • WorkerStore
  • MainStore (a subtype of WorkerStore?)
  • EitherStore = Union[WorkerStore, MainStore], if it turns out that WorkerStore contains functionality not in MainStore

These don't have to be concrete classes and could be Protocols if needed.

We could have more granular store requirements, but it wouldn't catch any extra bugs.

The code is currently structured like this in a lot of places:

class GroupsLocalWorkerHandler:
    def __init__(self, hs: "HomeServer"):
        self.store = hs.get_datastore()

Proposal 1: HomeServer[WorkerStore]

We could add a covariant type parameter to HomeServer to have data store consumers declare which type of data store they need:

class GroupsLocalWorkerHandler:
    def __init__(self, hs: "HomeServer[WorkerStore]"):
        self.store = hs.get_datastore()

HomeServer[WorkerStore] would mean "a homeserver with a data store that supports at least the capabilities of a WorkerStore".

We'd have to do this refactor across the entire codebase at once, otherwise the type hints for data stores would be implicitly degraded to Any everywhere.

Proposal 2: get_worker_datastore()

@clokep suggests that we add new get_*_datastore() methods to HomeServer that raise errors when called on the wrong process:

class GroupsLocalWorkerHandler:
    def __init__(self, hs: "HomeServer"):
        self.store = hs.get_worker_datastore()

mypy would not flag up issues, but any CI stage which spins up workers would fail.

@squahtx squahtx added T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. labels Oct 22, 2021
@DMRobertson
Copy link
Contributor

* `AdminCmdSlavedStore`, which is the data store class used when running admin commands(?)

Yes, see python -m synapse.app.admin_cmd --help for gory details.

@DMRobertson
Copy link
Contributor

Are proposals 1 and 2 mutually exclusive? I.e. could we have runtime and build-time checks? I think any costs would be one-off setup costs at the point on instantiating the different handlers.

@reivilibre
Copy link
Contributor

re: Proposal 1

These don't have to be concrete classes and could be Protocols if needed.

I think concrete classes are nicer; it feels a bit like an abuse of Protocols (which are more intended for structural types, which this case kind of isn't: we know which types we're talking about) and I think IDEs would give a better experience with concrete classes.

We'd have to do this refactor across the entire codebase at once, otherwise the type hints for data stores would be implicitly degraded to Any everywhere.

We can make the type variable used in the generic have a bound to avoid this, but ideally we'd just do it all at once anyway.
(Disallowing implicit Any generics does sound nice now though :))

re: Proposal 2

mypy would not flag up issues, but any CI stage which spins up workers would fail.

I'm not so much a fan of this, since it seems like we should be able to do it statically rather than relying on our tests triggering get_*_datastore() in a worker context.

@reivilibre
Copy link
Contributor

Are proposals 1 and 2 mutually exclusive? I.e. could we have runtime and build-time checks?

I'm struggling to imagine what you mean here (and if we can check it statically, I don't see a need for dynamic checks on top, but maybe I misunderstood)

@clokep
Copy link
Member

clokep commented Oct 22, 2021

Just to throw another point out here -- the storage classes frequently call self.foo where foo is on a different class, this makes mypy very angry. I don't think this is a matter of "just" find a good way to describe this class, but we really need some refactoring here for the datastores to know who has what method and who can call it, etc.

@DMRobertson
Copy link
Contributor

Just to throw another point out here -- the storage classes frequently call self.foo where foo is on a different class, this makes mypy very angry. I don't think this is a matter of "just" find a good way to describe this class, but we really need some refactoring here for the datastores to know who has what method and who can call it, etc.

Previous discussion, as I remember it:

  • We can make the storage classes explicitly inherit from the classes they depend on
  • Still have a big inheritance web, but at least it's explicit and something that mypy et al can reason about
  • @richvdh: can we do better, e.g. with composition rather than inheritance?

@DMRobertson
Copy link
Contributor

Are proposals 1 and 2 mutually exclusive? I.e. could we have runtime and build-time checks?

I'm struggling to imagine what you mean here (and if we can check it statically, I don't see a need for dynamic checks on top, but maybe I misunderstood)

Broadly agree in principle. But we don't yet run mypy across the whole source tree, and I wouldn't want us to lull ourselves into a false sense of typechecked security. In contrast, I think it's relatively easy to get runtime checks that we can be confident in.

@reivilibre
Copy link
Contributor

I don't know if this suggestion was brought up before, and although verbose, I thought it might be worth a mention:

Instead of inheritance, each store is a field in the main store.
Each store has a reference to the main store (fed during its constructor, which is typed) and is instantiated in the main store's constructor.

We have abstract classes to describe anything that should be common across multiple stores, otherwise the individual stores will depend on a more specific type; stores that work anywhere accept an AbstractStore. Stores that are worker-specific accept a WorkerStore. (example names; maybe want to mention they're the main store).

This is a bit of a sketch, but I haven't looked too deeply at whether this will fit in or not:

class AbstractStore(abc.ABC): # I don't know if this is better as a Protocol or using abstract fields
    # I want to write:
    wombat_store: WombatStore
    
    # but I think only this way works?: (someone please correct me!)
    @property
    @abstractmethod
    def wombat_store() -> WombatStore:
        ...
    
    @property
    @abstractmethod
    def spqr_store() -> AbstractSpqrStore:
        ...
        
  
class AbstractSpqrStore(abc.ABC):  # this is a store common to both worker and main process, but with different implementations
    @abstractmethod
    def get_spqr_by_blahblah(blah: int) -> str: # as expected, list out the signatures of all the methods we would expect to be common.
        ...
        
        
class WorkerSpqrStore(AbstractSpqrStore):  # implementation of the worker side. You can imagine there'd be a different one for the main process.
    def __init__(self, store: AbstractStore) -> None:
        self.store = store
        
    def get_spqr_by_blahblah(blah: int) -> str:
        return self.store.db_pool.simple_select_one(...)


class WombatStore():  # this store is agnostic about main process/worker process
    def __init__(self, store: AbstractStore) -> None:
        self.store = store
    
    def get_wombat_blah_blah(x: int) -> str:
        self.store.db_pool.simple_select_one(...)
        
    def get_wombat_with_spqr(x: int) -> Tuple[str, str]:
        # well you get the point, don't forget the runInteraction magic
        return self.store.spqr_store.get_spqr_by_blahblah(...)
    
class WorkerStore(AbstractStore):  # this is the top-level store for workers.
    def __init__(self, hs) -> None:
        self.wombat_store = WombatStore(self)
        self.spqr_store = WorkerSpqrStore(self)
        self.db_pool = hs.db_pool # don't look at me as though I know what I'm
        # doing, but I guess something like this happens somewhere

In a way, the top-level 'main store' acts as a typing trampoline for all the little stores.
We would want to take care that the stores don't access each other in the constructor — if we needed such a thing for some reason then maybe we want a hook that happens after all the stores have been constructed. (Maybe the __init__ method can be implemented in a superclass and made final so there's no way you can mess with it, eliminating the chance of this happening.)
I imagine others have already thought of this kind of approach and come to a dead end (since I don't think it's a genius idea), but I would be interested in having that written down here if so :).

@clokep
Copy link
Member

clokep commented Nov 15, 2021

Instead of inheritance, each store is a field in the main store.
Each store has a reference to the main store (fed during its constructor, which is typed) and is instantiated in the main store's constructor.

This sounds similar to what we do for the RootConfig object.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact
Projects
None yet
Development

No branches or pull requests

5 participants