Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CU-346mpwz Improving memory usage of MedCAT models #323

Merged
merged 48 commits into from
Jul 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
1f796fa
CU-863gntc58 Add parent to child relationship getter to UMLS preproce…
mart-r May 10, 2023
47215e9
CU-863gntc58 Only use ISA relationships
mart-r May 10, 2023
9d04fbf
Make sure parents do not have themselves as children
mart-r May 10, 2023
69abf16
CU-863gntc58 Only keep preferred names
mart-r May 10, 2023
21aec90
CU-346mpwz Add memory optimiser for CDB
mart-r Jun 5, 2023
b795a86
CU-346mpwz Add name2<stuff> to memory optimiser for CDB
mart-r Jun 5, 2023
9a76a27
CU-346mpwz Add keys/items/values views to memory optimiser fake dicts
mart-r Jun 5, 2023
35a9858
CU-346mpwz Fix keys/items/values views in memory optimiser fake dicts
mart-r Jun 5, 2023
acca90b
CU-346mpwz Add option to optimise or not cui and/or name based dicts …
mart-r Jun 5, 2023
8635c19
CU-346mpwz Make default memory optimiser omit name2... optimising; ad…
mart-r Jun 5, 2023
48bee48
CU-346mpwz Remove unused/legacy code from memory optimiser
mart-r Jun 5, 2023
5a39b2a
CU-346mpwz Add tests for memory optimiser
mart-r Jun 5, 2023
9999d73
CU-346mpwz Add tests memory optimised CDB
mart-r Jun 5, 2023
0bbeb2f
CU-346mpwz Make dict names available within memory optimiser
mart-r Jun 5, 2023
df98418
CU-346mpwz Add separate tests for memory optimised CDB
mart-r Jun 5, 2023
f5df964
CU-346mpwz Remove unused imports in memory optimiser
mart-r Jun 5, 2023
f2f0b35
CU-346mpwz Move some encoding and decoing stuff within serialisation …
mart-r Jun 6, 2023
7e4259b
CU-346mpwz Add tests for encoding/decoding stuff
mart-r Jun 6, 2023
c448b52
CU-346mpwz Add encoding/decoding for delegating dict as well as postp…
mart-r Jun 6, 2023
c191a40
CU-346mpwz Fix decision upon JSON deserialisation of CDB when loading…
mart-r Jun 6, 2023
7ad50ad
CU-346mpwz Adapt serialisation tests to the potential one2many mappings
mart-r Jun 6, 2023
b6d99e1
CU-346mpwz Add tests for memory optimisation, including JSON serialis…
mart-r Jun 6, 2023
eb569d5
CU-346mpwz Remove debug print statements
mart-r Jun 6, 2023
a2cfe73
CU-346mpwz Remove debug methods from tests
mart-r Jun 6, 2023
bc79082
CU-346mpwz Fix method signatures in encoding/decoding methods
mart-r Jun 6, 2023
3b7c44f
CU-346mpwz Fix typing issue in serialiser when passing encoder
mart-r Jun 6, 2023
48e0dac
CU-346mpwz Relax typing restrictions for umls preprocessing / parent2…
mart-r Jun 6, 2023
05638be
CU-346mpwz Remove some debug variables
mart-r Jun 6, 2023
d9842e0
Merge branch 'master' of https://github.com/CogStack/MedCAT into cui2…
mart-r Jun 6, 2023
82c1f54
CU-346mpwz Fix remnant merge conflict
mart-r Jun 6, 2023
f6af4a0
CU-346mpwz Add item removal and popping to delegating dict
mart-r Jun 7, 2023
144bbdb
CU-346mpwz Add item removal and popping tests to delegating dict
mart-r Jun 7, 2023
366c487
CU-346mpwz Add item adding/setting tests to delegating dict
mart-r Jun 7, 2023
7273c53
CU-346mpwz Fix typing issue (List vs list)
mart-r Jun 7, 2023
579b59d
CU-346mpwz Add possibility of memory-optimising for snames as well
mart-r Jul 3, 2023
efd06c5
CU-346mpwz Add comment regarding memory-optimising for filtering by C…
mart-r Jul 3, 2023
4405246
CU-346mpwz Add sname based memory optimisation tests
mart-r Jul 3, 2023
d19633e
CU-346mpwz Add json serialisation capabilities to snames delegation
mart-r Jul 3, 2023
6690f15
CU-346mpwz Make sname optimisation default for memory optimisation
mart-r Jul 3, 2023
bf5f1e3
CU-346mpwz Fix typo in serialisation tests
mart-r Jul 5, 2023
0f984b0
CU-346mpwz Add variable to keep track of current memory optimisation …
mart-r Jul 6, 2023
f75a1d0
CU-346mpwz Add default cui2snames to sname optimisations; make sure s…
mart-r Jul 6, 2023
d73b4e0
CU-346mpwz Add method to undo CDB memory optimisation
mart-r Jul 6, 2023
4fdfbfc
CU-346mpwz Add tests for undoing CDB memory optimisation
mart-r Jul 6, 2023
22eb13d
CU-346mpwz Clear memory optimised parts if/when undoing optimisations
mart-r Jul 6, 2023
1f60f02
CU-346mpwz Remove accidentally added file/module
mart-r Jul 6, 2023
57ceef1
CU-346mpwz Add more straight forward optimisation part names; Fix mem…
mart-r Jul 6, 2023
9bc8905
CU-346mpwz Add further tests for memory optimisation (dirty state, ch…
mart-r Jul 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions medcat/cat.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
from medcat.vocab import Vocab
from medcat.utils.decorators import deprecated
from medcat.ner.transformers_ner import TransformersNER
from medcat.utils.saving.serializer import SPECIALITY_NAMES
from medcat.utils.saving.serializer import SPECIALITY_NAMES, ONE2MANY


logger = logging.getLogger(__name__) # separate logger from the package-level one
Expand Down Expand Up @@ -353,7 +353,8 @@ def load_model_pack(cls,

# Load the CDB
cdb_path = os.path.join(model_pack_path, "cdb.dat")
has_jsons = len(glob.glob(os.path.join(model_pack_path, '*.json'))) >= len(SPECIALITY_NAMES)
nr_of_jsons_expected = len(SPECIALITY_NAMES) - len(ONE2MANY)
has_jsons = len(glob.glob(os.path.join(model_pack_path, '*.json'))) >= nr_of_jsons_expected
json_path = model_pack_path if has_jsons else None
logger.info('Loading model pack with %s', 'JSON format' if json_path else 'dill format')
cdb = CDB.load(cdb_path, json_path)
Expand Down
17 changes: 14 additions & 3 deletions medcat/cdb.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ def __init__(self, config: Union[Config, None] = None) -> None:
self._optim_params = None
self.is_dirty = False
self._hash: Optional[str] = None
self._memory_optimised_parts: Set[str] = set()

def get_name(self, cui: str) -> str:
"""Returns preferred name if it exists, otherwise it will return
Expand Down Expand Up @@ -180,9 +181,13 @@ def remove_cui(self, cui: str) -> None:
for name, cuis2status in self.name2cuis2status.items():
if cui in cuis2status:
del cuis2status[cui]
self.snames = set()
for cuis in self.cui2snames.values():
self.snames |= cuis
if isinstance(self.snames, set):
# if this is a memory optimised CDB, this won't be a set
# but it also won't need to be changed since it
# relies directly on cui2snames
self.snames = set()
for cuis in self.cui2snames.values():
self.snames |= cuis
self.name2count_train = {name: len(cuis) for name, cuis in self.name2cuis.items()}
self.is_dirty = True

Expand Down Expand Up @@ -540,6 +545,10 @@ def filter_by_cui(self, cuis_to_keep: Union[List[str], Set[str]]) -> None:
This also will not remove any data from cdb.addl_info - as this field can contain data of
unknown structure.

As a side note, if the CDB has been memory-optimised, filtering will undo this memory optimisation.
This is because the dicts being involved will be rewritten.
However, the memory optimisation can be performed again afterwards.

Args:
cuis_to_keep (List[str]):
CUIs that will be kept, the rest will be removed (not completely, look above).
Expand Down Expand Up @@ -603,6 +612,8 @@ def filter_by_cui(self, cuis_to_keep: Union[List[str], Set[str]]) -> None:
self.cui2type_ids = new_cui2type_ids
self.cui2preferred_name = new_cui2preferred_name
self.is_dirty = True
# reset memory optimisation state
self._memory_optimised_parts.clear()

def make_stats(self):
stats = {}
Expand Down
Loading