v1.7.1 Release PR (#331)

* CU-8677ge6j8 Version identification and updating (#313) * Expose example model card version in metadata test * Add version detection along with tests * Move to a more comprehensive version string parser (regex) * Add more comprehensive versioning tests * Move MedCAT unzip to a separate method * Separate getting semantic version from string * Add new CDB with version information and use that with versioning tests * Add methods to get version info from CDB dump and model pack zip/folder * Exposing CDB file name and adding custom dev patch version support * Fix config.linking.filters.cuis - from empty dict to empty set * Add logging to versioning * Fix f-strings instead of (intended) r-strings * Add creating model pack archive to versioning CDB fix * Fix logger initialising * Making versioning a runnable module that allows fixing the config * Add docstrings to CLI methods * CU-8677ge6j8 Make explicit check regards to empty dict when fixing config * CU-8677ge6j8 Add tests regarding versioning changes * CU-8677ge6j8 Add missing return type hint * CU-8677ge6j8 Simplify action handling for CLI input * CU-8677ge6j8 Simplifying archive making method * Pin down transformers for the de-identification model (#314) * NO-TICKET pin down transformers for the de-id model * Added function to remove CUI from cdb (#316) * Added function to remove CUI from cdb * Unit test for remove_cui * CU-862jjprjw Fix github actions failures (#317) * Added function to remove CUI from cdb --------- Co-authored-by: antsh3k <antshek@hotmail.com> * CU-862jr8wkk Pin pydantic dependency to avoid conflicts with v2.0 (#318) * Bump django from 3.2.18 to 3.2.19 in /webapp/webapp Bumps [django](https://github.com/django/django) from 3.2.18 to 3.2.19. - [Commits](django/django@3.2.18...3.2.19) --- updated-dependencies: - dependency-name: django dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * CU-863gntc58 Umlspt2ch (#322) * CU-863gntc58 Add parent to child relationship getter to UMLS preprocessing * CU-863gntc58 Only use ISA relationships * Make sure parents do not have themselves as children * CU-863gntc58 Only keep preferred names * CU-863gntc58 Fix typing issues * CU-863gntc58 Fix child-parent relationships being saved instea * Better system for avoiding parent-child being the same * Fix for Issue 325 (#326) * Issue-325 Add check for old/new spacy; fix code for nested entities * Issue-325 Fix a typing issue * Issue-325 Improve nested entity extraction in _doc_to_out; add type hint for individual entities * Issue-325 Remove unneccessary whitespace * Issue-325 Move spacy version detection from cat to utils.helpers * CU-86783u6d9 Add wrapper to simplify De-ID model usage (#324) * CU-2wgnqg5 Add javadoc to a method * CU-2wgnqg5 Fix issues with typing * CU-2wgnqg5 Add (potential) progress bar to regression testing * CU-2wgnqg5 Add runnable regression checker with command line arguments * CU-2wgnqg5 Add better help message for a CLI argument * CU-2wgnqg5 Fix import to use proper namespace * CU-2wgnqg5 Add parent-child functionality for filters * CU-2wgnqg5 Add cui and children option to the config example * Revert "CU-2wgnqg5 Fix import to use proper namespace" This reverts commit 882be44. * CU-2wgnqg5 Add default / empty children to translation layer * CU-2wgnqg5 Remove use of deprecated warning method * CU-2wgnqg5 Add new default test case that checks for 'heart rate' and its children 4 deep * CU-2wgnqg5 Remove unneccessary TODO comment * CU-2wgnqg5 Add possibility of using result reporting for regression checks * CU-2wgnqg5 Fix issue with delegations not shown for reports * CU-2wgnqg5 Add possibility of using reports for CLI regression testing * CU-2wgnqg5 Fix minor typing issues * CU-2wgnqg5 Fix typo in default regression config * CU-2wgnqg5 Make sure imports work both when running directly as well as when using as part of the project * CU-2wgnqg5 Add a new test case with the ANY strategy * CU-2wgnqg5 Fixing imports so that absolute imports are used * CU-2wgnqg5 Add new package to setup.py * CU-2wgnqg5 Fix typing issues * CU-2wgnqg5 Fix report output formating * CU-2vzhd93 Remove logging tutorials (move to MedCATtutorials) * CU-2wgnqg5 Move to a simpler filter design * CU-2wgnqg5 Add (optional) per-phrase results to results/reporting * CU-2wgnqg5 Add per-phrase information toggle to CLI * CU-2wgnqg5 Fix method signature changes between inherited classes * CU-2q50k3c: add contact email address. * added latest release news / accepted paper * Update README.md * CU-2zj4czk Move to a class based linking filter approach * CU-2zj4czk Move to identifier based linking filter access * CU-2zj4czk Use MCT filters when training supervised * New UMLS Full Model * CU-2zj4czk Make sure excluded CUIs are always specified (even if by an empty set) * CU-2zj4czk Add possibility of creating a copy of linking filters * CU-2zj4czk Use copies of linking.filters in train_supervised and _print_stats * CU-2zj4czk Add linking.filters merging functionality * CU-2zj4czk Add parameter to retain MCT filters within train_supervised * CU-2zj4czk Rename filters variable within print_stats method for better consistency and readability * CU-2zj4czk Consolidate some duplicate code between train_supervised and _print_stats * CU-2zj4czk Fix multi-project detection * CU-2zj4czk Fix linking filter merging * CU-2zj4czk Add tests for retaining filters from MCT along with a test-trainer export * CU-2zj4czk Remove debug print outputs from some tests * CU-2wgnqg5 Separate some of the regression code into different modules * Add URL of paper for Dutch model (#275) * CU-2wgnqg5 Add serialisation code along with tests * CU-2wgnqg5 Fix regression checker and case serialisation and add tests * CU-2wgnqg5 Add conversion code from MCT export to regression YAML along with tests * CU-2wgnqg5 Fix minor import and typing issues * CU-2wgnqg5 Add runnable to convert from MedCATtrainer to regression YAML * CU-2wgnqg5 Add for number of cases read from MCT export * CU-2wgnqg5 Add context selectors for conversion from MCT * CU-2wgnqg5 Add use of context selector to converter * CU-2wgnqg5 Add use of context selector to runnable * CU-2wgnqg5 Fix issue with typing * CU-2wgnqg5 Add regression case based progress bar in case the total of sub-cases is unknown * CU-2wgnqg5 Make sure (and test) that only 1 replacement '%s' is in each phrase for regression tests * CU-2wgnqg5 Add test cases for '%' replacement in context and some minor optimisation * CU-2wgnqg5 Add option to not show empty cases in report * CU-2wgnqg5 Fix verbose output mode/logging * CU-2wgnqg5 Fix name clashes in test cases * CU-2wgnqg5 Make conversion filter for both CUI and NAME * CU-2wgnqg5 Use different approach for generating targets for regression cases * CU-2wgnqg5 Add warning when no parent-child information is present (but continue to run) * Fix issue with typing * Add TODO comment regarding more comprehensive reporting * Fix whitespace issue * CU-2wgnqg5 Translation layer now able to confirm if a set of CUIs has a parent or child of a specified one * CU-2wgnqg5 Add reasons for failure of a regression case * CU-2wgnqg5 Make hiding failures a possibility from the CLI * CU-2wgnqg5 Use better report output for failures with summary * CU-2wgnqg5 Fix typing issues * CU-2wgnqg5 Add description to failed cases where applicable * CU-2wgnqg5 Fix successes not being reported on * CU-2wgnqg5 Rename some fail reasons for better readability * CU-2wgnqg5 Add test cases for specifeid CUI and name if/when none are found from the CDB * CU-2wgnqg5 Add extra information (names) in case of failure becasue name not in CDB * CU-2wgnqg5 Make converter consolidate different test cases with identical filters (CUI and name) into one with multiple phrases * CU-2wgnqg5 Remove use of TargetInfo and using a tuple instead * CU-2wgnqg5 Fix remnant targetinfo * CU-2wgnqg5 Fix remnant targetinfo stuff * CU-2wgnqg5 Fix remnant targetinfo in docstrings * CU-2wgnqg5 Fix missing argumnet in docstrings * CU-2wgnqg5 Allow only reports in regression checker * CU-2wgnqg5 Add medcat.utils.regression level parent logger * CU-2wgnqg5 Use medcat.utils.regression parent logger for verbose output in regression checker * CU-2wgnqg5 Move from logger.warn to logger.warning * CU-2wgnqg5 Fix issue with wrong targets being generated * CU-2wgnqg5 Fix checking tests * CU-2wgnqg5 Add dunder init to test (utils) packages to make the tests within discoverable * CU-2wgnqg5 Fix serialisation tests (add missing argument) * CU-2wgnqg5 Fix regression results tests (change method owner) * CU-2wgnqg5 Fix regression results tests (make names ordered) * CU-2wgnqg5 Remove unnecessary print output in test * CU-2wgnqg5 Update conversion code to not use target info * CU-2wgnqg5 Attempt to fix automated build on github actions (bin sklearn version) * CU-2wgnqg5 Move from sklearn to scikit-learn dependency * CU-2wgnqg5 Separate some code in converting, add docs * CU-2wgnqg5 Make yaml dumping save for yaml representation of regression checker * CU-2wgnqg5 Add initial editing code with some simple tests * CU-2wgnqg5 Add possibility for combinations to ignore identicals * CU-2wgnqg5 Add docs to the editing/combining methods * CU-2wgnqg5 Add runnable python file for combining different regression YAMLs * CU-2wgnqg5 Minor codebase improvements * CU-2wgnqg5 Make FailReasons serializable * CU-2wgnqg5 Add json output to regression checking * Make stats reporting not have np.nan values on empty train count (#277) * CU-327vb66 make stats reporting not have np.nan values on empty train count * CU-327vb66 start using scikit-learn instead of deprecated sklearn * Bump django from 3.2.15 to 3.2.16 in /webapp/webapp Bumps [django](https://github.com/django/django) from 3.2.15 to 3.2.16. - [Release notes](https://github.com/django/django/releases) - [Commits](django/django@3.2.15...3.2.16) --- updated-dependencies: - dependency-name: django dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Update ReadMe.md to show Licence change Updated News Section * CU-2wgnqg5 Add docstring to fail descriptor getter method * CU-2wgnqg5 Removed handled TODO * CU-33g09h4 Make strides towards PEP 257. Make all docstrings use triple double quotes; remove preceding whitespace from docstrings; remove raw-string docstrings where applicable; remove empty docstrings * CU-2zj4czk Add documentation regarding config.linking.filters * CU-2zj4czk Add test for leakage of extra_cui_filters * CU-33g09h4 Remove leftover whitespace from start of docstring * include joblib dep * CU-2zj4czk Add parameter to retain extra_cui_filters (instead of MCT filters). Make sure tests pass. * CU-33g09h4 Some docstring unification for config(s) * CU-33g09h4 Some docstring unification for pipe, meta_cat and vocab * CU-33g09h4 Some docstring unification for cdb * CU-33g09h4 Some docstring unification for cdb maker * CU-33g09h4 Some docstring unification for cdb and maker (Return: to Returns:) * CU-33g09h4 Some docstring unification for cat * CU-33g09h4 Fix typo in docstring * CU-33g09h4 Some docstring unification for utils * CU-33g09h4 Some docstring unification for tokenizers * CU-33g09h4 Some docstring unification for preprocessors * CU-33g09h4 Some docstring unification for NER parts * CU-33g09h4 Some docstring unification for NEO parts * CU-33g09h4 Some docstring unification for linking parts * CU-33g09h4 Some docstring unification for cogstack connection part * CU-33g09h4 Remove some leftover backticks from docstring types * CU-33g09h4 Remove some leftover 'Return:' -> 'Returns:' changes * CU-33g09h4 Fix typo in a return type name * CU-384mewq match post release branches in the production workflow (#283) * CU-346mpxm Add new JSON based (faster) serialization for CDB along with tests * CU-346mpxm Add new package to setup.py; add logger and docstrings to serializer; remove dead code and comments * CU-346mpxm Remove leftover codel; Fix type safety regarding optinal json path * CU-346mpxm Add logging on writing to serializer * CU-346mpxm Add logging on reading to serializer * CU-346mpxm Make deserializing consistent with previous CDB deserialising * CU-346mpxm Add JSON serialisation to CDB * CU-346mpxm Remove issue with circular imports * CU-346mpxm Make sure json files end with .json * CU-346mpxm Add json type format to modelpack creation * CU-346mpxm Add tests for json format modelpack creation * CU-346mpxm Add logging output to model pack creation and loading * CU-346mpxm Add model pack converter / runnable * Update README.md * CU-862hyd5wx Unify rosalind/vocab downloading in tests, identify and fail meaningfully in case of 503 * CU-862hyd5wx Remove unused imports in tests due to last commit * CU-862hyd5wx Add possibility of generating and using a simply vocab when Rosalind is down * CU-862hyd5wx Fix small typo in tests * Loosen dependency restrictions (#289) Signed-off-by: zethson <lukas.heumos@posteo.net> Signed-off-by: zethson <lukas.heumos@posteo.net> * bug found in snomed2OPCS func * markdown improvements * Mapping icd10 and opcs complete * get all children func added * pep8 fixes * Update README.md * Add confusion matrix to meta model evaluation * CU-862j0jcdu / CU-862j0jd2n Cdb json (#295) * CU-862j0jcdu Rename format parameter in model creation to specify it only applys to the CDB * CU-862j0jd2n Add addl_info to be JSON serialised when required * CU-862j0jd2n Add addl_info to docstring of CDB serializer * CU-38g55wn / CU-39cmv82 Support for python3.11 (and 3.10) (#285) * CU-38g55wn Move dependencies to (hopefully) support python 3.11 on Ubuntu * CU-38g55wn Attempt to fix dependencies for github dependency (gensim) * CU-38g55wn Attempt to fix dependencies for github dependency (gensim) x2 * CU-38g55wn Attempt to fix dependencies for github dependency (gensim) x3 * CU-38g55wn Attempt to fix dependencies for github dependency (gensim) x4 * CU-38g55wn Attempt to fix dependencies for github dependency (gensim) x5 - fix missing comma * CU-38g55wn Remove errorenous package from setup.py * CU-38g55wn Bump spacy version so as to (hopefully) fix pydantic issues * CU-38g55wn Bump spacy en_core_web_md version so as to (hopefully) fix requirements issues * CU-38g55wn Fix test typo that was fixed on newere en_core_web_md * CU-38g55wn Fix small issue in NER test * CU-38g55wn Fix small issue with NER test (int conversion) * CU-38g55wn Mark some places as ignore where newer mypy complains * CU-38g55wn Bump mypy dev requirement version * CU-38g55wn Add python 3.11 and 3.10 to workflow * CU-38g55wn Trying to install gensim over https rather tha ssh * CU-38g55wn Make python versions strings in GH worfklow so 3.10 doesn't get 'rounded' to 3.10 when read * CU-38g55wn Remove python 3.7 from workflow since it's not compatible with required versions of numpy and scipy * CU-38g55wn Universally fixing NER test regarding the 'movar~viruse' -> 'movar~virus' thing * CU-38g55wn Bump gensim version to 4.3.0 - the first to support 3.11 * CU-862hyd5wx Unify rosalind/vocab downloading in tests, identify and fail meaningfully in case of 503 * CU-862hyd5wx Remove unused imports in tests due to last commit * CU-862hyd5wx Add possibility of generating and using a simply vocab when Rosalind is down * CU-862hyd5wx Remove python 3.7 and add 3.10/3.11 to classifiers * CU-862hyd5wx Reorder python versions in GitHub workflow * CU-862hyd5wx Attempt to fix GHA by importing unittest.mock explicitly * CU-39cmvru Faster hashing (#286) * CU-39cmvru Add marking of CDB dirty if/when concepts change. Avoid calculating its hash separately if it hasn't been dirtied. Add tests to verify behaviour. * CU-39cmvru Add possibility to force recalculation of hash for CDB (inlcuding when getting hash for CAT) * CU-39cmvru Add possibility to force recalculation of hash for CDB through modelcat creation (new parameter, propageting through _versioning) * CU-39cmvru Remove previous hash from influencing hashing of CDB to produce consistent hash on every recalculation Add tests to make sure that is the case on the CDB level as well as the CAT/modelpack level. * CU-39cmvru Add logging around the (re)calclulation of the CDB hash * CU-39cmvru Fix typo in log message * CU-39cmvru Add test to make sure the CDB hash is saved to disk and loaded from disk * CU-39cmvru Add possibility to calculate hash upon saving of CDB if/when the hash is unknown (i.e when saving outside a model pack) * CU-39cmvru Add CDB dirty flag to all other methods that modify the CDB * Change confusion matrix to DF and add labels * Fix model config * CU-86777ey74 No elastic dependency (#298) * Removed elastic dependency * CU-86777ey74 Remove module that depends on elastic (cogstack/cogstack_conn) * CU-86777ey74 Remove medcat.cogstack package from setup.py packages * Docstring updated to google-style docstring * CU-2e77a2k Remove unused utility modules * CU-2e77a2k Remove deprecated utils * Bump django from 3.2.16 to 3.2.17 in /webapp/webapp Bumps [django](https://github.com/django/django) from 3.2.16 to 3.2.17. - [Release notes](https://github.com/django/django/releases) - [Commits](django/django@3.2.16...3.2.17) --- updated-dependencies: - dependency-name: django dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * CU-33g0f3w Read the docs build failures (#306) * CU-33g0f3w Pin aiohttp dependency version for docs * CU-33g0f3w Pin aiohttp dependency version for docs (#303) * CU-33g0f3w Pin aiohttp dependency version for docs in setup.py * Read the docs build failures (#304) * CU-33g0f3w Pin aiohttp dependency version for docs * CU-33g0f3w Pin aiohttp dependency version for docs in setup.py * CU-33g0f3w Pin blis dependency version for docs in setup.py * Add options for loading meta models and additional NERs (#300) * CU-8677aud63 add options for loading meta models and addl NERs * CU-8677aud63 reduce memory usage during test * Style fix * NO-TICKET reduce the false positives on pushing to test pypi (#307) * CU-862j5by9q Regression touchup - metadata and ability to split suites into categories (#301) * CU-862j5by9q Add metadata to regression suite, loaded from model card if/when specified. A model can be specified upon creation to get the model card from. * CU-862j5by9q Remove f-string from string with no placeholders * CU-862j5by9q Make regression case hashable * CU-862j5by9q Add category separation to regression test suite along with automated tests and test example * CU-862j5by9q Add missing docstringgs to category separation * CU-862j5by9q Add saving to category separator and a convenience method for separation based on regression test YAML file and categories YAML file * CU-862j5by9q Add missing docstrings to new methods * CU-862j5by9q Fix typo in class name * CU-862j5by9q Fix saving issue for separation results * CU-862j5by9q Add runnable category separator * CU-862j5by9q Separate some file location constants in separation tests * CU-862j5by9q Add test for separation that checks that no information gets lost (in the specific situation) * CU-862j5by9q Add an anything-goes category description * CU-862j5by9q Fix anything-goes option * CU-862j5by9q Add tests for anything-goes category description * CU-862j5by9q Add possibility of using an overflow category when separating regression suite * CU-862j5by9q Add use of the overflow category to the runnable * CU-862j5by9q Fix linting and typing issues * CU-862j5by9q Add test for each individual separated suite * CU-862j5by9q Fix minor abstract class issues * CU-862j5by9q Rename categoryseparation module as category_separation * CU-862j5by9q Add docstrings to category_separator * CU-8677craqe make transformer_ner continue processing other entities after the first non-matching * Bump django from 3.2.17 to 3.2.18 in /webapp/webapp Bumps [django](https://github.com/django/django) from 3.2.17 to 3.2.18. - [Release notes](https://github.com/django/django/releases) - [Commits](django/django@3.2.17...3.2.18) --- updated-dependencies: - dependency-name: django dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * CU-862j7b9jc Mypy full release - 1.0.0 (#308) * CU-862j7b9jc Add abstract base class to regression converting strategy where necessary * CU-862j7b9jc Bump mypy to version 1.0.0 * CU-862j7b9jc Mypy abc hotfix (#311) * CU-862j7b9jc Fix issue with duplicate imports * CU-862j7b9jc Fix issue with no whitespace after keyword (E275) * CU-862j7b9jc Remove unnecessary brackets from if statement * CU-8677ge6j8 Version identification and updating (#313) * Expose example model card version in metadata test * Add version detection along with tests * Move to a more comprehensive version string parser (regex) * Add more comprehensive versioning tests * Move MedCAT unzip to a separate method * Separate getting semantic version from string * Add new CDB with version information and use that with versioning tests * Add methods to get version info from CDB dump and model pack zip/folder * Exposing CDB file name and adding custom dev patch version support * Fix config.linking.filters.cuis - from empty dict to empty set * Add logging to versioning * Fix f-strings instead of (intended) r-strings * Add creating model pack archive to versioning CDB fix * Fix logger initialising * Making versioning a runnable module that allows fixing the config * Add docstrings to CLI methods * CU-8677ge6j8 Make explicit check regards to empty dict when fixing config * CU-8677ge6j8 Add tests regarding versioning changes * CU-8677ge6j8 Add missing return type hint * CU-8677ge6j8 Simplify action handling for CLI input * CU-8677ge6j8 Simplifying archive making method * Pin down transformers for the de-identification model (#314) * NO-TICKET pin down transformers for the de-id model * Added function to remove CUI from cdb (#316) * Added function to remove CUI from cdb * Unit test for remove_cui * CU-862jjprjw Fix github actions failures (#317) * Added function to remove CUI from cdb --------- Co-authored-by: antsh3k <antshek@hotmail.com> * CU-862jr8wkk Pin pydantic dependency to avoid conflicts with v2.0 (#318) * Bump django from 3.2.18 to 3.2.19 in /webapp/webapp Bumps [django](https://github.com/django/django) from 3.2.18 to 3.2.19. - [Commits](django/django@3.2.18...3.2.19) --- updated-dependencies: - dependency-name: django dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * CU-863gntc58 Umlspt2ch (#322) * CU-863gntc58 Add parent to child relationship getter to UMLS preprocessing * CU-863gntc58 Only use ISA relationships * Make sure parents do not have themselves as children * CU-863gntc58 Only keep preferred names * CU-863gntc58 Fix typing issues * CU-863gntc58 Fix child-parent relationships being saved instea * Better system for avoiding parent-child being the same * CU-86783u6d9 Add wrapper to simplify De-ID model usage * CU-86783u6d9 Add wrapper to simplify De-ID model usage * CU-86783u6d9 Fix typoe (nod vs not) * CU-86783u6d9 Fix typo in docstring * CU-86783u6d9 Change loading method name to match CAT * CU-86783u6d9 Separate NER model from DeID model * Better separation of NER models from DeID models * CU-86783u6d9 Move deid method from helpers module to deid model and deprecated the use of the wrappers in the helpers module * Fix imports in deid model * Fix deid training method return value * CU-86783u6d9 Fix dunder call defaults for redaction * CU-86783u6d9 Add a few simple tests for the DeID model * CU-86783u6d9 Add redaction test for the DeID model * CU-86783u6d9 Add remove senitive data * CU-86783u6d9 Fix deid model validation * CU-86783u6d9 Add ChatGPT generated DeId trian data * CU-86783u6d9 Add Warning regarding deid training data * CU-86783u6d9 Fix model issue with multiple NER models * CU-86783u6d9 Fix merge conflict in docstring * CU-86783u6d9 Try and fix keyword argument duplication * CU-86783u6d9 Ignore mypy where needed * CU-86783u6d9 Fix issue with NER model being returned when loading a DeID model * CU-86783u6d9 Remove unused import * CU-86783u6d9 Update training data with some more examples * CU-86783u6d9 Add type hints and doc string to deid method * CU-86783u6d9 Add comment regarding deid_text method being outside the model class * CU-86783u6d9 Add missing return type * CU-86783u6d9 Expose get_entities in NER model * CU-86783u6d9 Expose dunder call in NER model * CU-86783u6d9 Remove dunder call in override in deid model * CU-86783u6d9 Fix deid model tests * CU-86783u6d9 Fix a few typos in docstrings * CU-86783u6d9 Fix a method name in docstrings --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: zethson <lukas.heumos@posteo.net> Co-authored-by: tomolopolis <tsearle88@gmail.com> Co-authored-by: Zeljko <w.kraljevic@gmail.com> Co-authored-by: Sander Tan <s.c.tan-3@umcutrecht.nl> Co-authored-by: Xi Bai <82581439+baixiac@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Anthony Shek <55877857+antsh3k@users.noreply.github.com> Co-authored-by: Lukas Heumos <lukas.heumos@posteo.net> Co-authored-by: antsh3k <antshek@hotmail.com> Co-authored-by: James Brandreth <james.brandreth@gmail.com> Co-authored-by: Xi Bai <baixiac@gmail.com> * CU-862k1tt90 Fix circular imports by moving raw deid method back to helpers module (#328) * CU-862k1tt90 Fix circular imports by moving raw deid method back to helpers module * CU-862k1tt90 Fix missing import regarding deid * CU-862k1tt90 Remove unnecessary newline * Cu 863h30jyb separate train from data load (#329) * CU-863h30jyb Deprecated train_supervised method in favour of train_supervised_from_json method * CU-863h30jyb Shuffle around docstrings for supoervised training methods * CU-863h30jyb Create new train_supervised_raw method for raw data based training * CU-863h30jyb In MetaCat deprecate train method and replace with train_from_json method * CU-863h30jyb In MetaCat add train_raw method and move most of the training logic into that one * CU-863h30jyb Fix type hint * CU-86785yhfk Add method to populate cui2snames with data from cui2names (#327) * CU-86785yhfk Add method to populate cui2snames with data from cui2names * CU-86785yhfk Add test for cui2sname population method * Bump django from 3.2.19 to 3.2.20 in /webapp/webapp Bumps [django](https://github.com/django/django) from 3.2.19 to 3.2.20. - [Commits](django/django@3.2.19...3.2.20) --- updated-dependencies: - dependency-name: django dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * CU-346mpwz Improving memory usage of MedCAT models (#323) * CU-863gntc58 Add parent to child relationship getter to UMLS preprocessing * CU-863gntc58 Only use ISA relationships * Make sure parents do not have themselves as children * CU-863gntc58 Only keep preferred names * CU-346mpwz Add memory optimiser for CDB * CU-346mpwz Add name2<stuff> to memory optimiser for CDB * CU-346mpwz Add keys/items/values views to memory optimiser fake dicts * CU-346mpwz Fix keys/items/values views in memory optimiser fake dicts * CU-346mpwz Add option to optimise or not cui and/or name based dicts in memory optimiser * CU-346mpwz Make default memory optimiser omit name2... optimising; add comment regarding this in docstring * CU-346mpwz Remove unused/legacy code from memory optimiser * CU-346mpwz Add tests for memory optimiser * CU-346mpwz Add tests memory optimised CDB * CU-346mpwz Make dict names available within memory optimiser * CU-346mpwz Add separate tests for memory optimised CDB * CU-346mpwz Remove unused imports in memory optimiser * CU-346mpwz Move some encoding and decoing stuff within serialisation to their own module * CU-346mpwz Add tests for encoding/decoding stuff * CU-346mpwz Add encoding/decoding for delegating dict as well as postprocessing for delegation linking with json serialisation * CU-346mpwz Fix decision upon JSON deserialisation of CDB when loading model pack * CU-346mpwz Adapt serialisation tests to the potential one2many mappings * CU-346mpwz Add tests for memory optimisation, including JSON serialisation ones * CU-346mpwz Remove debug print statements * CU-346mpwz Remove debug methods from tests * CU-346mpwz Fix method signatures in encoding/decoding methods * CU-346mpwz Fix typing issue in serialiser when passing encoder * CU-346mpwz Relax typing restrictions for umls preprocessing / parent2child mapping * CU-346mpwz Remove some debug variables * CU-346mpwz Fix remnant merge conflict * CU-346mpwz Add item removal and popping to delegating dict * CU-346mpwz Add item removal and popping tests to delegating dict * CU-346mpwz Add item adding/setting tests to delegating dict * CU-346mpwz Fix typing issue (List vs list) * CU-346mpwz Add possibility of memory-optimising for snames as well * CU-346mpwz Add comment regarding memory-optimising for filtering by CUI to CDB * CU-346mpwz Add sname based memory optimisation tests * CU-346mpwz Add json serialisation capabilities to snames delegation * CU-346mpwz Make sname optimisation default for memory optimisation * CU-346mpwz Fix typo in serialisation tests * CU-346mpwz Add variable to keep track of current memory optimisation info to CDB * CU-346mpwz Add default cui2snames to sname optimisations; make sure sname optimisation dirties the CDB * CU-346mpwz Add method to undo CDB memory optimisation * CU-346mpwz Add tests for undoing CDB memory optimisation * CU-346mpwz Clear memory optimised parts if/when undoing optimisations * CU-346mpwz Remove accidentally added file/module * CU-346mpwz Add more straight forward optimisation part names; Fix memory optimisation part clearing * CU-346mpwz Add further tests for memory optimisation (dirty state, checking optimised parts) --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: zethson <lukas.heumos@posteo.net> Co-authored-by: Xi Bai <82581439+baixiac@users.noreply.github.com> Co-authored-by: Anthony Shek <55877857+antsh3k@users.noreply.github.com> Co-authored-by: antsh3k <antshek@hotmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: tomolopolis <tsearle88@gmail.com> Co-authored-by: Zeljko <w.kraljevic@gmail.com> Co-authored-by: Sander Tan <s.c.tan-3@umcutrecht.nl> Co-authored-by: Lukas Heumos <lukas.heumos@posteo.net> Co-authored-by: James Brandreth <james.brandreth@gmail.com> Co-authored-by: Xi Bai <baixiac@gmail.com>
CogStack · Jul 6, 2023 · 3396c4d · 3396c4d
1 parent 4af3b14
commit 3396c4d
Show file tree

Hide file tree

Showing 26 changed files with 2,299 additions and 85 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -42,7 +42,7 @@ jobs:
       github.ref == 'refs/heads/master' &&
       github.event_name == 'push' &&
       startsWith(github.ref, 'refs/tags') != true
-    runs-on: ubuntu-18.04
+    runs-on: ubuntu-20.04
     concurrency: publish-to-test-pypi
     needs: [build]
 

diff --git a/.github/workflows/production.yml b/.github/workflows/production.yml
@@ -8,7 +8,7 @@ on:
 
 jobs:
   build-n-publish-to-pypi:
-    runs-on: ubuntu-18.04
+    runs-on: ubuntu-20.04
     concurrency: build-n-publish-to-pypi
     if: github.repository == 'CogStack/MedCAT'
 

diff --git a/examples/cdb_new.dat b/examples/cdb_new.dat
diff --git a/medcat/cat.py b/medcat/cat.py
@@ -28,7 +28,7 @@
 from medcat.utils.data_utils import make_mc_train_test, get_false_positives
 from medcat.utils.normalizers import BasicSpellChecker
 from medcat.utils.checkpoint import Checkpoint, CheckpointConfig, CheckpointManager
-from medcat.utils.helpers import tkns_from_doc, get_important_config_parameters
+from medcat.utils.helpers import tkns_from_doc, get_important_config_parameters, has_new_spacy
 from medcat.utils.hasher import Hasher
 from medcat.ner.vocab_based_ner import NER
 from medcat.linking.context_based_linker import Linker
@@ -40,12 +40,15 @@
 from medcat.vocab import Vocab
 from medcat.utils.decorators import deprecated
 from medcat.ner.transformers_ner import TransformersNER
-from medcat.utils.saving.serializer import SPECIALITY_NAMES
+from medcat.utils.saving.serializer import SPECIALITY_NAMES, ONE2MANY
 
 
 logger = logging.getLogger(__name__) # separate logger from the package-level one
 
 
+HAS_NEW_SPACY = has_new_spacy()
+
+
 class CAT(object):
     """The main MedCAT class used to annotate documents, it is built on top of spaCy
     and works as a spaCy pipline. Creates an instance of a spaCy pipline that can
@@ -299,6 +302,31 @@ def create_model_pack(self, save_dir_path: str, model_pack_name: str = DEFAULT_M
         logger.info(self.get_model_card()) # Print the model card
         return model_pack_name
 
+    @classmethod
+    def attempt_unpack(cls, zip_path: str) -> str:
+        """Attempt unpack the zip to a folder and get the model pack path.
+
+        If the folder already exists, no unpacking is done.
+
+        Args:
+            zip_path (str): The ZIP path
+
+        Returns:
+            str: The model pack path
+        """
+        base_dir = os.path.dirname(zip_path)
+        filename = os.path.basename(zip_path)
+
+        foldername = filename.replace(".zip", '')
+
+        model_pack_path = os.path.join(base_dir, foldername)
+        if os.path.exists(model_pack_path):
+            logger.info("Found an existing unziped model pack at: {}, the provided zip will not be touched.".format(model_pack_path))
+        else:
+            logger.info("Unziping the model pack and loading models.")
+            shutil.unpack_archive(zip_path, extract_dir=model_pack_path)
+        return model_pack_path
+
     @classmethod
     def load_model_pack(cls,
                         zip_path: str,
@@ -324,20 +352,12 @@ def load_model_pack(cls,
         from medcat.vocab import Vocab
         from medcat.meta_cat import MetaCAT
 
-        base_dir = os.path.dirname(zip_path)
-        filename = os.path.basename(zip_path)
-        foldername = filename.replace(".zip", '')
-
-        model_pack_path = os.path.join(base_dir, foldername)
-        if os.path.exists(model_pack_path):
-            logger.info("Found an existing unziped model pack at: {}, the provided zip will not be touched.".format(model_pack_path))
-        else:
-            logger.info("Unziping the model pack and loading models.")
-            shutil.unpack_archive(zip_path, extract_dir=model_pack_path)
+        model_pack_path = cls.attempt_unpack(zip_path)
 
         # Load the CDB
         cdb_path = os.path.join(model_pack_path, "cdb.dat")
-        has_jsons = len(glob.glob(os.path.join(model_pack_path, '*.json'))) >= len(SPECIALITY_NAMES)
+        nr_of_jsons_expected = len(SPECIALITY_NAMES) - len(ONE2MANY)
+        has_jsons = len(glob.glob(os.path.join(model_pack_path, '*.json'))) >= nr_of_jsons_expected
         json_path = model_pack_path if has_jsons else None
         logger.info('Loading model pack with %s', 'JSON format' if json_path else 'dill format')
         cdb = CDB.load(cdb_path, json_path)
@@ -823,6 +843,8 @@ def add_and_train_concept(self,
                 for _cui in cuis:
                     self.linker.context_model.train(cui=_cui, entity=spacy_entity, doc=spacy_doc, negative=True)  # type: ignore
 
+    @deprecated(message="Use train_supervised_from_json to train based on data "
+                "loaded from a json file")
     def train_supervised(self,
                          data_path: str,
                          reset_cui_count: bool = False,
@@ -842,9 +864,93 @@ def train_supervised(self,
                          checkpoint: Optional[Checkpoint] = None,
                          retain_filters: bool = False,
                          is_resumed: bool = False) -> Tuple:
-        """TODO: Refactor, left from old
-        Run supervised training on a dataset from MedCATtrainer. Please take care that this is more a simulated
-        online training then supervised.
+        """Train supervised by reading data from a json file.
+
+        Refer to `train_supervvised_from_json` and/or `train_supervised_raw`
+        for further details.
+        """
+        return self.train_supervised_from_json(data_path, reset_cui_count, nepochs,
+                                               print_stats, use_filters, terminate_last,
+                                               use_overlaps, use_cui_doc_limit, test_size,
+                                               devalue_others, use_groups, never_terminate,
+                                               train_from_false_positives, extra_cui_filter,
+                                               retain_extra_cui_filter, checkpoint,
+                                               retain_filters, is_resumed)
+
+    def train_supervised_from_json(self,
+                                   data_path: str,
+                                   reset_cui_count: bool = False,
+                                   nepochs: int = 1,
+                                   print_stats: int = 0,
+                                   use_filters: bool = False,
+                                   terminate_last: bool = False,
+                                   use_overlaps: bool = False,
+                                   use_cui_doc_limit: bool = False,
+                                   test_size: int = 0,
+                                   devalue_others: bool = False,
+                                   use_groups: bool = False,
+                                   never_terminate: bool = False,
+                                   train_from_false_positives: bool = False,
+                                   extra_cui_filter: Optional[Set] = None,
+                                   retain_extra_cui_filter: bool = False,
+                                   checkpoint: Optional[Checkpoint] = None,
+                                   retain_filters: bool = False,
+                                   is_resumed: bool = False) -> Tuple:
+        """
+        Run supervised training on a dataset from MedCATtrainer in JSON format.
+
+        Refer to `train_supervised_raw` for more details.
+        """
+        with open(data_path) as f:
+            data = json.load(f)
+        return self.train_supervised_raw(data, reset_cui_count, nepochs,
+                                         print_stats, use_filters, terminate_last,
+                                         use_overlaps, use_cui_doc_limit, test_size,
+                                         devalue_others, use_groups, never_terminate,
+                                         train_from_false_positives, extra_cui_filter,
+                                         retain_extra_cui_filter, checkpoint,
+                                         retain_filters, is_resumed)
+
+    def train_supervised_raw(self,
+                             data: Dict[str, List[Dict[str, dict]]],
+                             reset_cui_count: bool = False,
+                             nepochs: int = 1,
+                             print_stats: int = 0,
+                             use_filters: bool = False,
+                             terminate_last: bool = False,
+                             use_overlaps: bool = False,
+                             use_cui_doc_limit: bool = False,
+                             test_size: int = 0,
+                             devalue_others: bool = False,
+                             use_groups: bool = False,
+                             never_terminate: bool = False,
+                             train_from_false_positives: bool = False,
+                             extra_cui_filter: Optional[Set] = None,
+                             retain_extra_cui_filter: bool = False,
+                             checkpoint: Optional[Checkpoint] = None,
+                             retain_filters: bool = False,
+                             is_resumed: bool = False) -> Tuple:
+        """Train supervised based on the raw data provided.
+
+        The raw data is expected in the following format:
+        {'projects':
+            [ # list of projects
+                { # project 1
+                    'name': '<some name>',
+                    # list of documents
+                    'documents': [{'name': '<some name>',  # document 1
+                                    'text': '<text of the document>',
+                                    # list of annotations
+                                    'annotations': [{'start': -1,  # annotation 1
+                                                    'end': 1,
+                                                    'cui': 'cui',
+                                                    'value': '<text value>'}, ...],
+                                    }, ...]
+                }, ...
+            ]
+        }
+
+        Please take care that this is more a simulated online training then supervised.
 
         When filtering, the filters within the CAT model are used first,
         then the ones from MedCATtrainer (MCT) export filters,
@@ -853,8 +959,8 @@ def train_supervised(self,
         extra_cui_filter ⊆ MCT filter ⊆ Model/config filter.
 
         Args:
-            data_path (str):
-                The path to the json file that we get from MedCATtrainer on export.
+            data (Dict[str, List[Dict[str, dict]]]):
+                The raw data, e.g from MedCATtrainer on export.
             reset_cui_count (boolean):
                 Used for training with weight_decay (annealing). Each concept has a count that is there
                 from the beginning of the CDB, that count is used for annealing. Resetting the count will
@@ -923,8 +1029,7 @@ def train_supervised(self,
         local_filters = self.config.linking.filters.copy_of()
 
         fp = fn = tp = p = r = f1 = examples = {}
-        with open(data_path) as f:
-            data = json.load(f)
+
         cui_counts = {}
 
         if retain_filters:
@@ -1489,6 +1594,43 @@ def _mp_cons(self, in_q: Queue, out_list: List, min_free_memory: int, lock: Lock
                         logger.warning(str(e))
         sleep(2)
 
+    def _add_nested_ent(self, doc: Doc, _ents: List[Span], _ent: Union[Dict, Span]) -> None:
+        # if the entities are serialised (PipeRunner.serialize_entities)
+        # then the entities are dicts
+        # otherwise they're Span objects
+        meta_anns = None
+        if isinstance(_ent, dict):
+            start = _ent['start']
+            end =_ent['end']
+            label = _ent['label']
+            cui = _ent['cui']
+            detected_name = _ent['detected_name']
+            context_similarity = _ent['context_similarity']
+            id = _ent['id']
+            if 'meta_anns' in _ent:
+                meta_anns = _ent['meta_anns']
+        else:
+            start = _ent.start
+            end = _ent.end
+            label = _ent.label
+            cui = _ent._.cui
+            detected_name = _ent._.detected_name
+            context_similarity = _ent._.context_similarity
+            if _ent._.has('meta_anns'):
+                meta_anns = _ent._.meta_anns
+            if HAS_NEW_SPACY:
+                id = _ent.id
+            else:
+                id = _ent.ent_id
+        entity = Span(doc, start, end, label=label)
+        entity._.cui = cui
+        entity._.detected_name = detected_name
+        entity._.context_similarity = context_similarity
+        entity._.id = id
+        if meta_anns is not None:
+            entity._.meta_anns = meta_anns
+        _ents.append(entity)
+
     def _doc_to_out(self,
                     doc: Doc,
                     only_cui: bool,
@@ -1499,16 +1641,9 @@ def _doc_to_out(self,
         if doc is not None:
             out_ent: Dict = {}
             if self.config.general.show_nested_entities:
-                _ents = []
+                _ents: List[Span] = []
                 for _ent in doc._.ents:
-                    entity = Span(doc, _ent['start'], _ent['end'], label=_ent['label'])
-                    entity._.cui = _ent['cui']
-                    entity._.detected_name = _ent['detected_name']
-                    entity._.context_similarity = _ent['context_similarity']
-                    entity._.id = _ent['id']
-                    if 'meta_anns' in _ent:
-                        entity._.meta_anns = _ent['meta_anns']
-                    _ents.append(entity)
+                    self._add_nested_ent(doc, _ents, _ent)
             else:
                 _ents = doc.ents  # type: ignore