Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CU-8693b0a61 Add method to get spacy model version #381

Merged
merged 27 commits into from
Jan 8, 2024

Conversation

mart-r
Copy link
Collaborator

@mart-r mart-r commented Dec 13, 2023

Most compatibility issues seem to boil down to the spacy version used by medcat being incompatible with the spacy model bundled with a model pack.
We should be able to move forward by knowing which spacy model we're trying to load.

The spacy model is generally saved at <model_pack>/<spacy_model_name> (though at <model_pack>/spacy_model for pre 1.2.4 built models).

The meta.json file is what spacy needs to read the metadata (call spacy.info on the folder).

This PR adds a bunch of methods and their corresponding tests. The most notable of these are the following:

  • medcat.utils.spacy_compatibility.get_name_and_version_of_spacy_model_in_medcat_modelpack
    It allows us to get the spacy model name and version that's saved within a modelpack.
  • medcat.utils.spacy_compatibility.medcat_model_pack_has_compatible_spacy_model
    It allows checking whether the spacy model shipped with a model pack is compatible with the install spacy version.
  • medcat.utils.spacy_compatibility.medcat_model_pack_has_semi_compatible_spacy_model
    It allows checking whether the spacy model shipped with a model pack is older than the currently used spacy model. In my limited testing, older models will work in newer spacy (but not the other way around). Though there are warnings on spacy's side on not guaranteeing 100% compatibility.

A note regarding spacy versions and spacy models:

  • Before v1.7.0
    • medcat requirement (in setup.py) spacy<3.1.4,>=3.1.0
      • Automatically installs 3.1.3
    • medcat requirement (in requirements.txt) en_core_web_md-3.1.0
      • Its constraints >=3.10,<3.2.0
  • From v1.7.0 up
    • medcat requirement (in setup.py) spacy>=3.1.0
      • Automatic installation installs 3.7.2 3.4.2 (at least on GHA) on all supported python versions
        • Apparently spacy is downloaded twice in there
        • At first 3.7.2 is downlaoded
        • Then 3.4.4 is downloaded and installed
    • medcat requirement (in requirements.txt) en_core_web_md-3.4.0
      • Its constraints >=3.4.0,<3.5.0
    • Though interestingly, the HPO models were built with en_core_web_md-3.6.0
      • With constraints >=3.6.0,<3.7.0
      • Which also seems to work fine on installed spacy
      • Not sure where the newer spacy model came from... I don't recall specifically installing / downloading it

@tomolopolis
Copy link
Member

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff - small things:

  • is it worth noting some of these defs as package private, i.e. leading underscore
  • fix the doc typo

Can you add these calls to the working_with_cogstack scripts, and provide some instructions to folks on what to do if they encounter warnings.

medcat/utils/spacy_compatibility.py Outdated Show resolved Hide resolved
@mart-r mart-r merged commit d9a1fac into master Jan 8, 2024
5 checks passed
mart-r added a commit that referenced this pull request Jan 8, 2024
* Bump urllib3 from 1.26.5 to 1.26.17 in /webapp/webapp

Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.5 to 1.26.17.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](urllib3/urllib3@1.26.5...1.26.17)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Cu 8692wbcq5 docs builds (#359)

* CU-8692wbcq5: Pin max version of numpy

* CU-8692wbcq5: Pin max version of numpy in setup.py

* CU-8692wbcq5: Bump python version for readthedocs workflow

* CU-8692wbcq5: Pin all requirement versions in docs requirements

* CU-8692wbcq5: Move docs requirements before setuptools

* CU-8692wbcq5: Fix typo in docs requirements

* CU-8692wbcq5: Remove some less relevant stuff from docs requirements

* CU-8692wbcq5: Add back sphinx-based requirements to docs requirements

* CU-8692wbcq5: Move back to python 3.9 on docs build workflow

* CU-8692wbcq5: Bump sphinx-autoapi version

* CU-8692wbcq5: Bump sphinx version

* CU-8692wbcq5: Bump python version back to 3.10 for future-proofing

* CU-8692wbcq5: Undo pinning numpy to max version in setup.py

* CU-8692wbcq5: Remove docs-build specific dependencies in setup.py

* Bump urllib3 from 1.26.17 to 1.26.18 in /webapp/webapp

Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.17 to 1.26.18.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](urllib3/urllib3@1.26.17...1.26.18)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* CU-8692uznvd: Allow empty-dict config.linking.filters.cuis and convert to set in memory (#352)

* CU-8692uznvd: Allow empty-dict config.linking.filters.cuis and convert to set in memory

* CU-8692uznvd: Move the empty-set detection and conversion from validator to init

* CU-8692uznvd: Remove unused import

* CU-8692t3fdf separate config on save (#350)

* CU-8692t3fdf Move saving config outside of the cdb.dat; Add test to make sure the config does not get saved with the CDB; patch a few existing tests

* CU-8692t3fdf Use class methods on class instead of instance in a few tests

* CU-8692t3fdf Fix typing issue

* CU-8692t3fdf Add additional tests for 2 configs and zero configs when loading model pack

* CU-8692t3fdf: Make sure CDB is linked to the correct config; Treat incorrect configs as dirty CDBs and force a recalc of the hash

* CU-2cdpd4t: Unify default addl_info in different methdos. (#363)

* Bump django from 3.2.20 to 3.2.23 in /webapp/webapp

Bumps [django](https://github.com/django/django) from 3.2.20 to 3.2.23.
- [Commits](django/django@3.2.20...3.2.23)

---
updated-dependencies:
- dependency-name: django
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Changing cdb.add_concept to a protected method

* Re-added deprecated method  with deprecated flag and addtional comments

* Initial commit for merge_cdb method

* Added indentation to make merge_cdb a class method

* fixed syntax issues

* more lint fixes

* more lint fixes

* bug fixes of merge_cdb

* removed print statements

* CU-86931prq4: Update GHA versions (checkout and setup-python) to v4 (#368)

* Cu 1yn0v9e duplicate multiprocessing methods (#364)

* CU-1yn0v9e: Rename and deprecate one of the multiprocessing methods;

Add docstring. Trying to be more explicit regarding usage and differences between different methods

* CU-1yn0v9e: Rename and deprecate the multiprocessing_pipe method;

Add docstring. Trying to be more explicit regarding usage and differences between different methods

* CU-1yn0v9e: Fix typo in docstring; more consistent naming

* 869377m3u: Add comment regarding demo link load times to README (#376)

* intermediate changes of merge_cdb and testing function

* Added README.md documentation for CPU only installations (#365)

* changed README.md to reflect installation options.

* added setup script to demonstrate how wrapper could look for CPU installations

* removed setup.sh as unnessescary for cpu only builds

* Initial commit for merge_cdb method

* Added indentation to make merge_cdb a class method

* fixed syntax issues

* more lint fixes

* more lint fixes

* bug fixes of merge_cdb

* removed print statements

* Added commentary on disk space usage of pytorch-gpu

* removed merge_cdb from branch

---------

Co-authored-by: adam-sutton-1992 <adam.sutton@kcl.ac.uk>

* Cu 8692zguyq no preferred name (#367)

* CU-8692zguyq: Slight simplification of minimum-name-length logic

* CU-8692zguyq: Add some tests for prepare_name preprocessor

* CU-8692zguyq: Add warning if no preferred name was added along a new CUI

* CU-8692zguyq: Add additional warning messages when adding/training a new CUI with no preferred name

* CU-8692zguyq: Make no preferred name warnings only run if name status is preferred

* CU-8692zguyq: Add tests for no-preferred name warnings

* CU-8692zguyq: Add Vocab.make_unigram_table to CAT tests

* CU-8692zguyq: Move to built in asserting for logging instead of patching the method

* CU-8692zguyq: Add workaround for assertNoLogs on python 3.8 and 3.9

* Add trainer callbacks for Transformer NER (#377)

CU-86938vf30 add trainer callbacks for Transformer NER

* changes to merge_cdb and adding unit tests for method

* fixing lint issues

* fixing flake8 linting

* bug fixes, additional tests, and more documentation

* moved set up of cdbs to be merged to tests.helper

* moved merge_cdb to utils and created test_cdb_utils

* removed class wrapper in cdb utils and fixed class set up in tests

* changed test object setup to class setup

* removed erroneous new line

* CU-2e77a31 improve print stats (#366)

* Add base class for CAT

* Add CDB base class

* Some whitespace fixes for base modules

* CU-2e77a31: Move print stats to their own module and class

* CU-2e77a31: Fix issues introduced by moving print stats

* CU-2e77a31: Rename print_stats to get_stats and add option to avoid printed output

* CU-2e77a31: Add test for print_stats

* CU-2e77a31: Remove unused import

* CU-2e77a31: Add new package to setup.py

* CU-2e77a31: Fix a bunch of typing issues

* CU-2e77a31: Revert CAT and CDB abstraction

* Load stopwords in Defaults before spacy model

* CU-8693az82g Remove cdb tests side effects (#380)

* 8693az82g: Add method to CDBMaker to reset the CDB

* 8693az82g: Add test in CDB tests to ensure a new CDB is used for each test

* 8693az82g: Reset CDB in CDB tests before each test to avoid side effects

* Added tests

* CU-8693bpq82 fallback spacy model (#384)

* CU-8693bpq82: Add fallback spacy model along with test

* CU-8693bpq82: Remove debug output

* CU-8693bpq82: Add exception info to warning upon spacy model load failure and fallback

* Remove tests of internals where possible

* Add test for skipping of stopwords

* Avoid supporting only English for stopwords

* Remove debug output

* Make sure stopwords language getter works for file-path spacy models

* CU-8693cv3w0 Fix fallback spacy model existance on pip installs (#386)

* CU-8693cv3w0: Add method to ensure spacy model and use it when falling back to default model

* CU-8693cv3w0: Add logged output when installing/downloading spacy model

* CU-8693b0a61 Add method to get spacy model version (#381)

* CU-8693b0a61: Add method to find spacy folder in model pack along with some tests

* CU-8693b0a61: Add test for spacy folder finding (full path)

* CU-8693b0a61: Add method for finding spacy model in model pack along with tests

* CU-8693b0a61: Add method for finding current spacy version

* CU-8693b0a61: Add method for getting spacy model version installed

* CU-8693b0a61: Fix getting spacy model folder return path

* CU-8693b0a61: Add method to get name and meta of spacy model based on model pack

* CU-8693b0a61: Add missing fake spacy model meta

* CU-8693b0a61: Add missing docstrings

* CU-8693b0a61: Change name of method for clarity

* CU-8693b0a61: Add method to get spacy model name and version from model pack path

* CU-8693b0a61: Fix a few typing issues

* CU-8693b0a61: Add a missing docstring

* CU-8693b0a61: Match folder name of fake spacy model to its name

* CU-8693b0a61: Make the final method return true name of spacy model instead of folder name

* Add additional output to method for getting spacy model version - the compatible spacy versions

* CU-8693b0a61: Add method for querying whether the spacy version is compatible with a range

* CU-8693b0a61: Add better abstraction for spacy version mocking in tests

* CU-8693b0a61: Add some more abstraction for fake model pack in tests

* CU-8693b0a61: Add method for checking whethera model pack has a spacy model compatible with installed spacy version

* CU-8693b0a61: Improve abstraction within tests

* CU-8693b0a61: Add method to check which of two versions is older

* CU-8693b0a61: Fix fake spacy model versioning

* CU-8693b0a61: Add method for determining whether a model pack has semi-compatible spacy model

* CU-8693b0a61: Add missing word in docstring.

* CU-8693b0a61: Change some method to protected ones

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: tomolopolis <tsearle88@gmail.com>
Co-authored-by: adam-sutton-1992 <adam.sutton@kcl.ac.uk>
Co-authored-by: adam-sutton-1992 <60137864+adam-sutton-1992@users.noreply.github.com>
Co-authored-by: Xi Bai <82581439+baixiac@users.noreply.github.com>
Co-authored-by: jenniferajiang <jennigato95@gmail.com>
Co-authored-by: Jennifer Jiang <37081323+jenniferajiang@users.noreply.github.com>
@mart-r mart-r deleted the CU-8693b0a61-get-spacy-version branch March 1, 2024 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants