Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CU-8693n892x environment/dependency snapshots #438

Merged
merged 15 commits into from
Jun 19, 2024

Conversation

mart-r
Copy link
Collaborator

@mart-r mart-r commented May 16, 2024

Saving environment/dependency snapshot alongside the model pack when one is created.

This includes the following information:

  • Installed dependencies along with their versions
  • OS being used
  • CPU architecture being used
  • Python version being used

The idea is that if/when we (in the future) see incompatibilities (i.e an old(er) model doesn't load) this information can be used to better understand why this error exists and/or replicate it in order to fix it.

EDIT:
Just to clarify current state (17.06.2024).

  • This PR moves the installation requirements outside setup.py into install_requires.txt (in project root).
  • During installation install_requires.txt is copied into the package root (the medcat folder)
    • This is so that the file gets distributed alongside the rest of the package (e.g in a wheel)
    • Otherwise it wouldn't be available during run time (e.g pip install)
    • Though this requires a slightly different location to be used

PS: We could move the requirements into medcat altogether. It would simplify the environmental snapshot somewhat. But I feel like having the dependencies in the project root makes more sense.

@tomolopolis
Copy link
Member

@mart-r
Copy link
Collaborator Author

mart-r commented May 23, 2024

Just as a note:
I was also looking at stuff other people may have used for the same purpose.
I more or less found the following options:

  • Using pip freeze and saving its output instead of doing it manually
  • Using pipreqs to generate a requirements file
  • Using conda or poetry or pipenv and capture the environment using them
  • Using environment-kernels kernel manager in a Jupyter notebook

The first 2 don't actually consider / capture information other than the installed dependencies. And they bring some additional steps (pip freeze generally requires spawning and running a sub process and parsing its stdout, and pipreqs requires the reading and (probably) removal of a text file).
The next ones are specific to specific users. While many may be using the model in a Jupyter notebook, we cannot really assume that.

Using pip freeze would probably be slightly fewer lines of code. So I'd be open to using that instead. But I don't think it adds a lot.

@mart-r
Copy link
Collaborator Author

mart-r commented Jun 10, 2024

Just to clarify here:
We wanted a better version that just a regex on setup.py. We found to approaches:

  1. Use importlib_metadata
  1. Separating dependencies to another file (install_requires.txt)
  • Trying to stick to this then
  • Seems to wokr



ENV_SNAPSHOT_FILE_NAME = "environment_snapshot.json"
SETUP_PY_PATH = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "..", "..", "setup.py"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these module level vars are not needed anymore

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

dep_lines = [line.split("#")[0].replace("'", "").replace('"', "").strip() for line in f.readlines()]
# remove comment-only (or empty) lines
deps = [dep for dep in dep_lines if dep]
return set(re.split("[<=>~]", dep)[0] for dep in deps)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

install directly from a git commit does this work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added @ separator.

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - in another PR, in load_model_pack - can this env_snapshot be used to asses the env differences between running env and model pack snapshot env?

@mart-r
Copy link
Collaborator Author

mart-r commented Jun 18, 2024

in another PR, in load_model_pack - can this env_snapshot be used to asses the env differences between running env and model pack snapshot env?

Yes, that should be possible and shouldn't be too difficult to do.
But I've not added the functionality anywhere as of yet. Though I agree, adding this in another PR / task would make sense.

Comment on lines +54 to +55
if package.project_name not in direct_deps:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO why not save transitive dependencies in the snapshot as well, together with direct dependencies? It would be easier for users like me to create a virtual environment with high fidelity to the original training environment before starting to use the model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few issues with listing all installed packages:

  • The list of dependencies may change depending on hardware (i.e nvidia or cuda related packages)
  • The version of dependencies may change depending for different python versions
  • Some environments may have unrelated packages

With that said, that was my original idea as well. If we're able to identify the exact environment (down to each individual package) then we should be able to exactly replicate behaviour across environments.

But the main reason to go with the current approach is the fact that downstream / transitive dependencies aren't (or at least shouldn't be) our responsibility. Our direct dependencies should know what they can and cannot work with.
There are, of course, situations where that expectation fails. Especially since we depend on quite a few older versions of other projects.

So all in all, I think it would be worth it to add something to this in the future (as part of another PR). Whether it's all installed packages or some subset of it would need to be figured out as part of that.
But I think this PR will still allow as to get at least 90% of the way there in terms of being able to validate our environment.

@mart-r mart-r merged commit e4715ae into master Jun 19, 2024
8 checks passed
mart-r added a commit that referenced this pull request Jun 19, 2024
* Pushing changes for bert-style models for MetaCAT

* Pushing fix for LSTM

* Pushing changes for flake8 and type fixes

* Pushing type fixes

* Fixing type issue

* Pushing changes

1) Added model.zero_grad to clear accumulated gradients
2) Fixed config save issue
3) Re-structured data preparation for oversampled data

* Pushing change and type fixes

Pushing ml_utils file which was missed in the last commit

* Fixing flake8 issues

* Pushing flake8 fixes

* Pushing fixes for flake8

* Pushing flake8 fix

* Adding peft to list of libraries

* Pushing changes with load and train workflow and type fixes

The workflow for inference is: load() and inference
For training: init() and train()
Train will always not load the model dict, except when the phase_number is set to 2 for 2 phase learning's second phase

* Pushing changes with type hints and new documentation

* Pushing type fix

* Fixing type issue

* Adding test case for BERT and reverting config changes

BERT test cases: Testing for BERT model along with 2 phase learning

* Merging changes from master to metacat_bert branch (#431)

* Small addition to contribution guidelines (#420)

* CU-8694cbcpu: Allow specifying an AU Snomed when preprocessing (#421)

* CU-8694dpy1c: Return empty generator upon empty stream (#423)

* CU-8694dpy1c: Return empty generator upon empty stream

* CU-8694dpy1c: Fix empty generator returns

* CU-8694dpy1c: Simplify empty generator returns

* Relation extraction (#173)

* Added files.

* More additions to rel extraction.

* Rel base.

* Update.

* Updates.

* Dependency parsing.

* Updates.

* Added pre-training steps.

* Added training & model utils.

* Cleanup & fixes.

* Update.

* Evaluation updates for pretraining.

* Removed duplicate relation storage.

* Moved RE model file location.

* Structure revisions.

* Added custom config for RE.

* Implemented custom dataset loader for RE.

* More changes.

* Small fix.

* Latest additions to RelCAT (pipe + predictions)

* Setup.py fix.

* RE utils update.

* rel model update.

* rel dataset + tokenizer improvements.

* RelCAT updates.

* RelCAT saving/loading improvements.

* RelCAT saving/loading improvements.

* RelCAT model fixes.

* Attempted gpu learning fix. Dataset label generation fixes.

* Minor train dataset gen fix.

* Minor train dataset gen fix No.2.

* Config updates.

* Gpu support fixes. Added label stats.

* Evaluation stat fixes.

* Cleaned stat output mode during training.

* Build fix.

* removed unused dependencies and fixed code formatting

* Mypy compliance.

* Fixed linting.

* More Gpu mode train fixes.

* Fixed model saving/loading issues when using other baes models.

* More fixes to stat evaluation. Added proper CAT integration of RelCAT.

* Setup.py typo fix.

* RelCAT loading fix.

* RelCAT Config changes.

* Type fix. Minor additions to RelCAT model.

* Type fixes.

* Type corrections.

* RelCAT update.

* Type fixes.

* Fixed type issue.

* RelCATConfig: added seed param.

* Adaptations to the new codebase + type fixes..

* Doc/type fixes.

* Fixed input size issue for model.

* Fixed issue(s) with model size and config.

* RelCAT: updated configs to new style.

* RelCAT: removed old refs to logging.

* Fixed GPU training + added extra stat print for train set.

* Type fixes.

* Updated dev requirements.

* Linting.

* Fixed pin_memory issue when training on CPU.

* Updated RelCAT dataset get + default config.

* Updated RelDS generator + default config

* Linting.

* Updated RelDatset + config.

* Pushing updates to model

Made changes to:
1) Extracting given number of context tokens left and right of the entities
2) Extracting hidden state from bert for all the tokens of the entities and performing max pooling on them

* Fixing formatting

* Update rel_dataset.py

* Update rel_dataset.py

* Update rel_dataset.py

* RelCAT: added test resource files.

* RelCAT: Fixed model load/checkpointing.

* RelCAT: updated to pipe spacy doc call.

* RelCAT: added tests.

* Fixed lint/type issues & added rel tag to test DS.

* Fixed ann id to token issue.

* RelCAT: updated test dataset + tests.

* RelCAT: updates to requested changes + dataset improvements.

* RelCAT: updated docs/logs according to commends.

* RelCAT: type fix.

* RelCAT: mct export dataset updates.

* RelCAT: test updates + requested changes p2.

* RelCAT: log for MCT export train.

* Updated docs + split train_test & dataset for benchmarks.

* type fixes.

---------

Co-authored-by: Shubham Agarwal <66172189+shubham-s-agarwal@users.noreply.github.com>
Co-authored-by: mart-r <mart.ratas@gmail.com>

* CU-8694fae3r: Avoid publishing PyPI release when doing GH pre-releases (#424)

* CU-8694fae3r: Avoid publishing PyPI release when doing GH pre-releases

* CU-8694fae3r: Fix pre-releases tagging

* CU-8694fae3r: Allow actions to run on release edit

---------

Co-authored-by: Mart Ratas <mart.ratas@gmail.com>
Co-authored-by: Vlad Dinu <62345326+vladd-bit@users.noreply.github.com>

* Pushing changed tests and removing empty change

* Pushing change for logging

* Revert "Pushing change for logging"

This reverts commit fbcdb70.

* CU-8694hukwm: Document the materialising of generator when multiproce… (#433)

* CU-8694hukwm: Document the materialising of generator when multiprocessing and batching for docs

* CU-8694hukwm: Add TODO note for where the generator is materialised

* CU-8694hukwm: Add warning from large amounts of generator data (10k items) is materialised by the docs size mp method

* CU-8694fk90t (almost) only primitive config (#425)

* CU-8694fk90r: Move backwards compatibility method from CDB to config utils

* CU-8694fk90r: Move weighted_average_function from config to CDB; create necessary backwards compatibility workarounds

* CU-8694fk90r: Move usage of weighted_average_function in tests

* CU-8694fk90r: Add JSON encode and decoder for re.Pattern

* CU-8694fk90r: Rebuild custom decoder if needed

* CU-8694fk90r: Add method to detect old style config

* CU-8694fk90r: Use regular json serialisation for config; Retain option to read old jsonpickled config

* CU-8694fk90r: Add test for config serialisation

* CU-8694fk90r: Make sure to fix weighted_average_function upon setting it

* CU-8694fk90t: Add missing tests for config utils

* CU-8694fk90t: Add tests for better raised exception upon old way of using weighted_average_function

* CU-8694fk90t: Fix exception type in an added test

* CU-8694fk90t: Add further tests for exception payload

* CU-8694fk90t: Add improved exceptions when using old/unsupported value of weighted_average_function in config

* CU-8694fk90t: Add typing fix exceptions

* CU-8694fk90t: Make custom exception derive from AttributeError to correctly handle hasattr calls

* CU-8694gza88: Create codeql.yml (#434)

Run CodeQL to identify vulnerabilities.
This will run on any push or pull request to `master`, but also runs once every day in case some new vulnerabilities are discovered (or something else changes).

* CU-8694mbn03: Remove the web app (#441)

* CU-8694n48uw better deprecation (#443)

* CU-8694n493m: Add deprecation and removal versions to deprecation decorator

* CU-8694n493m: Deprecation version to existing deprecated methods.

Made the removal version 2 minor versions from the minor version
in which the method was deprecated, or the next minor version if
the method had been deprecated for longer.

* CU-8694n4ff0: Raise exception upon deprecated method call at test time

* CU-8694n4ff0: Fix usage of deprecated methods call during test time

* CU-8694pey4u: extract cdb load to cls method, to be used in trainer for model pack loading

* CU-8694pey4u: extract meta cat loading also to a cls method

* CU-8694pey4u: docstrings

* CU-8694pey4u: typehints and mypy issues

* CU-8694pey4u: fix flake8

* CU-8694pey4u: fix flake8

* CU-8694pey4u: missing extra config if passed in

* CU-8694py1jr: Fix issue with reuse of opened file when loading old configs

* CU-8694py1jr: Make old config identifier more robust

* CU-8694py1jr: Add doc string to old config identifier

* CU-8694py1jr: Add test for old style MetaCAT config load

* CU-8694py1jr: Add test for old style main config load (functional)

* CU-8694py1jr: Refactor config utils load tests for more flexibility

* CU-8694py1jr: Add config utils load tests for NER and Rel CAT configs

* CU-8694vcvz7: Trust remote code when loading transfomers NER dataset (#453)

* CU-8694vcvz7: Trust remote code when loading transfomers NER dataset

* CU-8694vcvz7: Add support for older datasets without the remote code trusing kwarg

* CU-8694gzbn3 k fold metrics (#432)

* CU-8694gzbud: Add context manager that is able to snapshot CDB state

* CU-8694gzbud: Add tests to snapshotting CDB state

* CU-8694gzbud: Refactor tests for CDB state snapshotting

* CU-8694gzbud: Remove use of deprecated method in CDB utils and use non-deprecated one instead

* CU-8694gzbud: Add tests for training and CDB state capturing

* CU-8694gzbud: Small refactor in tests

* CU-8694gzbud: Add option to save state on disk

* CU-8694gzbud: Add debug logging output when saving state on disk

* CU-8694gzbud: Remove unused import

* CU-8694gzbud: Add tests for disk-based state save

* CU-8694gzbud: Move CDB state code to its own module

* CU-8694gzbud: Remove unused import

* CU-8694gzbud: Add doc strings to methods

* CU-8694gzbx4: Small optimisation for stats

* CU-8694gzbx4: Add MCTExport related module

* CU-8694gzbx4: Add MCTExport related tests

* CU-8694gzbx4: Add code for k-fold statistics

* CU-8694gzbx4: Add tests for k-fold statistics

* CU-8694gzbx4: Add test-MCT export with fake concepts

* CU-8694gzbx4: Fix a doc string

* CU-8694gzbx4: Fix types in MCT export module

* CU-8694gzbx4: Fix types in k-fold module

* CU-8694gzbx4: Remove accidentally committed test class

* CU-8694gzbn3: Add missing test helper file

* CU-8694gzbn3: Remove whitespace change from otherwise uncahnged file

* CU-8694gzbn3: Allow 5 minutes longer for tests

* CU-8694gzbn3: Move to python 3.8-compatible typed dict

* CU-8694gzbn3: Add more time for tests in worklow (now 30 minutes)

* CU-8694gzbn3: Add more time for tests in worklow (now 45 minutes)

* CU-8694gzbn3: Update test-pypi timeout to 45 minutes

* CU-8694gzbn3: Remove timeout from unit tests in main workflow

* CU-8694gzbn3: Make tests stop upon first failure

* CU-8694gzbn3: Fix test stop upon first failure (arg/option order)

* CU-8694gzbn3: Remove debug code and old comments

* CU-8694gzbn3: Remove all timeouts from main workflow

* CU-8694gzbn3: Remove more old / useless comments in tests

* CU-8694gzbn3: Add debug output when running k-fold tests to see where it may be stalling

* CU-8694gzbn3: Add debug output when ANY tests to see where it may be stalling

* CU-8694gzbn3: Remove explicit debug output from k-fold test cases

* CU-8694gzbn3: Remove timeouts from DEID tests in case they're the ones creating issues

* GHA/test fixes (#437)

* Revert "CU-8694gzbn3: Remove timeouts from DEID tests in case they're the ones creating issues"

This reverts commit faaf7fb.

* Revert "CU-8694gzbn3: Remove explicit debug output from k-fold test cases"

This reverts commit 9b02925.

* Revert "CU-8694gzbn3: Add debug output when ANY tests to see where it may be stalling"

This reverts commit 12c519a.

* Revert "CU-8694gzbn3: Add debug output when running k-fold tests to see where it may be stalling"

This reverts commit 03531da.

* Revert "CU-8694gzbn3: Remove all timeouts from main workflow"

This reverts commit e6debce.

* Revert "CU-8694gzbn3: Fix test stop upon first failure (arg/option order)"

This reverts commit 666c013.

* Revert "CU-8694gzbn3: Make tests stop upon first failure"

This reverts commit 94bce56.

* Revert "CU-8694gzbn3: Remove timeout from unit tests in main workflow"

This reverts commit 3618b9c.

* CU-8694gzbn3: Improve state copy code in CDB state tests

* CU-8694gzbn3: Fix a CDB state test issue

* CU-8694gzbn3: Split all tests into 2 halves

* CU-8694gzbn3: Remove legacy / archived / unused tests

* CU-8694gzbn3: Add doc strings for FoldCreator init

* CU-8694gzbn3: Move to a split-type enum

* CU-8694gzbn3: Add documentation to split-type enum

* CU-8694gzbn3: Create separate fold creators for different types of splitting strategies

* CU-8694gzbn3: Resort document order in test time nullification process

* CU-8694gzbn3: Add option to count number of annotations in doc for MCT export

* CU-8694gzbn3: Add weighted documents based split option along with relevant tests

* CU-8694gzbn3: Update default fold creation split type to weighted documents

* CU-8694gzbn3: Add test to ensure weighted documents split creates a reasonable number of annotations per split

* CU-8693n892x environment/dependency snapshots (#438)

* CU-8693n892x: Save environment/dependency snapshot upon model pack creation

* CU-8693n892x: Fix typing for env snapshot module

* CU-8693n892x: Add test for env file existance in .zip

* CU-8693n892x: Add doc strings

* CU-8693n892x: Centralise env snapshot file name

* CU-8693n892x: Add env snapshot file to exceptions in serialisation tests

* CU-8693n892x: Only list direct dependencies

* CU-8693n892x: Add test that verifies all direct dependencies are listed in environment

* CU-8693n892x: Move requirements to separate file and use that for environment snapshot

* CU-8693n892x: Remove unused constants

* CU-8693n892x: Allow URL based dependencies when using direct dependencies

* CU-8693n892x: Distribute install_requires.txt alongside the package; use correct path in distributed version

* CU-8694p8y0k deprecation GHA check (#445)

* CU-8694p8y0k: Add check for deprecations (code)

* CU-8694p8y0k: Add workflow check for deprecations

* CU-8694p8y0k: Fix (hopefully) workflow check for deprecations

* CU-8694p8y0k: Add option to remove version prefix when checking deprecation

* CU-8694p8y0k: Update deprecation checks with more detail (i.e current/next version).

* CU-8694p8y0k: Only run deprecation checking step when merging master into production

* CU-8694u3yd2 cleanup name removal (#450)

* CU-8694u3yd2: Add logged warning for when using full-unlink

* CU-8694u3yd2: Make CDB.remove_names simply expect an iterable of names

* CU-8694u3yd2: Improve CDB.remove_names doc string

* CU-8694u3yd2: Explicitly pass the keys to CDB.remove_names in CAT.unlink_concept_name

* CU-8694u3yd2: Add note regarding state (and order) dependent tests to some CDB maker tests

* CU-8694u3yd2: Rename/make protected CDB.remove_names method

* CU-8694u3yd2: Create deprecated CDB.remove_names method

* CU-8694vte2g 1.12 depr removal (#454)

* CU-8694vte2g: Remove CDB.add_concept method

* CU-8694vte2g: Remove unused import (deprecated decorator)

* CU-8694vte2g: Remove CAT.get_spacy_nlp method

* CU-8694vte2g: Remove CAT.train_supervised method

* CU-8694vte2g: Remove CAT multiprocessing methods

* CU-8694vte2g: Remove MetaCAT.train method

* CU-8694vte2g: Remove medcat.utils.ner.helper.deid_text method

* CU-8694vte2g: Remove use of deprecated method

* CU-8694vte2g: Add back removed deprecation import

---------

Co-authored-by: Shubham Agarwal <66172189+shubham-s-agarwal@users.noreply.github.com>
Co-authored-by: Vlad Dinu <62345326+vladd-bit@users.noreply.github.com>
Co-authored-by: Tom Searle <tsearle88@gmail.com>
@mart-r mart-r deleted the CU-8693n892x-dependency-snapshots branch August 12, 2024 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants