Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vendored scripts #1

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ingest/ingest.smk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

  1. Add a section in the pathogen repo README on how to update these scripts with git subtree pull
  2. Add that command in commit messages such as 7d507f7 where those changes are the result of pulling new contents from the subtree remote.

Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ rule transform_metadata:
"""
ingest/scripts/tsv-to-ndjson.py < {input.metadata} |
ingest/scripts/fix_country_field.py |
ingest/scripts/apply-geolocation-rules.py --geolocation-rules ingest/config/geoLocationRules.tsv |
ingest/vendored/apply-geolocation-rules --geolocation-rules ingest/config/geoLocationRules.tsv |
ingest/scripts/add-year.py |
ingest/scripts/ndjson-to-tsv.py --metadata-columns {params.metadata_columns} --metadata {output.metadata}
"""
Expand Down
37 changes: 37 additions & 0 deletions ingest/vendored/README.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish the auto-generated messages (f9d6407, ee78e39) had more information on the remote. If I'm reading the git history alone, it's not clear where these come from.

Suggestion: Modify one of the auto-generated messages to include the remote URL, https://github.com/nextstrain/ingest. I'm not sure which is best since the remote commit hash is in f9d6407, but only the merge commit ee78e39 will show up as the one making changes.

Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# ingest

Shared internal tooling for pathogen data ingest. Used by our individual
pathogen repos which produce Nextstrain builds. Expected to be vendored by
each pathogen repo using `git subtree` (or `git subrepo`).

Some tools may only live here temporarily before finding a permanent home in
`augur curate` or Nextstrain CLI. Others may happily live out their days here.

## History

Much of this tooling originated in
[ncov-ingest](https://github.com/nextstrain/ncov-ingest) and was passaged thru
[monkeypox's ingest/](https://github.com/nextstrain/monkeypox/tree/@/ingest/).
It subsequently proliferated from [monkeypox][] to other pathogen repos
([rsv][], [zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily
thru copying. To [counter that
proliferation](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079),
this repo was made.

[monkeypox]: https://github.com/nextstrain/monkeypox
[rsv]: https://github.com/nextstrain/rsv
[zika]: https://github.com/nextstrain/zika/pull/24
[dengue]: https://github.com/nextstrain/dengue/pull/10
[hepatitisB]: https://github.com/nextstrain/hepatitisB
[forecasts-ncov]: https://github.com/nextstrain/forecasts-ncov

## Elsewhere

The creation of this repo, in both the abstract and concrete, and the general
approach to "ingest" has been discussed in various internal places, including:

- https://github.com/nextstrain/private/issues/59
- @joverlee521's [workflows document](https://docs.google.com/document/d/1rLWPvEuj0Ayc8MR0O1lfRJZfj9av53xU38f20g8nU_E/edit#heading=h.4g0d3mjvb89i)
- [5 July 2023 Slack thread](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079)
- [6 July 2023 team meeting](https://docs.google.com/document/d/1FPfx-ON5RdqL2wyvODhkrCcjgOVX3nlXgBwCPhIEsco/edit)
- _…many others_
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,6 @@
any additional transformations on top of the user curations.
"""

"""
Copied from https://github.com/nextstrain/monkeypox/blob/62fca491c6775573ad036eedf34b271b4952f2c2/ingest/bin/apply-geolocation-rules
with two changes:
Comment on lines -9 to -10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script vendored here contains two code changes

A comment is used to describe the pathogen-specific changes, when those might be better described by git history.

Suggestion: Put these changes in a separate commit. I'd expect something like one commit to add the original contents from git subtree pull, then another commit with the pathogen-specific modifications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just realized that 7d507f7 is a direct copy of nextstrain/ingest@5afbcc7 and these aren't pathogen-specific changes. You can disregard this suggestion then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that 7d507f7 is a direct copy of nextstrain/ingest@5afbcc7

Yeah, or perhaps the other way around! The ingest commit was created by pushing the subtree from this (hbv) repo. In hindsight, probably not an approach I'd reach for again, but it was a good learning exercise.


First change allows missing fields in the input ndjson
- annotated_values = transform_geolocations(geolocation_rules, [record.[field] for field in location_fields])
+ annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields])

Second change allows blank lines in the location-rules TSV
- if line.lstrip()[0] == '#':
+ if line.strip()=="" or line.lstrip()[0] == '#':
"""

import argparse
import json
Expand Down