Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vendored scripts #1

Closed
wants to merge 3 commits into from
Closed

Vendored scripts #1

wants to merge 3 commits into from

Conversation

jameshadfield
Copy link
Member

A draft PR to show how vendored files could be consumed by pathogen repos

See nextstrain/ingest#2 for the PR which add the script to the ingest repo, vendored here as ingest/vendored

See nextstrain/ingest#3 for the git subtree commands I used

Note that I probably won't merge this PR, instead I'll recreate the subtree from its main branch once the above PR has been merged.

git-subtree-dir: ingest/vendored
git-subtree-split: 92a8868
This moves the script to the vendored directory, which is a git subtree
of https://github.com/nextstrain/ingest (currently at branch
apply-geolocation-rules). The file suffix is removed to match how it
appears in the other repos which use it.

As per [Overview of duplicated scripts](nextstrain/ingest#1)
this script also appears in:
* [monkeypox](https://github.com/nextstrain/monkeypox/blob/a1f0d7b757d323d87edcbe61c6c5ccfbdf47722c/ingest/bin/apply-geolocation-rules)
* [rsv](https://github.com/nextstrain/rsv/blob/ba171f4a43110382c38b6154be3febd50408d7bf/ingest/bin/apply-geolocation-rules)
* [dengue, branch new_ingest](https://github.com/nextstrain/dengue/blob/247b2fd897361f2548627de1d97d45fae4115c5c/ingest/bin/apply-geolocation-rules)

All three of those scripts are identical to each other. The script vendored
here contains two code changes (whitespace removed from diffs):

**Ignore comment lines in the location-rules TSV**
```diff
< if line.lstrip()[0] == '#':
---
> if line.strip()=="" or line.lstrip()[0] == '#':
```

**Allow fields to be missing from the input NDJSON**

The script previously mandated that the input NDJSON had all four
fields (region/country/division/location). This is relaxed here, with
an empty string used if the field is not present.

```diff
< annotated_values = transform_geolocations(geolocation_rules, [record[field] for field in location_fields])
---
> annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields])
```
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for showing a clear example!

You said you wouldn't merge this PR, but I assume you can force-push to the same PR branch with a new subtree, or take these suggestions for a similar future PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

  1. Add a section in the pathogen repo README on how to update these scripts with git subtree pull
  2. Add that command in commit messages such as 7d507f7 where those changes are the result of pulling new contents from the subtree remote.

Comment on lines -9 to -10
Copied from https://github.com/nextstrain/monkeypox/blob/62fca491c6775573ad036eedf34b271b4952f2c2/ingest/bin/apply-geolocation-rules
with two changes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script vendored here contains two code changes

A comment is used to describe the pathogen-specific changes, when those might be better described by git history.

Suggestion: Put these changes in a separate commit. I'd expect something like one commit to add the original contents from git subtree pull, then another commit with the pathogen-specific modifications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just realized that 7d507f7 is a direct copy of nextstrain/ingest@5afbcc7 and these aren't pathogen-specific changes. You can disregard this suggestion then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that 7d507f7 is a direct copy of nextstrain/ingest@5afbcc7

Yeah, or perhaps the other way around! The ingest commit was created by pushing the subtree from this (hbv) repo. In hindsight, probably not an approach I'd reach for again, but it was a good learning exercise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish the auto-generated messages (f9d6407, ee78e39) had more information on the remote. If I'm reading the git history alone, it's not clear where these come from.

Suggestion: Modify one of the auto-generated messages to include the remote URL, https://github.com/nextstrain/ingest. I'm not sure which is best since the remote commit hash is in f9d6407, but only the merge commit ee78e39 will show up as the one making changes.

@jameshadfield
Copy link
Member Author

Superseded by #5

@victorlin victorlin deleted the vendored-scripts branch August 8, 2023 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants