Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add duplicated scripts from pathogen repos #1

Closed
22 tasks done
jameshadfield opened this issue Jul 10, 2023 · 4 comments
Closed
22 tasks done

Add duplicated scripts from pathogen repos #1

jameshadfield opened this issue Jul 10, 2023 · 4 comments
Assignees

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Jul 10, 2023

The first step in making this repository useful is to populate it with scripts that are currently manually copied around pathogen repos.

See shared GDoc for additional context and details on scripts.

Progress

This was originally created by @joverlee521 in #1 (comment).

Identical scripts (added in #6)

  • s3-object-exists
  • trigger
  • sha256sum
  • cloudfront-invalidate
  • merge-user-metadata
  • transform-authors
  • transform-field-names
  • tranform-genbank-location

Diverged scripts with various different versions used across workflows
(binned into related groups):

Simple notify scripts (added in #8)

  • notify-slack
  • notify-on-job-start
  • notify-on-job-fail

S3 interaction + notify scripts that depend on S3 files (added in #12)

  • upload-to-s3
  • download-from-s3
  • notify-on-record-change
  • notify-on-diff
  • trigger-on-new-data

Genbank interactions

Nextclade joining

Potential augur curate scripts

Summary of differences

This is the original issue text from @jameshadfield.

Here's a quick scan of duplicated ingest scripts, using monkeypox as the "base", against 4 other ingest script directories:

Directories of scripts considered:

mpx       # monkeypox/ingest/bin at a1f0d7b
hbv       # hepatitisB/ingest/scripts at 1cdd197
rsv       # rsv/ingest/bin at ba171f4
dengue    # dengue/ingest/bin branch: new_ingest @ 247b2fd 
ncov      # ncov-ingest/bin at 88fddbe

Note that when there's only 1-3 lines different that's often just an added comment to indicate where the script's been copied from

mpx/apply-geolocation-rules
        rsv/apply-geolocation-rules     IDENTICAL
        hbv/apply-geolocation-rules.py      17 lines different
        dengue/apply-geolocation-rules  IDENTICAL
mpx/cloudfront-invalidate
        rsv/cloudfront-invalidate       IDENTICAL
        dengue/cloudfront-invalidate    IDENTICAL
        ncov/cloudfront-invalidate      IDENTICAL
mpx/csv-to-ndjson
        rsv/csv-to-ndjson.py      16 lines different
        dengue/csv-to-ndjson    IDENTICAL
        ncov/csv-to-ndjson       3 lines different
mpx/download-from-s3
        dengue/download-from-s3       2 lines different
        ncov/download-from-s3       8 lines different
mpx/fasta-to-ndjson
        rsv/fasta-to-ndjson     IDENTICAL
        dengue/fasta-to-ndjson  IDENTICAL
mpx/fetch-from-genbank
        dengue/fetch-from-genbank       1 lines different
mpx/genbank-url
        rsv/genbank-url      42 lines different
        dengue/genbank-url      11 lines different
mpx/join-metadata-and-clades.py
        rsv/join-metadata-and-clades.py       3 lines different
        dengue/join-metadata-and-clades.py      IDENTICAL
        ncov/join-metadata-and-clades     114 lines different
mpx/merge-user-metadata
        rsv/merge-user-metadata IDENTICAL
        dengue/merge-user-metadata      IDENTICAL
mpx/ndjson-to-tsv-and-fasta
        rsv/ndjson-to-tsv-and-fasta     IDENTICAL
        dengue/ndjson-to-tsv-and-fasta  IDENTICAL
mpx/notify-on-diff
        dengue/notify-on-diff   IDENTICAL
mpx/notify-on-job-fail
        rsv/notify-on-job-fail       1 lines different
        dengue/notify-on-job-fail       1 lines different
        ncov/notify-on-job-fail      10 lines different
mpx/notify-on-job-start
        rsv/notify-on-job-start       3 lines different
        dengue/notify-on-job-start       3 lines different
        ncov/notify-on-job-start      30 lines different
mpx/notify-on-record-change
        rsv/notify-on-record-change       3 lines different
        dengue/notify-on-record-change       3 lines different
        ncov/notify-on-record-change       6 lines different
mpx/notify-slack
        rsv/notify-slack      15 lines different
        dengue/notify-slack     IDENTICAL
        ncov/notify-slack      16 lines different
mpx/reverse_reversed_sequences.py
        dengue/reverse_reversed_sequences.py    IDENTICAL
mpx/s3-object-exists
        rsv/s3-object-exists    IDENTICAL
        dengue/s3-object-exists IDENTICAL
        ncov/s3-object-exists       1 lines different
mpx/sha256sum
        rsv/sha256sum   IDENTICAL
        dengue/sha256sum        IDENTICAL
        ncov/sha256sum       1 lines different
mpx/transform-authors
        rsv/transform-authors   IDENTICAL
        dengue/transform-authors        IDENTICAL
mpx/transform-date-fields
        rsv/transform-date-fields       IDENTICAL
        dengue/transform-date-fields    IDENTICAL
mpx/transform-field-names
        rsv/transform-field-names       IDENTICAL
        dengue/transform-field-names    IDENTICAL
mpx/transform-genbank-location
        rsv/transform-genbank-location  IDENTICAL
        dengue/transform-genbank-location       IDENTICAL
mpx/transform-strain-names
        rsv/transform-strain-names       1 lines different
        dengue/transform-strain-names   IDENTICAL
mpx/transform-string-fields
        rsv/transform-string-fields     IDENTICAL
        dengue/transform-string-fields  IDENTICAL
mpx/trigger
        dengue/trigger  IDENTICAL
        ncov/trigger    IDENTICAL
mpx/trigger-on-new-data
        dengue/trigger-on-new-data       1 lines different
        ncov/trigger-on-new-data       6 lines different
mpx/upload-to-s3
        rsv/upload-to-s3       3 lines different
        dengue/upload-to-s3       3 lines different
        ncov/upload-to-s3       1 lines different
jameshadfield added a commit that referenced this issue Jul 11, 2023
This moves the script to the vendored directory, which is a git subtree
of https://github.com/nextstrain/ingest (currently at branch
apply-geolocation-rules). The file suffix is removed to match how it
appears in the other repos which use it.

As per [Overview of duplicated scripts](#1)
this script also appears in:
* [monkeypox](https://github.com/nextstrain/monkeypox/blob/a1f0d7b757d323d87edcbe61c6c5ccfbdf47722c/ingest/bin/apply-geolocation-rules)
* [rsv](https://github.com/nextstrain/rsv/blob/ba171f4a43110382c38b6154be3febd50408d7bf/ingest/bin/apply-geolocation-rules)
* [dengue, branch new_ingest](https://github.com/nextstrain/dengue/blob/247b2fd897361f2548627de1d97d45fae4115c5c/ingest/bin/apply-geolocation-rules)

All three of those scripts are identical to each other. The script vendored
here contains two code changes (whitespace removed from diffs):

**Ignore comment lines in the location-rules TSV**
```diff
< if line.lstrip()[0] == '#':
---
> if line.strip()=="" or line.lstrip()[0] == '#':
```

**Allow fields to be missing from the input NDJSON**

The script previously mandated that the input NDJSON had all four
fields (region/country/division/location). This is relaxed here, with
an empty string used if the field is not present.

```diff
< annotated_values = transform_geolocations(geolocation_rules, [record[field] for field in location_fields])
---
> annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields])
```
jameshadfield added a commit to nextstrain/hbv that referenced this issue Jul 11, 2023
This moves the script to the vendored directory, which is a git subtree
of https://github.com/nextstrain/ingest (currently at branch
apply-geolocation-rules). The file suffix is removed to match how it
appears in the other repos which use it.

As per [Overview of duplicated scripts](nextstrain/ingest#1)
this script also appears in:
* [monkeypox](https://github.com/nextstrain/monkeypox/blob/a1f0d7b757d323d87edcbe61c6c5ccfbdf47722c/ingest/bin/apply-geolocation-rules)
* [rsv](https://github.com/nextstrain/rsv/blob/ba171f4a43110382c38b6154be3febd50408d7bf/ingest/bin/apply-geolocation-rules)
* [dengue, branch new_ingest](https://github.com/nextstrain/dengue/blob/247b2fd897361f2548627de1d97d45fae4115c5c/ingest/bin/apply-geolocation-rules)

All three of those scripts are identical to each other. The script vendored
here contains two code changes (whitespace removed from diffs):

**Ignore comment lines in the location-rules TSV**
```diff
< if line.lstrip()[0] == '#':
---
> if line.strip()=="" or line.lstrip()[0] == '#':
```

**Allow fields to be missing from the input NDJSON**

The script previously mandated that the input NDJSON had all four
fields (region/country/division/location). This is relaxed here, with
an empty string used if the field is not present.

```diff
< annotated_values = transform_geolocations(geolocation_rules, [record[field] for field in location_fields])
---
> annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields])
```
@joverlee521 joverlee521 self-assigned this Jul 13, 2023
@joverlee521

This comment was marked as duplicate.

This was referenced Jul 14, 2023
@victorlin victorlin changed the title Overview of duplicated scripts Add duplicated scripts from pathogen repos Aug 1, 2023
@victorlin
Copy link
Member

@joverlee521 thanks for making the checklist in the comment above! It'll be useful to have it continually updated. To make that easier, I've moved it to the main issue text.

@joverlee521
Copy link
Contributor

In talking through #20 with @j23414, we realized that join-metadata-and-clades can mostly be replaced with a couple of csvtk commands (csvtk cut | csvtk rename2 | csvtk join).

The version of the script in ncov-ingest adds clock_deviation, but that can be done separately from the joining.

@joverlee521
Copy link
Contributor

Closing issue as we have resolved all of the listed duplicate scripts.
Any other additions can be opened as separate issues in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants