Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scripts for NCBI Virus #16

Merged
merged 13 commits into from
Aug 29, 2023
Merged

Add scripts for NCBI Virus #16

merged 13 commits into from
Aug 29, 2023

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Aug 21, 2023

Description of proposed changes

Add scripts for downloading metadata and sequences from NCBI Virus.
See commits for details.

Related issue(s)

Subset of scripts listed in #1

Checklist

  • Checks pass
  • If adding a script, add an entry for it in the README.

Copied from https://github.com/nextstrain/monkeypox/blob/12ccb428aabccd9271fe8c5e61533edd656474e4/ingest/bin/genbank-url

The original script in ncov-ingest has already been removed because of
the switch over to using NCBI Datasets CLI,¹ so I'm copying from the
actively maintained version of the script in monkeypox.

Subsequent copies of the script in other pathogen repos have all made
changes to the URL params, which I will generalize in subsequent commits.

¹ nextstrain/ncov-ingest@60d603f
There are too many potential filters to add as individual options, so
add a general filter option to allow custom filter criteria.

Added based on the changes made in the dengue repo
https://github.com/nextstrain/dengue/blob/a6df20f0e8540a60a403eab81d80e517d1e31ee0/ingest/bin/genbank-url#L41
We have a set of core fields that are shared across viruses, but
certain pathogens may require additional fields from NCBI Virus.

Added based on the changes made in the dengue repo
https://github.com/nextstrain/dengue/blob/a6df20f0e8540a60a403eab81d80e517d1e31ee0/ingest/bin/genbank-url#L60
Copied from https://github.com/nextstrain/ncov-ingest/blob/6c7ae5d7d259d62b8a0f016f02be51a11090592a/bin/csv-to-ndjson

Subsequent copies of the script that are functionally identical:
- https://github.com/nextstrain/monkeypox/blob/12ccb428aabccd9271fe8c5e61533edd656474e4/ingest/bin/csv-to-ndjson
- https://github.com/nextstrain/dengue/blob/a6df20f0e8540a60a403eab81d80e517d1e31ee0/ingest/bin/csv-to-ndjson
- https://github.com/nextstrain/zika/blob/7ec737b8d249302e9dc1e26f42e32696781e1488/ingest/bin/csv-to-ndjson

Copy in the rsv repo¹ has been modified to use argparse to take in input
and output option, which I do not think are necessary to port over to
the centralized script which will just pipe data through stdin/stdout:

Copy in the hepatitisB repo² has been modified to convert TSV to NDJSON
by modifying the delimiter. I don't think we need to port over this
changes since we can drop the `tsv-to-ndjson` script if we modify the
workflow as described in nextstrain/hbv#9.

¹ https://github.com/nextstrain/rsv/blob/75d4c07c0b17693dd19b137a317cb24c434b83f3/ingest/bin/csv-to-ndjson.py
² https://github.com/nextstrain/hepatitisB/blob/1f38f623d493bbafc25a4ab60226a4bd59ef8f6d/ingest/scripts/tsv-to-ndjson.py
Copied from https://github.com/nextstrain/monkeypox/blob/12ccb428aabccd9271fe8c5e61533edd656474e4/ingest/bin/fetch-from-genbank

The original script in ncov-ingest has already been removed because of
the switch over to using NCBI Datasets CLI,¹ so I'm copying the actively
maintained version of the script in monkeypox.

I had considered going back to using a single Python script to fetch the
data, but found that the requests library would still run into a
ChunkedEncodingError as previously explained in
nextstrain/ncov-ingest@6af81e7.
Seems more reliable to just stick with curl.

The following commit will update to use the new options for the
ncbi-virus-url script.

¹ nextstrain/ncov-ingest@60d603f
@joverlee521 joverlee521 requested a review from a team August 21, 2023 21:25
('region', 'Region_s'),
('location', 'CountryFull_s'),
('collected', 'CollectionDate_s'),
('submitted', 'CreateDate_dt'),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding updated as a central output column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would updated be the date the most recent revision was released, as opposed to the first revision, which is submitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's my interpretation of the field. I can't find any official documentation though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in d304c91

Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a few non-blocking comments.

csv-to-ndjson Show resolved Hide resolved
('region', 'Region_s'),
('location', 'CountryFull_s'),
('collected', 'CollectionDate_s'),
('submitted', 'CreateDate_dt'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would updated be the date the most recent revision was released, as opposed to the first revision, which is submitted?

@@ -73,6 +73,9 @@ NCBI interaction scripts that are useful for fetching public metadata and sequen

- [fetch-from-ncbi-entrez](fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/) or [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs.
- [fetch-from-ncbi-virus](fetch-from-ncbi-virus) - Fetch metadata and nucleotide sequences from [NCBI Virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/) and output NDJSON records to stdout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know of a data dictionary for all the csv/ndjson fields returned? The "all-fields" example is amazing, but it would be great if we had a bit more information about what they represent - in case this exists, maybe it doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not really any documentation since this is not an official API. The NCBI Virus help page is the closest I can find, but that uses the "pretty" field names that they display on the web UI.

Comment on lines +24 to +27
parser.add_argument("--filters", required=False, nargs="*",
help="Filter criteria to add as `fq` param values. " +
"Apply filters via the NCBI Virus UI and observe the network " +
"activity to find the desired filter string.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our latest meeting with WA DOH, people were concerned whether the clean data that Nextstrain hosts at data.nextstrain.org/files might exclude certain records from NCBI.

Allowing customizable filters here might add to that concern, but there's already so many other points in ingest pipelines that might filter out records. I think we just need to be better about documenting these filters that are used to generate the clean data.

joverlee521 and others added 5 commits August 24, 2023 16:03
Switch over to the new ncbi-virus-url script that allows customizations
of filters and fields.
Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
Useful for knowing when the record was last updated in NCBI, which
@j23414 noted to be the "Modification date".¹

This will ultimately add a new field to the NDJSON output from
fetch-from-ncbi-virus, but it should not affect the final metadata
output of the ingest pipelines. Ingest pipelines usually define the
final metadata output columns in the config and would need to be updated
separately.

¹ nextstrain/mpox#104 (comment)
Add github_repo argument to allow customization of the User-Agent
request header.

Following the pattern of other scripts in this repo that requires the
repo owner and name.
@joverlee521
Copy link
Contributor Author

Merging to use as part of https://github.com/nextstrain/pathogen-repo-template.
Can circle back if we find issues in pathogen ingest workflows.

@joverlee521 joverlee521 merged commit 8238f10 into main Aug 29, 2023
2 checks passed
@joverlee521 joverlee521 deleted the ncbi-virus branch August 29, 2023 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants