Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scripts for NCBI Virus #16

Merged
merged 13 commits into from
Aug 29, 2023
13 changes: 11 additions & 2 deletions ncbi-virus-url
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ and observing the network activity at
https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide
"""
from urllib.parse import urlencode
from typing import List, Optional
import argparse

def parse_args():
Expand All @@ -20,9 +21,13 @@ def parse_args():
help="NCBI Taxon ID. Visit NCBI virus at " +
"https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/find-data/virus " +
"to search for supported taxon IDs.")
parser.add_argument("--filters", required=False, nargs="*",
help="Filter criteria to add as `fq` param values. " +
"Apply filters via the NCBI Virus UI and observe the network " +
"activity to find the desired filter string.")
Comment on lines +24 to +27
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our latest meeting with WA DOH, people were concerned whether the clean data that Nextstrain hosts at data.nextstrain.org/files might exclude certain records from NCBI.

Allowing customizable filters here might add to that concern, but there's already so many other points in ingest pipelines that might filter out records. I think we just need to be better about documenting these filters that are used to generate the clean data.

return parser.parse_args()

def build_query_url(ncbi_taxon_id: str):
def build_query_url(ncbi_taxon_id: str, filters: Optional[List[str]]=None):
"""
Generate URL to download all viral sequences and their curated metadata
from GenBank via NCBI Virus.
Expand All @@ -33,6 +38,7 @@ def build_query_url(ncbi_taxon_id: str):
'fq': [
'{!tag=SeqType_s}SeqType_s:("Nucleotide")', # Nucleotide sequences (as opposed to protein)
f'VirusLineageId_ss:({ncbi_taxon_id})',
*(filters or []),
],

# Unclear, but seems necessary.
Expand Down Expand Up @@ -80,7 +86,10 @@ def build_query_url(ncbi_taxon_id: str):

def main():
args = parse_args()
build_query_url(args.ncbi_taxon_id)
build_query_url(
ncbi_taxon_id=args.ncbi_taxon_id,
filters=args.filters,
)

if __name__ == '__main__':
main()