Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create and upload partitioned datasets by year-month, clade, continent #398

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ include: "workflow/snakemake_rules/curate.smk"

include: "workflow/snakemake_rules/nextclade.smk"

include: "workflow/snakemake_rules/partition.smk"

if send_notifications and config.get("s3_src"):
include: "workflow/snakemake_rules/slack_notifications.smk"

Expand Down
52 changes: 52 additions & 0 deletions workflow/snakemake_rules/partition.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
"""
Creates partitioned datasets:
- by year_month
- by clade
- by continent
"""


rule metadata_by_year_month:
input:
"data/{database}/metadata.tsv",
output:
"data/{database}/metadata_year-month_{year}-{month}.tsv",
shell:
"""
tsv-filter -H --istr-in-fld "date:{wildcards.year}-{wildcards.month}" {input} > {output}
"""


rule metadata_by_clade:
input:
"data/{database}/metadata.tsv",
output:
"data/{database}/metadata_clade_{clade}.tsv",
shell:
"""
tsv-filter -H --istr-in-fld "Nextstrain_clade:{wildcards.clade}" {input} > {output}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This splits into single clades, so I guess for the 21L-rooted builds we'd have to list as inputs 21L, 22A, 22B, and so on? And that will need continual updating, then, right?

If we end up using multiple inputs, we will start commonly running into long-standing issues with the very poor memory efficiency of the ncov workflow's "combine metadata" step. We avoid that now in our production runs because we only pass a single input.

Separately, a substring condition seems a bit imprecise and fragile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can generate that list programatically as well if desired - because clade definitions are now hierarchical.

We don't update clades that often, it wouldn't be that much of a headache to update the list explicitly every time.

"""


rule metadata_by_continent:
input:
"data/{database}/metadata.tsv",
output:
"data/{database}/metadata_region_{continent}.tsv",
shell:
"""
tsv-filter -H --istr-eq "region:{wildcards.continent}" {input} > {output}
"""

rule sequences_by_metadata:
input:
sequences="data/{database}/sequences.fasta",
metadata="data/{database}/metadata_{partition}.tsv",
output:
sequences="data/{database}/sequences_{partition}.fasta",
strains=temp("data/{database}/strains_{partition}.txt"),
shell:
"""
tsv-select -H -f strain {input.metadata} > {output.strains}
seqkit grep -f {output.strains} {input.sequences} > {output.sequences}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's at all possible strain will contain spaces (and I think it is), then you'll want seqkit grep's --by-name (-n) option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

"""
26 changes: 26 additions & 0 deletions workflow/snakemake_rules/upload.smk
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ These output files are empty flag files to force Snakemake to run the upload rul
Note: we are doing parallel uploads of zstd compressed files to slowly make the transition to this format.
"""

import datetime

def compute_files_to_upload():
"""
Compute files to upload
Expand All @@ -33,6 +35,30 @@ def compute_files_to_upload():
"aligned.fasta.zst": f"data/{database}/aligned.fasta",
"nextclade_21L.tsv.zst": f"data/{database}/nextclade_21L.tsv",
}

now = datetime.datetime.now()
months_since_2020_01 = {f"{year}-{month:02d}" for year in range(2020, now.year+1) for month in range(1, 12+1) if year < now.year or month <= now.month}
regions={"europe", "north-america", "south-america", "asia", "africa", "oceania"}

max_per_year = {"19": "B", "20":"K", "21":"M", "22":"F","23":"A"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need to get bumped after every new clade in the current year?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation yes, but it could be automated based on clades.tsv - this is just a first attempt at showing how it could work

clades = set()
for year, max_letter in max_per_year.items():
for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
if letter > max_letter:
break
clades.add(f"{year}{letter}")

for clade in clades:
files_to_upload[f"metadata_{clade}.tsv.zst"] = f"data/{database}/metadata_clade_{clade}.tsv"
files_to_upload[f"sequences_{clade}.fasta.zst"] = f"data/{database}/sequences_clade_{clade}.fasta"

for region in regions:
files_to_upload[f"metadata_{region}.tsv.zst"] = f"data/{database}/metadata_region_{region}.tsv"
files_to_upload[f"sequences_{region}.fasta.zst"] = f"data/{database}/sequences_region_{region}.fasta"

for year_month in months_since_2020_01:
files_to_upload[f"metadata_{year_month}.tsv.zst"] = f"data/{database}/metadata_year-month_{year_month}.tsv"
files_to_upload[f"sequences_{year_month}.fasta.zst"] = f"data/{database}/sequences_year-month_{year_month}.fasta"

if database=="genbank":
files_to_upload["biosample.tsv.gz"] = f"data/{database}/biosample.tsv"
Expand Down