Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create and upload partitioned datasets by year-month, clade, continent #398

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

corneliusroemer
Copy link
Member

@corneliusroemer corneliusroemer commented Apr 3, 2023

Description of proposed changes

We discussed partitioned sequences/metadata at our most recent Nextstrain call.

This PR shows how easy it would be to produce such partitions for metadata and sequences.

A subsample, e.g. 100k would be similarly easy to create.

Testing

Test results are available for inspection via:

aws s3 cp s3://nextstrain-staging/files/ncov/open/branch/partitioned-metadata/sequences_2023-03.fasta.zst .
aws s3 cp s3://nextstrain-staging/files/ncov/open/branch/partitioned-metadata/metadata_2023-03.tsv.zst .

and likewise for the partitions metadata_22F.tsv.zst metadata_africa.tsv.zst etc

@corneliusroemer corneliusroemer requested a review from a team April 4, 2023 16:30
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think this would cause quite an increase in storage usage during both transiently in ncov-ingest workflow runs and permanently in S3.

shell:
"""
tsv-select -H -f strain {input.metadata} > {output.strains}
seqkit grep -f {output.strains} {input.sequences} > {output.sequences}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's at all possible strain will contain spaces (and I think it is), then you'll want seqkit grep's --by-name (-n) option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

"data/{database}/metadata_clade_{clade}.tsv",
shell:
"""
tsv-filter -H --istr-in-fld "Nextstrain_clade:{wildcards.clade}" {input} > {output}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This splits into single clades, so I guess for the 21L-rooted builds we'd have to list as inputs 21L, 22A, 22B, and so on? And that will need continual updating, then, right?

If we end up using multiple inputs, we will start commonly running into long-standing issues with the very poor memory efficiency of the ncov workflow's "combine metadata" step. We avoid that now in our production runs because we only pass a single input.

Separately, a substring condition seems a bit imprecise and fragile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can generate that list programatically as well if desired - because clade definitions are now hierarchical.

We don't update clades that often, it wouldn't be that much of a headache to update the list explicitly every time.

months_since_2020_01 = {f"{year}-{month:02d}" for year in range(2020, now.year+1) for month in range(1, 12+1) if year < now.year or month <= now.month}
regions={"europe", "north-america", "south-america", "asia", "africa", "oceania"}

max_per_year = {"19": "B", "20":"K", "21":"M", "22":"F","23":"A"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need to get bumped after every new clade in the current year?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation yes, but it could be automated based on clades.tsv - this is just a first attempt at showing how it could work

@corneliusroemer
Copy link
Member Author

I'd think this would cause quite an increase in storage usage during both transiently in ncov-ingest workflow runs and permanently in S3.

Since partitions are equivalence classes, every partition is in total the same size as what is being partitioned.

For s3 that would mean extra ~1.5 GB for open (sequences and metadata together) and ~3GB for GISAID (seq + metadata) per run. We could disable versioning if we're worried about s3 storage.

For storage during run: yeah it's inefficient, but only because upload requires uncompressed inputs. If upload accepted zst compressed inputs, impact would be only ~1.5-3GB per partition (continent counts as one partion, year-month is one partition etc), so this PR would add 3 partitions. Also, output could be marked as temp so that it gets deleted once it's uploaded, that should make overall storage impact much smaller.

tsibley added a commit to nextstrain/ncov that referenced this pull request Apr 5, 2023
Filter rules in the config are applied _after_ subsampling, which poses
issues with reliably getting the desired number of sequences.  As @trvrb
wrote¹:

> In the workflow, the filter rule happens after the subsampling rules.
> This makes it so that if we ask for say 2560 in a sampling bucket, we'll
> lose >50% due to filtering out non-21L-descending clades.
>
> This could be solved by padding count targets to compensate, but this is
> hacky and the numbers will change as time goes on. Or the filter rule
> could be placed again before subsample, but we moved it afterwards for
> good reasons.

A few custom rules for the builds allows us to prefilter the full
dataset before subsampling.  Currently these rules are specific to our
GISAID data source, but they could be easily expanded to our Open data
sources too.

In the future we might also provide clade-partitioned subsets from
ncov-ingest², which we could use here instead with some adaptation of
the build config.

¹ <#1029 (comment)>
² e.g. <nextstrain/ncov-ingest#398>
tsibley added a commit to nextstrain/ncov that referenced this pull request Apr 6, 2023
Filter rules in the config are applied _after_ subsampling, which poses
issues with reliably getting the desired number of sequences.  As @trvrb
wrote¹:

> In the workflow, the filter rule happens after the subsampling rules.
> This makes it so that if we ask for say 2560 in a sampling bucket, we'll
> lose >50% due to filtering out non-21L-descending clades.
>
> This could be solved by padding count targets to compensate, but this is
> hacky and the numbers will change as time goes on. Or the filter rule
> could be placed again before subsample, but we moved it afterwards for
> good reasons.

A few custom rules for the builds allows us to prefilter the full
dataset before subsampling.  Currently these rules are specific to our
GISAID data source, but they could be easily expanded to our Open data
sources too.

In the future we might also provide clade-partitioned subsets from
ncov-ingest², which we could use here instead with some adaptation of
the build config.

¹ <#1029 (comment)>
² e.g. <nextstrain/ncov-ingest#398>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants