Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiple regions analysis / 5R / SMURF / q2-sidle #702

Merged
merged 30 commits into from
Mar 18, 2024

Conversation

d4straub
Copy link
Collaborator

@d4straub d4straub commented Feb 6, 2024

Addresses #701

Status:

  • produce simulated small test data
  • move primer info into meta map, i.e. sample-wise
  • process primers sample-wise
  • read multi region info via --input_multiregion
  • process without error --input samplesheet_multiregion.tsv --input_multiregion regions_multiregion.tsv --skip_taxonomy
  • retain all current function
  • produce per-region ASV table
  • add reconstruction subworkflow via pipesidle
  • plugin sidle output to downstream analysis
  • add sidle reference taxonomy entries & custom input
  • process --sidle_ref_tax_custom <file,file,file> --sidle_ref_tree_custom <file>
  • process --sidle_ref_taxonomy silva
  • process --sidle_ref_taxonomy greengenes
  • add proper test
  • q2-sidle to conda not wanted by q2-sidle developer
  • update docs
  • make sure all tests pass [except linting, template update prevents that!]
  • organize output
  • organize params in nextflow_schema.json
  • add params check for incompatible settings in lib/WorkflowAmpliseq.groovy

Potential problem:

  • All processes using Sidle use container 'docker.io/d4straub/pipesidle:0.1.0-beta', a container I build with https://github.com/d4straub/pipesidle/blob/main/Dockerfile and hosted in my docker hub account. The author of the pypi package doesnt want the software in conda. So thats the way I went for now. Any suggestions welcome.
  • Failing linting is due to template update 2.13 (that restructures pipeline functions in folder lib, therefore postponed for now).

Ideas postponed:

  • optional per-region read trimming definitions (trunclenf & trunlenr)
  • In case there are multiple regions, in overall_summary.tsv & *.stats.tsv remove from sample the region info and aggregate to obtain numbers per-sample instead of currently per-sample-per-region.
  • Pipeline summary report isnt compatible to multi-region analysis, this could be improved in the future
  • use appropriate taxlevels (conf/ref_databases.conf see taxlevels) & output tab separated taxonomies

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/ampliseq branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link

github-actions bot commented Feb 6, 2024

nf-core lint overall result: Failed ❌

Posted for pipeline commit 8c711a7

+| ✅ 180 tests passed       |+
#| ❔   6 tests were ignored |#
!| ❗   2 tests had warnings |!
-| ❌  10 tests failed       |-

❌ Test failures:

  • files_exist - File must be removed: lib/Utils.groovy
  • files_exist - File must be removed: lib/WorkflowMain.groovy
  • files_exist - File must be removed: lib/NfcoreTemplate.groovy
  • files_exist - File must be removed: lib/WorkflowAmpliseq.groovy
  • files_unchanged - .github/CONTRIBUTING.md does not match the template
  • files_unchanged - .github/PULL_REQUEST_TEMPLATE.md does not match the template
  • files_unchanged - .github/workflows/branch.yml does not match the template
  • files_unchanged - .github/workflows/linting_comment.yml does not match the template
  • files_unchanged - .github/workflows/linting.yml does not match the template
  • files_unchanged - pyproject.toml does not match the template

❗ Test warnings:

  • readme - README did not have a Nextflow minimum version badge.
  • schema_lint - Parameter input is not defined in the correct subschema (input_output_options)

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.13.1
  • Run at 2024-03-13 15:41:17

@d4straub
Copy link
Collaborator Author

Failing linting is due to template update 2.13 (that restructures pipeline functions in folder lib, therefore postponed for now).

@d4straub
Copy link
Collaborator Author

d4straub commented Mar 8, 2024

The implemented test_multiregion seems to run quite long, here are the processes with longest walltimes when I run it on my laptop:

process min sec
NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_DIVERSITY:QIIME2_ALPHARAREFACTION 16 14
NFCORE_AMPLISEQ:AMPLISEQ:SIDLE_WF:SIDLE_TREERECON 13 21
NFCORE_AMPLISEQ:AMPLISEQ:SIDLE_WF:SIDLE_DBEXTRACT 5 37
NFCORE_AMPLISEQ:AMPLISEQ:SIDLE_WF:SIDLE_DBFILT 5 29

SIDLE_DBEXTRACT is repeated for each region, i.e. in this example 5 times.

Possible changes to reduce runtime:

  • 16min: --skip_alpha_rarefaction, minimal disadvantage (channel connection not tested) --> DONE
  • 13min+: SIDLE_TREERECON, currently cannot be omitted; disadvantage if omitted: phylogenetic tree isnt constructed, this will lead omission of downstream processes involving alpha & beta diversity --> DONE*
  • 10min+: SIDLE_DBEXTRACT, using less regions, i.e. less repetitions of this process, potentially downstream processes faster
  • 10min+: SIDLE_DBEXTRACT, using smaller database, i.e. greengenes 85% or such, disadvantage: minimal! --> DONE, 88% (85% failed due to no kmer matches)

*: Due to the smaller reference taxonomy database and the tiny dataset, alpha diversity plugin fails. Therefore the complete diversity calculation would need to be omitted, and therefore the tree reconstruction becomes obsolete.

edit: Now test_multiregion requires 17 minutes, which is in line with test (19min) & test_sintax (18min).

@d4straub d4straub mentioned this pull request Mar 18, 2024
11 tasks
@d4straub d4straub marked this pull request as ready for review March 18, 2024 09:24
@d4straub d4straub self-assigned this Mar 18, 2024
Copy link
Member

@erikrikarddaniel erikrikarddaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two tiny comments. Otherwise 👍

bin/taxref_reformat_sidle.sh Outdated Show resolved Hide resolved
modules/local/dada2_err.nf Show resolved Hide resolved
@d4straub d4straub merged commit 6e727a4 into nf-core:dev Mar 18, 2024
16 of 17 checks passed
@d4straub
Copy link
Collaborator Author

Thanks!

@d4straub d4straub deleted the add-multiple-regions-analysis branch March 18, 2024 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants