Skip to content
This repository has been archived by the owner on Aug 20, 2024. It is now read-only.

Very slow validation for single-end fastq samplesheets #90

Closed
apetkau opened this issue Aug 30, 2023 · 5 comments · Fixed by #120
Closed

Very slow validation for single-end fastq samplesheets #90

apetkau opened this issue Aug 30, 2023 · 5 comments · Fixed by #120
Labels
enhancement New feature or request hackathon Issues that can be worked on in the nf-core hackathon

Comments

@apetkau
Copy link

apetkau commented Aug 30, 2023

Hello. I have been working on how to make use of the fromSamplesheet function to validate a samplesheet using the assets/schema_input.json file and create a channel of input data. However, I have been encountering a large difference in the time it takes to validate a samplesheet containing paired-end fastq files and single-end fastq files when using nf-validator version 0.3.1.

I have written up a method to reproduce the issue at https://github.com/apetkau/from-samplesheet-test-nf, but in brief you can run:

# Get total runtime of pipeline on a samplesheet
time nextflow run apetkau/from-samplesheet-test-nf -r main --input https://raw.githubusercontent.com/apetkau/from-samplesheet-test-nf/main/samplesheet.pe.30.csv

# Get time for validating samplesheet against schema_input.json from logs
grep 'Starting validation' -A1 .nextflow.log

You can replace the samplesheet.pe.30.csv with the listed files in the below table to run the other cases.

Type Samplesheet Number of samples Total runtime for example pipeline Time for validating samplesheet with schema_input.json
Paired-end samplesheet.pe.30.csv 30 5 seconds < 1 second
Paired-end samplesheet.pe.60.csv 60 6 seconds < 1 second
Single-end samplesheet.se.30.csv 30 30 seconds 25 seconds
Single-end samplesheet.se.60.csv 60 387 seconds 382 seconds

That is, validating the samplesheet against the schema_input.json file appears roughly constant as samples increase for paired-end samplesheets, but for single-end samplesheets going from 30 to 60 samples increases the validation time by a factor of 15x.

I am wondering if someone could help me to sort out this issue?

Thanks so much. And thanks for the amazing software. It's helped me out in my work 😄

@nvnieuwk
Copy link
Collaborator

nvnieuwk commented Sep 6, 2023

That's weird, I'll have to investigate further when I have some time. Could you in the meantime please try this:

  • Can you do the validation with the validateParameters() function first? fromSamplesheet() doesn't actually perform much validation and is mainly used to create a channel from the samplesheet

@apetkau
Copy link
Author

apetkau commented Sep 7, 2023

Thanks so much 😄

I have updated the code in that repo to include validation with validateParameters() (the workflow is a minimal example, right here: https://github.com/apetkau/from-samplesheet-test-nf/blob/main/main.nf)

The results are the same, except that it seems like validation of the JSON schema is done twice now (I'm assuming once for validateParameters() and once for fromSamplesheet()).

Type Samplesheet Number of samples Total runtime for example pipeline Time for validating samplesheet with schema_input.json
Paired-end samplesheet.pe.30.csv 30 6 seconds < 1 second (x2)
Paired-end samplesheet.pe.60.csv 60 6 seconds < 1 second (x2)
Single-end samplesheet.se.30.csv 30 52 seconds 24 + 13 seconds
Single-end samplesheet.se.60.csv 60 734 seconds 368 + 360 seconds

I was running this with Nextflow version 23.04.1 build 5866

@ewels
Copy link
Member

ewels commented Sep 15, 2023

I wonder if there's something unexpected happening here with type coercion of file objects, where a lot of data is being moved around unexpectedly or something... Could be related to #92

@cmatKhan
Copy link

cmatKhan commented Sep 23, 2023

bump -- having the same problem with SE input. I appreciate @apetkau 's description...I wouldn't have done nearly as nice a job with it. Glad to know that it will eventually get on to the workflow.

For what it is worth, I am using nf-validate@0.3.2 and nextflow 23.04.2.

@apetkau
Copy link
Author

apetkau commented Oct 23, 2023

Thanks so much @awgymer and the nf-validation team for the fix. It works great 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request hackathon Issues that can be worked on in the nf-core hackathon
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants