Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility of dataset downloads #126

Open
faroit opened this issue May 18, 2020 · 2 comments
Open

Reproducibility of dataset downloads #126

faroit opened this issue May 18, 2020 · 2 comments

Comments

@faroit
Copy link
Contributor

faroit commented May 18, 2020

When downloading corpora from versioned data stores, I would expect to take into account a tag or specific hash of that dataset. That way users are sure if a specific version of audiomate yields an identical corpus to foster reproducibility.

e.g. lets take the esc-50 corpus: the root url downloads directly from master branch

DOWNLOAD_URL = 'https://github.com/karoldvl/ESC-50/archive/master.zip'

To improve reproducibility, I suggest that audiomate uses tags where possible (github, zenodo, ...) and furthermore provide a checksum mechanism that verifies a successful download.

This issue is part of a JOSS review openjournals/joss-reviews#2135

@ynop
Copy link
Owner

ynop commented May 18, 2020

Yes, that is a good idea. I also had something similar in mind.
Due to some datasets changing "frequently", I wanted to introduce versions.
So you could actually select which version you want to use.
Of course this should/could be combined with your approach with tags and checksums.

@faroit
Copy link
Contributor Author

faroit commented May 18, 2020

Due to some datasets changing "frequently", I wanted to introduce versions.
So you could actually select which version you want to use.

yes, that's also a good idea. and could be added on top of tags. But I think you might end up with a less confusing API if one version of audiomate only supports a fixed amount of dataset versions. Of course, this would come with the drawback that users would be unable to load an older dataset version even though they use a new version of audiomate. But then I think this wouldn't happen in practice as most users would use audiomate for a single dataset per project.

For now, I would suggest to freeze as many downloads as possible instead of pointing to the latest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants