Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kaldi_writer.save() aborts after 3 hours #107

Open
davidavdav opened this issue Mar 9, 2020 · 11 comments
Open

kaldi_writer.save() aborts after 3 hours #107

davidavdav opened this issue Mar 9, 2020 · 11 comments

Comments

@davidavdav
Copy link

Calling from german-asr local/prepare_data.py it looks like the process aborts after 3 hours consistently, without any diagnostics. Just the train/wav.scp is written, and that happens in the first few minutes. After that the 64GB mem machine is busy with roughly 20% CPU and 14% memory for three hours, after which it silently aborts.

@aahlenst
Copy link
Collaborator

aahlenst commented Mar 9, 2020

Thanks for the report. What Python version are you using, what operating system? Is there anything else you can tell us to help track down the problem?

@davidavdav
Copy link
Author

Hi, this would be Python 3.7.3, Linux 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64 GNU/Linux. I am now trying the same on a 128GB mem sibling computer.

I followed the megs download instructions, but in such a way that the input corpus in data/full/files.txt contains paths starting with ../../megs/data/download. The resulting data/train/wav.scp contains correctly resolved absolute paths (despite the fact I am not a big fan of absolute paths).

@ynop
Copy link
Owner

ynop commented Mar 9, 2020

Did your problem occur, when executing the scripts as they are?
What exactly have you changed to get relative paths like ../../megs/data/download?

@davidavdav
Copy link
Author

Typically I never really execute the do-it-all scripts as I tend to store source databases somewhat differently from what the do-it-all script write had in mind. Also, the do-it-all scripts tend to never actually run in one go for me because things go wrong, and take long.

So I have megs as a submodule of german-asr, so that the download directory ended up as megs/data/download. Then within megs I ran python scripts/merge_and_subset.py data/download ../data/full, thinking that this would produce the kaldi-style data/train files (which was not the case). This is how the relative paths ended op in data/full.

But of course I can re-run megs python scripts/merge_and_subset.py but then to produce a megs/data/full that doesn't contain path starting with ... I can see if that resolves the problem.

@ynop
Copy link
Owner

ynop commented Mar 9, 2020

Did you also run the waverize script before exporting to kaldi?

@davidavdav
Copy link
Author

No, I didn't. The instructions about recreate.sh suggested that that was optional ("if needed").
I did check the consistency, the script found some 20 differences or so w.r.t. the expected data/state.json, but since there was no indication as to what those differences would be I assumed it would be the relative paths.

@ynop
Copy link
Owner

ynop commented Mar 9, 2020

Ok, so the consistency check is just for making sure having the same data.
This is not that important for the rest to work at all.
And it is also fine that you run it with different paths.
But I don't know what exactly you are doing to reproduce the problem.
I would suggest you run the waverize as well, which will help for kaldi export if there are only wav files.

@davidavdav
Copy link
Author

Sure, I understand that is is hard to reproduce the problem from the description, and the databases are quite large, but is there a way to get kaldi_writer.save() a bit more verbose?

I can run the waverise, I suppose at the cost some extra disk space.

But generally, I don't really know why it is so hard to produce the kaldi files, the information is in the data/full/{files,utterances}.txt, right?

@ynop
Copy link
Owner

ynop commented Mar 9, 2020

The main problem is, that all utterance durations are needed in the kaldi files.
It needs to get these from the audio files.
Therefore it takes that long.

@davidavdav
Copy link
Author

OK, just for the sake of trying, I re-did cd megs; python scripts/merge_and_subset.py data/download data/full, and then cd ..; python local/prepare_data.py megs/data/full/ data, and this still bails out after 261 minutes, after only having produced data/train/wav.scp, without any further reason why.

I understand you need the utterance durations, I think these are already computed in scripts/merge_and_subset.py. The Kaldi approach is to store the durations in utt2dur.

I think I'm going ahead now by making my own script that will process the relevant files in megs/data/full and produce kaldi train/test files. But this sort-of defeats the purpose of audiomate I suppose.

@ynop
Copy link
Owner

ynop commented Mar 11, 2020

With utt2dur you are right, but it is also needed for the segments file, since some of the utterances only are segments of the full audio file.

Have you tried with waverize?
Is the wav.scp empty?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants