kaldi_writer.save() aborts after 3 hours #107

davidavdav · 2020-03-09T08:26:36Z

Calling from german-asr local/prepare_data.py it looks like the process aborts after 3 hours consistently, without any diagnostics. Just the train/wav.scp is written, and that happens in the first few minutes. After that the 64GB mem machine is busy with roughly 20% CPU and 14% memory for three hours, after which it silently aborts.

The text was updated successfully, but these errors were encountered:

aahlenst · 2020-03-09T08:30:00Z

Thanks for the report. What Python version are you using, what operating system? Is there anything else you can tell us to help track down the problem?

davidavdav · 2020-03-09T08:56:06Z

Hi, this would be Python 3.7.3, Linux 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64 GNU/Linux. I am now trying the same on a 128GB mem sibling computer.

I followed the megs download instructions, but in such a way that the input corpus in data/full/files.txt contains paths starting with ../../megs/data/download. The resulting data/train/wav.scp contains correctly resolved absolute paths (despite the fact I am not a big fan of absolute paths).

ynop · 2020-03-09T09:13:32Z

Did your problem occur, when executing the scripts as they are?
What exactly have you changed to get relative paths like ../../megs/data/download?

davidavdav · 2020-03-09T09:26:38Z

Typically I never really execute the do-it-all scripts as I tend to store source databases somewhat differently from what the do-it-all script write had in mind. Also, the do-it-all scripts tend to never actually run in one go for me because things go wrong, and take long.

So I have megs as a submodule of german-asr, so that the download directory ended up as megs/data/download. Then within megs I ran python scripts/merge_and_subset.py data/download ../data/full, thinking that this would produce the kaldi-style data/train files (which was not the case). This is how the relative paths ended op in data/full.

But of course I can re-run megs python scripts/merge_and_subset.py but then to produce a megs/data/full that doesn't contain path starting with ... I can see if that resolves the problem.

ynop · 2020-03-09T09:31:31Z

Did you also run the waverize script before exporting to kaldi?

davidavdav · 2020-03-09T09:41:41Z

No, I didn't. The instructions about recreate.sh suggested that that was optional ("if needed").
I did check the consistency, the script found some 20 differences or so w.r.t. the expected data/state.json, but since there was no indication as to what those differences would be I assumed it would be the relative paths.

ynop · 2020-03-09T09:48:29Z

Ok, so the consistency check is just for making sure having the same data.
This is not that important for the rest to work at all.
And it is also fine that you run it with different paths.
But I don't know what exactly you are doing to reproduce the problem.
I would suggest you run the waverize as well, which will help for kaldi export if there are only wav files.

davidavdav · 2020-03-09T10:07:04Z

Sure, I understand that is is hard to reproduce the problem from the description, and the databases are quite large, but is there a way to get kaldi_writer.save() a bit more verbose?

I can run the waverise, I suppose at the cost some extra disk space.

But generally, I don't really know why it is so hard to produce the kaldi files, the information is in the data/full/{files,utterances}.txt, right?

ynop · 2020-03-09T10:12:01Z

The main problem is, that all utterance durations are needed in the kaldi files.
It needs to get these from the audio files.
Therefore it takes that long.

davidavdav · 2020-03-11T15:10:42Z

OK, just for the sake of trying, I re-did cd megs; python scripts/merge_and_subset.py data/download data/full, and then cd ..; python local/prepare_data.py megs/data/full/ data, and this still bails out after 261 minutes, after only having produced data/train/wav.scp, without any further reason why.

I understand you need the utterance durations, I think these are already computed in scripts/merge_and_subset.py. The Kaldi approach is to store the durations in utt2dur.

I think I'm going ahead now by making my own script that will process the relevant files in megs/data/full and produce kaldi train/test files. But this sort-of defeats the purpose of audiomate I suppose.

ynop · 2020-03-11T15:18:55Z

With utt2dur you are right, but it is also needed for the segments file, since some of the utterances only are segments of the full audio file.

Have you tried with waverize?
Is the wav.scp empty?

aahlenst added need-more-info waiting-for-triage labels Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kaldi_writer.save() aborts after 3 hours #107

kaldi_writer.save() aborts after 3 hours #107

davidavdav commented Mar 9, 2020

aahlenst commented Mar 9, 2020

davidavdav commented Mar 9, 2020

ynop commented Mar 9, 2020

davidavdav commented Mar 9, 2020

ynop commented Mar 9, 2020

davidavdav commented Mar 9, 2020

ynop commented Mar 9, 2020

davidavdav commented Mar 9, 2020

ynop commented Mar 9, 2020

davidavdav commented Mar 11, 2020

ynop commented Mar 11, 2020

kaldi_writer.save() aborts after 3 hours #107

kaldi_writer.save() aborts after 3 hours #107

Comments

davidavdav commented Mar 9, 2020

aahlenst commented Mar 9, 2020

davidavdav commented Mar 9, 2020

ynop commented Mar 9, 2020

davidavdav commented Mar 9, 2020

ynop commented Mar 9, 2020

davidavdav commented Mar 9, 2020

ynop commented Mar 9, 2020

davidavdav commented Mar 9, 2020

ynop commented Mar 9, 2020

davidavdav commented Mar 11, 2020

ynop commented Mar 11, 2020