-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect training data #15
Comments
Hi, By running The cleanup script also decodes the training set (with a biased language model built from the reference), so that's similar to what you did. The ids from s5_r2 are similar to s5, there is just a suffix added for the different microphones (a,b,c) since s5 didn't use all the available data. Can you send me your diff and/or would you be able to fix the corpus XML's directly? Otherwise since this doesn't seem to be a lot of the total data (1.5%), I'd simply remove the broken data from the next release (v3) of the corpus. |
If fixing the XML files needs some helping hands and makes sense, let me know. |
Hi svenha!
That would be greatly appreciated! There is already a list of files with
wrong transcriptions in the cleanup folder in the repository. That should
be a good starting point!
I'm currently preparing v3 of the corpus, where I sorted these files out
into a separate folder. I can send you a link tomorrow.
svenha <notifications@github.com> schrieb am Mo., 28. Mai 2018, 16:17:
… If fixing the XML files needs some helping hands and makes sense, let me
know.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#15 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJRgetAq1txc8OE-8BcLKaUKG3Ve-HFxks5t3AbzgaJpZM4T37R6>
.
|
Here is our planned v3 package where hopefully most of the bad utterances are moved into a separate folder: http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/german-speechdata-package-v3.tar.gz @KlausBuchegger can you upload your diff and decoded output? |
Two questions about the file problematic_wavs.txt.
I think the uploads from @KlausBuchegger might be helpful ... |
Hello What speaks for mixed (wrong) assignment is that sometimes I have the impression the text was just shifted: train/2014-03-17-13-03-49_Kinect-Beam.wav should have the same text as train/2014-03-17-13-03-33_Kinect-Beam.wav The dates are pretty close (they are in my training file neighbor lines). |
@fbenites Yes, it would be good to distribute the correction work. Benjamin (bmilde) offered to produce a list of problematic transcripts by running the recognizer on the train set. When this is done, let us partition the manual check work in two parts. |
@fbenites @svenha thanks for offering to help. I'm running the decode on the train set right now and should have the results soon. @svenha As for the number of problematic utterances in the proposed v3 tar: the numbers are different since I excluded all wav files of all microphones, whenever at least one decode of one microphone fails. There are multiple microphone recordings of the same utterance. Better safe than sorry. Still, only about 1.5% of all utterances are problematic. But the problematic files tend to be in the same recording session(s). @fbenites Low RNN-CTC performance will probably remain, even if we fix all the problematic files. There is probably not enough data for end-to-end RNN training (40h x number of microphones, but that is more like doing augmentation learning on 40h of data). What kind of WERs are you seeing with DeepSpeech at the moment? Are you training with or without a phonetic dictionary? Utterance variance is unfortunately also fairly low, there are only about 3000 distinct sentences in train/dev/test combined, so I suggest using a phoneme dictionary if possible. I also suggesting adding German speech data from SWC (https://nats.gitlab.io/swc/), worked very well in the our r2 scripts in this repository (18.39% WER dev / 19.60% WER test now). |
@bmilde Thanks, I will have a look at the wikipedia. |
I uploaded the decode of the tuda train set to: http://speech.tools/decode_tuda_train.tar.gz This file might be interesting: exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/wer_details/per_utt or alternatively you can also diff e.g.: diff exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/penalty_0.5/10.txt exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/test_filt.txt Though this doesn't look so pretty in standard diff, so some graphical diff-like tool like in the screenshot from Klaus is probably a better idea. |
Thanks @bmilde . I investigated the file per_utt with a little script that sums the isd part of the "cisd" fields. I count all of them as errors of equal weight. Then I used an error threshold t. For t=4, 2490 affected files; for t=5, 1788 files, for t=6,1511 files. If I ignore the microphone suffix (_a etc.), there remain 616 files to be checked for t=6. As we have only 2 annotators, I would suggest to use t=6. (We can repeat the decoding of the train set with the improved corpus and choose a second annotation round.) If you agree, I can produce the file list, sort it, and cut it in the middle. I would take the first half, @fbenites the second part? @bmilde : How should we contribute the changes? We could edit the .xml files, collect all changed .xml files and send them to you. But I am open to other approaches, like a normal pull request (if the repo is not too large for this). One final point: in per_utt, there are file names like 02dae8284f104451a8de85538da6fdec_20140317140355_a . Is it safe to assume that it corresponds 1:1 to an xml file derived from the date/time part, here 2014-03-17-14-03-55.xml ? |
Many thanks @svenha ! Yes, it is safe to assume that 02dae8284f104451a8de85538da6fdec_20140317140355_a belongs to 2014-03-17-14-03-55.xml 02dae8284f104451a8de85538da6fdec is the speaker hash, the last letter indicates the microphone, between that the IDs contain the timestamp without the dashes. Note that it's also safe to assume that _a, _b, _c etc all belong to the same utterance and that they should contain the same transcription. If all of them decode to something else, it's safe to assume an incorrect transcription. Maybe it's also a good idea to make the threshold t depended on the length of the transcription, there are some very short utterances, too. But I can also always rerun the decoding for you after getting corrected xml files. Since the corpus is not hosted on Github, it's probably easier if you send me the corrected xml files directly. But you can also send me a pull request, containing only the corrected xml files, placed somewhere in a subfolder of the local directory. I can then also write a script that checks the corpus files and patches them if needed, so that it is not necessary to redownload the whole tar.gz file. One final point: Note that the xml files have sentence IDs. E.g. sentence_id 59 in:
A text file with all of the sentence IDs and transcriptions is in the root of the corpus archive. Since there are multiple recordings per sentence ID, it is very unlikely that the correct sentence is not included. Maybe we can also just try to find the closest match automatically and check that its correct manually? |
I switched from an absolute threshold to a relative threshold as suggested by @bmilde : number_of_errors / number_of_words >= 0.2 |
tuda-20perc-err.files1.txt is finished; my manual correction speed was around 2 minutes per recording. I will send the files to @bmilde. Any volunteers for tuda-20perc-err.files2.txt? (If not, I might have time in August.) |
Just to avoid duplicate work, I would let you know that I am working on files2. |
The second half (i.e. files2) were finished some weeks ago. Is there a rough estimate of the release date of tuda v3? |
This comment has been minimized.
This comment has been minimized.
Hi, Sorry for the radio silence, I was caught up in other projects. Thanks again! |
@silenterus: It would be nice to keep the discussion on-topic. This bug is about mix-ups and errors in the data files. Feel free to open a new bug about training deep speech with our data. @fbenites: speech.tools is hosted by @bmilde afaik, maybe he can fix that. |
Sry you are absolutly right. |
Are there any plans to integrate all the corrections from 2018 or later into a new corpus version? |
I created a repository with the whole ~20GB Dataset here: https://github.com/Alienmaster/TudaDataset |
@Alienmaster Thanks for picking this up and the clever issue template in the new TudaDataset repo. If you have any new evaluation results, please let us know :-) |
Closing this, release 4 of the tuda dataset contains the fixes: http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/german-speechdata-package-v4.tar.gz Our CV7 branch uses this already to train the models, together with the newest Common voice data. See https://github.com/uhh-lt/kaldi-tuda-de/tree/CV7 |
Hi,
so I trained Kaldi using your (old) s5 script, and as a sanity check I tried to decode the training data.
When I compared the text file from the training data to my results, I noticed that there seem to be quite a number of errors in the texts.
I checked the audio and xml files and saw that the sentences were wrong.
I added a screenshot of a partial vimdiff of the text and my results.
Appears to be a mixup, as those sentences do exist, but in other audio files
The text was updated successfully, but these errors were encountered: