-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
good results #30
Comments
@ggsonic Nice work! If you share training time or curve as well as your modified code, it would be appreciated. |
@ggsonic Nice work! Looking forward to your code modification. |
committed! actually it is a simple idea and simple change. |
@ggsonic Can you share the 10.mp3 with another resource (e.g. dropbox) Btw, your loss is still high, a loss with numbers such as ~0.001 is most likely to yield good results. |
@basuam Just click Download and then "save as" (right click, or ctrl-s). |
@Spotlight0xff I did right click and saved it but the default player (Windows Media Player) and even VLC cannot reproduce it. The problem with right click (in my case) is that it is not saving the file, is saving the metadata related to the file, that's why I cannot reproduce it. I'm using MATLAB to read the file and it kind of worked, I just need the Frequency Sample to listen clearly to whatever you have listened to. Without it, it is reproduced slowly or fast and it sounds like either a demon talking or a chipmunk talking. Btw, the small loss is out of experience when working with AUDIO. I glimpsed the paper you attached but, it is applied to images and not to sequential data. We cannot extrapolate that information unless someone has tested that for "generative models in sequential data". |
That is terrific output. Your output sounds similar to the audio samples https://google.github.io/tacotron/ without the post processing network and the vanilla seq2seq. The buzz is present, but a voice is clearly hearable. Great job! I look forward to replicating your output. |
@ggsonic I've trained your latest commit to 200 epochs (seems to be the default?). Here is the trained model: https://ufile.io/yuu7e |
Guys, I've run some training for around 32k global steps using @ggsonic s latest commit. I used a german dataset (pavoque in case someone is interested) and I've got some really cool results: https://soundcloud.com/user-604181615/tacotron-german-pavoque-voice (Words are: "Fühlst du dich etwas unsicher"). I did the training on my GTX 1070 for around 12 hours. Adjusted the sentence max characters according to my dataset as well as the wave length. I used Zero Masking for loss calculation. I also observed that @ggsonic s instance normalization without the latest merged PRs gives better results. |
@chief7 did you do any dynamic step size stuff? |
No not yet. I definitely plan to do things like that but my first goal is
or was to prove that it's worth spending all that time :D
I'll keep you posted on my progress!
Am 11. Juni 2017 1:00:28 nachm. schrieb DarkDefender
<notifications@github.com>:
… @chief7 did you do any dynamic step size stuff?
While I don't speak german, I think that it sounds really good. Perhaps it
could be even better if we adjusted step size (as in the paper) the result
would be even better?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#30 (comment)
|
@chief7 Can you try with the database that is mentioned in the github? I would like to know if the dataset that we are using is not big enough in comparison to the dataset you have used. Hopefully, you can try with the Bible dataset. Thank you for your time. |
@chief7 could you please tell us the size of your corpus? |
Sorry guys, I totally forgot to answer ... weekend. I use a really small corpus. Around 5.3 hours of utterances. It seems to be enough to generate random new sentences. So I guess the bible corpus isn't too small. I'll try to check as soon as possible but my computing resources are limited. |
according to @barronalex 's code, the frames should be reshaped to output multiple non-overlapping frames at each time step. then in eval.py, reshape these frames back to the normal overlapping representation. |
@ggsonic In the latest pull request, I have added this feature. Could you please help check whether it is right? |
@candlewill not exactly. A simple tf.reshape handling above example data will get
i think get_spectrograms function in utils.py should do this trick. |
@ggsonic Thanks! I guess you're right. I've changed the |
@ggsonic Have you tested the new commits regarding Could you explain more why it's In my mind, the latter seems more correct. The paper said "neighboring speech frames are correlated" and therefore, these neighboring frames are the ones that must be grouped and predicted together so the attention can move forward faster. Unless |
@reiinakano In my experiments, the new reduce_frames method can make the training process stable, while the former method always had somewhat "mode collapse". But the new method might need more global steps to get better results and i am still waiting . |
@ggsonic Okay, am also running it right now with default hyperparameters. Do you use batch normalization or should I just stick with instance normalization? Currently at epoch 94 with batch norm and no voice.. :( Edit: iirc, mode collapse is for GANs. What do you mean when you say mode collapse in this context? Edit2: My loss curves so far epoch 94. Any comments? |
I didn't have much luck with the new features introduced during the last commits. I do get the best results with 7ed2f20 by ggsonic and the loss curves are similar to the ones posted by @reiinakano - especially when it comes to numbers. The sample I uploaded last week was sampled from the model while loss was around 1.2... just in case someone's interested. |
@chief7 Thanks for sharing. Perhaps we should rethink the Update: Am at epoch 115 and still no voice can be heard |
As far as I understand what the paper says they're predicting frames |
@reiinakano It is possible that after certain epochs your learning rate is too high for the model to converge, hence the spiking up, going down, repeat. Here are some of the best results I've obtained after training it for about 2 days (before reduce frames commit with instance normalization). It seems that reduce frames is actually making the convergence take longer. |
@minsangkim142 What is your learning rate? I have reverted to the 7ed2f20 commit by @ggsonic and am at epoch 157. I am using the default learning rate of 0.001. So far, it's been better than with |
I started with 0.001 then moved it down to 0.0005, 0.0003 and 0.0001 as suggested by the original paper, except I changed my learning rate every time the model spiked (suggesting that it may have jumped out of the local minimum) in which case I reverted the model before it spiked and changed the learning rate. I started hearing some clear voices after 30k global timesteps with dataset size of 4 hrs of utterance which is about 190 epochs. Also I used .npy objects and np.memmap instead of librosa.load which increased the globalsteps/second by about twice the original rate. |
@minsangkim142 can you share with us those clear voices? |
@minsangkim142 did you use a more sophisticated save/restore logic or did you go for the one checked in here? Or did you turn the whole process of learning into a supervised process and adjusted everything manually? |
@minsangkim142 In May, I implemented (base on seq2seq ) the Deep Voice2's multi-speaker Tacotron, Learned awesome attention/alignments on the VCTK corpus. |
@ggsonic In my opinion, |
@minsangkim142 - On which dataset have you trained? |
Hi, does anyone know if there is a dependancy on file size when training? For example, multiple files each around 1h of recordings vs smaller files each around 5 min, etc. |
@TherapyBox those are way too long. Tacotron trains on utterances, so only 10-30 seconds at max (not sure about exact limit for this implementation) |
@minsangkim142 I see the same spiking you describe, this on a 594 utterance training set (Arctic SLT, set A). I'm getting a muffled voice, but the fine detail isn't there, so I'm going to try reducing the learning rate too. I was also interested to hear what you said about converting the sounds to .npy objects and loading with np.memmap instead of librosa.load to speed things up. Have you considered submitting that as a pull request? |
In terms of transcriptions, the format in Bible transcripts is: |
So , everyone agrees that a multi speaker tacotron will probably have worse quality ( but be more easily trainable , which could offset it ) , even with the modifications detailed in deep voice 2 ? |
I've had time during the last days to confirm the following: Instance Normalization and Zero Masking is the way to go. I've trained on the pavoque set again and I get clear voices after ~30k global steps. |
@chief7 are the results better than the previous example you posted? |
@chief7 Inspired by the samples you shared (Good job!), I'm also trying to work with the pavouqe dataset.
Thanks! |
But what about going multi speaker now ? , to be able to exploit exponentially more data ( and to not be left behind , tacotron will probably be based on deep voice 2 version ) |
@lifeiteng Hello, very nice jobs. You mentioned that you implemented (base on seq2seq ) the Deep Voice2's multi-speaker Tacotron, and learned awesome attention/alignments on the VCTK corpus. I am quite interested in this, and I also want to know:
|
@jarheadfa Nice to see that you're trying on the pavoque set! Concerning your questions:
I'll share everything I have as soon as there's something that's worth sharing. Right now, there's a lot work you have to do manually. Here's what I did:
I noticed that you may have to adjust hyperparams after some time. Here's my very brittle process "how to train the pavoque set". Please be aware: This is by no means complete or even for sure. It's just my experience. |
@chief7 Thank you for your response!
Thanks again :) |
@jarheadfa I'm on vacation right now - I'll upload it as soon as possible ;) aloha! |
hi, has anyone tried training the model on two or more voices rather than a single voice? That would open up much bigger data sets, but I have doubts about voice's quality, so if anyone can share their experience that would be interesting |
Hi, does anyone know if I can keep adding data while the model is training? I have 5h of audio ready, but will have 10h by the end of next day, so I am thinking of feeding 5h in and then adding 5h more while in the process. Has anyone tried this approach and knows if it works fine? |
Hi, we got about 170 audio files about 30 seconds each (just a sample set to test whether training works. Full data set is much bigger). The script finishes running the main program with the print("Done"), but I don't think there's any training happening as I get the error. Any help? |
Amother reason to go multi speaker now https://twitter.com/throwawayndelet/status/887418621877772291 |
@chief7 it would be interesting to discuss your findings and potential use of them. Please drop me a line at kkolbus@therapy-box.co.uk if you are interested |
Will do so ... but not before the weekend. |
@ggsonic hi, have you tried to feed in every hp.r-th frame to decode_inputs? Thank you for sharing some details. |
@ggsonic I run your code of multiple gpu version, and get 200 epoch model, but when I run eval.py to test the result, there is a notfounderror I can't figure it out, can you tell me how to fix it, I really want to see my training result, thanks so much |
@ggsonic I listened to your sample output file from the top of this thread/issue earlier this year ( At this point I'm no longer sure how representative those original samples from google have been, v.s. the possibility that they've been cherry picked from a much larger pool of mediocre outputs, before being showcased. Or perhaps the internal dataset they have used had some auspicious properties to it, absent from the public datasets available for training. I am just wondering, whether you have ultimately been able to accomplish results more like those of the original authors, in later training? That aside, the deep voice 2 article from Baidu claims a significant improvement by changing the architecture a little bit, I wonder whether that change is part of this repo. I hope to be creating a high quality and perfectly aligned dataset (non-English) for training with this architecture, but I seem to be failing at getting at evidence, that it is really going to work to the standard of the google showcased samples. I wonder if anyone here has much insight about the reproducibility of the google results yet... |
@jaron and @minsangkim142 : did you find out how to fix the spikes in loss or why they appear? I wonder if it is related to gradients suddenly "exploding" whenever a silence-like spectrogram or outlier sequence, e.g. short sentences or repeated words, appears. |
@rafaelvalle I noticed the same loss spikes, but I'm working on another project at the moment so haven't had time to investigate. But I think your intuition is correct. The Tacotron paper mentions a 50 ms frame length and a 12.5 ms frame shift to look 4 intervals ahead. But phonemes can be very short. I wonder if the more examples we encounter, the greater the risk we begin to "hear" things that aren't actually part of the set of phonemes in common speech. Which might be why people in the wood sometimes hear "words" when the wind is blowing through the branches... |
I'm using the architecture for a different problem and will report here if I found out what's correlated with the spikes! |
This is a run with gradient clipping at 1.0 norm of gradients, LR 0.001. ITER: LOSS SUM(GRADNORM) |
Continuing the analysis, the standard deviation of the activations of the pre layer and decoder layer of the decoder RNN increase, and keep increasing, considerably when there's a spike on the loss, possibly due to a "bad" input going into the pre layer of the decoder! |
https://github.com/ggsonic/tacotron/blob/master/10.mp3
based on your code, i can get clear voices,like the one above.
text is : The seven angels who had the seven trumpets prepared themselves to sound.
you can hear some of the words clearly.
the main changes are about 'batch_norm',i use instance normalization instead. And i think there are problems in the batch_norms.
And there may have something wrong about hp.r related data flows. But i don't have time to figure it out for now.
later this week i will commit my code and thanks your great works!
The text was updated successfully, but these errors were encountered: