Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

good results #30

Open
ggsonic opened this issue Jun 8, 2017 · 68 comments
Open

good results #30

ggsonic opened this issue Jun 8, 2017 · 68 comments

Comments

@ggsonic
Copy link

ggsonic commented Jun 8, 2017

https://github.com/ggsonic/tacotron/blob/master/10.mp3
based on your code, i can get clear voices,like the one above.
text is : The seven angels who had the seven trumpets prepared themselves to sound.
you can hear some of the words clearly.
the main changes are about 'batch_norm',i use instance normalization instead. And i think there are problems in the batch_norms.
And there may have something wrong about hp.r related data flows. But i don't have time to figure it out for now.
later this week i will commit my code and thanks your great works!

@Kyubyong
Copy link
Owner

Kyubyong commented Jun 8, 2017

@ggsonic Nice work! If you share training time or curve as well as your modified code, it would be appreciated.
Plus, instance normalization instead of batch normalization... interesting. Is anyone willing to review my normalize.py in modules.py? If you see my batch normalization code in modules.py, basically I use tf.contrib.layers.batch_norm. Many complains the performance of the batch normalization code in TF is poor. So, they officially recommend we use a fused version for that reason. But the fused batch normalization doesn't work for 3-d tensor. So I reshaped a 3d input tensor to 4d before applying the fused batch normalization and then recover its shape.

@ggsonic
Copy link
Author

ggsonic commented Jun 9, 2017

this is my tensorboard graph. i use your provided dataset and your default hyperparams. batch 32,lr 0.001,200 epochs,trained for 8 hours using one single gpu card to get the above result. After 70 epochs,you can hear some words. But It seems that after 110 epochs, learning rate should be lowered , i will test it this weekend.
graph

@candlewill
Copy link
Contributor

@ggsonic Nice work! Looking forward to your code modification.

@ggsonic
Copy link
Author

ggsonic commented Jun 9, 2017

committed! actually it is a simple idea and simple change.

@basuam
Copy link

basuam commented Jun 9, 2017

@ggsonic Can you share the 10.mp3 with another resource (e.g. dropbox)
I wasn't able to listen to your file. It was never downloaded.

Btw, your loss is still high, a loss with numbers such as ~0.001 is most likely to yield good results.

@Spotlight0xff
Copy link
Contributor

@basuam Just click Download and then "save as" (right click, or ctrl-s).
Also loss doesn't have to be low to yield good results, they are not necessarily directly related to perceptual quality. See e.g. this paper

@basuam
Copy link

basuam commented Jun 9, 2017

@Spotlight0xff I did right click and saved it but the default player (Windows Media Player) and even VLC cannot reproduce it. The problem with right click (in my case) is that it is not saving the file, is saving the metadata related to the file, that's why I cannot reproduce it. I'm using MATLAB to read the file and it kind of worked, I just need the Frequency Sample to listen clearly to whatever you have listened to. Without it, it is reproduced slowly or fast and it sounds like either a demon talking or a chipmunk talking.

Btw, the small loss is out of experience when working with AUDIO. I glimpsed the paper you attached but, it is applied to images and not to sequential data. We cannot extrapolate that information unless someone has tested that for "generative models in sequential data".

@dbiglari
Copy link

dbiglari commented Jun 9, 2017

That is terrific output. Your output sounds similar to the audio samples https://google.github.io/tacotron/ without the post processing network and the vanilla seq2seq. The buzz is present, but a voice is clearly hearable. Great job! I look forward to replicating your output.

@greigs
Copy link

greigs commented Jun 11, 2017

@ggsonic I've trained your latest commit to 200 epochs (seems to be the default?). Here is the trained model: https://ufile.io/yuu7e
And the samples generated by default when running eval.py on the latest epoch.. https://ufile.io/41djx

@chief7
Copy link

chief7 commented Jun 11, 2017

Guys, I've run some training for around 32k global steps using @ggsonic s latest commit. I used a german dataset (pavoque in case someone is interested) and I've got some really cool results:

https://soundcloud.com/user-604181615/tacotron-german-pavoque-voice (Words are: "Fühlst du dich etwas unsicher").

I did the training on my GTX 1070 for around 12 hours.

Adjusted the sentence max characters according to my dataset as well as the wave length. I used Zero Masking for loss calculation.

I also observed that @ggsonic s instance normalization without the latest merged PRs gives better results.

@DarkDefender
Copy link

DarkDefender commented Jun 11, 2017

@chief7 did you do any dynamic step size stuff?
While I don't speak german, I think that it sounds really good. Perhaps it could be even better if we adjusted step size (as in the paper)?

@chief7
Copy link

chief7 commented Jun 11, 2017 via email

@basuam
Copy link

basuam commented Jun 11, 2017

@chief7 Can you try with the database that is mentioned in the github?
I'm really impressed that "Fühlst du dich etwas unsicher" is so clear, a little bit robotic but still, it's so clear.
I'm pretty sure it can be improved, it looks like you have found a way.

I would like to know if the dataset that we are using is not big enough in comparison to the dataset you have used. Hopefully, you can try with the Bible dataset. Thank you for your time.

@ghost
Copy link

ghost commented Jun 13, 2017

@chief7 could you please tell us the size of your corpus?

@chief7
Copy link

chief7 commented Jun 13, 2017

Sorry guys, I totally forgot to answer ... weekend.

I use a really small corpus. Around 5.3 hours of utterances. It seems to be enough to generate random new sentences.

So I guess the bible corpus isn't too small. I'll try to check as soon as possible but my computing resources are limited.

@ggsonic
Copy link
Author

ggsonic commented Jun 14, 2017

according to @barronalex 's code, the frames should be reshaped to output multiple non-overlapping frames at each time step. then in eval.py, reshape these frames back to the normal overlapping representation.
an example data flow shown below. This seems to be the correct hp.r related data flow as the paper described. Look forward to get better results after doing so.
[[ 1 1 1 1 1 1 1] [ 2 2 2 2 2 2 2] [ 3 3 3 3 3 3 3] [ 4 4 4 4 4 4 4] [ 5 5 5 5 5 5 5] [ 6 6 6 6 6 6 6] [ 7 7 7 7 7 7 7] [ 8 8 8 8 8 8 8] [ 9 9 9 9 9 9 9] [10 10 10 10 10 10 10] [11 11 11 11 11 11 11] [12 12 12 12 12 12 12] [13 13 13 13 13 13 13] [14 14 14 14 14 14 14] [15 15 15 15 15 15 15] [16 16 16 16 16 16 16] [17 17 17 17 17 17 17] [18 18 18 18 18 18 18] [19 19 19 19 19 19 19] [20 20 20 20 20 20 20] [21 21 21 21 21 21 21] [22 22 22 22 22 22 22] [23 23 23 23 23 23 23] [24 24 24 24 24 24 24] [25 25 25 25 25 25 25] [26 26 26 26 26 26 26] [27 27 27 27 27 27 27] [28 28 28 28 28 28 28] [29 29 29 29 29 29 29] [30 30 30 30 30 30 30] [31 31 31 31 31 31 31] [32 32 32 32 32 32 32] [33 33 33 33 33 33 33] [34 34 34 34 34 34 34] [35 35 35 35 35 35 35] [36 36 36 36 36 36 36]]
reshaped to[[ 1 1 1 1 1 1 1 5 5 5 5 5 5 5 9 9 9 9 9 9 9 13 13 13 13 13 13 13 17 17 17 17 17 17 17] [ 2 2 2 2 2 2 2 6 6 6 6 6 6 6 10 10 10 10 10 10 10 14 14 14 14 14 14 14 18 18 18 18 18 18 18] [ 3 3 3 3 3 3 3 7 7 7 7 7 7 7 11 11 11 11 11 11 11 15 15 15 15 15 15 15 19 19 19 19 19 19 19] [ 4 4 4 4 4 4 4 8 8 8 8 8 8 8 12 12 12 12 12 12 12 16 16 16 16 16 16 16 20 20 20 20 20 20 20] [21 21 21 21 21 21 21 25 25 25 25 25 25 25 29 29 29 29 29 29 29 33 33 33 33 33 33 33 0 0 0 0 0 0 0] [22 22 22 22 22 22 22 26 26 26 26 26 26 26 30 30 30 30 30 30 30 34 34 34 34 34 34 34 0 0 0 0 0 0 0] [23 23 23 23 23 23 23 27 27 27 27 27 27 27 31 31 31 31 31 31 31 35 35 35 35 35 35 35 0 0 0 0 0 0 0] [24 24 24 24 24 24 24 28 28 28 28 28 28 28 32 32 32 32 32 32 32 36 36 36 36 36 36 36 0 0 0 0 0 0 0]]

@candlewill
Copy link
Contributor

@ggsonic In the latest pull request, I have added this feature. Could you please help check whether it is right?

@ggsonic
Copy link
Author

ggsonic commented Jun 14, 2017

@candlewill not exactly. A simple tf.reshape handling above example data will get [[ 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5] [ 6 6 6 6 6 6 6 ......] but we need the non-overlapping frames [[ 1 1 1 1 1 1 1 5 5 5 5 5 5 5 9 9 9 9 9 9 9 13 13 13 13 13 13 13 17 17 17 17 17 17 17] [ 2 2 2 2 2 2 2 ......]
these will do the trick like paper said:

This is likely because neighboring speech frames are correlated and each character usually corresponds to
multiple frames. Emitting one frame at a time forces the model to attend to the same input token for
multiple timesteps; emitting multiple frames allows the attention to move forward early in training.

i think get_spectrograms function in utils.py should do this trick.

@Kyubyong
Copy link
Owner

@ggsonic Thanks! I guess you're right. I've changed the reduce_frames and adjust other relevant parts.

@reiinakano
Copy link

reiinakano commented Jun 17, 2017

@ggsonic Have you tested the new commits regarding reduce_frames and seen if it works better?

Could you explain more why it's [[ 1 1 1 1 1 1 1 5 5 5 5 5 5 5 9 9 9 9 9 9 9 13 13 13 13 13 13 13 17 17 17 17 17 17 17] [ 2 2 2 2 2 2 2 ...... and not [[ 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5] [ 6 6 6 6 6 6 6 ......]?

In my mind, the latter seems more correct. The paper said "neighboring speech frames are correlated" and therefore, these neighboring frames are the ones that must be grouped and predicted together so the attention can move forward faster. Unless [1 1 1 1 1 1 1] and [5 5 5 5 5 5 5] are considered "neighboring frames" while [1 1 1 1 1 1 1] and [2 2 2 2 2 2 2] are not, the first reshaping (the one currently committed) does not make much sense to me.

@ggsonic
Copy link
Author

ggsonic commented Jun 19, 2017

@reiinakano In my experiments, the new reduce_frames method can make the training process stable, while the former method always had somewhat "mode collapse". But the new method might need more global steps to get better results and i am still waiting .

@reiinakano
Copy link

reiinakano commented Jun 19, 2017

@ggsonic Okay, am also running it right now with default hyperparameters. Do you use batch normalization or should I just stick with instance normalization? Currently at epoch 94 with batch norm and no voice.. :(

Edit: iirc, mode collapse is for GANs. What do you mean when you say mode collapse in this context?

Edit2: My loss curves so far epoch 94. Any comments?

screenshot from 2017-06-19 11 51 03

@chief7
Copy link

chief7 commented Jun 19, 2017

I didn't have much luck with the new features introduced during the last commits. I do get the best results with 7ed2f20 by ggsonic and the loss curves are similar to the ones posted by @reiinakano - especially when it comes to numbers. The sample I uploaded last week was sampled from the model while loss was around 1.2... just in case someone's interested.

@reiinakano
Copy link

@chief7 Thanks for sharing. Perhaps we should rethink the reduce_frames method? What is your opinion on the "neighboring speech frames" discussion.

Update: Am at epoch 115 and still no voice can be heard

@chief7
Copy link

chief7 commented Jun 19, 2017

As far as I understand what the paper says they're predicting frames 1..r in at once. If that's correct then the current state of the reduce_frames method is not correct. Though I'm still digging into this ...

@ghost
Copy link

ghost commented Jun 19, 2017

@reiinakano It is possible that after certain epochs your learning rate is too high for the model to converge, hence the spiking up, going down, repeat.

Here are some of the best results I've obtained after training it for about 2 days (before reduce frames commit with instance normalization). It seems that reduce frames is actually making the convergence take longer.
Here is the script,
2: Talib Kweli confirmed to All Hip Hop that he will be releasing an album in the next year.
8: The quick brown fox jumps over the lazy dog
11: Basilar membrane and otolaryngology are not auto correlations.
12: would you like to know more about something.
17: Generative adversarial network or variational auto encoder.
19: New Zealand is an island nation in the south western pacific ocean.
31: New Zealand's capital city is wellington while its most populous city is auckland.
https://www.dropbox.com/s/o1yhsaew8h2lsix/examples.zip?dl=0

@reiinakano
Copy link

reiinakano commented Jun 20, 2017

@minsangkim142 What is your learning rate? I have reverted to the 7ed2f20 commit by @ggsonic and am at epoch 157. I am using the default learning rate of 0.001. So far, it's been better than with reduce_frames (can hear a very robotic voice) but not really hearing actual words yet. The loss curve also looks much more steady now gradually going down.

@ghost
Copy link

ghost commented Jun 20, 2017

I started with 0.001 then moved it down to 0.0005, 0.0003 and 0.0001 as suggested by the original paper, except I changed my learning rate every time the model spiked (suggesting that it may have jumped out of the local minimum) in which case I reverted the model before it spiked and changed the learning rate. I started hearing some clear voices after 30k global timesteps with dataset size of 4 hrs of utterance which is about 190 epochs.

Also I used .npy objects and np.memmap instead of librosa.load which increased the globalsteps/second by about twice the original rate.

@basuam
Copy link

basuam commented Jun 20, 2017

@minsangkim142 can you share with us those clear voices?
Thank you for your help and knowledge.

@chief7
Copy link

chief7 commented Jun 20, 2017

@minsangkim142 did you use a more sophisticated save/restore logic or did you go for the one checked in here? Or did you turn the whole process of learning into a supervised process and adjusted everything manually?

@lifeiteng
Copy link

lifeiteng commented Jun 21, 2017

@minsangkim142 In May, I implemented (base on seq2seq ) the Deep Voice2's multi-speaker Tacotron, Learned awesome attention/alignments on the VCTK corpus.
00077_spk0015_p241_154_attention
00077_spk0015_p241_154_linear_spectrums
reduction factor = 4 (older style) but the audio quality is not so good (busy on other things nowadays).

@lifeiteng
Copy link

@ggsonic In my opinion, 10_reduce.mp3 isn't as good as 10.mp3.

@jacksmithe
Copy link

@minsangkim142 - On which dataset have you trained?

@TherapyBox
Copy link

Hi, does anyone know if there is a dependancy on file size when training? For example, multiple files each around 1h of recordings vs smaller files each around 5 min, etc.

@reiinakano
Copy link

@TherapyBox those are way too long. Tacotron trains on utterances, so only 10-30 seconds at max (not sure about exact limit for this implementation)

@jaron
Copy link

jaron commented Jun 22, 2017

@minsangkim142 I see the same spiking you describe, this on a 594 utterance training set (Arctic SLT, set A). I'm getting a muffled voice, but the fine detail isn't there, so I'm going to try reducing the learning rate too.

screen shot 2017-06-22 at 10 27 09

I was also interested to hear what you said about converting the sounds to .npy objects and loading with np.memmap instead of librosa.load to speed things up. Have you considered submitting that as a pull request?

@TherapyBox
Copy link

In terms of transcriptions, the format in Bible transcripts is:
Genesis/Genesis_1-1,In the beginning God created the heavens and the earth.,4
.9
So location / text / length of the audio file
Is length of the audio file necessary? I am trying to implement tacotron on my own dataset and I am trying to understand best format for transcriptions

@GunpowderGuy
Copy link

So , everyone agrees that a multi speaker tacotron will probably have worse quality ( but be more easily trainable , which could offset it ) , even with the modifications detailed in deep voice 2 ?

@chief7
Copy link

chief7 commented Jun 27, 2017

I've had time during the last days to confirm the following: Instance Normalization and Zero Masking is the way to go. I've trained on the pavoque set again and I get clear voices after ~30k global steps.

@DarkDefender
Copy link

@chief7 are the results better than the previous example you posted?

@jarheadfa
Copy link

@chief7 Inspired by the samples you shared (Good job!), I'm also trying to work with the pavouqe dataset.
Since I'm getting samples which are not as good as you shared I wanted to ask some questions:

  1. I'm using the https://github.com/marytts/pavoque-data/releases dataset.
    I noticed that it has a few modes (happy, angry, etc...), in total 10 hours, and you mentioned
    you are using 5.3 hours. Which modes have you chosen?
  2. The current code version removes all non english chars. Did you do some text normalization
    stuff? Can you share?
  3. What are the values of the max wav length and max text length you set?
  4. Did you do any preprocessing to the wav files?
  5. You are sampling when the loss is about 1.2 - are you referring to the self.mean_loss = self.mean_loss1 + self.mean_loss2 ?
  6. Can you share your code modification?

Thanks!

@GunpowderGuy
Copy link

GunpowderGuy commented Jun 28, 2017

But what about going multi speaker now ? , to be able to exploit exponentially more data ( and to not be left behind , tacotron will probably be based on deep voice 2 version )

@GuangChen2016
Copy link

@lifeiteng Hello, very nice jobs. You mentioned that you implemented (base on seq2seq ) the Deep Voice2's multi-speaker Tacotron, and learned awesome attention/alignments on the VCTK corpus. I am quite interested in this, and I also want to know:

  1. Did you implement multi-speaker Tacotron based on this repo, and just replace the seq2seq module? Did you implement the single-speaker Tacotron as well?
  2. Could you provide some synthesized voices (multi-speaker and single speaker)?
  3. Any other suggestions or tips to improve the results?
    Thank you very much.

@chief7
Copy link

chief7 commented Jun 29, 2017

@jarheadfa Nice to see that you're trying on the pavoque set!

Concerning your questions:

  1. I used only the neutral voices as every other style has different emphasis and stuff like that.
  2. Yeah I did some normalization. I replaces all the German umlauts like ö to oe etc. I have a very crappy script that's able to do so. I'll upload it as soon as possible.
  3. In current revision the max_wave_length is gone. Previously, I had it set to 16.8. max_len is set to 247
  4. I did no preprocessing.
  5. I'm talking about the combination of the two. Though since the last commits, loss typically is around 0.02 to 0.04

I'll share everything I have as soon as there's something that's worth sharing. Right now, there's a lot work you have to do manually.

Here's what I did:

  • tweaked data_load.py to store the features and text in ram to prevent it from reading them over and over again
  • did some text normalization
  • tweaked some stuff in the code (tried RandomShuffleQueue etc)

I noticed that you may have to adjust hyperparams after some time. Here's my very brittle process "how to train the pavoque set". Please be aware: This is by no means complete or even for sure. It's just my experience.
Start training with no zero masking and lr = 0.001 until you are able to hear words and random noise in the samples. Then you should adjust to lr = 0.0005 until the words become clearer. After some more time I use zero masking as it seems to avoid the random noises in the sample.

@jarheadfa
Copy link

@chief7 Thank you for your response!
A couple of things:

  1. On which commit does your recommendations based upon?
  2. Other than replacing ö to oe , did you also replaced äöüß or did you removed
    the'm?
  3. You mention random shuffle - did this helped?
  4. What was the loss in the time you had the best samples?
  5. On which global step (give or take) did you start hearing words and adjusted
    the learning rate? Did you do any other learning rate modifications apart from lowering it to 0.0005?
  6. You said you start using zero masking, On which global step was that?

Thanks again :)

@chief7
Copy link

chief7 commented Jun 30, 2017

@jarheadfa I'm on vacation right now - I'll upload it as soon as possible ;) aloha!

@TherapyBox
Copy link

hi, has anyone tried training the model on two or more voices rather than a single voice? That would open up much bigger data sets, but I have doubts about voice's quality, so if anyone can share their experience that would be interesting

@TherapyBox
Copy link

Hi, does anyone know if I can keep adding data while the model is training? I have 5h of audio ready, but will have 10h by the end of next day, so I am thinking of feeding 5h in and then adding 5h more while in the process. Has anyone tried this approach and knows if it works fine?

@TherapyBox
Copy link

Hi,
for some reason this bit of code is not running on our dataset:
"for step in tqdm(range(g.num_batch), total=g.num_batch, ncols=70, leave=False, unit='b'):
sess.run(g.train_op)"

we got about 170 audio files about 30 seconds each (just a sample set to test whether training works. Full data set is much bigger). The script finishes running the main program with the print("Done"), but I don't think there's any training happening as I get the error. Any help?

@GunpowderGuy
Copy link

Amother reason to go multi speaker now https://twitter.com/throwawayndelet/status/887418621877772291

@TherapyBox
Copy link

@chief7 it would be interesting to discuss your findings and potential use of them. Please drop me a line at kkolbus@therapy-box.co.uk if you are interested

@chief7
Copy link

chief7 commented Jul 21, 2017

Will do so ... but not before the weekend.

@jpdz
Copy link

jpdz commented Aug 2, 2017

@ggsonic hi, have you tried to feed in every hp.r-th frame to decode_inputs? Thank you for sharing some details.

@jackchinor
Copy link

@ggsonic I run your code of multiple gpu version, and get 200 epoch model, but when I run eval.py to test the result, there is a notfounderror I can't figure it out, can you tell me how to fix it, I really want to see my training result, thanks so much

@matanox
Copy link

matanox commented Sep 5, 2017

@ggsonic I listened to your sample output file from the top of this thread/issue earlier this year (10.mp3), and it sounds nothing like the sleek audio outputs showcased along the original Tacotron article (the voice is metallic, and there's a long period of noise appended to the intelligible part of the speech).

At this point I'm no longer sure how representative those original samples from google have been, v.s. the possibility that they've been cherry picked from a much larger pool of mediocre outputs, before being showcased. Or perhaps the internal dataset they have used had some auspicious properties to it, absent from the public datasets available for training.

I am just wondering, whether you have ultimately been able to accomplish results more like those of the original authors, in later training? That aside, the deep voice 2 article from Baidu claims a significant improvement by changing the architecture a little bit, I wonder whether that change is part of this repo.

I hope to be creating a high quality and perfectly aligned dataset (non-English) for training with this architecture, but I seem to be failing at getting at evidence, that it is really going to work to the standard of the google showcased samples. I wonder if anyone here has much insight about the reproducibility of the google results yet...

@rafaelvalle
Copy link

rafaelvalle commented Nov 17, 2017

@jaron and @minsangkim142 : did you find out how to fix the spikes in loss or why they appear? I wonder if it is related to gradients suddenly "exploding" whenever a silence-like spectrogram or outlier sequence, e.g. short sentences or repeated words, appears.

@jaron
Copy link

jaron commented Nov 17, 2017

@rafaelvalle I noticed the same loss spikes, but I'm working on another project at the moment so haven't had time to investigate.

But I think your intuition is correct. The Tacotron paper mentions a 50 ms frame length and a 12.5 ms frame shift to look 4 intervals ahead. But phonemes can be very short. I wonder if the more examples we encounter, the greater the risk we begin to "hear" things that aren't actually part of the set of phonemes in common speech. Which might be why people in the wood sometimes hear "words" when the wind is blowing through the branches...

@rafaelvalle
Copy link

I'm using the architecture for a different problem and will report here if I found out what's correlated with the spikes!

@rafaelvalle
Copy link

rafaelvalle commented Nov 18, 2017

This is a run with gradient clipping at 1.0 norm of gradients, LR 0.001.
My gradients are vanishing, not exploding. I'll look into the data to see what's producing it.

ITER: LOSS SUM(GRADNORM)
1885: 1.586263299 4.938561218
1886: 1.683523536 5.001653109
1887: 1.629835010 2.163018592
1887: 1.629835010 2.163018592
1888: 1.616167665 4.112304520
1889: 2.122112274 0.000000000
1890: 2.226523161 3.525334871
1891: 2.122971773 4.609711523
1892: 2.150014400 0.000000000
1893: 2.201522827 0.000000000
1894: 2.185153008 4.181719988
1895: 2.158463240 0.000000000
1896: 2.221507549 0.000000000
1897: 2.052042246 0.000000000
1898: 2.191329002 0.000000000
1899: 2.194499731 nan
1900: nan nan
1901: nan nan

@rafaelvalle
Copy link

rafaelvalle commented Nov 19, 2017

Continuing the analysis, the standard deviation of the activations of the pre layer and decoder layer of the decoder RNN increase, and keep increasing, considerably when there's a spike on the loss, possibly due to a "bad" input going into the pre layer of the decoder!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests