Parallel TTS

[TOC]

What's New !

2021/04/20 Merge from wavegan branch to main branch, delete wavegan branch!
2021/04/13 Create encoder branch to dev the speech style transfer module!
2021/04/13 softdtw branch support SoftDTW loss!
2021/04/09 ~~wavegan branch~~ support PWG / MelGAN / Multi-band MelGAN vocoder!
2021/04/05 Support ParallelText2Mel + MelGAN vocoder!
[ Key Info ] Speed Indicator, Samples, Web Demo, [Few Issues](#Few Issues), Communication ......

Repo Structure

.
|--- config/      # config file
     |--- default.yaml
     |--- ...
|--- datasets/    # data process
|--- encoder/     # voice encoder
     |--- voice_encoder.py
     |--- ...
|--- helpers/     # some helpers
     |--- trainer.py
     |--- synthesizer.py
     |--- ...
|--- logdir/      # training log directory
|--- losses/      # loss function
|--- models/      # synthesizor
     |--- layers.py
     |--- duration.py
     |--- parallel.py
|--- pretrained/  # pretrained (LJSpeech dataset)
|--- samples/     # synthesized samples
|--- utils/       # some common utils
|--- vocoder/     # vocoder
     |--- melgan.py
     |--- ...
|--- wandb/       # Wandb save directory
|--- extract-duration.py
|--- extract-embedding.py
|--- LICENSE
|--- prepare-dataset.py  # prepare dataset
|--- README.md
|--- requirements.txt    # dependencies
|--- synthesize.py       # synthesize script
|--- train-duration.py   # train script
|--- train-parallel.py

Samples

Here are some synthesized samples.

Pretrained

Here are some pretrained models.

Quick Start

Step (1)：clone repo

$ git clone https://github.com/atomicoo/ParallelTTS.git

Step (2)：install dependencies

$ conda create -n ParallelTTS python=3.7.9
$ conda activate ParallelTTS
$ pip install -r requirements.txt

Step (3)：synthesize audio

$ python synthesize.py \
  --checkpoint ./pretrained/ljspeech-parallel-epoch0100.pth \
  --melgan_checkpoint ./pretrained/ljspeech-melgan-epoch3200.pth \
  --input_texts ./samples/english/synthesize.txt \
  --outputs_dir ./outputs/

If synthesizing audio of other languages, you should set config file through --config.

Training

Step (1)：prepare dataset

$ python prepare-dataset.py

Through --config to set config file, default (default.yaml) is for LJSpeech dataset.

Step (2)：train alignment model

$ python train-duration.py

Step (3)：extract durations

$ python extract-duration.py

Through --ground_truth to set weather generating ground-truth spectrograms or not。

Step (4)：train synthesize model

$ python train-parallel.py

Through --ground_truth to set weather training model by ground-truth spectrograms。

Training Log

if use TensorBoardX, run this:

$ tensorboard --logdir logdir/[DIR]/

It is highly recommended to use Wandb（Weights & Biases）, just set --enable_wandb when training。

Datasets

LJSpeech: English, Female, 22050 Hz, ~24 h
LibriSpeech: English, Multi-speakers (only use audios of train-clean-100),16000 Hz，total ~1000 h
JSUT: Japanese, Female, 48000 Hz, ~10 h
BiaoBei: Mandarin, Female, 48000 Hz, ~12 h
KSS: Korean, Female, 44100 Hz, ~12 h
RuLS: Russian, Multi-speakers (only use audios of single speaker), 16000 Hz, total ~98 h
TWLSpeech (non-public, poor quality): Tibetan, Female (multi-speakers, sound similar), 16000 Hz，~23 h

Quality

TODO: to be added.

Speed

Speed of Training：LJSpeech dataset, batch size = 64, training on 8GB GTX 1080 GPU, elapsed ~8h (~300 epochs).

Speed of Synthesizing：test under CPU @ Intel Core i7-8550U / GPU @ NVIDIA GeForce MX150, 8s per synthesized audio (about 20 words)

Batch Size	Spec (GPU)	Audio (GPU)	Spec (CPU)	Audio (CPU)
1	0.042	0.218	0.100	2.004
2	0.046	0.453	0.209	3.922
4	0.053	0.863	0.407	7.897
8	0.062	2.386	0.878	14.599

Attention, no multiple tests, for reference only.

Few Issues

In wavegan branch, code of vocoder is from ParallelWaveGAN. Since the method of acoustic feature extraction is not compatible, it needs to be transformed. See here.
The input of mandarin model is pinyin. Because of the lack of punctuations in BiaoBei's raw pinyin sequence and the incomplete alignment model training, there's something wrong with the rhythm of synthesized samples.
I haven't trained a Korean vocoder specially, and just use the vocoder of LJSpeech (22050 Hz), which might slightly affect the quality of synthesized audio.

References

Kyubyong/tacotron
r9y9/deepvoice3_pytorch
tugstugi/pytorch-dc-tts
janvainer/speedyspeech
Po-Hsun-Su/pytorch-ssim
Maghoumi/pytorch-softdtw-cuda
seungwonpark/melgan
kan-bayashi/ParallelWaveGAN

TODO

Synthetic speech quality assessment (MOS)
More tests in different languages
Speech style transfer (tone)

Communication

E-mail: atomicoo95@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

Parallel TTS

What's New !

Repo Structure

Samples

Pretrained

Quick Start

Training

Training Log

Datasets

Quality

Speed

Few Issues

References

TODO

Communication

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

Parallel TTS

What's New !

Repo Structure

Samples

Pretrained

Quick Start

Training

Training Log

Datasets

Quality

Speed

Few Issues

References

TODO

Communication