Skip to content

Latest commit

 

History

History
187 lines (136 loc) · 7.2 KB

README_en.md

File metadata and controls

187 lines (136 loc) · 7.2 KB

简体中文 | English

Parallel TTS

[TOC]

What's New !

Repo Structure

.
|--- config/      # config file
     |--- default.yaml
     |--- ...
|--- datasets/    # data process
|--- encoder/     # voice encoder
     |--- voice_encoder.py
     |--- ...
|--- helpers/     # some helpers
     |--- trainer.py
     |--- synthesizer.py
     |--- ...
|--- logdir/      # training log directory
|--- losses/      # loss function
|--- models/      # synthesizor
     |--- layers.py
     |--- duration.py
     |--- parallel.py
|--- pretrained/  # pretrained (LJSpeech dataset)
|--- samples/     # synthesized samples
|--- utils/       # some common utils
|--- vocoder/     # vocoder
     |--- melgan.py
     |--- ...
|--- wandb/       # Wandb save directory
|--- extract-duration.py
|--- extract-embedding.py
|--- LICENSE
|--- prepare-dataset.py  # prepare dataset
|--- README.md
|--- requirements.txt    # dependencies
|--- synthesize.py       # synthesize script
|--- train-duration.py   # train script
|--- train-parallel.py

Samples

Here are some synthesized samples.

Pretrained

Here are some pretrained models.

Quick Start

Step (1):clone repo

$ git clone https://github.com/atomicoo/ParallelTTS.git

Step (2):install dependencies

$ conda create -n ParallelTTS python=3.7.9
$ conda activate ParallelTTS
$ pip install -r requirements.txt

Step (3):synthesize audio

$ python synthesize.py \
  --checkpoint ./pretrained/ljspeech-parallel-epoch0100.pth \
  --melgan_checkpoint ./pretrained/ljspeech-melgan-epoch3200.pth \
  --input_texts ./samples/english/synthesize.txt \
  --outputs_dir ./outputs/

If synthesizing audio of other languages, you should set config file through --config.

Training

Step (1):prepare dataset

$ python prepare-dataset.py

Through --config to set config file, default (default.yaml) is for LJSpeech dataset.

Step (2):train alignment model

$ python train-duration.py 

Step (3):extract durations

$ python extract-duration.py

Through --ground_truth to set weather generating ground-truth spectrograms or not。

Step (4):train synthesize model

$ python train-parallel.py

Through --ground_truth to set weather training model by ground-truth spectrograms。

Training Log

if use TensorBoardX, run this:

$ tensorboard --logdir logdir/[DIR]/

It is highly recommended to use Wandb(Weights & Biases), just set --enable_wandb when training。

Datasets

  • LJSpeech: English, Female, 22050 Hz, ~24 h
  • LibriSpeech: English, Multi-speakers (only use audios of train-clean-100),16000 Hz,total ~1000 h
  • JSUT: Japanese, Female, 48000 Hz, ~10 h
  • BiaoBei: Mandarin, Female, 48000 Hz, ~12 h
  • KSS: Korean, Female, 44100 Hz, ~12 h
  • RuLS: Russian, Multi-speakers (only use audios of single speaker), 16000 Hz, total ~98 h
  • TWLSpeech (non-public, poor quality): Tibetan, Female (multi-speakers, sound similar), 16000 Hz,~23 h

Quality

TODO: to be added.

Speed

Speed of TrainingLJSpeech dataset, batch size = 64, training on 8GB GTX 1080 GPU, elapsed ~8h (~300 epochs).

Speed of Synthesizing:test under CPU @ Intel Core i7-8550U / GPU @ NVIDIA GeForce MX150, 8s per synthesized audio (about 20 words)

Batch Size Spec
(GPU)
Audio
(GPU)
Spec
(CPU)
Audio
(CPU)
1 0.042 0.218 0.100 2.004
2 0.046 0.453 0.209 3.922
4 0.053 0.863 0.407 7.897
8 0.062 2.386 0.878 14.599

Attention, no multiple tests, for reference only.

Few Issues

  • In wavegan branch, code of vocoder is from ParallelWaveGAN. Since the method of acoustic feature extraction is not compatible, it needs to be transformed. See here.
  • The input of mandarin model is pinyin. Because of the lack of punctuations in BiaoBei's raw pinyin sequence and the incomplete alignment model training, there's something wrong with the rhythm of synthesized samples.
  • I haven't trained a Korean vocoder specially, and just use the vocoder of LJSpeech (22050 Hz), which might slightly affect the quality of synthesized audio.

References

TODO

  • Synthetic speech quality assessment (MOS)
  • More tests in different languages
  • Speech style transfer (tone)

Communication