本文用以记录语音合成 (Speech Synthesis) 领域相关论文,包括经典的和未来的方向。
Acoustic model
- Tacotron1: Tacotron: Towards End-to-End Speech Synthesis (Interspeech 2017)
- Tacotron2: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (ICASSP 2018)
- FastSpeech1: FastSpeech: Fast, Robust and Controllable Text to Speech (NIPS 2019)
- FastSpeech2: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (arXiv 2020)
- Glow-TTS: Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search (NIPS 2020)
- EfficientTTS: EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (arXiv 2020)
- BVAE-TTS: Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (ICLR 2021)
Vocoder
- WaveNet: WaveNet: A Generative Model for Raw Audio (ISCA SS Workshop 2016)
- FFTNet: FFTNet: a Real-Time Speaker-Dependent Neural Vocoder (ICASSP 2018)
- WaveRNN: Efficient Neural Audio Synthesis (ICML 2018)[Code]
- WaveGlow: WaveGlow: A Flow-based Generative Network for Speech Synthesis (ICASSP 2019)
- MelGAN: MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (NIPS 2019) [Code]
- HiFi-GAN: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (NIPS 2020) [Code]
Prosody Modeling
- Prosody-Tacotron: Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron (ICML2018)
- GST-Tacotron: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (ICML2018)
- VAE-Tacotron: Learning Latent Representations for Style Control and Transfer in End-to-End Speech Synthesis (ICASSP 2019)
- VAE-Flow: Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech (ICASSP 2020)
- Fine-grained-Attention: Robust and Fine-Grained Prosody Control of End-to-End Speech Synthesis (ICASSP 2019)
- Manual-feature-based: Fine-grained robust prosody transfer for single-speaker neural text-to-speech (Interspeech 2019)
- CopyCat: CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech (Interspeech 2020)