Speech Synthesis and Personalization under Unimodal and Multimodal Conditions

Authors

  • Wanlin Gao

DOI:

https://doi.org/10.62051/7b0mc109

Keywords:

TTS; SpeechSynthesis; Speech-to-Gesture; Personalized Speech.

Abstract

Recently, there have been notable advancements in TTS technology, with researchers optimizing the efficiency, quality, and flexibility of speech generation through various models. This paper systematically explores end-to-end TTS models based on waveform generation, including Parallel WaveGAN, NaturalSpeech, and Multi-Band MelGAN, each of which has unique features in enhancing real-time generation capabilities and sound quality. Additionally, the paper discusses the development of speech separation and synthesis technologies, highlighting the applications of models like CONTENTVEC in pitch adjustment and speaker information disentanglement. In terms of multimodal technology, speech-to-gesture generation has also seen important breakthroughs, utilizing multimodal information to generate natural gestures. The paper provides a detailed summary of the main datasets used in related research, such as LibriTTS, LJSpeech, and VCTK, aiming to offer reference and guidance for future research in speech generation. Although these technologies have achieved significant advancements in efficiency and multifunctionality, the associated models remain complex and require substantial computational resources, limiting their widespread application in practical scenarios.

Downloads

Download data is not yet available.

References

[1] Yamamoto R, Song E, Kim J M. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 6199-6203.

[2] Tan X, Chen J, Liu H, et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.

[3] Yang G, Yang S, Liu K, et al. Multi-band melgan: Faster waveform generation for high-quality text-to-speech[C]//2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021: 492-498.

[4] Polyak A, Adi Y, Copet J, et al. Speech resynthesis from discrete disentangled self-supervised representations[J]. arXiv preprint arXiv:2104.00355, 2021.

[5] Bińkowski M, Donahue J, Dieleman S, et al. High fidelity speech synthesis with adversarial networks[J]. arXiv preprint arXiv:1909.11646, 2019.

[6] Wang X, Takaki S, Yamagishi J. Neural source-filter-based waveform model for statistical parametric speech synthesis[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019: 5916-5920.

[7] Łańcucki A. Fastpitch: Parallel text-to-speech with pitch prediction[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 6588-6592.

[8] Huang R, Lam M W Y, Wang J, et al. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis[J]. arXiv preprint arXiv:2204.09934, 2022.

[9] Rosenberg A, Zhang Y, Ramabhadran B, et al. Speech recognition with augmented synthesized speech[C]//2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 2019: 996-1002.

[10] Ning Y, He S, Wu Z, et al. A review of deep learning based speech synthesis[J]. Applied Sciences, 2019, 9(19): 4050.

[11] Qian K, Zhang Y, Gao H, et al. Contentvec: An improved self-supervised speech representation by disentangling speakers[C]//International Conference on Machine Learning. PMLR, 2022: 18003-18017.

[12] Mustafa A, Pia N, Fuchs G. Stylemelgan: An efficient high-fidelity adversarial vocoder with temporal adaptive normalization[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 6034-6038.

[13] Wang X, Thakker M, Chen Z, et al. Speechx: Neural codec language model as a versatile speech transformer[J]. arXiv preprint arXiv:2308.06873, 2023.

[14] Le M, Vyas A, Shi B, et al. Voicebox: Text-guided multilingual universal speech generation at scale[J]. Advances in neural information processing systems, 2024, 36.

[15] Kucherenko T, Hasegawa D, Henter G E, et al. Analyzing input and output representations for speech-driven gesture generation[C]//Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents. 2019: 97-104.

[16] Yoon Y, Cha B, Lee J H, et al. Speech gesture generation from the trimodal context of text, audio, and speaker identity[J]. ACM Transactions on Graphics (TOG), 2020, 39(6): 1-16.

[17] Min D, Lee D B, Yang E, et al. Meta-stylespeech: Multi-speaker adaptive text-to-speech generation[C]//International Conference on Machine Learning. PMLR, 2021: 7748-7759.

[18] Lei Y, Yang S, Xie L. Fine-grained emotion strength transfer, control and prediction for emotional speech synthesis[C]//2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021: 423-430.

[19] Jia Y, Ramanovich M T, Remez T, et al. Translatotron 2: High-quality direct speech-to-speech translation with voice preservation[C]//International Conference on Machine Learning. PMLR, 2022: 10120-10134.

Downloads

Published

25-11-2024

How to Cite

Gao, W. (2024) “Speech Synthesis and Personalization under Unimodal and Multimodal Conditions”, Transactions on Computer Science and Intelligent Systems Research, 7, pp. 126–137. doi:10.62051/7b0mc109.