Text-to-speech (TTS) models and voice cloning using Generative AI. The main focus is to find a solution capable of generating natural and personalized speech, accurately replicating the voice of any individual. The technology should be able to make inferences to capture the unique characteristics of a person's voice, resulting in artificial speech that realistically and faithfully reproduces the original voice's tones, timbres, and nuances.
The Coqui AI TTS library provides several pre-trained models covering a variety of languages for creating high-quality multilingual solutions. It also offers tools for training new models and fine-tuning existing ones, enabling advanced customizations according to each project's needs.
Tip
Among the algorithms used for generating spectrograms are prominent models such as Tacotron, Glow-TTS, and SpeedySpeech, which are known for their efficiency and accuracy in natural speech synthesis.
Additionally, the library supports end-to-end models, including XTTS and YOURTTS, which are highly suitable for more complex applications involving voice cloning and personalized speech synthesis.
Important
This repository covers xxts_v2 and vits voice-cloning models and how to perform inference using the Coqui AI TTS library.
The model generates audio with high production quality and has shown good performance in generating multilingual speech. The voice cloning capability is remarkable, using only a small 3-second sample of the original voice, as well as the effectiveness of cross-language cloning.
The code uses the xtts_v2 model from Coqui AI to synthesize speech in Brazilian Portuguese, cloning the voice from a provided sample input_speaker.wav
and saving the cloned voice in the output_xtts.wav
format.
Warning
It is important to note that the XXTS model does not allow commercial use of the model and the generated outputs. For more information, read: Coqui Public Model License 1.0.0 / Coqui
VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech) is an end-to-end TTS model that utilizes Deep Learning techniques such as GANs, VAE, and Normalizing Flows. The model does not require external alignment annotations and learns text-to-audio alignment using MAS. The architecture combines the GlowTTS encoder and the HiFiGAN vocoder.
YourTTS is a multilingual and multi-speaker TTS model that performs zero-shot voice conversion and adaptation. It also has the capability to learn a new language or voice with only a 1-minute audio clip, making it an interesting model for training new TTS models in various languages using limited computational resources.
Note
The VITS model did not deliver satisfactory performance in generating natural-sounding speech when using voice cloning. Although the technique is capable of replicating a speaker's vocal characteristics, the final synthesis lacks the fluidity and expressiveness expected in natural human speech, which limits its applicability in contexts that demand high-quality speech synthesis. However, the model's main benefit lies in its efficiency in terms of speed in generation and cloning. Its architecture allows for real-time generation, making it viable for applications that require speech synthesis without the need for a high level of naturalness.
Create a virtual environment:
python -m venv .venv
.venv\Scripts\activate
Upgrade Setuptools, Wheel and Pip:
pip install --upgrade setuptools wheel pip
Install the TTS package from Coqui:
pip install tts
git clone https://github.com/coqui-ai/TTS
pip install -e .
Install PyTorch with CUDA support:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Caution
Common installation errors: Use pip install --upgrade pip setuptools wheel
to resolve TTS package installation issues. For Spacy version compatibility, use pip install spacy==3.4.0
.
- Giovane Iwamoto, computer science student at UFMS - Brazil, Campo Grande - MS.
I am always open to receiving constructive criticism and suggestions for improvement in my developed code. I believe that feedback is an essential part of the learning and growth process, and I am eager to learn from others and make my code the best it can be. Whether it's a minor tweak or a major overhaul, I am willing to consider all suggestions and implement the changes that will benefit my code and its users.