Skip to content

🐸 Voice Cloning XTTS VITS - Coqui AI library for TTS models and voice cloning, aiming to generate natural, personalized speech that accurately replicates an individual's voice.

License

Notifications You must be signed in to change notification settings

GiovaneIwamoto/voice-cloning-xtts-vits

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

VOICE CLONING XTTS VITS

OVERVIEW

Text-to-speech (TTS) models and voice cloning using Generative AI. The main focus is to find a solution capable of generating natural and personalized speech, accurately replicating the voice of any individual. The technology should be able to make inferences to capture the unique characteristics of a person's voice, resulting in artificial speech that realistically and faithfully reproduces the original voice's tones, timbres, and nuances.

Icons



Coqui AI Logo

The Coqui AI TTS library provides several pre-trained models covering a variety of languages for creating high-quality multilingual solutions. It also offers tools for training new models and fine-tuning existing ones, enabling advanced customizations according to each project's needs.

Tip

Among the algorithms used for generating spectrograms are prominent models such as Tacotron, Glow-TTS, and SpeedySpeech, which are known for their efficiency and accuracy in natural speech synthesis.

Additionally, the library supports end-to-end models, including XTTS and YOURTTS, which are highly suitable for more complex applications involving voice cloning and personalized speech synthesis.

Important

This repository covers xxts_v2 and vits voice-cloning models and how to perform inference using the Coqui AI TTS library.

XTTS

The model generates audio with high production quality and has shown good performance in generating multilingual speech. The voice cloning capability is remarkable, using only a small 3-second sample of the original voice, as well as the effectiveness of cross-language cloning.

The code uses the xtts_v2 model from Coqui AI to synthesize speech in Brazilian Portuguese, cloning the voice from a provided sample input_speaker.wav and saving the cloned voice in the output_xtts.wav format.

Warning

It is important to note that the XXTS model does not allow commercial use of the model and the generated outputs. For more information, read: Coqui Public Model License 1.0.0 / Coqui

VITS

VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech) is an end-to-end TTS model that utilizes Deep Learning techniques such as GANs, VAE, and Normalizing Flows. The model does not require external alignment annotations and learns text-to-audio alignment using MAS. The architecture combines the GlowTTS encoder and the HiFiGAN vocoder.

YourTTS is a multilingual and multi-speaker TTS model that performs zero-shot voice conversion and adaptation. It also has the capability to learn a new language or voice with only a 1-minute audio clip, making it an interesting model for training new TTS models in various languages using limited computational resources.

Note

The VITS model did not deliver satisfactory performance in generating natural-sounding speech when using voice cloning. Although the technique is capable of replicating a speaker's vocal characteristics, the final synthesis lacks the fluidity and expressiveness expected in natural human speech, which limits its applicability in contexts that demand high-quality speech synthesis. However, the model's main benefit lies in its efficiency in terms of speed in generation and cloning. Its architecture allows for real-time generation, making it viable for applications that require speech synthesis without the need for a high level of naturalness.


INSTALLATION GUIDE

Create a virtual environment:

python -m venv .venv
.venv\Scripts\activate

Upgrade Setuptools, Wheel and Pip:

pip install --upgrade setuptools wheel pip

Install the TTS package from Coqui:

pip install tts

git clone https://github.com/coqui-ai/TTS
pip install -e .

Install PyTorch with CUDA support:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Caution

Common installation errors: Use pip install --upgrade pip setuptools wheel to resolve TTS package installation issues. For Spacy version compatibility, use pip install spacy==3.4.0.


AUTHOR

  • Giovane Iwamoto, computer science student at UFMS - Brazil, Campo Grande - MS.

I am always open to receiving constructive criticism and suggestions for improvement in my developed code. I believe that feedback is an essential part of the learning and growth process, and I am eager to learn from others and make my code the best it can be. Whether it's a minor tweak or a major overhaul, I am willing to consider all suggestions and implement the changes that will benefit my code and its users.

About

🐸 Voice Cloning XTTS VITS - Coqui AI library for TTS models and voice cloning, aiming to generate natural, personalized speech that accurately replicates an individual's voice.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published