How to Install and Use GPT-SoVITS: A New Era in Voice Conversion and Synthesis π€β¨
Thursday, Jan 2, 2025 | 7 minute read
Revolutionizing voice tech, this innovative tool delivers instant voice conversion and text-to-speech synthesis with minimal data input. Its advanced features include cross-language support and user-friendly interfaces, making it accessible for all! π€β¨
βIn today’s rapidly advancing tech landscape, voice technology is transforming our means of communication and creative expression at an astonishing pace.β π
In this information-heavy era, voice conversion and synthesis technologies have become hot topics across various industries. π€ From online education to content creation, game development to customer service, voice technology plays a crucial role. Among these innovations, GPT-SoVITS, as a cutting-edge tool, utilizes the latest AI techniques to provide users with convenient voice conversion and text-to-speech (TTS) synthesis services, enabling nearly instantaneous personalized voice experiences. π
1. Unveiling the Mystery: What is GPT-SoVITS? π
GPT-SoVITS is an innovative online tool designed specifically for voice conversion and text-to-speech (TTS) synthesis with minimal sample input! Simply upload a 5-second audio clip, and you can achieve voice conversion. π It supports multiple languages including English, Japanese, Korean, Cantonese, and Chinese, demonstrating a wide range of application potential that allows easy use for anyone from any background! π The tool was developed to enhance the flexibility and convenience of voice conversion and synthesis, catering to both professionals and beginners with tailored solutions.
2. Disruptive Features: What Makes GPT-SoVITS Stand Out? π
One of the standout features of GPT-SoVITS is its βzero-shot TTSβ technology! This enables immediate text-to-speech conversion with minimal sample data, making the conversion process efficient and quick! β±οΈ With its βfew-shot TTSβ technology, users only need 1 minute of training data to achieve nearly realistic voice effects, significantly lowering the barrier for data recording and inviting more participation. π Furthermore, its cross-language support makes voice generation possible across different languages, expanding its application scenarios and meeting diverse user needs. π The integrated WebUI tool simplifies the voice processing workflow, encompassing voice accompaniment separation, automated training data segmentation, Chinese automatic speech recognition (ASR), and text tagging functions, making it incredibly user-friendly for beginners! π οΈ
3. Ideal for Developers: Why Choose GPT-SoVITS? π»
There are countless reasons to choose GPT-SoVITS! First and foremost, it features an open-source nature and robust community support that offer users a seamless experience and abundant resources, a true boon for developers! π¬ The latest version has notably enhanced the quality of voice synthesis, especially excelling with low-quality reference audio, meeting user demands for high-quality synthesis! π Its ease of use and flexibility allows users of all skill levels to get started effortlessly and adjust as needed! π§ Continuous development and maintenance also ensure GPT-SoVITS remains cutting-edge and practical, retaining its competitiveness in the voice technology field. β‘
With its powerful capabilities and wide applicability, GPT-SoVITS is ushering in unprecedented development opportunities in the voice conversion and synthesis realm, making it an indispensable tool for users and developers alike! π
1. Installation and Usage π
To begin using GPT-SoVITS, the first step is to ensure you have all the required dependencies. Type the following command in your terminal or command prompt:
# Install the necessary dependencies for the project
pip install -r requirements.txt
Note: This command reads the contents of the requirements.txt
file and automatically installs all listed dependencies required to run the project. πͺ
Using Docker π³
For easier environment management, you can also run GPT-SoVITS using Docker. Docker effectively avoids issues like dependency version mismatches. Here are the steps you can follow:
docker-compose.yaml Configuration π¦
-
You can visit Docker Hub to obtain a pre-packaged image or build a Docker image locally using the Dockerfile included in the project.
-
Environment Variables:
is_half
is used to set the model’s precision at runtime; generally, half precision can enhance performance, so give it a try!
-
Volume Configuration: Set the working directory of the application in Docker to
/workspace
. -
Shared Memory Size: Adjust according to your machine’s resources; it’s generally recommended to set a larger value to avoid performance bottlenecks.
Run with Docker Compose ποΈ
To start the project, run the following command:
docker compose -f "docker-compose.yaml" up -d
Note: This command reads the docker-compose.yaml
file and runs all configured services in the background, making it super easy! π
Run with Docker Command π
You can also run the project directly in Docker with the following command:
docker run --rm -it --gpus=all --env=is_half=False --volume=G:\\GPT-SoVITS-DockerTest\\output:/workspace/output --volume=G:\\GPT-SoVITS-DockerTest\\logs:/workspace/logs --volume=G:\\GPT-SoVITS-DockerTest\\SoVITS_weights:/workspace/SoVITS_weights --workdir=/workspace -p 9880:9880 -p 9871:9871 -p 9872:9872 -p 9873:9873 -p 9874:9874 --shm-size="16G" -d breakstring/gpt-sovits:xxxxx
Note: This command has detailed parameter configurations.
--gpus=all
means the system will utilize all available GPUs to accelerate computations.- The
--volume
parameter maps local folders to the Docker container, ensuring that your output results and logs are saved locally, while the internal paths of the program are/workspace/...
. - The
-p
parameter is used to specify port mappings for communication between different network ports. --shm-size="16G"
sets the shared memory size, which is crucial when handling large amounts of data.
2. Pre-trained Models π€
Once your environment is set up, you will need to prepare some pre-trained voice models for voice generation and conversion.
- Download the pre-trained models from GPT-SoVITS Models and place them in the
GPT_SoVITS/pretrained_models
folder. - For Chinese text-to-speech (TTS), you can download the model from G2PWModel_1.1.zip and place it in the
GPT_SoVITS/text
folder. - For voice conversion, you can download related models from UVR5 Weights and place them in
tools/uvr5/uvr5_weights
. - Finally, for Chinese automatic speech recognition (ASR), you can download models from the following links and save them in the
tools/asr/models
folder:
3. Dataset Format π
When using GPT-SoVITS for voice generation, you need to prepare a TTS annotation file in the following format:
vocal_path|speaker_name|language|text
- vocal_path: Path to the audio file.
- speaker_name: Name of the speaker.
- language: Type of language.
- text: The text you wish to generate.
You can use the following language identifiers:
- ‘zh’ for Chinese
- ‘ja’ for Japanese
- ’en’ for English
- ‘ko’ for Korean
- ‘yue’ for Cantonese
βοΈ Example format:
D:\\GPT-SoVITS\\xxx/xxx.wav|xxx|en|I like playing Genshin.
4. Fine-tuning and Inference π§
Launching WebUI π
To start WebUI, you have different options to choose from:
For Integrated Package Users
- Double-click
go-webui.bat
or use thego-webui.ps1
file to launch it effortlessly!
For Others
python webui.py <language(optional)>
Note: Running this command will start WebUI. You can specify which language to use in the optional <language>
parameter.
Fine-tuning Steps βοΈ
During fine-tuning, you need to follow these steps:
- Fill in the audio file path.
- Choose to slice the audio into smaller chunks.
- (Optional) Perform noise reduction.
- Conduct ASR (automatic speech recognition) processing.
- Proofread the transcribed text generated by ASR.
- Fine-tune the model in the next tab.
Command Line Usage π₯οΈ
To open UVR5’s WebUI from the command line, type the following command:
python tools/uvr5/webui.py "<infer_device>" <is_half> <webui_port_uvr5>
Note: Here, <infer_device>
is the device you want to use; the is_half
parameter can be set to True
or False
to choose between half-precision or full-precision models; webui_port_uvr5
is the port number you’d like to set for WebUI.
To use the command line for dataset splitting, you can use:
python audio_slicer.py --input_path "<path_to_original_audio_file_or_directory>" --output_root "<directory_where_subdivided_audio_clips_will_be_saved>" --threshold <volume_threshold> --min_length <minimum_duration_of_each_subclip> --min_interval <shortest_time_gap_between_adjacent_subclips> --hop_size <step_size_for_computing_volume_curve>
Note: This command will help you slice audio files into smaller segments for easier processingβsuper cool! π
For ASR processing (Chinese only):
python tools/asr/funasr_asr.py -i <input> -o <output>
The command for non-Chinese ASR processing is:
python ./tools/asr/fasterwhisper_asr.py -i <input> -o <output> -l <language> -p <precision>
Note: These two commands are for processing automatic speech recognition for Chinese and non-Chinese languages, and the input and output paths should be specified correctly.
If you want to generate text outputs using GPT-SoVITS, a basic code example is as follows:
from gpt_sovits import GPTSoVITS
# Create an instance of GPTSoVITS model
model = GPTSoVITS()
# Generate output
output = model.generate("Your input text")
# Print the output result
print(output)
Note: Here, we are importing the GPTSoVITS class and instantiating a model. We call the generate
method to create text outputβsimple and straightforward, right? π
Advanced Features β¨
You can achieve more complex text generation effects by adjusting parameters, for example:
output = model.generate("Your input text", max_length=100, temperature=0.7)
print(output)
Note: Setting max_length
defines the maximum length of the generated text, while temperature
controls the randomness of the text generationβhigher values lead to more random results. Explore and find more fun! π’
5. Code Examples π
Overall Structure π³
Finally, the project file structure generally looks like this:
gpt_sovits/
βββ main.py # Main program entry
βββ requirements.txt # List of project dependencies
βββ gpt_sovits/
βββ __init__.py # Package initialization file
βββ model.py # Model definitions
βββ utils.py # Utility functions
Note: Each file serves a clear function; main.py
is the starting file for the program, requirements.txt
contains all the dependencies required for the project, and the gpt_sovits/
directory primarily stores model-related code and utility functions.
Through the detailed steps and instructions provided above, you should be able to smoothly install, configure, and use the GPT-SoVITS project, embarking on your journey into voice generation and conversion! π