Tortoise TTS

AyaAbout 3 min

Tortoise TTS

Tortoise TTS是一個文字轉語音的程序，它可以將文字轉換為逼真的語音。這個程式有多個聲音，能夠模擬不同說話者的音色和語調。所以，你可以根據需要選擇不同的聲音風格。 Tortoise TTS程式的原始程式碼包含了在推理模式下運行所需的所有程式碼。

Web UI tool

sample code

colab sample

colab long text sample

# Imports used through the rest of the notebook.
import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F

from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, load_voice, load_voices

# This will download all the models used by Tortoise from the HuggingFace hub.
tts = TextToSpeech()

# This is the text that will be spoken.
text = "Thanks for reading this article. I hope you learned something."

# Pick a "preset mode" to determine quality. Options: {"ultra_fast", "fast" (default), "standard", "high_quality"}. See docs in api.py
# Generate speech with the custotm voice.
gen = tts.tts_with_preset(text, preset="fast")
torchaudio.save(f'generated.wav', gen.squeeze(0).cpu(), 24000)

if want use custotm voice:

# Optionally, upload use your own voice by running the next two cells. I recommend
# you upload at least 2 audio clips. They must be a WAV file, 6-10 seconds long.
CUSTOM_VOICE_NAME = "martin"

import os
from google.colab import files

custom_voice_folder = f"tortoise/voices/{CUSTOM_VOICE_NAME}"
os.makedirs(custom_voice_folder)
for i, file_data in enumerate(files.upload().values()):
  with open(os.path.join(custom_voice_folder, f'{i}.wav'), 'wb') as f:
    f.write(file_data)


# Generate speech with the custotm voice.
voice_samples, conditioning_latents = load_voice(CUSTOM_VOICE_NAME)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)
torchaudio.save(f'generated-{CUSTOM_VOICE_NAME}.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(f'generated-{CUSTOM_VOICE_NAME}.wav')

Result

Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length. Sample rate is 24kHz.

eng text : Thanks for reading this article. I hope you learned something.

zh text : 感謝您閱讀本文。我希望你學到了一些東西。

Parameter

preset:
- ultra_fast : Produces speech at a speed which belies the name of this repo. (Not really, but it's definitely fastest).
- fast : Decent quality speech at a decent inference rate. A good choice for mass inference.
- standard : Very good quality. This is generally about as good as you are going to get.
- high_quality : Use if you want the absolute best. This is not really worth the compute, though.
temperature
length_penalty
repetition_penalty
top_p
cond_free_k
diffusion_temperature

Custotm voice notice

選擇「乾淨」的音頻：
- 音頻中除了人聲外，必須沒有其他噪音或音樂（有聲書是最佳選擇）。
避免氣聲過重或音調過低的音頻：
- 該程序無法正確處理這種音效，會生成雜音和怪聲。
控制音量：
- 音頻音量不應過大。如果使用有聲書，在 Audacity 中查看波形圖，選擇的音頻片段應該與波形圖相似。為了避免合成出的音頻變成雜音或怪聲，建議將音量降低至少 6dB。
選擇合適的音頻片段數量和時長：
- 建議選擇 3-5 個每個 8-12 秒的片段，每個片段中語調應有所變化。如果語調過於平穩，合成的聲音會顯得沒有感情。
注意音色的差異：
- 即使是同一位錄音者，不同場合錄製的聲音音色也會有所不同。比如同一位配音者在不同有聲書中的音色差異。建議選擇音色差異明顯的素材來提高合成音頻的效果。
音頻編碼要求：
- 雖然開發者的 README 中指出 WAV 音頻編碼必須是 Floating Point，但實際上只要音頻的采樣率是 22050Hz 且是單聲道，就應該沒有問題。即使合成時出現 "Chunk (non-data) not understood" 的錯誤，也不必擔心，音頻仍會正常合成。

Tortoise TTS

Tortoise TTS

sample code

if want use custotm voice:

Result

Parameter

Custotm voice notice

Reference