Tag: voice recognition

Android SpeechRecognizer

在 Android 14（API level 34）及以上版本中，android.speech.SpeechRecognizer 提供了標準的語音識別能力，允許應用透過底層系統或指定的識別服務，將音訊輸入轉換為文字輸出。其主要功能包括：建立識別器實例、偵測系統是否支援識別、啟動/停止識別會話、取消或銷毀識別器，以及透過回呼介面接收識別結果和錯誤訊息。同時，Android 14 針對主執行緒呼叫、前台服務類型聲明等方面強化了行為約束，需要開發者註意避免 ANR 並正確配置服務類型。此外，API 在內網/雲端辨識、持續辨識及電量消耗等方面有侷限，需要根據應用場景選擇適當的實作方式。以下文件將從功能概覽、核心方法、回呼機制、權限與配置、Android 14 特性與限制五個部分進行介紹。

AyaAbout 6 min

WhisperX

WhisperX 是一個基於 OpenAI 開源語音識別模型 Whisper 的增強工具，專注於解決標準 Whisper 在語音轉文字 (ASR) 應用中的一些局限性，特別是精確時間戳和說話人分離功能。

whisperx-pipeline

WhisperX 的功能與特點

1. 精確時間戳 (Word-level Alignment)

特點：
- Whisper 原生僅支持語句級別 (phrase-level) 的時間戳，這對於字幕生成等應用場景可能不夠精確。
- WhisperX 通過集成 音素對齊算法，提供逐字級別的時間戳，確保每個單詞的開始和結束時間更準確。
使用技術：
- 使用工具如 pyctcdecode 和 Aeneas，基於聲學特徵和語言模型進行對齊。
應用場景：
- 字幕生成（精確到每個單詞）。
- 時間敏感的語音轉文字應用（如語音檢索、索引構建）。

AyaAbout 5 min

Coqui.ai TTS

web demo

TTS Performance

Performance from git Underlined "TTS*" and "Judy*" are internal 🐸TTS models that are not released open-source. They are here to show the potential. Models prefixed with a dot (.Jofish .Abe and .Janice) are real human voices.

AyaAbout 3 min

Tortoise TTS

Tortoise TTS是一個文字轉語音的程序，它可以將文字轉換為逼真的語音。這個程式有多個聲音，能夠模擬不同說話者的音色和語調。所以，你可以根據需要選擇不同的聲音風格。 Tortoise TTS程式的原始程式碼包含了在推理模式下運行所需的所有程式碼。

Web UI tool

sample code

colab sample

AyaAbout 3 min

Open Source Text-to-Speech Models (TTS)

Started to save /u/M4xM9450’s comment on the topic of open source TTS models.
Disclaimer: I’m far from an expert in this field, but I saw some desire to have a shared resource.
Please feel free to suggest or comment to clean this up or extend as you see fit.

Neural TTS Models

AyaAbout 3 min

Fine-Tune Whisper For Multilingual ASR with huggingface Transformers

All 11 of the pre-trained checkpoints are available on the Hugging Face Hub. The checkpoints are summarised in the following table with links to the models on the Hub:

Size	Layers	Width	Heads	Parameters	English-only	Multilingual
tiny	4	384	6	39 M	✓	✓
base	6	512	8	74 M	✓	✓
small	12	768	12	244 M	✓	✓
medium	24	1024	16	769 M	✓	✓
large	32	1280	20	1550 M	x	✓
large-v2	32	1280	20	1550 M	x	✓
large-v3	32	1280	20	1550 M	x	✓

AyaAbout 1 min

Robust Speech Recognition via Large-Scale Weak Supervision

1. 簡介

目前大規模基於純語音預訓練模型取得了很好的發展。（wav2vec2, et al.）

但作者的認為語音識別系統的目標應該在通義環境下做到開箱即用，而不是需要針對於每個數據集，設置一個特定的解碼器，來進行帶監督的微調

“ The goal of a speech recognition system should be to work reliably “out of the box” in a broad range of environments without requiring supervised fine-tuning of a decoder for every deployment distribution “

AyaAbout 8 min

Whisper

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio (680,000 hours of multilingual and multitask supervised data) and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

AyaAbout 2 min