Fine-Tune Whisper For Multilingual ASR with huggingface Transformers

AyaAbout 1 min

Fine-Tune Whisper For Multilingual ASR with huggingface Transformers

All 11 of the pre-trained checkpoints are available on the Hugging Face Hub. The checkpoints are summarised in the following table with links to the models on the Hub:

Size	Layers	Width	Heads	Parameters	English-only	Multilingual
tiny	4	384	6	39 M	✓	✓
base	6	512	8	74 M	✓	✓
small	12	768	12	244 M	✓	✓
medium	24	1024	16	769 M	✓	✓
large	32	1280	20	1550 M	x	✓
large-v2	32	1280	20	1550 M	x	✓
large-v3	32	1280	20	1550 M	x	✓

*fine-tune時須考量local端硬體能夠支援的model大小

Fine-tuning Whisper in a Google Colab

colab: 採用small model來當預訓練模型，利用mozilla-foundation/common_voice_11_0 的這個資料集來 fine-tune 印地語。原先的 Whisper small模型的 WER 為 63.5%，經過 4000 steps後，最後使得該WER降到 32.0%。

Fine-tuning by own dataset

根據上面colab的sample code,這邊找了兩段語音，來對tiny model做微調。

語音分別為 :

"這個BenQ大約是2月底購買的" -> training
"又看到BenQ官方主打最適合Mac" -> test

tiny-model predction :

"這個Benz大約是2月底購買的"
"又看到Ben Koo官方主打最適合MET"

after Fine-tuning :

"BenQ 這是大約二月底購買的"
"又看到BenQ 官方主打最適合美"

可以看到BenQ、Mac在原先的model是不認識的，我們收集了有BenQ的語音後，就可以讓model成功辨識出該字串，在Mac的部分因為training data中還是沒有包含該字串，所以在 fine-tuning 後依然不認識該字串。

Sample code

whisper_fine_tune sample : flow the README.md, tuning self model.