---
license: apache-2.0
datasets:
- dolly-vn/dolly-audio-1000h-vietnamese
base_model:
- k2-fsa/OmniVoice
pipeline_tag: text-to-speech
tags:
- voice-cloning
- multilingual
- zero-shot
- voice-design
language:
- aae
- aal
- aao
- ab
- abb
- abn
- abr
- abs
- abv
- acm
- acw
- acx
- adf
- adx
- ady
- aeb
- aec
- af
- afb
- afo
- ahl
- ahs
- ajg
- aju
- ala
- aln
- alo
- am
- amu
- an
- anc
- ank
- anp
- anw
- aom
- apc
- apd
- arb
- arq
- ars
- ary
- arz
- as
- ast
- avl
- awo
- ayl
- ayp
- az
- ba
- bag
- bas
- bax
- bba
- bbj
- bbl
- bbu
- bce
- bci
- bcs
- bcy
- bda
- bde
- bdm
- be
- beb
- bew
- bfd
- bft
- bg
- bgp
- bhb
- bhh
- bho
- bhp
- bhr
- bjj
- bjk
- bjn
- bjt
- bkh
- bkm
- bky
- bmm
- bmq
- bn
- bnm
- bnn
- bns
- bo
- bou
- bqg
- br
- bra
- brh
- bri
- brx
- bs
- bsh
- bsj
- bsk
- btm
- btv
- bug
- bum
- buo
- bux
- bwr
- bxf
- byc
- bys
- byv
- byx
- bzc
- bzw
- ca
- ccg
- ceb
- cen
- cfa
- cgg
- chq
- cjk
- ckb
- ckl
- ckr
- cky
- cnh
- cpy
- cs
- cte
- ctl
- cut
- cux
- cv
- cy
- da
- dag
- dar
- dav
- dbd
- dcc
- de
- deg
- dgh
- dgo
- dje
- dmk
- dml
- dru
- dty
- dua
- dv
- dyu
- dzg
- ebr
- ebu
- ego
- eiv
- eko
- ekr
- el
- elm
- en
- eo
- es
- esu
- et
- eto
- ets
- etu
- eu
- ewo
- ext
- eyo
- fa
- fan
- fat
- ff
- ffm
- fi
- fia
- fil
- fip
- fkk
- fmp
- fr
- fub
- fuc
- fue
- fuf
- fuh
- fui
- fuq
- fuv
- fy
- ga
- gbm
- gbr
- gby
- gcc
- gdf
- gej
- ges
- ggg
- gid
- gig
- giz
- gjk
- gju
- gl
- glw
- gn
- gol
- gom
- gsl
- gu
- gui
- gur
- guz
- gv
- gwc
- gwe
- gwt
- gya
- gyz
- ha
- hah
- hao
- haw
- haz
- hbb
- he
- hem
- hi
- hia
- hkk
- hla
- hno
- hoj
- hr
- hsb
- ht
- hu
- hue
- hul
- hux
- hwo
- hy
- hz
- ia
- ibb
- id
- ida
- idu
- ig
- ijc
- ijn
- ik
- ikw
- is
- ish
- iso
- it
- its
- itw
- itz
- ja
- jal
- jax
- jgo
- jmx
- jns
- jqr
- juk
- juo
- jv
- ka
- kab
- kai
- kaj
- kam
- kbd
- kbl
- kbt
- kcq
- kdh
- kea
- keu
- kfe
- kfk
- kfp
- khg
- khw
- kj
- kjc
- kjk
- kk
- kln
- kls
- km
- kmr
- kmy
- kn
- kna
- knn
- ko
- kol
- koo
- kpo
- kqo
- ks
- ksd
- ksf
- kto
- kuh
- kvx
- kw
- kwm
- kxp
- ky
- kyx
- lag
- lb
- lcm
- ldb
- lg
- lij
- lir
- lkb
- lla
- ln
- lnu
- lo
- loa
- lrk
- lss
- lt
- ltg
- lto
- lua
- luo
- lus
- lv
- lwg
- mab
- maf
- mai
- mau
- max
- mbo
- mcf
- mcn
- mcx
- mdd
- mde
- mdf
- mek
- mer
- meu
- mfm
- mfn
- mfo
- mfv
- mgg
- mgi
- mhk
- mhr
- mi
- mig
- miu
- mk
- mkf
- mki
- ml
- mlq
- mn
- mne
- mni
- mqy
- mr
- mrj
- mrr
- mrt
- ms
- mse
- msh
- msw
- mt
- mtr
- mtu
- mtx
- mua
- mug
- mui
- mve
- mvy
- mxs
- mxu
- mxy
- my
- myv
- mzl
- nal
- nan
- nap
- nb
- nbh
- ncf
- nco
- ncx
- ndi
- ng
- ngi
- nhg
- nhi
- nhn
- nhq
- nja
- nl
- nla
- nlv
- nmg
- nmz
- nn
- nnh
- 'no'
- noe
- npi
- nso
- ny
- nyu
- oc
- odk
- odu
- ogo
- om
- orc
- oru
- ory
- os
- pa
- pbs
- pbt
- pbu
- pcm
- pex
- phl
- phr
- pip
- piy
- pko
- pl
- plk
- plt
- pmq
- pms
- pmy
- pnb
- poc
- poe
- pow
- prq
- ps
- pst
- pt
- pua
- pwn
- qug
- qum
- qup
- qur
- qus
- quv
- qux
- quy
- qva
- qvi
- qvj
- qvl
- qwa
- qws
- qxa
- qxp
- qxt
- qxu
- qxw
- rag
- rm
- ro
- rob
- rof
- roo
- rth
- ru
- rup
- rw
- sa
- sah
- sat
- sau
- say
- sbn
- sc
- scl
- scn
- sd
- sei
- shu
- si
- sip
- siw
- sjr
- sk
- skg
- skr
- sl
- sn
- snc
- snk
- so
- sol
- sps
- sq
- sr
- src
- sro
- ssi
- ste
- sua
- sv
- sva
- sw
- szy
- ta
- tan
- tar
- tay
- tbf
- tcf
- tcy
- tdn
- tdx
- te
- tg
- tgc
- th
- the
- thq
- thr
- thv
- ti
- tig
- tio
- tk
- tkg
- tkt
- tli
- tlp
- tn
- tok
- tpl
- tpz
- tqp
- tr
- trp
- trq
- trv
- trw
- tt
- ttj
- ttr
- ttu
- tui
- tul
- tuq
- tuv
- tuy
- tvo
- tvu
- tw
- twu
- txs
- txy
- udl
- ug
- uk
- uki
- umb
- ur
- ush
- uz
- uzn
- vai
- var
- ver
- vi
- vmc
- vmj
- vmm
- vmp
- vmz
- vot
- vro
- wbl
- wci
- weo
- wes
- wja
- wji
- wo
- wof
- xh
- xhe
- xka
- xmf
- xmv
- xmw
- xpe
- xti
- xtu
- yaq
- yav
- yay
- ydd
- ydg
- yer
- 'yes'
- yi
- yo
- yue
- zga
- zgh
- zh
- zoc
- zoh
- zor
- zpv
- zpy
- ztg
- ztn
- ztp
- zts
- ztu
- zu
- zza
---
# OmniVoice Vietnamese — Fine-tuned for Vietnamese Speech Synthesis

Fine-tuned version of [OmniVoice](https://huggingface.co/k2-fsa/OmniVoice) on 1,000 hours of high-quality Vietnamese speech data, optimized for Vietnamese voice cloning and text-to-speech.

- **Model:** [splendor1811/omnivoice-vietnamese](https://huggingface.co/splendor1811/omnivoice-vietnamese)
- **Base model:** [k2-fsa/OmniVoice](https://huggingface.co/k2-fsa/OmniVoice) (Qwen3-0.6B backbone, diffusion language model)
- **Dataset:** [dolly-vn/dolly-audio-1000h-vietnamese](https://huggingface.co/datasets/dolly-vn/dolly-audio-1000h-vietnamese)
- **License:** Apache 2.0 (model), CC-BY-NC-SA-4.0 (dataset)

## Training Details

| Parameter | Value |
|---|---|
| Base model | k2-fsa/OmniVoice (0.6B params) |
| Architecture | Diffusion Language Model (non-autoregressive, iterative masked decoding) |
| Training steps | 8,000 |
| Training time | 6 hours |
| Hardware | NVIDIA H200 SXM (141 GB VRAM) |
| Output sample rate | 24,000 Hz |

## Dataset

[Dolly-Audio: Vietnamese Multi-Speaker High-Quality Speech Corpus](https://huggingface.co/datasets/dolly-vn/dolly-audio-1000h-vietnamese)

| Property | Value |
|---|---|
| Duration | ~1,000 hours |
| Samples | 664,125 |
| Speakers | 152 (multi-region, diverse accents) |
| Language | Vietnamese (100%) |
| Audio duration | 0.63 – 32.1 seconds per sample |
| Quality | Cleaned, noise-free, sentence-level boundary trimming |
| Domains | News, entertainment, education, conversational |
| License | CC-BY-NC-SA-4.0 (research use only) |

## Installation

### Step 1: Install PyTorch (NVIDIA GPU)

```bash
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
```

### Step 2: Install OmniVoice

```bash
pip install omnivoice
```

## Usage

### Python API

```python
import torch
import torchaudio
from omnivoice import OmniVoice

# Load the Vietnamese fine-tuned model
model = OmniVoice.from_pretrained(
    "splendor1811/omnivoice-vietnamese",
    device_map="cuda:0",
    dtype=torch.float16,
)

# Zero-shot voice cloning
audio = model.generate(
    text="Xin chào, đây là mô hình tổng hợp giọng nói tiếng Việt.",
    language="vietnamese",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
)

torchaudio.save("output.wav", audio[0], 24000)
```

### With Cached Voice Prompt (recommended for serving)

```python
# Create voice prompt once (caches the encoded reference audio)
voice_prompt = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
)

# Reuse for multiple generations — no re-encoding cost
audio = model.generate(
    text="Em chào anh, em gọi từ tổng đài ngân hàng.",
    language="vietnamese",
    voice_clone_prompt=voice_prompt,
)
```

### With torch.compile (recommended for production)

```python
from omnivoice import OmniVoiceGenerationConfig

# Apply torch.compile for faster inference
torch.set_float32_matmul_precision("high")
model.llm = torch.compile(model.llm, mode="reduce-overhead", dynamic=True)

# Warmup (triggers compilation)
config = OmniVoiceGenerationConfig(num_step=8, guidance_scale=2.0)
for _ in range(3):
    model.generate(
        text="Xin chào.",
        language="vietnamese",
        voice_clone_prompt=voice_prompt,
        generation_config=config,
    )

# Production inference at num_step=8 for speed
audio = model.generate(
    text="Dạ chào anh, anh có cần hỗ trợ gì không ạ?",
    language="vietnamese",
    voice_clone_prompt=voice_prompt,
    generation_config=config,
)
```

## Performance

Benchmarked on NVIDIA L4 GPU with `num_step=8` and `torch.compile(mode="reduce-overhead")`:

| Metric | Value |
|---|---|
| RTF (Real-Time Factor) | ~0.07 (14x real-time) |
| TTFB (short sentence) | ~210 ms |
| P95 TTFB at CCU=4 | ~1.26 s |
| Sample rate | 24,000 Hz |

## Base Model: OmniVoice

[OmniVoice](https://huggingface.co/k2-fsa/OmniVoice) is a massively multilingual zero-shot TTS model supporting 600+ languages, built on a diffusion language model architecture with Qwen3-0.6B as the backbone.

Key features of the base model:
- **600+ languages** — broadest coverage among zero-shot TTS models
- **Voice cloning** — state-of-the-art from short reference audio
- **Voice design** — control via speaker attributes (gender, age, pitch, accent)
- **Non-verbal symbols** — `[laughter]`, `[breath]`, etc.
- **Fast inference** — RTF as low as 0.025 (40x real-time)

## Citation

```bibtex
@article{zhu2026omnivoice,
  title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
  author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
  journal={arXiv preprint arXiv:2604.00688},
  year={2026}
}

@dataset{dolly_audio_2025,
  title={Dolly-Audio: Vietnamese Multi-Speaker High-Quality Speech Corpus},
  author={Nguyen, Vinh Huy and Nguyen, Dinh Thuan},
  year={2025},
  publisher={Dolly AI Team},
  howpublished={\url{https://huggingface.co/datasets/dolly-vn/dolly-audio-1000h-vietnamese}},
  note={Released under CC-BY-NC-SA-4.0. Research use only.}
}
```