--- license: apache-2.0 datasets: - dolly-vn/dolly-audio-1000h-vietnamese base_model: - k2-fsa/OmniVoice pipeline_tag: text-to-speech tags: - voice-cloning - multilingual - zero-shot - voice-design language: - aae - aal - aao - ab - abb - abn - abr - abs - abv - acm - acw - acx - adf - adx - ady - aeb - aec - af - afb - afo - ahl - ahs - ajg - aju - ala - aln - alo - am - amu - an - anc - ank - anp - anw - aom - apc - apd - arb - arq - ars - ary - arz - as - ast - avl - awo - ayl - ayp - az - ba - bag - bas - bax - bba - bbj - bbl - bbu - bce - bci - bcs - bcy - bda - bde - bdm - be - beb - bew - bfd - bft - bg - bgp - bhb - bhh - bho - bhp - bhr - bjj - bjk - bjn - bjt - bkh - bkm - bky - bmm - bmq - bn - bnm - bnn - bns - bo - bou - bqg - br - bra - brh - bri - brx - bs - bsh - bsj - bsk - btm - btv - bug - bum - buo - bux - bwr - bxf - byc - bys - byv - byx - bzc - bzw - ca - ccg - ceb - cen - cfa - cgg - chq - cjk - ckb - ckl - ckr - cky - cnh - cpy - cs - cte - ctl - cut - cux - cv - cy - da - dag - dar - dav - dbd - dcc - de - deg - dgh - dgo - dje - dmk - dml - dru - dty - dua - dv - dyu - dzg - ebr - ebu - ego - eiv - eko - ekr - el - elm - en - eo - es - esu - et - eto - ets - etu - eu - ewo - ext - eyo - fa - fan - fat - ff - ffm - fi - fia - fil - fip - fkk - fmp - fr - fub - fuc - fue - fuf - fuh - fui - fuq - fuv - fy - ga - gbm - gbr - gby - gcc - gdf - gej - ges - ggg - gid - gig - giz - gjk - gju - gl - glw - gn - gol - gom - gsl - gu - gui - gur - guz - gv - gwc - gwe - gwt - gya - gyz - ha - hah - hao - haw - haz - hbb - he - hem - hi - hia - hkk - hla - hno - hoj - hr - hsb - ht - hu - hue - hul - hux - hwo - hy - hz - ia - ibb - id - ida - idu - ig - ijc - ijn - ik - ikw - is - ish - iso - it - its - itw - itz - ja - jal - jax - jgo - jmx - jns - jqr - juk - juo - jv - ka - kab - kai - kaj - kam - kbd - kbl - kbt - kcq - kdh - kea - keu - kfe - kfk - kfp - khg - khw - kj - kjc - kjk - kk - kln - kls - km - kmr - kmy - kn - kna - knn - ko - kol - koo - kpo - kqo - ks - ksd - ksf - kto - kuh - kvx - kw - kwm - kxp - ky - kyx - lag - lb - lcm - ldb - lg - lij - lir - lkb - lla - ln - lnu - lo - loa - lrk - lss - lt - ltg - lto - lua - luo - lus - lv - lwg - mab - maf - mai - mau - max - mbo - mcf - mcn - mcx - mdd - mde - mdf - mek - mer - meu - mfm - mfn - mfo - mfv - mgg - mgi - mhk - mhr - mi - mig - miu - mk - mkf - mki - ml - mlq - mn - mne - mni - mqy - mr - mrj - mrr - mrt - ms - mse - msh - msw - mt - mtr - mtu - mtx - mua - mug - mui - mve - mvy - mxs - mxu - mxy - my - myv - mzl - nal - nan - nap - nb - nbh - ncf - nco - ncx - ndi - ng - ngi - nhg - nhi - nhn - nhq - nja - nl - nla - nlv - nmg - nmz - nn - nnh - 'no' - noe - npi - nso - ny - nyu - oc - odk - odu - ogo - om - orc - oru - ory - os - pa - pbs - pbt - pbu - pcm - pex - phl - phr - pip - piy - pko - pl - plk - plt - pmq - pms - pmy - pnb - poc - poe - pow - prq - ps - pst - pt - pua - pwn - qug - qum - qup - qur - qus - quv - qux - quy - qva - qvi - qvj - qvl - qwa - qws - qxa - qxp - qxt - qxu - qxw - rag - rm - ro - rob - rof - roo - rth - ru - rup - rw - sa - sah - sat - sau - say - sbn - sc - scl - scn - sd - sei - shu - si - sip - siw - sjr - sk - skg - skr - sl - sn - snc - snk - so - sol - sps - sq - sr - src - sro - ssi - ste - sua - sv - sva - sw - szy - ta - tan - tar - tay - tbf - tcf - tcy - tdn - tdx - te - tg - tgc - th - the - thq - thr - thv - ti - tig - tio - tk - tkg - tkt - tli - tlp - tn - tok - tpl - tpz - tqp - tr - trp - trq - trv - trw - tt - ttj - ttr - ttu - tui - tul - tuq - tuv - tuy - tvo - tvu - tw - twu - txs - txy - udl - ug - uk - uki - umb - ur - ush - uz - uzn - vai - var - ver - vi - vmc - vmj - vmm - vmp - vmz - vot - vro - wbl - wci - weo - wes - wja - wji - wo - wof - xh - xhe - xka - xmf - xmv - xmw - xpe - xti - xtu - yaq - yav - yay - ydd - ydg - yer - 'yes' - yi - yo - yue - zga - zgh - zh - zoc - zoh - zor - zpv - zpy - ztg - ztn - ztp - zts - ztu - zu - zza --- # OmniVoice Vietnamese — Fine-tuned for Vietnamese Speech Synthesis Fine-tuned version of [OmniVoice](https://huggingface.co/k2-fsa/OmniVoice) on 1,000 hours of high-quality Vietnamese speech data, optimized for Vietnamese voice cloning and text-to-speech. - **Model:** [splendor1811/omnivoice-vietnamese](https://huggingface.co/splendor1811/omnivoice-vietnamese) - **Base model:** [k2-fsa/OmniVoice](https://huggingface.co/k2-fsa/OmniVoice) (Qwen3-0.6B backbone, diffusion language model) - **Dataset:** [dolly-vn/dolly-audio-1000h-vietnamese](https://huggingface.co/datasets/dolly-vn/dolly-audio-1000h-vietnamese) - **License:** Apache 2.0 (model), CC-BY-NC-SA-4.0 (dataset) ## Training Details | Parameter | Value | |---|---| | Base model | k2-fsa/OmniVoice (0.6B params) | | Architecture | Diffusion Language Model (non-autoregressive, iterative masked decoding) | | Training steps | 8,000 | | Training time | 6 hours | | Hardware | NVIDIA H200 SXM (141 GB VRAM) | | Output sample rate | 24,000 Hz | ## Dataset [Dolly-Audio: Vietnamese Multi-Speaker High-Quality Speech Corpus](https://huggingface.co/datasets/dolly-vn/dolly-audio-1000h-vietnamese) | Property | Value | |---|---| | Duration | ~1,000 hours | | Samples | 664,125 | | Speakers | 152 (multi-region, diverse accents) | | Language | Vietnamese (100%) | | Audio duration | 0.63 – 32.1 seconds per sample | | Quality | Cleaned, noise-free, sentence-level boundary trimming | | Domains | News, entertainment, education, conversational | | License | CC-BY-NC-SA-4.0 (research use only) | ## Installation ### Step 1: Install PyTorch (NVIDIA GPU) ```bash pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128 ``` ### Step 2: Install OmniVoice ```bash pip install omnivoice ``` ## Usage ### Python API ```python import torch import torchaudio from omnivoice import OmniVoice # Load the Vietnamese fine-tuned model model = OmniVoice.from_pretrained( "splendor1811/omnivoice-vietnamese", device_map="cuda:0", dtype=torch.float16, ) # Zero-shot voice cloning audio = model.generate( text="Xin chào, đây là mô hình tổng hợp giọng nói tiếng Việt.", language="vietnamese", ref_audio="reference.wav", ref_text="Transcript of the reference audio.", ) torchaudio.save("output.wav", audio[0], 24000) ``` ### With Cached Voice Prompt (recommended for serving) ```python # Create voice prompt once (caches the encoded reference audio) voice_prompt = model.create_voice_clone_prompt( ref_audio="reference.wav", ref_text="Transcript of the reference audio.", ) # Reuse for multiple generations — no re-encoding cost audio = model.generate( text="Em chào anh, em gọi từ tổng đài ngân hàng.", language="vietnamese", voice_clone_prompt=voice_prompt, ) ``` ### With torch.compile (recommended for production) ```python from omnivoice import OmniVoiceGenerationConfig # Apply torch.compile for faster inference torch.set_float32_matmul_precision("high") model.llm = torch.compile(model.llm, mode="reduce-overhead", dynamic=True) # Warmup (triggers compilation) config = OmniVoiceGenerationConfig(num_step=8, guidance_scale=2.0) for _ in range(3): model.generate( text="Xin chào.", language="vietnamese", voice_clone_prompt=voice_prompt, generation_config=config, ) # Production inference at num_step=8 for speed audio = model.generate( text="Dạ chào anh, anh có cần hỗ trợ gì không ạ?", language="vietnamese", voice_clone_prompt=voice_prompt, generation_config=config, ) ``` ## Performance Benchmarked on NVIDIA L4 GPU with `num_step=8` and `torch.compile(mode="reduce-overhead")`: | Metric | Value | |---|---| | RTF (Real-Time Factor) | ~0.07 (14x real-time) | | TTFB (short sentence) | ~210 ms | | P95 TTFB at CCU=4 | ~1.26 s | | Sample rate | 24,000 Hz | ## Base Model: OmniVoice [OmniVoice](https://huggingface.co/k2-fsa/OmniVoice) is a massively multilingual zero-shot TTS model supporting 600+ languages, built on a diffusion language model architecture with Qwen3-0.6B as the backbone. Key features of the base model: - **600+ languages** — broadest coverage among zero-shot TTS models - **Voice cloning** — state-of-the-art from short reference audio - **Voice design** — control via speaker attributes (gender, age, pitch, accent) - **Non-verbal symbols** — `[laughter]`, `[breath]`, etc. - **Fast inference** — RTF as low as 0.025 (40x real-time) ## Citation ```bibtex @article{zhu2026omnivoice, title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models}, author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel}, journal={arXiv preprint arXiv:2604.00688}, year={2026} } @dataset{dolly_audio_2025, title={Dolly-Audio: Vietnamese Multi-Speaker High-Quality Speech Corpus}, author={Nguyen, Vinh Huy and Nguyen, Dinh Thuan}, year={2025}, publisher={Dolly AI Team}, howpublished={\url{https://huggingface.co/datasets/dolly-vn/dolly-audio-1000h-vietnamese}}, note={Released under CC-BY-NC-SA-4.0. Research use only.} } ```