基于Viseme的連續(xù)語音識別系統(tǒng)及Talking Head
The Viseme Based Continuous Speech Recognition System for a Talking Head
-
摘要: 為實(shí)現(xiàn)聽覺/視覺驅(qū)動(dòng)的說話人頭部動(dòng)畫,該文給出了一個(gè)基于viseme(說話時(shí)的基本嘴形單位)的連續(xù)語音識別系統(tǒng)。它訓(xùn)練viseme隱馬爾可夫模型(HMM),識別語音為viseme圖像序列。建模采用triseme的概念來考慮viseme的上下文相關(guān)性,但它需要超大量的訓(xùn)練數(shù)據(jù)。該文根據(jù)viseme圖像及其相似度權(quán)值(VSW)定義視覺問題集,用來建立triseme決策樹,以實(shí)現(xiàn)triseme的狀態(tài)捆綁及HMM參數(shù)共享。為比較系統(tǒng)性能,基于phoneme(聽覺領(lǐng)域的語音基本單位)的語音識別結(jié)果也被映射為viseme序列。在評價(jià)準(zhǔn)則上,定義viseme圖像相似度加權(quán)識別精度,更全面地考慮輸出和參考圖像序列的差別,并用嘴形圓度和VSW曲線中的突變點(diǎn)來評估所得viseme序列的平滑性。結(jié)果表明,基于viseme的語音識別系統(tǒng)能給出更平滑和合理的嘴形圖像序列。Abstract: A continuous speech recognition system for a talking head is presented in this paper, which is based on the viseme (the basic speech unit in visual domain) HMMs and segments speech to mouth shape sequences with timing boundaries. The trisemes are for malized to consider the viseme contexts. Based on the 3D talking head images, the viseme similarity weight (VSW) is denned, and 166 visual questions are designed for the building of the triseme decision trees to tie the states of the trisemes with similar contexts, so that they can share the same parameters. For the system evaluation, besides the recognition rate, an image related measurement, the viseme similarity weighted accuracy accounts for the mismatches of the recognized viseme sequence with its reference, and jerky points in liprounding and VSW graphs help evaluate the smoothness of the resulting viseme image sequences. Results show that the viseme based speech recognition system gives smoother and more plausible mouth shapes.
-
Petajan E D, Goldschen A J, Garcia O N. Continuous automatic speech recognition by lipreading,In Motion-Based Recognition, USA: Kluwer Academnic Publishers, 1997: 321-343.[2]Woodland P C, Young S J, Odell J. Tree-based state tying for high accuracy acoustic modelling.In Proc. ARPA Workshop on Human Language Technology, Plainsboro, NJ, USA, 1994: 307-312.[3]Kate R, Faruquie T A, Kapoor A. Audio driven facial animation for audio-visual reality. In Proc.International Conference on Multimedia and EXPO (ICME), Tokyo, Japan, 2001: 22-25.[4]Young S J. The HTK Hidden Markov Model Toolkit: Design and Philosophy, Technical Report,CUED, Cambridge University, 1994.[5]Young S J, Kershaw D, Odell J, Woodland P. The HTK Book (for HTK Version 3.0),Http://htk.eng.cam.ac.uk/docs/docs.shtml, 2000.[6]Ezzat T, Poggio T. MikeTalk: A talking facial display based on morphing visemes. In Proc.Computer Animation Conference, Philadelphia, USA, 1998: 456-459. -