基于遞歸神經(jīng)網(wǎng)絡(luò)的語(yǔ)音識(shí)別快速解碼算法
doi: 10.11999/JEIT160543 cstr: 32379.14.JEIT160543
國(guó)家自然科學(xué)基金(U1536117, 11590770-4),國(guó)家重點(diǎn)研發(fā)計(jì)劃重點(diǎn)專項(xiàng)(2016YFB0801200, 2016YFB0801203),新疆維吾爾自治區(qū)科技重大專項(xiàng)(2016A03007-1)
Fast Decoding Algorithm for Automatic Speech Recognition Based on Recurrent Neural Networks
The National Natural Science Foundation of China (U1536117, 11590770-4), The National Key Research and Development Plan of China (2016YFB0801200, 2016YFB0801203), The Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (2016A03007-1)
-
摘要: 遞歸神經(jīng)網(wǎng)絡(luò)(Recurrent Neural Network, RNN)如今已經(jīng)廣泛用于自動(dòng)語(yǔ)音識(shí)別(Automatic Speech Recognition, ASR)的聲學(xué)建模。雖然其較傳統(tǒng)的聲學(xué)建模方法有很大優(yōu)勢(shì),但相對(duì)較高的計(jì)算復(fù)雜度限制了這種神經(jīng)網(wǎng)絡(luò)的應(yīng)用,特別是在實(shí)時(shí)應(yīng)用場(chǎng)景中。由于遞歸神經(jīng)網(wǎng)絡(luò)采用的輸入特征通常有較長(zhǎng)的上下文,因此利用重疊信息來(lái)同時(shí)降低聲學(xué)后驗(yàn)和令牌傳遞的時(shí)間復(fù)雜度成為可能。該文介紹了一種新的解碼器結(jié)構(gòu),通過(guò)有規(guī)律拋棄存在重疊的幀來(lái)獲得解碼過(guò)程中的計(jì)算開銷降低。特別地,這種方法可以直接用于原始的遞歸神經(jīng)網(wǎng)絡(luò)模型,只需對(duì)隱馬爾可夫模型(Hidden Markov Model, HMM)結(jié)構(gòu)做小的變動(dòng),這使得這種方法具有很高的靈活性。該文以時(shí)延神經(jīng)網(wǎng)絡(luò)為例驗(yàn)證了所提出的方法,證明該方法能夠在精度損失相對(duì)較小的情況下取得2~4倍的加速比。
-
關(guān)鍵詞:
- 語(yǔ)音識(shí)別 /
- 遞歸神經(jīng)網(wǎng)絡(luò) /
- 解碼器 /
- 跳幀計(jì)算
Abstract: Recurrent Neural Networks (RNN) are widely used for acoustic modeling in Automatic Speech Recognition (ASR). Although RNNs show many advantages over traditional acoustic modeling methods, the inherent higher computational cost limits its usage, especially in real-time applications. Noticing that the features used by RNNs usually have relatively long acoustic contexts, it is possible to lower the computational complexity of both posterior calculation and token passing process with overlapped information. This paper introduces a novel decoder structure that drops the overlapped acoustic frames regularly, which leads to a significant computational cost reduction in the decoding process. Especially, the new approach can directly use the original RNNs with minor modifications on the HMM topology, which makes it flexible. In experiments on conversation telephone speech datasets, this approach achieves 2 to 4 times speedup with little relative accuracy reduction.-
Key words:
- Speech recognition /
- Recurrent Neural Network (RNN) /
- Decoder /
- Frame skipping
-
GRAVES Alex, JAITLY Navdeep, and MOHAMED Abdel-rahman. Hybrid speech recognition with deep bidirectional LSTM[C]. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Olomouc, Czech Republic, 2013: 273-278. SAK Hasim, SENIOR Andrew, and BEAUFAYS Franoise. Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]. 15th Annual Conference of the International Speech Communication Association (Interspeech 2014), Singapore, 2014: 338-342. NARAYANAN Arun, MISRA Ananya, and CHIN Kean. Large-scale, sequence-discriminative, joint adaptive training for masking-based robust ASR[C]. 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, 2015: 3571-3575. LI Jinyu, MOHAMED Abdelrahman, ZWEIG Geoffrey, et al. Exploring multidimensional LSTMs for large vocabulary ASR[C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016: 4940-4944. PEDDINTI Vijayaditya, POVEY Daniel, and KHUDANPUR Sanjeev. A time delay neural network architecture for efficient modeling of long temporal contexts[C]. 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, 2015: 3214-3218. SNYDER David, GARCIA-ROMERO Daniel, and POVEY Daniel. Time delay deep neural network-based universal background models for speaker recognition[C]. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, USA, 2015: 92-97. PEDDINTI Vijayaditya, CHEN Guoguo, MANOHAR Vimal, et al. JHU ASpIRE system: robust LVCSR with TDNNs, i-vector adaptation, and RNN-LMs[C]. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, USA, 2015: 539-546. SEIDE Frank, LI Gang, and YU Dong. Conversational speech transcription using context-dependent deep neural networks[C]. 12th Annual Conference of the International Speech Communication Association (Interspeech 2011), Florence, Italy, 2011: 437-440. SELTZER Michael L, YU Dong, and WANG Yongqiang. An investigation of deep neural networks for noise robust speech recognition[C]. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013: 7398-7402. VANHOUCKE Vincent, DEVIN Matthieu, and HEIGOLD Georg. Multiframe deep neural networks for acoustic modeling[C]. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013: 7582-7585. MOORE Darren, DINES John, DOSS Mathew Magimai, et al. Juicer: A Weighted Finite-State Transducer Speech Decoder[M]. Berlin, Heidelberg, Springer, 2006: 285-296. YOUNG S J, RUSSELL N H, and THORNTON J H S. Token passing: A simple conceptual model for connected speech recognition systems[R]. CUED/F-INFENG/TR38, Engineering Department, Cambridge University, 1989. NOLDEN David, SCHLTER Ralf, and NEY Hermann. Extended search space pruning in LVCSR[C]. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 2012: 4429-4432. 郭宇弘. 基于加權(quán)有限狀態(tài)轉(zhuǎn)換機(jī)的語(yǔ)音識(shí)別系統(tǒng)研究[D]. [博士論文], 中國(guó)科學(xué)院大學(xué), 2013: 1-20. GUO Yuhong. Automatic speech recognition system based on weighted finite-state transducers[D]. [Ph.D. dissertation], University of Chinese Academy of Sciences, 2013: 1-20. RABINER Lawrence R and JUANG Biinghwang. An introduction to hidden Markov models[J]. IEEE ASSP Magazine, 1986, 3(1): 4-16. doi: 10.1109/MASSP.1986. 1165342 YOUNG Steve, EVERMANN Gunnar, GALES Mark, et al. The HTK Book Vol. 2[M]. Cambridge, Entropic Cambridge Research Laboratory, 1997: 59-210. ZHANG Qingqing, SOONG Frank, QIAN Yao, et, al. Improved modeling for F0 generation and V/U decision in HMM-based TTS[C]. 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, USA, 2010: 4606-4609. -
計(jì)量
- 文章訪問(wèn)數(shù): 1923
- HTML全文瀏覽量: 202
- PDF下載量: 744
- 被引次數(shù): 0