結(jié)合有監(jiān)督聯(lián)合一致性自編碼器的跨音視頻說話人標(biāo)注
doi: 10.11999/JEIT171011 cstr: 32379.14.JEIT171011
-
①(華僑大學(xué)計算機(jī)科學(xué)與技術(shù)學(xué)院 廈門 361021) ②(計算機(jī)視覺與模式識別廈門市重點實驗室 廈門 361021)
國家自然科學(xué)基金(61673185, 61572205, 61673186),福建省自然科學(xué)基金(2017J01112),華僑大學(xué)中青年創(chuàng)新人才培育項目(ZQN-309)
Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder
-
LIU Xin①② LI Heyang① ZHONG Bineng①② DU Jixiang①②
The National Natural Science Foundation of China (61673185, 61572205, 61673186), The Natural Science Foundation of Fujian Province (2017J01112), The Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (ZQN-309)
-
摘要: 跨模態(tài)說話人標(biāo)注旨在利用說話人的不同生物特征進(jìn)行相互匹配和互標(biāo)注,可廣泛應(yīng)用于各種人機(jī)交互場合。針對人臉和語音兩種不同模態(tài)生物特征之間存在明顯的“語義鴻溝”問題,該文提出一種結(jié)合有監(jiān)督聯(lián)合一致性自編碼器的跨音視頻說話人標(biāo)注方法。首先分別利用卷積神經(jīng)網(wǎng)絡(luò)和深度信念網(wǎng)絡(luò)分別對人臉圖像和語音數(shù)據(jù)進(jìn)行判別性特征提取,接著在聯(lián)合自編碼器模型的基礎(chǔ)上,提出一種新的有監(jiān)督跨模態(tài)神經(jīng)網(wǎng)絡(luò)模型,同時嵌入softmax回歸模型以保證模態(tài)間和模態(tài)內(nèi)樣本的相似性,進(jìn)而擴(kuò)展為3種有監(jiān)督一致性自編碼器神經(jīng)網(wǎng)絡(luò)模型來挖掘音視頻異構(gòu)特征之間的潛在關(guān)系,從而有效實現(xiàn)人臉和語音的跨模態(tài)相互標(biāo)注。實驗結(jié)果表明,該文提出的網(wǎng)絡(luò)模型能夠有效的對說話人進(jìn)行跨模態(tài)標(biāo)注,效果顯著,取得了對姿態(tài)變化和樣本多樣性的魯棒性。Abstract: Cross-modal speaker tagging aims to learn the latent relationship between different biometrics for mutual annotation, which can potentially be utilized in various human-computer interactions. In order to solve the “semantic gap” between the face and audio modalities, this paper presents an efficient supervised joint correspondence auto-encoder to link the face and audio counterpart, where by the speaker can be crosswise tagged. First, Convolutional Neural Network (CNN) and Deep Belief Network (DBN) are used to extract the discriminative features of the face and the audio samples respectively. Then, a supervised neural network model associated with softmax regression is embedded into a joint auto-encoder model, which can discriminatively preserving the inter-modal and intra-modal similarities. Accordingly, three different kinds of supervised joint correspondence auto-encoder models are presented to correlate the semantic relationships between the face and the audio counterparts, and the speaker can be crosswise annotated efficiently. The experimental results show that the proposed supervised joint auto-encoder is able to perform cross-modal speaker tagging with outstanding performance, and demonstrate the robustness to facial posture variations and sample diversities.
-
CHEN Cunbao and ZHAO Li. Speaker identification based on GMM with embedded AANN[J]. Journal of Electronics & Information Technology, 2010, 32(3): 528-532. doi: 10.3724/ SP.J.1146.2008.00275. 陳存寶, 趙力. 嵌入自聯(lián)想神經(jīng)網(wǎng)絡(luò)的高斯混合模型說話人辨認(rèn)[J]. 電子與信息學(xué)報, 2010, 32(3): 528-532.doi: 10.3724/ SP.J.1146.2008.00275. GUO Wu, DAI Lirong, and WANG Renhua. Speaker verification based on factor analysis and SVM[J]. Journal of Electronics & Information Technology, 2009, 31(2): 302-305. doi: 10.3724/SP.J.1146.2007.01289. [3] RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]. ACM International Conference on Multimedia, Firenze, Italy, 2010: 251-260. [4] ZHANG Liang, MA Bingpeng, LI Guorong, et al. Cross- modal retrieval using multiordered discriminative structured subspace learning[J]. IEEE Transactions on Multimedia, 2017, 19(6): 1220-1233. doi: 10.1109/TMM.2016.2646219. [5] ZOU Hui, DU Jixiang, ZHAI Chuanmin, et al. Deep learning and shared representation space learning based cross-modal multimedia retrieval[C]. International Conference on Intelligent Computing. Lanzhou, China, 2016: 322-331. [6] SUN Yi, WANG Xiaogang, and TANG Xiaoou. Hybrid deep learning for face verification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 1997-2009. doi: 10.1109/TPAMI.2015.2505293. [7] SUN Yi, WANG Xiaogang, TANG Xiaoou, et al. Deep learning face representation from predicting 10,000 classes[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1891-1898. [8] KARAABA M F, SURINTA O, SCHOMAKER L R B, et al. Robust face identification with small sample sizes using bag of words and histogram of oriented gradients[C]. International Joint Conference on Computer Vision Imaging and Computer Graphics Theory and Applications, Rome, Italy, 2016: 582-589. [9] TAIGMAN Y, YANG M, RANZATO M, et al. DeepFace: Closing the gap to human-level performance in face verification[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1701-1708. [10] YUAN Xiaochen, PUN Chiman, and CHEN C L. Robust Mel-Frequency cepstral coefficients feature detection and dual-tree complex wavelet transform for digital audio watermarking[J]. Information Sciences, 2015, 29(8): 159-179. doi: 10.1016/j.ins.2014.11.040. [11] PATHAK M A and RAJ B. Privacy-preserving speaker verification and identification using Gaussian mixture models [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(2): 397-406. doi: 10.1109/ TASL.2012. 2215602. [12] HINTON G, LI Deng, DONG Yu, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. doi: 10.1109/MSP.2012.2205597. [13] NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]. IEEE International Conference on Machine Learning, Bellevue, USA, 2011: 689-696. [14] HU Yongtao, REN J S, DAI Jingwen, et al. Deep multimodal speaker naming[C]. ACM International Conference on Multimedia, Brisbane, Australia, 2015: 1107-1110. [15] FENG Fangxiang, WANG Xi, LI Ruifan, et al. Correspondence autoencoders for cross-modal retrieval[J]. ACM Transactions on Multimedia Computing Communications & Applications, 2015, 12(1s): 1-22. doi: 10.1145/2808205. [16] MOHAMED A, DAHL G E, and HINTON G. Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22. doi: 10.1109/TASL.2011.2109382. [17] WANG Kaiye, HE Ran, WANG Liang, et al. Joint feature selection and subspace learning for cross-modal retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2010-2023. doi: 10.1109/TPAMI. 2015.2505311. [18] CASTREJÓN L, AYTAR Y, VONDRICK C, et al. Learning aligned cross-modal representations from weakly aligned data[C]. IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 2940-2949. [19] KIM J, NAM J, and GUREVYCH I. Learning semantics with deep belief network for cross-language information retrieval[C]. International Conference on Computational Linguistics, Dublin, Ireland, 2013: 579-588. [20] TANG Jun, WANG Ke, and SHAO Ling. Supervised matrix factorization hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2016, 25(7): 3157-3166. doi: 10.1109/TIP.2016.2564638. -
計量
- 文章訪問數(shù): 1295
- HTML全文瀏覽量: 180
- PDF下載量: 50
- 被引次數(shù): 0