結(jié)合有監(jiān)督聯(lián)合一致性自編碼器的跨音視頻說話人標(biāo)注

柳欣; 李鶴洋; 鐘必能; 杜吉祥

doi:10.11999/JEIT171011

結(jié)合有監(jiān)督聯(lián)合一致性自編碼器的跨音視頻說話人標(biāo)注

doi: 10.11999/JEIT171011 cstr: 32379.14.JEIT171011

^①(華僑大學(xué)計算機(jī)科學(xué)與技術(shù)學(xué)院廈門 361021) ^②(計算機(jī)視覺與模式識別廈門市重點實驗室廈門 361021)

基金項目:

國家自然科學(xué)基金(61673185, 61572205, 61673186)，福建省自然科學(xué)基金(2017J01112)，華僑大學(xué)中青年創(chuàng)新人才培育項目(ZQN-309)

詳細(xì)信息

作者簡介:
柳欣：柳欣：男，1982年生，博士，副教授，研究方向為生物特征識別和機(jī)器學(xué)習(xí). 李鶴洋：男，1994年生，碩士生，研究方向為計算機(jī)視覺與模式識別. 鐘必能：男，1981年生，博士，教授，研究方向為機(jī)器學(xué)習(xí)和模式識別. 杜吉祥：男，1977年生，博士，教授，研究方向為計算機(jī)視覺和機(jī)器學(xué)習(xí).

中圖分類號: TP391.4
計量
- 文章訪問數(shù): 1295
- HTML全文瀏覽量: 180
- PDF下載量: 50
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2017-10-30
- 修回日期: 2018-04-10
- 刊出日期: 2018-07-19

Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder

LIU Xin^①② LI Heyang^① ZHONG Bineng^①② DU Jixiang^①②

Funds:

The National Natural Science Foundation of China (61673185, 61572205, 61673186), The Natural Science Foundation of Fujian Province (2017J01112), The Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (ZQN-309)

摘要

摘要: 跨模態(tài)說話人標(biāo)注旨在利用說話人的不同生物特征進(jìn)行相互匹配和互標(biāo)注，可廣泛應(yīng)用于各種人機(jī)交互場合。針對人臉和語音兩種不同模態(tài)生物特征之間存在明顯的“語義鴻溝”問題，該文提出一種結(jié)合有監(jiān)督聯(lián)合一致性自編碼器的跨音視頻說話人標(biāo)注方法。首先分別利用卷積神經(jīng)網(wǎng)絡(luò)和深度信念網(wǎng)絡(luò)分別對人臉圖像和語音數(shù)據(jù)進(jìn)行判別性特征提取，接著在聯(lián)合自編碼器模型的基礎(chǔ)上，提出一種新的有監(jiān)督跨模態(tài)神經(jīng)網(wǎng)絡(luò)模型，同時嵌入softmax回歸模型以保證模態(tài)間和模態(tài)內(nèi)樣本的相似性，進(jìn)而擴(kuò)展為3種有監(jiān)督一致性自編碼器神經(jīng)網(wǎng)絡(luò)模型來挖掘音視頻異構(gòu)特征之間的潛在關(guān)系，從而有效實現(xiàn)人臉和語音的跨模態(tài)相互標(biāo)注。實驗結(jié)果表明，該文提出的網(wǎng)絡(luò)模型能夠有效的對說話人進(jìn)行跨模態(tài)標(biāo)注，效果顯著，取得了對姿態(tài)變化和樣本多樣性的魯棒性。
- 跨模態(tài)說話人標(biāo)注 /
- 有監(jiān)督聯(lián)合自編碼器 /
- softmax回歸模型 /
- 有監(jiān)督神經(jīng)網(wǎng)絡(luò)模型
Abstract: Cross-modal speaker tagging aims to learn the latent relationship between different biometrics for mutual annotation, which can potentially be utilized in various human-computer interactions. In order to solve the “semantic gap” between the face and audio modalities, this paper presents an efficient supervised joint correspondence auto-encoder to link the face and audio counterpart, where by the speaker can be crosswise tagged. First, Convolutional Neural Network (CNN) and Deep Belief Network (DBN) are used to extract the discriminative features of the face and the audio samples respectively. Then, a supervised neural network model associated with softmax regression is embedded into a joint auto-encoder model, which can discriminatively preserving the inter-modal and intra-modal similarities. Accordingly, three different kinds of supervised joint correspondence auto-encoder models are presented to correlate the semantic relationships between the face and the audio counterparts, and the speaker can be crosswise annotated efficiently. The experimental results show that the proposed supervised joint auto-encoder is able to perform cross-modal speaker tagging with outstanding performance, and demonstrate the robustness to facial posture variations and sample diversities.
- Cross-modal speaker tagging、Supervised joint correspondence auto-encoder、Softmax regression、Supervised neural network model /

HTML全文

參考文獻(xiàn)(21)

CHEN Cunbao and ZHAO Li. Speaker identification based on GMM with embedded AANN[J]. Journal of Electronics & Information Technology, 2010, 32(3): 528-532. doi: 10.3724/ SP.J.1146.2008.00275.

陳存寶, 趙力. 嵌入自聯(lián)想神經(jīng)網(wǎng)絡(luò)的高斯混合模型說話人辨認(rèn)[J]. 電子與信息學(xué)報, 2010, 32(3): 528-532.

doi: 10.3724/ SP.J.1146.2008.00275.

GUO Wu, DAI Lirong, and WANG Renhua. Speaker verification based on factor analysis and SVM[J]. Journal of Electronics & Information Technology, 2009, 31(2): 302-305. doi: 10.3724/SP.J.1146.2007.01289.

[3] RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]. ACM International Conference on Multimedia, Firenze, Italy, 2010: 251-260.

[4] ZHANG Liang, MA Bingpeng, LI Guorong, et al. Cross- modal retrieval using multiordered discriminative structured subspace learning[J]. IEEE Transactions on Multimedia, 2017, 19(6): 1220-1233. doi: 10.1109/TMM.2016.2646219.

[5] ZOU Hui, DU Jixiang, ZHAI Chuanmin, et al. Deep learning and shared representation space learning based cross-modal multimedia retrieval[C]. International Conference on Intelligent Computing. Lanzhou, China, 2016: 322-331.

[6] SUN Yi, WANG Xiaogang, and TANG Xiaoou. Hybrid deep learning for face verification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 1997-2009. doi: 10.1109/TPAMI.2015.2505293.

[7] SUN Yi, WANG Xiaogang, TANG Xiaoou, et al. Deep learning face representation from predicting 10,000 classes[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1891-1898.

[8] KARAABA M F, SURINTA O, SCHOMAKER L R B, et al. Robust face identification with small sample sizes using bag of words and histogram of oriented gradients[C]. International Joint Conference on Computer Vision Imaging and Computer Graphics Theory and Applications, Rome, Italy, 2016: 582-589.

[9] TAIGMAN Y, YANG M, RANZATO M, et al. DeepFace: Closing the gap to human-level performance in face verification[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1701-1708.

[10] YUAN Xiaochen, PUN Chiman, and CHEN C L. Robust Mel-Frequency cepstral coefficients feature detection and dual-tree complex wavelet transform for digital audio watermarking[J]. Information Sciences, 2015, 29(8): 159-179. doi: 10.1016/j.ins.2014.11.040.

[11] PATHAK M A and RAJ B. Privacy-preserving speaker verification and identification using Gaussian mixture models [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(2): 397-406. doi: 10.1109/ TASL.2012. 2215602.

[12] HINTON G, LI Deng, DONG Yu, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. doi: 10.1109/MSP.2012.2205597.

[13] NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]. IEEE International Conference on Machine Learning, Bellevue, USA, 2011: 689-696.

[14] HU Yongtao, REN J S, DAI Jingwen, et al. Deep multimodal speaker naming[C]. ACM International Conference on Multimedia, Brisbane, Australia, 2015: 1107-1110.

[15] FENG Fangxiang, WANG Xi, LI Ruifan, et al. Correspondence autoencoders for cross-modal retrieval[J]. ACM Transactions on Multimedia Computing Communications & Applications, 2015, 12(1s): 1-22. doi: 10.1145/2808205.

[16] MOHAMED A, DAHL G E, and HINTON G. Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22. doi: 10.1109/TASL.2011.2109382.

[17] WANG Kaiye, HE Ran, WANG Liang, et al. Joint feature selection and subspace learning for cross-modal retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2010-2023. doi: 10.1109/TPAMI. 2015.2505311.

[18] CASTREJÓN L, AYTAR Y, VONDRICK C, et al. Learning aligned cross-modal representations from weakly aligned data[C]. IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 2940-2949.

[19] KIM J, NAM J, and GUREVYCH I. Learning semantics with deep belief network for cross-language information retrieval[C]. International Conference on Computational Linguistics, Dublin, Ireland, 2013: 579-588.

[20] TANG Jun, WANG Ke, and SHAO Ling. Supervised matrix factorization hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2016, 25(7): 3157-3166. doi: 10.1109/TIP.2016.2564638.

相關(guān)文章

施引文獻(xiàn)

資源附件(0)

訪問統(tǒng)計