一级黄色片免费播放|中国黄色视频播放片|日本三级a|可以直接考播黄片影视免费一级毛片

高級搜索

留言板

尊敬的讀者、作者、審稿人, 關(guān)于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復(fù)。謝謝您的支持!

姓名
郵箱
手機(jī)號碼
標(biāo)題
留言內(nèi)容
驗證碼

結(jié)合有監(jiān)督聯(lián)合一致性自編碼器的跨音視頻說話人標(biāo)注

柳欣 李鶴洋 鐘必能 杜吉祥

柳欣, 李鶴洋, 鐘必能, 杜吉祥. 結(jié)合有監(jiān)督聯(lián)合一致性自編碼器的跨音視頻說話人標(biāo)注[J]. 電子與信息學(xué)報, 2018, 40(7): 1635-1642. doi: 10.11999/JEIT171011
引用本文: 柳欣, 李鶴洋, 鐘必能, 杜吉祥. 結(jié)合有監(jiān)督聯(lián)合一致性自編碼器的跨音視頻說話人標(biāo)注[J]. 電子與信息學(xué)報, 2018, 40(7): 1635-1642. doi: 10.11999/JEIT171011
LIU Xin, LI Heyang, ZHONG Bineng, DU Jixiang. Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder[J]. Journal of Electronics & Information Technology, 2018, 40(7): 1635-1642. doi: 10.11999/JEIT171011
Citation: LIU Xin, LI Heyang, ZHONG Bineng, DU Jixiang. Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder[J]. Journal of Electronics & Information Technology, 2018, 40(7): 1635-1642. doi: 10.11999/JEIT171011

結(jié)合有監(jiān)督聯(lián)合一致性自編碼器的跨音視頻說話人標(biāo)注

doi: 10.11999/JEIT171011 cstr: 32379.14.JEIT171011
基金項目: 

國家自然科學(xué)基金(61673185, 61572205, 61673186),福建省自然科學(xué)基金(2017J01112),華僑大學(xué)中青年創(chuàng)新人才培育項目(ZQN-309)

詳細(xì)信息
    作者簡介:

    柳欣:柳 欣: 男,1982年生,博士,副教授,研究方向為生物特征識別和機(jī)器學(xué)習(xí). 李鶴洋: 男,1994年生,碩士生,研究方向為計算機(jī)視覺與模式識別. 鐘必能: 男,1981年生,博士,教授,研究方向為機(jī)器學(xué)習(xí)和模式識別. 杜吉祥: 男,1977年生,博士,教授,研究方向為計算機(jī)視覺和機(jī)器學(xué)習(xí).

  • 中圖分類號: TP391.4

Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder

Funds: 

The National Natural Science Foundation of China (61673185, 61572205, 61673186), The Natural Science Foundation of Fujian Province (2017J01112), The Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (ZQN-309)

  • 摘要: 跨模態(tài)說話人標(biāo)注旨在利用說話人的不同生物特征進(jìn)行相互匹配和互標(biāo)注,可廣泛應(yīng)用于各種人機(jī)交互場合。針對人臉和語音兩種不同模態(tài)生物特征之間存在明顯的“語義鴻溝”問題,該文提出一種結(jié)合有監(jiān)督聯(lián)合一致性自編碼器的跨音視頻說話人標(biāo)注方法。首先分別利用卷積神經(jīng)網(wǎng)絡(luò)和深度信念網(wǎng)絡(luò)分別對人臉圖像和語音數(shù)據(jù)進(jìn)行判別性特征提取,接著在聯(lián)合自編碼器模型的基礎(chǔ)上,提出一種新的有監(jiān)督跨模態(tài)神經(jīng)網(wǎng)絡(luò)模型,同時嵌入softmax回歸模型以保證模態(tài)間和模態(tài)內(nèi)樣本的相似性,進(jìn)而擴(kuò)展為3種有監(jiān)督一致性自編碼器神經(jīng)網(wǎng)絡(luò)模型來挖掘音視頻異構(gòu)特征之間的潛在關(guān)系,從而有效實現(xiàn)人臉和語音的跨模態(tài)相互標(biāo)注。實驗結(jié)果表明,該文提出的網(wǎng)絡(luò)模型能夠有效的對說話人進(jìn)行跨模態(tài)標(biāo)注,效果顯著,取得了對姿態(tài)變化和樣本多樣性的魯棒性。
  • CHEN Cunbao and ZHAO Li. Speaker identification based on GMM with embedded AANN[J]. Journal of Electronics & Information Technology, 2010, 32(3): 528-532. doi: 10.3724/ SP.J.1146.2008.00275.
    陳存寶, 趙力. 嵌入自聯(lián)想神經(jīng)網(wǎng)絡(luò)的高斯混合模型說話人辨認(rèn)[J]. 電子與信息學(xué)報, 2010, 32(3): 528-532.

    doi: 10.3724/ SP.J.1146.2008.00275.
    GUO Wu, DAI Lirong, and WANG Renhua. Speaker verification based on factor analysis and SVM[J]. Journal of Electronics & Information Technology, 2009, 31(2): 302-305. doi: 10.3724/SP.J.1146.2007.01289.
    [3] RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]. ACM International Conference on Multimedia, Firenze, Italy, 2010: 251-260.
    [4] ZHANG Liang, MA Bingpeng, LI Guorong, et al. Cross- modal retrieval using multiordered discriminative structured subspace learning[J]. IEEE Transactions on Multimedia, 2017, 19(6): 1220-1233. doi: 10.1109/TMM.2016.2646219.
    [5] ZOU Hui, DU Jixiang, ZHAI Chuanmin, et al. Deep learning and shared representation space learning based cross-modal multimedia retrieval[C]. International Conference on Intelligent Computing. Lanzhou, China, 2016: 322-331.
    [6] SUN Yi, WANG Xiaogang, and TANG Xiaoou. Hybrid deep learning for face verification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 1997-2009. doi: 10.1109/TPAMI.2015.2505293.
    [7] SUN Yi, WANG Xiaogang, TANG Xiaoou, et al. Deep learning face representation from predicting 10,000 classes[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1891-1898.
    [8] KARAABA M F, SURINTA O, SCHOMAKER L R B, et al. Robust face identification with small sample sizes using bag of words and histogram of oriented gradients[C]. International Joint Conference on Computer Vision Imaging and Computer Graphics Theory and Applications, Rome, Italy, 2016: 582-589.
    [9] TAIGMAN Y, YANG M, RANZATO M, et al. DeepFace: Closing the gap to human-level performance in face verification[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1701-1708.
    [10] YUAN Xiaochen, PUN Chiman, and CHEN C L. Robust Mel-Frequency cepstral coefficients feature detection and dual-tree complex wavelet transform for digital audio watermarking[J]. Information Sciences, 2015, 29(8): 159-179. doi: 10.1016/j.ins.2014.11.040.
    [11] PATHAK M A and RAJ B. Privacy-preserving speaker verification and identification using Gaussian mixture models [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(2): 397-406. doi: 10.1109/ TASL.2012. 2215602.
    [12] HINTON G, LI Deng, DONG Yu, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. doi: 10.1109/MSP.2012.2205597.
    [13] NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]. IEEE International Conference on Machine Learning, Bellevue, USA, 2011: 689-696.
    [14] HU Yongtao, REN J S, DAI Jingwen, et al. Deep multimodal speaker naming[C]. ACM International Conference on Multimedia, Brisbane, Australia, 2015: 1107-1110.
    [15] FENG Fangxiang, WANG Xi, LI Ruifan, et al. Correspondence autoencoders for cross-modal retrieval[J]. ACM Transactions on Multimedia Computing Communications & Applications, 2015, 12(1s): 1-22. doi: 10.1145/2808205.
    [16] MOHAMED A, DAHL G E, and HINTON G. Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22. doi: 10.1109/TASL.2011.2109382.
    [17] WANG Kaiye, HE Ran, WANG Liang, et al. Joint feature selection and subspace learning for cross-modal retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2010-2023. doi: 10.1109/TPAMI. 2015.2505311.
    [18] CASTREJÓN L, AYTAR Y, VONDRICK C, et al. Learning aligned cross-modal representations from weakly aligned data[C]. IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 2940-2949.
    [19] KIM J, NAM J, and GUREVYCH I. Learning semantics with deep belief network for cross-language information retrieval[C]. International Conference on Computational Linguistics, Dublin, Ireland, 2013: 579-588.
    [20] TANG Jun, WANG Ke, and SHAO Ling. Supervised matrix factorization hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2016, 25(7): 3157-3166. doi: 10.1109/TIP.2016.2564638.
  • 加載中
計量
  • 文章訪問數(shù):  1295
  • HTML全文瀏覽量:  180
  • PDF下載量:  50
  • 被引次數(shù): 0
出版歷程
  • 收稿日期:  2017-10-30
  • 修回日期:  2018-04-10
  • 刊出日期:  2018-07-19

目錄

    /

    返回文章
    返回