一级黄色片免费播放|中国黄色视频播放片|日本三级a|可以直接考播黄片影视免费一级毛片

高級搜索

留言板

尊敬的讀者、作者、審稿人, 關(guān)于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復。謝謝您的支持!

姓名
郵箱
手機號碼
標題
留言內(nèi)容
驗證碼

借助語音和面部圖像的雙模態(tài)情感識別

薛珮蕓 戴書濤 白靜 高翔

薛珮蕓, 戴書濤, 白靜, 高翔. 借助語音和面部圖像的雙模態(tài)情感識別[J]. 電子與信息學報, 2024, 46(12): 4542-4552. doi: 10.11999/JEIT240087
引用本文: 薛珮蕓, 戴書濤, 白靜, 高翔. 借助語音和面部圖像的雙模態(tài)情感識別[J]. 電子與信息學報, 2024, 46(12): 4542-4552. doi: 10.11999/JEIT240087
XUE Peiyun, DAI Shutao, BAI Jing, GAO Xiang. Emotion Recognition with Speech and Facial Images[J]. Journal of Electronics & Information Technology, 2024, 46(12): 4542-4552. doi: 10.11999/JEIT240087
Citation: XUE Peiyun, DAI Shutao, BAI Jing, GAO Xiang. Emotion Recognition with Speech and Facial Images[J]. Journal of Electronics & Information Technology, 2024, 46(12): 4542-4552. doi: 10.11999/JEIT240087

借助語音和面部圖像的雙模態(tài)情感識別

doi: 10.11999/JEIT240087 cstr: 32379.14.JEIT240087
基金項目: 山西省青年基金 (20210302124544),山西省應用基礎(chǔ)研究計劃(201901D111094)
詳細信息
    作者簡介:

    薛珮蕓:女,講師,研究方向為信號處理

    戴書濤:男,碩士生,研究方向為信號處理

    白靜:女,教授,研究方向為信號處理

    高翔:男,碩士生,研究方向為信號處理

    通訊作者:

    薛珮蕓 xuepeiyun@tyut.edu.cn

  • 中圖分類號: TN911.7; TP391.41

Emotion Recognition with Speech and Facial Images

Funds: Shanxi Youth Fundation (20210302124544), The Applied Basic Research Program of Shanxi Province (201901D111094)
  • 摘要: 為提升情感識別模型的準確率,解決情感特征提取不充分的問題,對語音和面部圖像的雙模態(tài)情感識別進行研究。語音模態(tài)提出一種結(jié)合通道-空間注意力機制的多分支卷積神經(jīng)網(wǎng)絡(Multi-branch Convolutional Neural Networks, MCNN)的特征提取模型,在時間、空間和局部特征維度對語音頻譜圖提取情感特征;面部圖像模態(tài)提出一種殘差混合卷積神經(jīng)網(wǎng)絡(Residual Hybrid Convolutional Neural Network, RHCNN)的特征提取模型,進一步建立并行注意力機制關(guān)注全局情感特征,提高識別準確率;將提取到的語音和面部圖像特征分別通過分類層進行分類識別,并使用決策融合對識別結(jié)果進行最終的融合分類。實驗結(jié)果表明,所提雙模態(tài)融合模型在RAVDESS, eNTERFACE’05, RML三個數(shù)據(jù)集上的識別準確率分別達到了97.22%, 94.78%和96.96%,比語音單模態(tài)的識別準確率分別提升了11.02%, 4.24%, 8.83%,比面部圖像單模態(tài)的識別準確率分別提升了4.60%, 6.74%, 4.10%,且與近年來對應數(shù)據(jù)集上的相關(guān)方法相比均有所提升。說明了所提的雙模態(tài)融合模型能有效聚焦情感信息,從而提升情感識別的準確率。
  • 圖  1  雙模態(tài)情感識別模型總體結(jié)構(gòu)

    圖  2  對數(shù)梅爾頻譜圖提取流程

    圖  3  MCNN結(jié)構(gòu)

    圖  4  RHCNN網(wǎng)絡結(jié)構(gòu)

    圖  5  并行注意力機制

    圖  6  決策融合算法

    圖  7  語音模型混淆矩陣

    圖  8  面部圖像模型混淆矩陣

    圖  9  雙模態(tài)混淆矩陣

    圖  10  不同模塊對模型性能的有效性驗證

    圖  11  不同數(shù)據(jù)量對模型性能的影響

    表  1  語音模型各類情感識別結(jié)果(%)

    類別 RAVDESS RML eNTERFACE’05
    P R S F1 P R S F1 P R S F1
    中性 83.22 87.94 98.84 85.52
    平靜 86.94 88.35 97.94 87.64
    快樂 79.86 97.46 97.20 87.79 78.12 92.59 95.62 84.75 95.66 95.94 99.12 95.80
    悲傷 80.46 78.64 97.04 79.56 91.95 78.82 98.47 84.88 94.59 90.27 99.01 92.38
    憤怒 87.03 84.62 97.93 85.80 96.09 94.51 99.25 95.29 84.44 87.99 96.85 86.18
    恐懼 85.40 91.06 97.65 88.14 83.43 82.51 96.80 82.97 85.15 88.68 96.90 87.36
    厭惡 90.66 83.61 98.41 87.00 86.50 87.37 97.07 86.93 91.83 93.68 98.30 92.75
    驚訝 94.31 82.30 99.19 87.89 93.30 94.27 98.60 93.78 92.15 85.67 98.47 88.79
    下載: 導出CSV

    表  2  語音模態(tài)已有方法與本文模型在準確率上的對比(%)

    數(shù)據(jù)集 CNN-Transformer[18] 2D Feature
    Extration+
    SVM[19]
    CNN-BiLSTM[20] CNN-X[21] 3D-CNN[22] DCNN-GWO[23] Auto-encoder+
    SVM[24]
    Multi-label model[25] 本文
    方法
    RAVDESS 82.0 81.94 84.72 82.99 86.20
    RML 74.07 84.72 88.13
    eNTERFACE’05 88.47 90.40 90.54
    下載: 導出CSV

    表  3  面部圖像模型各類情感識別結(jié)果(%)

    類別 RAVDESS RML eNTERFACE’05
    P R S F1 P R S F1 P R S F1
    中性 91.95 91.33 99.44 91.64
    平靜 95.22 93.44 99.24 94.32
    快樂 94.79 92.54 99.25 93.65 96.84 97.35 99.36 97.10 89.31 88.29 97.82 88.79
    悲傷 94.04 92.21 99.10 93.11 93.22 88.24 98.71 90.66 86.62 84.21 97.57 85.40
    憤怒 92.41 93.59 98.80 92.99 91.67 95.38 98.42 93.48 87.61 89.68 97.48 88.63
    恐懼 87.27 92.13 97.95 89.63 91.35 87.56 98.27 89.42 87.04 87.78 97.29 87.41
    沮喪 94.58 96.62 99.09 95.59 93.75 94.74 98.71 94.24 90.42 89.68 97.99 90.81
    驚訝 90.39 87.89 98.66 89.12 90.31 94.15 97.96 92.19 87.01 86.75 97.49 86.88
    下載: 導出CSV

    表  4  面部圖像模態(tài)經(jīng)典網(wǎng)絡與本文結(jié)果對比(%)

    數(shù)據(jù)集 方法 A R S F1
    RAVDESS ResNet-34[26] 88.37 88.64 98.34 88.19
    ShuffleNetV2[27] 82.03 82.06 97.43 81.60
    MobileNetV2[28] 84.03 84.46 97.71 83.86
    本文方法 92.62 92.47 98.94 92.51
    RML ResNet-34[26] 89.29 89.41 97.86 89.27
    ShuffleNetV2[27] 79.55 79.82 95.95 79.24
    MobileNetV2[28] 86.07 86.33 97.22 86.05
    本文方法 92.86 92.90 98.57 92.85
    eNTERFACE’05 ResNet-34[26] 83.30 83.57 96.67 83.28
    ShuffleNetV2[27] 82.13 82.17 96.43 82.04
    MobileNetV2[28] 83.50 83.68 96.70 83.49
    本文方法 88.04 87.98 97.61 87.99
    下載: 導出CSV

    表  5  雙模態(tài)模型各類情感識別結(jié)果(%)

    類別RAVDESSRMLeENTERFACE’05
    PRSF1PRSF1PRSF1
    中性98.6696.7199.9197.67
    平靜100.0098.13100.0099.05
    快樂97.5796.5699.6597.0696.9598.9699.3797.9596.5396.5399.0096.53
    悲傷96.3697.9899.4597.1697.2295.6399.4896.4295.5495.2498.5995.39
    憤怒98.1095.3899.7096.7298.9199.4599.7999.1895.1894.8398.4194.96
    恐懼93.4896.4798.9594.9595.2493.7599.0694.4992.4494.5699.3093.48
    沮喪96.9998.4799.4997.7298.0295.5999.5897.3093.2493.7799.1993.50
    驚訝97.5197.8699.6597.6895.5097.4599.0696.4696.0793.8199.2494.93
    下載: 導出CSV

    表  6  雙模態(tài)已有方法與本文結(jié)果對比(%)

    數(shù)據(jù)集 文獻[29]
    方法
    文獻[30]
    方法
    文獻[31]
    方法
    文獻[32]
    方法
    文獻[33]
    方法
    文獻[34]
    方法
    文獻[35]
    方法
    文獻[36]
    方法
    文獻[37]
    方法
    本文
    方法
    RAVDESS 86.0 86.7 87.5 87.89 82.99- 93.23 97.22
    RML 82.47 96.79 96.96
    eNTERFACE’05 72.27 91.62 88.11 87.2 94.78
    下載: 導出CSV
  • [1] KUMARAN U, RADHA RAMMOHAN S, NAGARAJAN S M, et al. Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN[J]. International Journal of Speech Technology, 2021, 24(2): 303–314. doi: 10.1007/s10772-020-09792-x.
    [2] 韓虎, 范雅婷, 徐學鋒. 面向方面情感分析的多通道增強圖卷積網(wǎng)絡[J]. 電子與信息學報, 2024, 46(3): 1022–1032. doi: 10.11999/JEIT230353.

    HAN Hu, FAN Yating, and XU Xuefeng. Multi-channel enhanced graph convolutional network for aspect-based sentiment analysis[J]. Journal of Electronics & Information Technology, 2024, 46(3): 1022–1032. doi: 10.11999/JEIT230353.
    [3] CORNEJO J and PEDRINI H. Bimodal emotion recognition based on audio and facial parts using deep convolutional neural networks[C]. Proceedings of the 18th IEEE International Conference On Machine Learning And Applications, Boca Raton, USA, 2019: 111–117. doi: 10.1109/ICMLA.2019.00026.
    [4] O’TOOLE A J, CASTILLO C D, PARDE C J, et al. Face space representations in deep convolutional neural networks[J]. Trends in Cognitive Sciences, 2018, 22(9): 794–809. doi: 10.1016/j.tics.2018.06.006.
    [5] CHEN Qiupu and HUANG Guimin. A novel dual attention-based BLSTM with hybrid features in speech emotion recognition[J]. Engineering Applications of Artificial Intelligence, 2021, 102: 104277. doi: 10.1016/J.ENGAPPAI.2021.104277.
    [6] PAN Bei, HIROTA K, JIA Zhiyang, et al. A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods[J]. Neurocomputing, 2023, 561: 126866. doi: 10.1016/j.neucom.2023.126866.
    [7] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324. doi: 10.1109/5.726791.
    [8] HOCHREITER S and SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735–1780. doi: 10.1162/neco.1997.9.8.1735.
    [9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [10] KIM B K, LEE H, ROH J, et al. Hierarchical committee of deep CNNs with exponentially-weighted decision fusion for static facial expression recognition[C]. Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle Washington, USA, 2015: 427–434. doi: 10.1145/2818346.2830590.
    [11] TZIRAKIS P, TRIGEORGIS G, NICOLAOU M A, et al. End-to-end multimodal emotion recognition using deep neural networks[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1301–1309. doi: 10.1109/JSTSP.2017.2764438.
    [12] SAHOO S and ROUTRAY A. Emotion recognition from audio-visual data using rule based decision level fusion[C]. Proceedings of 2016 IEEE Students’ Technology Symposium, Kharagpur, India, 2016: 7–12. doi: 10.1109/TechSym.2016.7872646.
    [13] WANG Qilong, WU Banggu, ZHU Pengfei, et al. ECA-Net: Efficient channel attention for deep convolutional neural networks[C]. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 11534–11542. doi: 10.1109/CVPR42600.2020.01155.
    [14] CHOLLET F. Xception: Deep learning with depthwise separable convolutions[C]. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 1251–1258. doi: 10.1109/CVPR.2017.195.
    [15] LIVINGSTONE S R and RUSSO F A. The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English[J]. PLoS One, 2018, 13(5): e0196391. doi: 10.1371/journal.pone.0196391.
    [16] WANG Yongjin and GUAN Ling. Recognizing human emotional state from audiovisual signals[J]. IEEE Transactions on Multimedia, 2008, 10(5): 936–946. doi: 10.1109/TMM.2008.927665.
    [17] MARTIN O, KOTSIA I, MACQ B, et al. The eNTERFACE’ 05 audio-visual emotion database[C]. Proceedings of the 22nd International Conference on Data Engineering Workshops, Atlanta, USA, 2006: 8. doi: 10.1109/ICDEW.2006.145.
    [18] VIJAYAN D M, ARUN A V, GANESHNATH R, et al. Development and analysis of convolutional neural network based accurate speech emotion recognition models[C]. Proceedings of the 19th India Council International Conference, Kochi, India, 2022: 1–6. doi: 10.1109/INDICON56171.2022.10040174.
    [19] AGGARWAL A, SRIVASTAVA A, AGARWAL A, et al. Two-way feature extraction for speech emotion recognition using deep learning[J]. Sensors, 2022, 22(6): 2378. doi: 10.3390/s22062378.
    [20] ZHANG Limin, LI Yang, ZHANG Yueting, et al. A deep learning method using gender-specific features for emotion recognition[J]. Sensors, 2023, 23(3): 1355. doi: 10.3390/s23031355.
    [21] KANANI C S, GILL K S, BEHERA S, et al. Shallow over deep neural networks: A empirical analysis for human emotion classification using audio data[M]. MISRA R, KESSWANI N, RAJARAJAN M, et al. Internet of Things and Connected Technologies. Cham: Springer, 2021: 134–146. doi: 10.1007/978-3-030-76736-5_13.
    [22] FALAHZADEH M R, FARSA E Z, HARIMI A, et al. 3D convolutional neural network for speech emotion recognition with its realization on Intel CPU and NVIDIA GPU[J]. IEEE Access, 2022, 10: 112460–112471. doi: 10.1109/ACCESS.2022.3217226.
    [23] FALAHZADEH M R, FAROKHI F, HARIMI A, et al. Deep convolutional neural network and gray wolf optimization algorithm for speech emotion recognition[J]. Circuits, Systems, and Signal Processing, 2023, 42(1): 449–492. doi: 10.1007/s00034-022-02130-3.
    [24] HARáR P, BURGET R, and DUTTA M K. Speech emotion recognition with deep learning[C]. Proceedings of the 4th International Conference on Signal Processing and Integrated Networks, Noida, India, 2017: 137–140. doi: 10.1109/SPIN.2017.8049931.
    [25] SLIMI A, HAFAR N, ZRIGUI M, et al. Multiple models fusion for multi-label classification in speech emotion recognition systems[J]. Procedia Computer Science, 2022, 207: 2875–2882. doi: 10.1016/j.procs.2022.09.345.
    [26] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
    [27] MA Ningning, ZHANG Xiangyu, ZHENG Haitao, et al. ShuffleNet V2: Practical guidelines for efficient CNN architecture design[C]. Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 2018: 116–131. doi: 10.1007/978-3-030-01264-9_8.
    [28] SANDLER M, HOWARD A, ZHU Menglong, et al. MobileNetV2: Inverted residuals and linear bottlenecks[C]. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 4510–4520. doi: 10.1109/CVPR.2018.00474.
    [29] MIDDYA A I, NAG B, and ROY S. Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities[J]. Knowledge-Based Systems, 2022, 244: 108580. doi: 10.1016/j.knosys.2022.108580.
    [30] LUNA-JIMéNEZ C, KLEINLEIN R, GRIOL D, et al. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset[J]. Applied Sciences, 2021, 12(1): 327. doi: 10.3390/app12010327.
    [31] BOUALI Y L, AHMED O B, and MAZOUZI S. Cross-modal learning for audio-visual emotion recognition in acted speech[C]. Proceedings of the 6th International Conference on Advanced Technologies for Signal and Image Processing, Sfax, Tunisia, 2022: 1–6. doi: 10.1109/ATSIP55956.2022.9805959.
    [32] MOCANU B and TAPU R. Audio-video fusion with double attention for multimodal emotion recognition[C]. Proceedings of the 14th Image, Video, and Multidimensional Signal Processing Workshop, Nafplio, Greece, 2022: 1–5. doi: 10.1109/IVMSP54334.2022.9816349.
    [33] WOZNIAK M, SAKOWICZ M, LEDWOSINSKI K, et al. Bimodal emotion recognition based on vocal and facial features[J]. Procedia Computer Science, 2023, 225: 2556–2566. doi: 10.1016/j.procs.2023.10.247.
    [34] PAN Bei, HIROTA K, JIA Zhiyang, et al. Multimodal emotion recognition based on feature selection and extreme learning machine in video clips[J]. Journal of Ambient Intelligence and Humanized Computing, 2023, 14(3): 1903–1917. doi: 10.1007/s12652-021-03407-2.
    [35] TANG Guichen, XIE Yue, LI Ke, et al. Multimodal emotion recognition from facial expression and speech based on feature fusion[J]. Multimedia Tools and Applications, 2023, 82(11): 16359–16373. doi: 10.1007/s11042-022-14185-0.
    [36] CHEN Luefeng, WANG Kuanlin, LI Min, et al. K-means clustering-based kernel canonical correlation analysis for multimodal emotion recognition in human-robot interaction[J]. IEEE Transactions on Industrial Electronics, 2023, 70(1): 1016–1024. doi: 10.1109/TIE.2022.3150097.
    [37] CHEN Guanghui and ZENG Xiaoping. Multi-modal emotion recognition by fusing correlation features of speech-visual[J]. IEEE Signal Processing Letters, 2021, 28: 533–537. doi: 10.1109/LSP.2021.3055755.
  • 加載中
圖(11) / 表(6)
計量
  • 文章訪問數(shù):  388
  • HTML全文瀏覽量:  210
  • PDF下載量:  77
  • 被引次數(shù): 0
出版歷程
  • 收稿日期:  2024-02-05
  • 修回日期:  2024-11-06
  • 網(wǎng)絡出版日期:  2024-11-08
  • 刊出日期:  2024-12-01

目錄

    /

    返回文章
    返回