殘差網(wǎng)絡(luò)在嬰幼兒哭聲識(shí)別中的應(yīng)用
doi: 10.11999/JEIT180276 cstr: 32379.14.JEIT180276
-
北京理工大學(xué)信息與電子學(xué)院 ??北京 ??100081
Application of Residual Network to Infant Crying Recognition
-
School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
-
摘要:
該文使用語(yǔ)譜圖結(jié)合殘差網(wǎng)絡(luò)的深度學(xué)習(xí)模型進(jìn)行嬰幼兒哭聲的識(shí)別,使用嬰幼兒哭聲與非哭聲樣本比例均衡的語(yǔ)料庫(kù),經(jīng)過(guò)五折交叉驗(yàn)證,與支持向量機(jī)(SVM),卷積神經(jīng)網(wǎng)絡(luò)(CNN),基于Gammatone濾波器的聽(tīng)覺(jué)譜殘差網(wǎng)絡(luò)(GT-Resnet)3種模型相比,基于語(yǔ)譜圖的殘差網(wǎng)絡(luò)取得了最優(yōu)結(jié)果,F1-score達(dá)到0.9965,滿足實(shí)時(shí)性要求,證明了語(yǔ)譜圖在嬰幼兒哭聲識(shí)別任務(wù)中能直觀地反映聲學(xué)特征,基于語(yǔ)譜圖的殘差網(wǎng)絡(luò)是解決嬰幼兒哭聲識(shí)別任務(wù)的優(yōu)秀方法。
-
關(guān)鍵詞:
- 嬰兒哭聲識(shí)別 /
- 深度學(xué)習(xí) /
- 殘差網(wǎng)絡(luò) /
- 語(yǔ)譜圖
Abstract:The deep learning model based on the residual network and the spectrogram is used to recognize infant crying. The corpus has balanced proportion of infant crying and non-crying samples. Finally, through the 5-fold cross validation, compared with three models of Support Vector Machine (SVM), Convolutional Neural Network (CNN) and the cochleagram residual network based on Gammatone filters (GT-Resnet), the spectrogram based residual network gets the best F1-score of 0.9965 and satisfies requirements of real time. It is proved that the spectrogram can react acoustics features intuitively and comprehensively in the recognition of infant crying. The residual network based on spectrogram is a good solution to infant crying recognition problem.
-
Key words:
- Infant crying recognition /
- Deep learning /
- Residual network /
- Spectrogram
-
表 1 五折交叉驗(yàn)證數(shù)據(jù)集平均規(guī)模(條)
嬰幼兒哭聲 非哭聲 總計(jì) 訓(xùn)練集規(guī)模 1243 1148 2391 測(cè)試集規(guī)模 310 286 596 下載: 導(dǎo)出CSV
表 2 SVM實(shí)驗(yàn)特征提取
提取特征類型 統(tǒng)計(jì)處理方法 維數(shù) MFCC及其1階2階差分 均值、方差 72 短時(shí)能量 均值、方差 2 基音頻率 均值、方差、最大值、最小值、極差 5 下載: 導(dǎo)出CSV
表 3 SVM不同核函數(shù)性能比較
核函數(shù)類型 F1-score 參數(shù) 線性核函數(shù) 0.8717 c=0.68 多項(xiàng)式核函數(shù) 0.9316 c=0.30, g=0.35, r=–0.20, d=3.00 高斯核函數(shù) 0.9458 c=0.98, g=1.71 Sigmod核函數(shù) 0.8874 c=5.00, g=0.04, r=1.80 下載: 導(dǎo)出CSV
表 4 不同層數(shù)CNN性能對(duì)比
CNN模型 輸入特征 F1-score CNN-4-MEL 40×128Mel語(yǔ)譜圖 0.9184 CNN-4-227 227×227語(yǔ)譜圖 0.9233 CNN-4 128×128語(yǔ)譜圖 0.9229 CNN-5-227 227×227語(yǔ)譜圖 0.9482 CNN-5 128×128語(yǔ)譜圖 0.9489 CNN-6 128×128語(yǔ)譜圖 0.9365 CNN-7 128×128語(yǔ)譜圖 0.9398 下載: 導(dǎo)出CSV
表 5 模型性能對(duì)比
模型 網(wǎng)絡(luò)結(jié)構(gòu) 輸入特征 生成模型大小(MB) 平均測(cè)試時(shí)間(s) F1-score SVM 單層網(wǎng)絡(luò) 統(tǒng)計(jì)特征 0.7 0.0910+0.0001 0.9458 CNN-5 4conv+1fc 語(yǔ)譜圖 10 0.1251+0.0093 0.9489 Resnet15 3resblock+1fc 語(yǔ)譜圖 48 0.1251+0.0281 0.9836 Resnet19 4resblock+1fc 語(yǔ)譜圖 87 0.1251+0.0315 0.9965 Resnet27 6resblock+1fc 語(yǔ)譜圖 171 0.1251+0.0355 0.9965 GT-Resnet15 3resblock+1fc 聽(tīng)覺(jué)譜 48 0.1933+0.0218 0.9803 GT-Resnet19 4resblock+1fc 聽(tīng)覺(jué)譜 87 0.1933+0.0237 0.9782 GT-Resnet27 6resblock+1fc 聽(tīng)覺(jué)譜 171 0.1933+0.0285 0.9719 注:平均測(cè)試時(shí)間=特征提取時(shí)間+模型預(yù)測(cè)時(shí)間 下載: 導(dǎo)出CSV
-
于洪志, 劉思思. 三個(gè)月嬰兒啼哭聲的聲學(xué)分析[C]. 全國(guó)人機(jī)語(yǔ)音通訊學(xué)術(shù)會(huì)議, 西安, 2011: 1–4.YU Hongzhi and LIU Sisi. Crying sound learning analysis of three months baby[C]. National Conference on Man-Machine Speech Communication, Xi’an, China, 2011: 1–4. 王之禹, 雷云珊. 嬰兒啼哭聲的聲學(xué)特征[C]. 中國(guó)聲學(xué)學(xué)會(huì)2006年全國(guó)聲學(xué)學(xué)術(shù)會(huì)議, 廈門, 2006: 389–390.WANG Zhiyu and LEI Yunshan. Acoustic characteristic of infant cries[C]. National Conference on Acoustics. Acoustical Society of China, Xiamen, China, 2006: 389–390. ABDULAZIZ Y and AHMAD S M S. Infant cry recognition system: A comparison of system performance based on mel frequency and linear prediction cepstral coefficients[C]. International Conference on Information Retrieval & Knowledge Management, Shah Alam, Malaysia, 2010: 260–263. doi: 10.1109/INFRKM.2010.5466907. COHEN R and LAVNER Y. Infant cry analysis and detection[C]. Electrical & Electronics Engineers in Israel, Eilat, Israel, 2012: 1–5. LAVNER Y, COHEN R, RUINSKIY D, et al. Baby cry detection in domestic environment using deep learning[C]. 2016 IEEE International Conference on the Science of Electrical Engineering (ICSEE), Eilat, Israel, 2016: 1–5. doi: 10.1109/EEEI.2012.6376996. TORRES R, BATTAGLINO D, and LEPAULOUX L. Baby cry sound detection: A comparison of hand crafted features and deep learning approach[C]. International Conference on Engineering Applications of Neural Networks. Springer, Cham, 2017: 168–179. doi: 10.1007/978-3-319-65172-9_15. CHANG Chuanyu and LI Jiajing. Application of deep learning for recognizing infant cries[C]. IEEE International Conference on Consumer Electronics, Nantou, China, 2016: 1–2. doi: 10.1109/ICCE-TW.2016.7520947. SHARAN R V and MOIR T J. Cochleagram image feature for improved robustness in sound recognition[C]. IEEE International Conference on Digital Signal Processing, Singapore, 2015: 441–444. PATTERSON R D, NIMMO-SMITH I, HOLDSWORTH J, et al. An efficient auditory filterbank based on the gammatone function[C]. Proceedings of the 1987 Speech-Group Meeting of the Institute of Acoustics on Auditory Modelling, RSRE, Malvern, 1987: 2–18. 劉文舉, 聶帥, 梁山, 等. 基于深度學(xué)習(xí)語(yǔ)音分離技術(shù)的研究現(xiàn)狀與進(jìn)展[J]. 自動(dòng)化學(xué)報(bào), 2016, 42(6): 819–833. doi: 10.16383/j.aas.2016.c150734LIU Wenju, NIE Shuai, LIANG Shan, et al. Deep learning based speech separation technology and its developments[J]. Acta Automatica Sinica, 2016, 42(6): 819–833. doi: 10.16383/j.aas.2016.c150734 MITTAL V K. Discriminating features of infant cry acoustic signal for automated detection of cause of crying[C]. International Symposium on Chinese Spoken Language Processing, Tianjin, China, 2017: 1–5. doi: 10.1109/ISCSLP.2016.7918391. RPSITA Y D and JUNAEDI H. Infant’s cry sound classification using Mel-Frequency Cepstrum Coefficients feature extraction and Backpropagation Neural Network[C]. International Conference on Science and Technology-Computer, Yogyakarta, Indonesia, 2017: 160–166. doi: 10.1109/ICSTC.2016.7877367. 雷云珊. 嬰兒啼哭聲分析與模式分類[D]. [碩士論文], 山東科技大學(xué), 2006.LEI Yunshan. Analysis and pattern classification of infants’ cry[D]. [Master dissertation], Shandong University of Science and Technology, 2006. KRIZHEVAKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[C]. International Conference on Neural Information Processing Systems, Nevada, USA, 2012: 1097–1105. HE Kaiming, ZHANG Xianyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Computer Vision and Pattern Recognition, Nevada, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90. GVERES. donateacry-corpus[OL]. https://github.com/gveres/donateacry-corpus, 2017.3. 彭天強(qiáng), 栗芳. 基于深度卷積神經(jīng)網(wǎng)絡(luò)和二進(jìn)制哈希學(xué)習(xí)的圖像檢索方法[J]. 電子與信息學(xué)報(bào), 2016, 38(8): 2068–2075. doi: 10.11999/JEIT151346PENG Tianqiang and LI Fang. Image retrieval based on deep convolutional neural networks and binary hashing learning[J]. Journal of Electronics &Information Technology, 2016, 38(8): 2068–2075. doi: 10.11999/JEIT151346 CHANG Chihchung and LIN Chihjen. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 1–27. doi: 10.1145/1961189.1961199 徐利強(qiáng), 謝湘, 黃石磊, 等. 連續(xù)語(yǔ)音中的笑聲檢測(cè)研究與實(shí)現(xiàn)[C]. 全國(guó)聲學(xué)學(xué)術(shù)會(huì)議, 武漢, 2016: 581–584.XU Liqiang, XIE Xiang, HUANG Shilei, et al. Research and implementation of laughter detection in continuous speech[C]. National Conference on Acoustics. Acoustical Society of China, Wuhan, China, 2016: 581–584. -