殘差網(wǎng)絡(luò)在嬰幼兒哭聲識(shí)別中的應(yīng)用

謝湘; 張立強(qiáng); 王晶

doi:10.11999/JEIT180276

殘差網(wǎng)絡(luò)在嬰幼兒哭聲識(shí)別中的應(yīng)用

doi: 10.11999/JEIT180276 cstr: 32379.14.JEIT180276

北京理工大學(xué)信息與電子學(xué)院 ??北京 ??100081

基金項(xiàng)目: 國(guó)家自然科學(xué)基金(61473041, 11590772, 61571044)

詳細(xì)信息

作者簡(jiǎn)介:
謝湘：男，1976年生，副教授，研究方向?yàn)檎Z(yǔ)音識(shí)別

張立強(qiáng)：男，1995年生，碩士生，研究方向?yàn)檎Z(yǔ)音人格感知

王晶：女，1980年生，副教授，研究方向?yàn)橐纛l信號(hào)處理

通訊作者:
謝湘　xiexiang@bit.edu.cn

中圖分類號(hào): TP391.42
計(jì)量
- 文章訪問(wèn)數(shù): 3138
- HTML全文瀏覽量: 1156
- PDF下載量: 108
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2018-03-23
- 修回日期: 2018-09-04
- 網(wǎng)絡(luò)出版日期: 2018-09-11
- 刊出日期: 2019-01-01

Application of Residual Network to Infant Crying Recognition

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

Funds: The National Natural Science Foundation of China (61473041, 11590772, 61571044)

摘要

摘要:
該文使用語(yǔ)譜圖結(jié)合殘差網(wǎng)絡(luò)的深度學(xué)習(xí)模型進(jìn)行嬰幼兒哭聲的識(shí)別，使用嬰幼兒哭聲與非哭聲樣本比例均衡的語(yǔ)料庫(kù)，經(jīng)過(guò)五折交叉驗(yàn)證，與支持向量機(jī)(SVM)，卷積神經(jīng)網(wǎng)絡(luò)(CNN)，基于Gammatone濾波器的聽(tīng)覺(jué)譜殘差網(wǎng)絡(luò)(GT-Resnet)3種模型相比，基于語(yǔ)譜圖的殘差網(wǎng)絡(luò)取得了最優(yōu)結(jié)果，F1-score達(dá)到0.9965，滿足實(shí)時(shí)性要求，證明了語(yǔ)譜圖在嬰幼兒哭聲識(shí)別任務(wù)中能直觀地反映聲學(xué)特征，基于語(yǔ)譜圖的殘差網(wǎng)絡(luò)是解決嬰幼兒哭聲識(shí)別任務(wù)的優(yōu)秀方法。
- 嬰兒哭聲識(shí)別 /
- 深度學(xué)習(xí) /
- 殘差網(wǎng)絡(luò) /
- 語(yǔ)譜圖
Abstract:
The deep learning model based on the residual network and the spectrogram is used to recognize infant crying. The corpus has balanced proportion of infant crying and non-crying samples. Finally, through the 5-fold cross validation, compared with three models of Support Vector Machine (SVM), Convolutional Neural Network (CNN) and the cochleagram residual network based on Gammatone filters (GT-Resnet), the spectrogram based residual network gets the best F1-score of 0.9965 and satisfies requirements of real time. It is proved that the spectrogram can react acoustics features intuitively and comprehensively in the recognition of infant crying. The residual network based on spectrogram is a good solution to infant crying recognition problem.
- Infant crying recognition /
- Deep learning /
- Residual network /
- Spectrogram

HTML全文

圖 1 嬰幼兒哭聲，成人說(shuō)話聲和鈴聲語(yǔ)譜圖對(duì)比

下載: 全尺寸圖片幻燈片

圖 2 殘差模塊

下載: 全尺寸圖片幻燈片

圖 3 CNN-5模型結(jié)構(gòu)

下載: 全尺寸圖片幻燈片

圖 4 3種模型測(cè)試集F1-score對(duì)比

下載: 全尺寸圖片幻燈片

圖 5 3種層數(shù)殘差網(wǎng)絡(luò)測(cè)試集F1-score對(duì)比

下載: 全尺寸圖片幻燈片

圖 6 殘差網(wǎng)絡(luò)模型

下載: 全尺寸圖片幻燈片

表 1 五折交叉驗(yàn)證數(shù)據(jù)集平均規(guī)模(條)

	嬰幼兒哭聲	非哭聲	總計(jì)
訓(xùn)練集規(guī)模	1243	1148	2391
測(cè)試集規(guī)模	310	286	596

下載: 導(dǎo)出CSV

表 2 SVM實(shí)驗(yàn)特征提取

提取特征類型	統(tǒng)計(jì)處理方法	維數(shù)
MFCC及其1階2階差分	均值、方差	72
短時(shí)能量	均值、方差	2
基音頻率	均值、方差、最大值、最小值、極差	5

下載: 導(dǎo)出CSV

表 3 SVM不同核函數(shù)性能比較

核函數(shù)類型	F1-score	參數(shù)
線性核函數(shù)	0.8717	c=0.68
多項(xiàng)式核函數(shù)	0.9316	c=0.30, g=0.35, r=–0.20, d=3.00
高斯核函數(shù)	0.9458	c=0.98, g=1.71
Sigmod核函數(shù)	0.8874	c=5.00, g=0.04, r=1.80

下載: 導(dǎo)出CSV

表 4 不同層數(shù)CNN性能對(duì)比

CNN模型	輸入特征	F1-score
CNN-4-MEL	40×128Mel語(yǔ)譜圖	0.9184
CNN-4-227	227×227語(yǔ)譜圖	0.9233
CNN-4	128×128語(yǔ)譜圖	0.9229
CNN-5-227	227×227語(yǔ)譜圖	0.9482
CNN-5	128×128語(yǔ)譜圖	0.9489
CNN-6	128×128語(yǔ)譜圖	0.9365
CNN-7	128×128語(yǔ)譜圖	0.9398

下載: 導(dǎo)出CSV

表 5 模型性能對(duì)比

模型	網(wǎng)絡(luò)結(jié)構(gòu)	輸入特征	生成模型大小(MB)	平均測(cè)試時(shí)間(s)	F1-score
SVM	單層網(wǎng)絡(luò)	統(tǒng)計(jì)特征	0.7	0.0910+0.0001	0.9458
CNN-5	4conv+1fc	語(yǔ)譜圖	10	0.1251+0.0093	0.9489
Resnet15	3resblock+1fc	語(yǔ)譜圖	48	0.1251+0.0281	0.9836
Resnet19	4resblock+1fc	語(yǔ)譜圖	87	0.1251+0.0315	0.9965
Resnet27	6resblock+1fc	語(yǔ)譜圖	171	0.1251+0.0355	0.9965
GT-Resnet15	3resblock+1fc	聽(tīng)覺(jué)譜	48	0.1933+0.0218	0.9803
GT-Resnet19	4resblock+1fc	聽(tīng)覺(jué)譜	87	0.1933+0.0237	0.9782
GT-Resnet27	6resblock+1fc	聽(tīng)覺(jué)譜	171	0.1933+0.0285	0.9719
注：平均測(cè)試時(shí)間=特征提取時(shí)間+模型預(yù)測(cè)時(shí)間

下載: 導(dǎo)出CSV

參考文獻(xiàn)(19)

于洪志, 劉思思. 三個(gè)月嬰兒啼哭聲的聲學(xué)分析[C]. 全國(guó)人機(jī)語(yǔ)音通訊學(xué)術(shù)會(huì)議, 西安, 2011: 1–4.

YU Hongzhi and LIU Sisi. Crying sound learning analysis of three months baby[C]. National Conference on Man-Machine Speech Communication, Xi’an, China, 2011: 1–4.

王之禹, 雷云珊. 嬰兒啼哭聲的聲學(xué)特征[C]. 中國(guó)聲學(xué)學(xué)會(huì)2006年全國(guó)聲學(xué)學(xué)術(shù)會(huì)議, 廈門, 2006: 389–390.

WANG Zhiyu and LEI Yunshan. Acoustic characteristic of infant cries[C]. National Conference on Acoustics. Acoustical Society of China, Xiamen, China, 2006: 389–390.

ABDULAZIZ Y and AHMAD S M S. Infant cry recognition system: A comparison of system performance based on mel frequency and linear prediction cepstral coefficients[C]. International Conference on Information Retrieval & Knowledge Management, Shah Alam, Malaysia, 2010: 260–263. doi: 10.1109/INFRKM.2010.5466907.

COHEN R and LAVNER Y. Infant cry analysis and detection[C]. Electrical & Electronics Engineers in Israel, Eilat, Israel, 2012: 1–5.

LAVNER Y, COHEN R, RUINSKIY D, et al. Baby cry detection in domestic environment using deep learning[C]. 2016 IEEE International Conference on the Science of Electrical Engineering (ICSEE), Eilat, Israel, 2016: 1–5. doi: 10.1109/EEEI.2012.6376996.

TORRES R, BATTAGLINO D, and LEPAULOUX L. Baby cry sound detection: A comparison of hand crafted features and deep learning approach[C]. International Conference on Engineering Applications of Neural Networks. Springer, Cham, 2017: 168–179. doi: 10.1007/978-3-319-65172-9_15.

CHANG Chuanyu and LI Jiajing. Application of deep learning for recognizing infant cries[C]. IEEE International Conference on Consumer Electronics, Nantou, China, 2016: 1–2. doi: 10.1109/ICCE-TW.2016.7520947.

SHARAN R V and MOIR T J. Cochleagram image feature for improved robustness in sound recognition[C]. IEEE International Conference on Digital Signal Processing, Singapore, 2015: 441–444.

PATTERSON R D, NIMMO-SMITH I, HOLDSWORTH J, et al. An efficient auditory filterbank based on the gammatone function[C]. Proceedings of the 1987 Speech-Group Meeting of the Institute of Acoustics on Auditory Modelling, RSRE, Malvern, 1987: 2–18.

劉文舉, 聶帥, 梁山, 等. 基于深度學(xué)習(xí)語(yǔ)音分離技術(shù)的研究現(xiàn)狀與進(jìn)展[J]. 自動(dòng)化學(xué)報(bào), 2016, 42(6): 819–833. doi: 10.16383/j.aas.2016.c150734

LIU Wenju, NIE Shuai, LIANG Shan, et al. Deep learning based speech separation technology and its developments[J]. Acta Automatica Sinica, 2016, 42(6): 819–833. doi: 10.16383/j.aas.2016.c150734

MITTAL V K. Discriminating features of infant cry acoustic signal for automated detection of cause of crying[C]. International Symposium on Chinese Spoken Language Processing, Tianjin, China, 2017: 1–5. doi: 10.1109/ISCSLP.2016.7918391.

RPSITA Y D and JUNAEDI H. Infant’s cry sound classification using Mel-Frequency Cepstrum Coefficients feature extraction and Backpropagation Neural Network[C]. International Conference on Science and Technology-Computer, Yogyakarta, Indonesia, 2017: 160–166. doi: 10.1109/ICSTC.2016.7877367.

雷云珊. 嬰兒啼哭聲分析與模式分類[D]. [碩士論文], 山東科技大學(xué), 2006.

LEI Yunshan. Analysis and pattern classification of infants’ cry[D]. [Master dissertation], Shandong University of Science and Technology, 2006.

KRIZHEVAKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[C]. International Conference on Neural Information Processing Systems, Nevada, USA, 2012: 1097–1105.

HE Kaiming, ZHANG Xianyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Computer Vision and Pattern Recognition, Nevada, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.

GVERES. donateacry-corpus[OL]. https://github.com/gveres/donateacry-corpus, 2017.3.

彭天強(qiáng), 栗芳. 基于深度卷積神經(jīng)網(wǎng)絡(luò)和二進(jìn)制哈希學(xué)習(xí)的圖像檢索方法[J]. 電子與信息學(xué)報(bào), 2016, 38(8): 2068–2075. doi: 10.11999/JEIT151346

PENG Tianqiang and LI Fang. Image retrieval based on deep convolutional neural networks and binary hashing learning[J]. Journal of Electronics &Information Technology, 2016, 38(8): 2068–2075. doi: 10.11999/JEIT151346

CHANG Chihchung and LIN Chihjen. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 1–27. doi: 10.1145/1961189.1961199

徐利強(qiáng), 謝湘, 黃石磊, 等. 連續(xù)語(yǔ)音中的笑聲檢測(cè)研究與實(shí)現(xiàn)[C]. 全國(guó)聲學(xué)學(xué)術(shù)會(huì)議, 武漢, 2016: 581–584.

XU Liqiang, XIE Xiang, HUANG Shilei, et al. Research and implementation of laughter detection in continuous speech[C]. National Conference on Acoustics. Acoustical Society of China, Wuhan, China, 2016: 581–584.

相關(guān)文章

施引文獻(xiàn)

資源附件(0)

訪問(wèn)統(tǒng)計(jì)