期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Classification of stop place in consonant-vowel contexts using feature extrapolation of acoustic-phonetic features in telephone speech

Lee JW Choi JY Kang HG 《The Journal of the Acoustical Society of America》2012,131(2):1536-1546

Knowledge-based speech recognition systems extract acoustic cues from the signal to identify speech characteristics. For channel-deteriorated telephone speech, acoustic cues, especially those for stop consonant place, are expected to be degraded or absent. To investigate the use of knowledge-based methods in degraded environments, feature extrapolation of acoustic-phonetic features based on Gaussian mixture models is examined. This process is applied to a stop place detection module that uses burst release and vowel onset cues for consonant-vowel tokens of English. Results show that classification performance is enhanced in telephone channel-degraded speech, with extrapolated acoustic-phonetic features reaching or exceeding performance using estimated Mel-frequency cepstral coefficients (MFCCs). Results also show acoustic-phonetic features may be combined with MFCCs for best performance, suggesting these features provide information complementary to MFCCs. 相似文献

2.

Evaluation of formant-like features on an automatic vowel classification task

de Wet F Weber K Boves L Cranen B Bengio S Bourlard H 《The Journal of the Acoustical Society of America》2004,116(3):1781-1792

Numerous attempts have been made to find low-dimensional, formant-related representations of speech signals that are suitable for automatic speech recognition. However, it is often not known how these features behave in comparison with true formants. The purpose of this study was to compare two sets of automatically extracted formant-like features, i.e., robust formants and HMM2 features, to hand-labeled formants. The robust formant features were derived by means of the split Levinson algorithm while the HMM2 features correspond to the frequency segmentation of speech signals obtained by two-dimensional hidden Markov models. Mel-frequency cepstral coefficients (MFCCs) were also included in the investigation as an example of state-of-the-art automatic speech recognition features. The feature sets were compared in terms of their performance on a vowel classification task. The speech data and hand-labeled formants that were used in this study are a subset of the American English vowels database presented in Hillenbrand et al. [J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. Classification performance was measured on the original, clean data and in noisy acoustic conditions. When using clean data, the classification performance of the formant-like features compared very well to the performance of the hand-labeled formants in a gender-dependent experiment, but was inferior to the hand-labeled formants in a gender-independent experiment. The results that were obtained in noisy acoustic conditions indicated that the formant-like features used in this study are not inherently noise robust. For clean and noisy data as well as for the gender-dependent and gender-independent experiments the MFCCs achieved the same or superior results as the formant features, but at the price of a much higher feature dimensionality. 相似文献

3.

Perceptual MVDR-based cepstral coefficients(PMCCs)for speaker recognition

LIANG Chunyan ZHANG Xiang YANG Lin ZHANG Jianping YAN Yonghong 《声学学报：英文版》2012,(4):489-498

A feature extraction technique named perceptual MVDR-based cepstral coefficients (PMCCs) was introduced into speaker recognition.PMCCs are extracted and modeled using Gaussian Mixture Models(GMMs) for speaker recognition.In order to compensate for speaker and channel variability effects,joint factor analysis(JFA) is used.The experiments are carried out on the core conditions of NIST 2008 speaker recognition evaluation data.The experimental results show that the systems based on PMCCs can achieve comparable performance to those based on the conventional MFCCs.Besides,the fusion of the two kinds of systems can make significant performance improvement compared to the MFCCs system alone,reducing equal error rate(EER) by the factor between 7.6%and 30.5%as well as minimum detect cost function (minDCF) by the factor between 3.2%and 21.2%on different test sets.The results indicate that PMCCs can be effectively applied in speaker recognition and they are complementary with MFCCs to some extent. 相似文献

4.

最小方差无失真响应感知倒谱系数在说话人识别中的应用

下载免费PDF全文

梁春燕张翔杨琳张建平颜永红《声学学报》2012,37(6):673-678

研究最小方差无失真响应感知倒谱系数在说话人识别中的应用。提取最小方差无失真响应感知倒谱系数,对其进行高斯混合模型建模并采用联合因子分析的方法来拟合高斯混合模型中的说话人和信道差异,在美国国家标准技术研究院2008年说话人识别评测核心测试集上分别对最小方差无失真响应感知倒谱系数和传统的Mel频率倒谱系数进行测试。结果显示,两种不同特征的系统性能相当,采用线性融合方法后,在不同测试集上的等错误率相对下降了7.6%~30.5%,最小检测错误代价相对下降了3.2%~21.2%。实验表明,最小方差无失真响应感知倒谱系数能有效应用于说话人识别中,且与传统的Mel频率倒谱系数存在一定程度的互补性。相似文献

5.

Statistical modeling of speech Poincaré sections in combination of frequency analysis to improve speech recognition performance

Jafari A Almasganj F Bidhendi MN 《Chaos (Woodbury, N.Y.)》2010,20(3):033106

This paper introduces a combinational feature extraction approach to improve speech recognition systems. The main idea is to simultaneously benefit from some features obtained from Poincare? section applied to speech reconstructed phase space (RPS) and typical Mel frequency cepstral coefficients (MFCCs) which have a proved role in speech recognition field. With an appropriate dimension, the reconstructed phase space of speech signal is assured to be topologically equivalent to the dynamics of the speech production system, and could therefore include information that may be absent in linear analysis approaches. Moreover, complicated systems such as speech production system can present cyclic and oscillatory patterns and Poincare? sections could be used as an effective tool in analysis of such trajectories. In this research, a statistical modeling approach based on Gaussian mixture models (GMMs) is applied to Poincare? sections of speech RPS. A final pruned feature set is obtained by applying an efficient feature selection approach to the combination of the parameters of the GMM model and MFCC-based features. A hidden Markov model-based speech recognition system and TIMIT speech database are used to evaluate the performance of the proposed feature set by conducting isolated and continuous speech recognition experiments. By the proposed feature set, 5.7% absolute isolated phoneme recognition improvement is obtained against only MFCC-based features. 相似文献

6.

A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition

Juneja A Espy-Wilson C 《The Journal of the Acoustical Society of America》2008,123(2):1154-1168

A probabilistic framework for a landmark-based approach to speech recognition is presented for obtaining multiple landmark sequences in continuous speech. The landmark detection module uses as input acoustic parameters (APs) that capture the acoustic correlates of some of the manner-based phonetic features. The landmarks include stop bursts, vowel onsets, syllabic peaks and dips, fricative onsets and offsets, and sonorant consonant onsets and offsets. Binary classifiers of the manner phonetic features-syllabic, sonorant and continuant-are used for probabilistic detection of these landmarks. The probabilistic framework exploits two properties of the acoustic cues of phonetic features-(1) sufficiency of acoustic cues of a phonetic feature for a probabilistic decision on that feature and (2) invariance of the acoustic cues of a phonetic feature with respect to other phonetic features. Probabilistic landmark sequences are constrained using manner class pronunciation models for isolated word recognition with known vocabulary. The performance of the system is compared with (1) the same probabilistic system but with mel-frequency cepstral coefficients (MFCCs), (2) a hidden Markov model (HMM) based system using APs and (3) a HMM based system using MFCCs. 相似文献

7.

Analysis of acoustic parameters for consonant voicing classification in clean and telephone speech

Lee SM Choi JY 《The Journal of the Acoustical Society of America》2012,131(3):EL197-EL202

This paper describes acoustic cues for classification of consonant voicing in a distinctive feature-based speech recognition system. Initial acoustic cues are selected by studying consonant production mechanisms. Spectral representations, band-limited energies, and correlation values, along with Mel-frequency cepstral coefficients features (MFCCs) are also examined. Analysis of variance is performed to assess relative significance of features. Overall, 82.2%, 80.6%, and 78.4% classification rates are obtained on the TIMIT database for stops, fricatives, and affricates, respectively. Combining acoustic parameters with MFCCs shows performance improvement in all cases. Also, performance in the NTIMIT telephone channel speech shows that acoustic parameters are more robust than MFCCs. 相似文献

8.

倒谱参数稀疏分解下的汉语音谎言检测

下载免费PDF全文

樊晓鹤赵鹤鸣陈雪勤周燕《声学学报》2018,43(1):121-128

为了提高汉语语音的谎言检测准确率,提出了一种对信号倒谱参数进行稀疏分解的方法。首先,采用小波包滤波器组对语音信号进行多频带划分,求得子频带对数能量并进行离散余弦变换以提取小波包频带倒谱系数,结合梅尔频率谱系数得到倒谱参数;其次,依据K-奇异值分解方法分别利用说谎和非说谎两种状态下的语音倒谱参数集训练得到过完备混合字典,在此字典上根据正交匹配追踪算法对参数集进行稀疏编码提取稀疏特征;最终进行多种分类模型下的识别实验·实验结果表明,稀疏分解方法相比传统参数降维方法具有更好的优化性能,本文推荐的稀疏谱特征最佳识别率达到78.34%,优于其他特征参数,显著提高了谎言检测识别准确率。相似文献

9.

Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes

Meyer BT Brand T Kollmeier B 《The Journal of the Acoustical Society of America》2011,129(1):388-403

The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems. 相似文献

10.

冲击声的稀疏特征提取及声源类型识别

下载免费PDF全文

梁雍陈克安《声学学报》2018,43(4):708-718

针对低信噪比下声源材料类型的细分任务,将稀疏表达用于冲击声信号的声源类型识别,提取的稀疏特征相比传统的MFCC特征有效改善了识别性能。分别基于3种预定义词典和一组根据训练信号学习的词典,利用正交匹配追踪(OMP)方法对录制冲击声进行稀疏表达,提取稀疏特征用于不同信噪比下冲击声信号的声源辨识,并与MFCC特征进行比较。对包含12类材料的冲击声数据库的分类结果显示,在几乎所有情况下,稀疏特征比MFCC特征具有更好的识别效果。特别是在信噪比较低的情况下,稀疏特征具有更好的抗噪性能。相似文献

11.

Automatic classification and speaker identification of African elephant (Loxodonta africana) vocalizations

Clemins PJ Johnson MT Leong KM Savage A 《The Journal of the Acoustical Society of America》2005,117(2):956-963

A hidden Markov model (HMM) system is presented for automatically classifying African elephant vocalizations. The development of the system is motivated by successful models from human speech analysis and recognition. Classification features include frequency-shifted Mel-frequency cepstral coefficients (MFCCs) and log energy, spectrally motivated features which are commonly used in human speech processing. Experiments, including vocalization type classification and speaker identification, are performed on vocalizations collected from captive elephants in a naturalistic environment. The system classified vocalizations with accuracies of 94.3% and 82.5% for type classification and speaker identification classification experiments, respectively. Classification accuracy, statistical significance tests on the model parameters, and qualitative analysis support the effectiveness and robustness of this approach for vocalization analysis in nonhuman species. 相似文献

12.

Deceptive Chinese speech detection based on sparse decomposition of cepstral feature

FAN Xiaohe ZHAO Heming CHEN Xueqin ZHOU Yan 《声学学报：英文版》2019,(1)

In order to improve the performance of deception detection based on Chinese speech signals, a method of sparse decomposition on spectral feature is proposed. First, the wavelet packet transform is applied to divide the speech signal into multiple sub-bands. Band cepstral features of wavelet packets are obtained by operating the discrete cosine transform on loga?rithmic energy of each sub-band. The cepstral feature is generated by combing Mel Frequency Cepstral Coefficient and Wavelet Packet Band Cepstral Coefficient. Second, K-singular value decomposition algorithm is employed to achieve the training of an over-complete mixture dictionary based on both the truth and deceptive feature sets, and an orthogonal matching pursuit algorithm is used for sparse coding according to the mixture dictionary to get sparse feature.Finally, recognition experiments axe performed with various classified modules. Experimental results show that the sparse decomposition method has better performance comparied with con?ventional dimension reduced methods. The recognition accuracy of the method proposed in this paper is 78.34%, which is higher than methods using other features, improving the recognition ability of deception detection system significantly. 相似文献

13.

用于噪声鲁棒性语音识别的子带能量规整感知线性预测系数

下载免费PDF全文

蔡尚金鑫高圣翔潘接林颜永红《声学学报》2012,37(6):667-672

为了提高感知线性预测系数(PLP)在噪声环境下的识别性能,使用子带能量偏差减的方法,提出了一种基于子带能量规整的感知线性预测系数(SPNPLP)。PLP有效地集中了语音中的有用信息,在安静环境下自动语音识别系统使用PLP可以取得良好的识别率;但是在噪声环境中其识别性能急剧下降。通过使用能量偏差减的方法对PLP的子带能量进行规整,抑制背景噪声激励,提出了SPNPLP,增强自动语音识别系统在噪声环境下的鲁棒性。在一个语法大小为501的孤立词识别任务和一个大词表连续语音识别任务上做了测试,SPNPLP在这两个任务上,与PLP相比,汉字识别精度分别绝对提升了11.26%和9.2%。实验结果表明SPNPLP比PLP具有更好的噪声鲁棒性。相似文献

14.

Automatic speech recognition in cocktail-party situations: a specific training for separated speech

Marti A Cobos M Lopez JJ 《The Journal of the Acoustical Society of America》2012,131(2):1529-1535

相似文献

15.

基于Gabor小波纹理特征的目标识别新方法 总被引：5，自引：2，他引：5

张敏许廷发《物理实验》2004,24(4):12-15

给出了一种基于Gabor小波纹理特征的目标识别新方法．主要是利用Gabor小波设计了一种多通道小波滤波器。对图像目标直接进行小波变换，用Gabor小波变换系数的模的平均值和其标准方差来表示抽取的图像目标的特征，把获得的小波特征归一化后输入到改进的BP神经网络分类器进行分类识别．最后。进行了一系列的仿真实验，结果表明，这种特征提取方法能有效提取图像目标纹理特征，并且对噪音和形状的变化具有鲁棒性．在应用于目标识别时，神经网络的训练时间减少到lOmin，识别率达到94％．相似文献

16.

基于听觉模型与自适应分数阶Fourier变换的声学特征在语音识别中的应用

下载免费PDF全文

尹辉谢湘匡镜明《声学学报》2012,37(1):97-103

分数阶Fourier变换在处理非平稳信号尤其是chirp信号方面有着独特的优势,而人耳听觉系统具有自动语音识别系统难以比拟的优良性能。本文采用Gammatone听觉滤波器组对语音信号进行前端时域滤波,然后对输出的各个子带信号用分数阶Fourer变换方法提取声学特征。分数阶Fourier变换的阶数对其性能有着重要影响,本文针对子带时域信号提出了采用瞬时频率曲线拟合求取阶数的方法,并将其与采用模糊函数的方法作了比较。在干净与含噪汉语孤立数字库上的语音识别结果表明,采用新提出的声学特征得到的识别正确率相对MFCC基线系统有了显著提高;根据瞬时频率曲线搜索阶数的算法与模糊函数方法相比,计算量大大减少,并且根据该方法提取的声学特征得到了最高的平均识别正确率。相似文献

17.

Predicting fundamental frequency from mel-frequency cepstral coefficients to enable speech reconstruction

Shao X Milner B 《The Journal of the Acoustical Society of America》2005,118(2):1134-1143

This work proposes a method to reconstruct an acoustic speech signal solely from a stream of mel-frequency cepstral coefficients (MFCCs) as may be encountered in a distributed speech recognition (DSR) system. Previous methods for speech reconstruction have required, in addition to the MFCC vectors, fundamental frequency and voicing components. In this work the voicing classification and fundamental frequency are predicted from the MFCC vectors themselves using two maximum a posteriori (MAP) methods. The first method enables fundamental frequency prediction by modeling the joint density of MFCCs and fundamental frequency using a single Gaussian mixture model (GMM). The second scheme uses a set of hidden Markov models (HMMs) to link together a set of state-dependent GMMs, which enables a more localized modeling of the joint density of MFCCs and fundamental frequency. Experimental results on speaker-independent male and female speech show that accurate voicing classification and fundamental frequency prediction is attained when compared to hand-corrected reference fundamental frequency measurements. The use of the predicted fundamental frequency and voicing for speech reconstruction is shown to give very similar speech quality to that obtained using the reference fundamental frequency and voicing. 相似文献

18.

Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition

Skowronski MD Harris JG 《The Journal of the Acoustical Society of America》2004,116(3):1774-1780

Mel frequency cepstral coefficients (MFCC) are the most widely used speech features in automatic speech recognition systems, primarily because the coefficients fit well with the assumptions used in hidden Markov models and because of the superior noise robustness of MFCC over alternative feature sets such as linear prediction-based coefficients. The authors have recently introduced human factor cepstral coefficients (HFCC), a modification of MFCC that uses the known relationship between center frequency and critical bandwidth from human psychoacoustics to decouple filter bandwidth from filter spacing. In this work, the authors introduce a variation of HFCC called HFCC-E in which filter bandwidth is linearly scaled in order to investigate the effects of wider filter bandwidth on noise robustness. Experimental results show an increase in signal-to-noise ratio of 7 dB over traditional MFCC algorithms when filter bandwidth increases in HFCC-E. An important attribute of both HFCC and HFCC-E is that the algorithms only differ from MFCC in the filter bank coefficients: increased noise robustness using wider filters is achieved with no additional computational cost. 相似文献

19.

一种抗姿态与表情变化的三维人脸识别方法

蔡川丽张建平张彦博《应用光学》2018,39(4):491-499

为了提高人脸在姿态和表情变化下的识别率,结合局部平面距离(DLP)对曲面局部凹凸性优良的判断能力,提出了一种采用人脸的等距不变表示形式来匹配的人脸识别方法。首先,对深度摄像头采集到的深度图像进行距离约束、位置约束、转换等操作,得到干净完整的三维人脸,利用三维人脸上每一点DLP值确定鼻尖点,利用聚类的思想确定鼻根点;其次,采用改进的快速推进算法计算人脸的测地距矩阵,设置阈值并切割出有效的人脸区域;最后,计算有效的人脸区域的高阶矩特征,作为人脸的特征向量进行匹配。实验结果表明,对于不同的数据库,本文算法的识别率接近97%;将本文算法与基于轮廓线特征的人脸识别算法以及基于Gabor特征的人脸识别算法进行比较,其识别率分别提高了14.1%和8.3%,同时有着较高的运算效率。相似文献

20.

Spectro-temporal modulation energy based mask for robust speaker identification

Chi TS Lin TH Hsu CC 《The Journal of the Acoustical Society of America》2012,131(5):EL368-EL374

Spectro-temporal modulations of speech encode speech structures and speaker characteristics. An algorithm which distinguishes speech from non-speech based on spectro-temporal modulation energies is proposed and evaluated in robust text-independent closed-set speaker identification simulations using the TIMIT and GRID corpora. Simulation results show the proposed method produces much higher speaker identification rates in all signal-to-noise ratio (SNR) conditions than the baseline system using mel-frequency cepstral coefficients. In addition, the proposed method also outperforms the system, which uses auditory-based nonnegative tensor cepstral coefficients [Q. Wu and L. Zhang, "Auditory sparse representation for robust speaker recognition based on tensor structure," EURASIP J. Audio, Speech, Music Process. 2008, 578612 (2008)], in low SNR (≤ 10 dB) conditions. 相似文献