期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An overlapping-feature-based phonological model incorporating linguistic constraints: applications to speech recognition

Sun J Deng L 《The Journal of the Acoustical Society of America》2002,111(2):1086-1101

Modeling phonological units of speech is a critical issue in speech recognition. In this paper, our recent development of an overlapping-feature-based phonological model that represents long-span contextual dependency in speech acoustics is reported. In this model, high-level linguistic constraints are incorporated in automatic construction of the patterns of feature-overlapping and of the hidden Markov model (HMM) states induced by such patterns. The main linguistic information explored includes word and phrase boundaries, morpheme, syllable, syllable constituent categories, and word stress. A consistent computational framework developed for the construction of the feature-based model and the major components of the model are described. Experimental results on the use of the overlapping-feature model in an HMM-based system for speech recognition show improvements over the conventional triphone-based phonological model. 相似文献

2.

汉语耳语音孤立字识别研究 总被引：6，自引：0，他引：6

下载免费PDF全文

杨莉莉林玮徐柏龄《应用声学》2006,25(3):187-192

耳语音识别有着广泛的应用前景,是一个全新的课题.但是由于耳语音本身的特点,如声级低、没有基频等,给耳语音识别研究带来了困难.本文根据耳语音信号发音模型,结合耳语音的声学特性,建立了一个汉语耳语音孤立字识别系统.由于耳语音信噪比低,必须对其进行语音增强处理,同时在识别系统中应用声调信息提高了识别性能.实验结果说明了MFCC结合幅值包络可作为汉语耳语音自动识别的特征参数,在小字库内用HMM模型识别得出的识别率为90.4%. 相似文献

3.

基于鼻韵尾分离的汉语声韵母识别模型

邵健赵庆卫颜永红《声学学报》2010,35(5):587-592

研究汉语自然口语识别中的建模单元选择问题。在HMM三状态模型中,声韵母单元与音素单元作为两种最流行的建模单元各有优劣。一方面从自然口语音变严重的问题出发,倾向采用粗粒度的声韵母单元以概括各种音变;另一方面从三状态结构可能无法有效描述复杂单元的问题出发,又倾向采用细粒度的音素单元。本文在实验语音学理论研究成果与声韵母时长分析实验结果的基础上,主张对扩展声韵母单元进行有选择的拆分,提出了基于鼻韵尾分离的声韵母拆分方法。实验结果表明本文的方法与扩展声韵母单元、音素单元相比,识别性能有了明显改善,其字错误率分别降低2.23%和9.45%。相似文献

4.

重庆方言语音识别系统的设计与实现

张策韦鹏程陆晓燕石熙《应用声学》2018,26(1)

语音识别赋予了计算机能够识别出语音内容的功能,是人机交互技术领域的重要研究内容。随着计算机技术的发展,语音识别已经得到了成熟的发展。但是关于方言的语音识别还有很大的发展空间。中国是一个幅员辽阔、人口众多的国家,因此方言种类繁多,其中有3000多万人交流使用的重庆方言就是其中之一。采集了重庆方言的部分词语的文本文件和对应的语音文件建立语料库,根据重庆方言的发音特点,选取重庆方言的声韵母作为声学建模基元,选取隐马尔可夫模型(Hidden Markov Model, HMM)为声学模型设计了一个基于HMM的重庆方言语音识别系统。在训练过程利用语料库中训练集语料对声学模型进行训练,形成HMM模型库;在识别过程利用语料库中的测试集语料进行识别测试。实验结果表明,该系统能够实现重庆方言的语音识别,并且识别的正确率为100%。相似文献

5.

A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition

Juneja A Espy-Wilson C 《The Journal of the Acoustical Society of America》2008,123(2):1154-1168

A probabilistic framework for a landmark-based approach to speech recognition is presented for obtaining multiple landmark sequences in continuous speech. The landmark detection module uses as input acoustic parameters (APs) that capture the acoustic correlates of some of the manner-based phonetic features. The landmarks include stop bursts, vowel onsets, syllabic peaks and dips, fricative onsets and offsets, and sonorant consonant onsets and offsets. Binary classifiers of the manner phonetic features-syllabic, sonorant and continuant-are used for probabilistic detection of these landmarks. The probabilistic framework exploits two properties of the acoustic cues of phonetic features-(1) sufficiency of acoustic cues of a phonetic feature for a probabilistic decision on that feature and (2) invariance of the acoustic cues of a phonetic feature with respect to other phonetic features. Probabilistic landmark sequences are constrained using manner class pronunciation models for isolated word recognition with known vocabulary. The performance of the system is compared with (1) the same probabilistic system but with mel-frequency cepstral coefficients (MFCCs), (2) a hidden Markov model (HMM) based system using APs and (3) a HMM based system using MFCCs. 相似文献

6.

Automatic classification and speaker identification of African elephant (Loxodonta africana) vocalizations

Clemins PJ Johnson MT Leong KM Savage A 《The Journal of the Acoustical Society of America》2005,117(2):956-963

A hidden Markov model (HMM) system is presented for automatically classifying African elephant vocalizations. The development of the system is motivated by successful models from human speech analysis and recognition. Classification features include frequency-shifted Mel-frequency cepstral coefficients (MFCCs) and log energy, spectrally motivated features which are commonly used in human speech processing. Experiments, including vocalization type classification and speaker identification, are performed on vocalizations collected from captive elephants in a naturalistic environment. The system classified vocalizations with accuracies of 94.3% and 82.5% for type classification and speaker identification classification experiments, respectively. Classification accuracy, statistical significance tests on the model parameters, and qualitative analysis support the effectiveness and robustness of this approach for vocalization analysis in nonhuman species. 相似文献

7.

Detection of speech landmarks: use of temporal information

Salomon A Espy-Wilson CY Deshmukh O 《The Journal of the Acoustical Society of America》2004,115(3):1296-1305

Studies by Shannon et al. [Science, 270, 303-304 (1995)], Van Tasell et al. [J. Acoust. Soc. Am. 82, 1152-1161 (1987)], and others show that human listeners can understand important aspects of the speech signal when spectral shape has been significantly degraded. These experiments suggest that temporal information is particularly important in human speech perception when the speech signal is heavily degraded. In this study, a system is developed that extracts linguistically relevant temporal information that can be used in the front end of an automatic speech recognition system. The parameters targeted include energy onset and offsets (computed using an adaptive algorithm) and measures of periodic and aperiodic content; together these are used to find abrupt acoustic events which signify landmarks. Overall detection rates for strongly robust events, robust events, and weak events in a portion of the TIMIT test database are 98.9%, 94.7%, and 52.1%, respectively. Error rates increase by less than 5% when the speech signals are spectrally impoverished. Use of the four temporal parameters as the front end of a hidden Markov model (HMM)-based system for the automatic recognition of the manner classes "sonorant," "fricative," "stop," and "silence" results in the same recognition accuracy achieved when the standard 39 cepstral-based parameters are used, 70.1%. The combination of the temporal parameters and cepstral parameters results in an accuracy of 74.8%. 相似文献

8.

混合双语语音识别的研究 总被引：1，自引：0，他引：1

张晴晴潘接林颜永红《声学学报》2010,35(2):270-275

随着现代社会信息的全球化,双语以及多语混合的语言现象日趋普遍,随之而产生的双语或多语语音识别也成为语音识别研究领域的热门课题。在双语混合语音识别中,主要面临的问题有两个:一是在保证双语识别率的前提下控制系统的复杂度;二是有效处理插入语中原用语引起的非母语口音现象。为了解决双语混合现象以及减少统计建模所需的数据量,通过音素混合聚类方法建立起一个统一的双语识别系统。在聚类算法中,提出了一种新型基于混淆矩阵的两遍音素聚类算法,并将该方法与传统的基于声学似然度准则的聚类方法进行比较;针对双语语音中非母语语音识别性能较低的问题,提出一种新型的双语模型修正算法用于提高非母语语音的识别性能。实验结果表明,通过上述方法建立起来的中英双语语音识别系统在有效控制模型规模的同时,实现了同时对两种语言的识别,且在单语言语音和混合语言语音上的识别性能也能得到有效保证。相似文献

9.

Robust speech recognition from binary masks

Narayanan A Wang D 《The Journal of the Acoustical Society of America》2010,128(5):EL217-EL222

Inspired by recent evidence that a binary pattern may provide sufficient information for human speech recognition, this letter proposes a fundamentally different approach to robust automatic speech recognition. Specifically, recognition is performed by classifying binary masks corresponding to a word utterance. The proposed method is evaluated using a subset of the TIDigits corpus to perform isolated digit recognition. Despite dramatic reduction of speech information encoded in a binary mask, the proposed system performs surprisingly well. The system is compared with a traditional HMM based approach and is shown to perform well under low SNR conditions. 相似文献

10.

Experimental and numerical study on long-span reticulate structure with multidimensional high-damping earthquake isolation devices 总被引：1，自引：0，他引：1

Zhao-Dong Xu Shao-An WangChao Xu 《Journal of sound and vibration》2014

Multidimensional high-damping earthquake isolation device (MHEID) for long-span reticulate structure is an innovative passive vibration control device. In this paper, the results of horizontal and vertical property tests are presented first and then effects of excitation frequency, displacement on MHEID are studied. In order to consider the effects of excitation frequency, displacement amplitude and temperature on MHEID, a new mathematical model, i.e., fractional-derivative equivalent standard solid model, is put forward to describe the dynamic properties of MHEID precisely in both horizontal and vertical directions. Then, horizontal and vertical pseudo-dynamic tests on structures with and without MHEID are conducted. It can be seen from the experimental results that MHEID can obviously reduce the displacement responses, acceleration responses and input forces of the long-span reticulate structure. In order to analyze the earthquake isolation effect of MHEID on long-span reticulate structures, the dynamic responses are simulated by using a new dynamic response analysis method. The numerical results fit well with the experimental results and it is indicated that the proposed method can simulate the dynamic responses of the long-span reticulate structure. 相似文献

11.

基于改进的隐马尔可夫和神经网络混合模型的语音识别

下载免费PDF全文

陈立伟张晔《应用声学》2006,25(2):90-95

研究了一种非齐次隐马尔可夫模型（Inhomogeneous Hidden Markov Model），然后将自组织特征映射神经网络与这种非齐次隐马尔可夫模型相结合，训练出抗噪声的HMM模型，并应用该混合模型进行语音识别。实验结果表明，该模型适合于对噪声背景下的语音进行识别。该模型具有更好的抗噪鲁棒性，在信噪比较低的情况下（5dB-10dB），识别率可以提高5％左右。相似文献

12.

汉语自然口语中声调识别的研究 总被引：2，自引：0，他引：2

下载免费PDF全文

刘赵杰邵健张鹏远赵庆卫颜永红冯稷《物理学报》2007,56(12):7064-7069

汉语是一种带声调的语言，声调信息在汉语识别中具有非常重要的意义.传统的声调识别一般只研究朗读式语音中相对标准的声调，很少对声调调型比较复杂的自然口语进行专门的处理.针对汉语自然口语的特点，在声调建模单元的选择时提出了真实上下文的模型.同时，为了对声调模式进行精细建模，采用了一种层次聚类的方法来获得更多的声调模式.实验结果证明了方法的有效性. 关键词：声调识别自然口语真实上下文模型聚类相似文献

13.

Statistical properties of Chinese phonemic networks

Shuiyuan Yu Chunshan Xu 《Physica A》2011,390(7):1370-1380

The study of properties of speech sound systems is of great significance in understanding the human cognitive mechanism and the working principles of speech sound systems. Some properties of speech sound systems, such as the listener-oriented feature and the talker-oriented feature, have been unveiled with the statistical study of phonemes in human languages and the research of the interrelations between human articulatory gestures and the corresponding acoustic parameters. With all the phonemes of speech sound systems treated as a coherent whole, our research, which focuses on the dynamic properties of speech sound systems in operation, investigates some statistical parameters of Chinese phoneme networks based on real text and dictionaries. The findings are as follows: phonemic networks have high connectivity degrees and short average distances; the degrees obey normal distribution and the weighted degrees obey power law distribution; vowels enjoy higher priority than consonants in the actual operation of speech sound systems; the phonemic networks have high robustness against targeted attacks and random errors. In addition, for investigating the structural properties of a speech sound system, a statistical study of dictionaries is conducted, which shows the higher frequency of shorter words and syllables and the tendency that the longer a word is, the shorter the syllables composing it are. From these structural properties and dynamic properties one can derive the following conclusion: the static structure of a speech sound system tends to promote communication efficiency and save articulation effort while the dynamic operation of this system gives preference to reliable transmission and easy recognition. In short, a speech sound system is an effective, efficient and reliable communication system optimized in many aspects. 相似文献

14.

Recognizing articulatory gestures from speech for robust speech recognition

Mitra V Nam H Espy-Wilson C Saltzman E Goldstein L 《The Journal of the Acoustical Society of America》2012,131(3):2270-2287

Studies have shown that supplementary articulatory information can help to improve the recognition rate of automatic speech recognition systems. Unfortunately, articulatory information is not directly observable, necessitating its estimation from the speech signal. This study describes a system that recognizes articulatory gestures from speech, and uses the recognized gestures in a speech recognition system. Recognizing gestures for a given utterance involves recovering the set of underlying gestural activations and their associated dynamic parameters. This paper proposes a neural network architecture for recognizing articulatory gestures from speech and presents ways to incorporate articulatory gestures for a digit recognition task. The lack of natural speech database containing gestural information prompted us to use three stages of evaluation. First, the proposed gestural annotation architecture was tested on a synthetic speech dataset, which showed that the use of estimated tract-variable-time-functions improved gesture recognition performance. In the second stage, gesture-recognition models were applied to natural speech waveforms and word recognition experiments revealed that the recognized gestures can improve the noise-robustness of a word recognition system. In the final stage, a gesture-based Dynamic Bayesian Network was trained and the results indicate that incorporating gestural information can improve word recognition performance compared to acoustic-only systems. 相似文献

15.

Recognition of Putonghua voiceless stop like initials based on speech main periods

OU Guiwen 《声学学报：英文版》1994,(1)

I.Intr0ductionNowadays,thereismuchadvancemcntinthcrcsearchintospeechrecognition.Manyresearchershavebecninterestedintheimplementationofareliab1crealtimerec-ognitionsystemofunlimitedv0cabu1ary.Thercareafewproductsconversingsyl1ablesintoChinesecharactersinthemarket.However,theimp1ementationofarobustrealtAnerecognitionsystemofunlimitcdvocabularyisvcrydifficu1t,anditisthcgreataimofourresearch.WehaveaTMS32O-C25signa1processingboardattachedtoacomputerofthM-PC/AT80386.Wehopcthatourspeechrecognit… 相似文献

16.

Vocal quality factors: analysis, synthesis, and perception. 总被引：4，自引：0，他引：4

D G Childers C K Lee 《The Journal of the Acoustical Society of America》1991,90(5):2394-2410

The purpose of this study was to examine several factors of vocal quality that might be affected by changes in vocal fold vibratory patterns. Four voice types were examined: modal, vocal fry, falsetto, and breathy. Three categories of analysis techniques were developed to extract source-related features from speech and electroglottographic (EGG) signals. Four factors were found to be important for characterizing the glottal excitations for the four voice types: the glottal pulse width, the glottal pulse skewness, the abruptness of glottal closure, and the turbulent noise component. The significance of these factors for voice synthesis was studied and a new voice source model that accounted for certain physiological aspects of vocal fold motion was developed and tested using speech synthesis. Perceptual listening tests were conducted to evaluate the auditory effects of the source model parameters upon synthesized speech. The effects of the spectral slope of the source excitation, the shape of the glottal excitation pulse, and the characteristics of the turbulent noise source were considered. Applications for these research results include synthesis of natural sounding speech, synthesis and modeling of vocal disorders, and the development of speaker independent (or adaptive) speech recognition systems. 相似文献

17.

层叠式“产生/判别”混合模型的语音情感识别

下载免费PDF全文

黄永明章国宝董飞李悦《声学学报》2013,38(2):231-240

提出了层叠式“产生/判别”混合模型的语音情感识别方法。首先,提取63维语句级特征,运用Fisher从中选择12个最佳的语句级特征,建立小波神经网络(WNN)的层叠式产生式模型进行语音情感识别;然后提取69维帧级特征,采用SFS选择出待使用的8维特征,将高斯混合模型(GMM)进行多维概率输出,建立层叠式“产生/判别”混合模型进行语音情感识别。实验结果显示:(1)层叠式“产生/判别”混合模型较单独WNN、GMM、HMM (隐马尔可夫模型)、SVM (支持向量机)的识别率要高;(2)层叠式“产生/判决式”混合模型识别率较基于WNN的层叠产生式模型高;(3) M=13,D维GMM-MAP/SVM (MAP,最大后验概率)串联融合模型为最优的层叠式“产生/判别”混合模型,能获得最高85.1%的识别率。相似文献

18.

Structural design of hidden Markov model speech recognizer using multivalued phonetic features: comparison with segmental speech units.

L Deng K Erler 《The Journal of the Acoustical Society of America》1992,92(6):3058-3067

相似文献

19.

汉语重音的凸显度分析与合成

孟凡博吴志勇贾珈蔡莲红《声学学报》2015,40(1):1-11

重音是重要的语调特征,重音合成技术可以提高语音的自然度和表现力。针对重音的局部凸显性,该文提出了声学特征凸显度的表示方法,分析了不同韵律位置(韵律词首、中、尾,韵律短语首、中、尾等)重音音节的声学特征凸显度,发现在韵律单元末(韵律词末音节和韵律短语末韵律词)的重音其基频最大值凸显度要低于非韵律单元末重音,提出了基于声学特征凸显度的非线性的重音声学参数生成算法,解决了传统重音声学参数线性修改算法的修改幅度不足或过大的问题。采用该算法建立了基于隐Markov模型的支持重音合成的语音合成系统。实验表明,该系统可以有效合成带有重音的语音,提高了合成语音的自然度和表现力。相似文献

20.

An Improvement to Conformer-Based Model for High-Accuracy Speech Feature Extraction and Learning

Mengzhuo Liu Yangjie Wei 《Entropy (Basel, Switzerland)》2022,24(7)

Owing to the loss of effective information and incomplete feature extraction caused by the convolution and pooling operations in a convolution subsampling network, the accuracy and speed of current speech processing architectures based on the conformer model are influenced because the shallow features of speech signals are not completely extracted. To solve these problems, in this study, we researched a method that used a capsule network to improve the accuracy of feature extraction in a conformer-based model, and then, we proposed a new end-to-end model architecture for speech recognition. First, to improve the accuracy of speech feature extraction, a capsule network with a dynamic routing mechanism was introduced into the conformer model; thus, the structural information in speech was preserved, and it was input to the conformer blocks via sequestered vectors; the learning ability of the conformed-based model was significantly enhanced using dynamic weight updating. Second, a residual network was added to the capsule blocks, thus, the mapping ability of our model was improved and the training difficulty was reduced. Furthermore, the bi-transformer model was adopted in the decoding network to promote the consistency of the hypotheses in different directions through bidirectional modeling. Finally, the effectiveness and robustness of the proposed model were verified against different types of recognition models by performing multiple sets of experiments. The experimental results demonstrated that our speech recognition model achieved a lower word error rate without a language model because of the higher accuracy of speech feature extraction and learning using our model architecture with a capsule network. Furthermore, our model architecture benefited from the advantage of the capsule network and the conformer encoder, and also has potential for other speech-related applications. 相似文献