首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
周璐璐  邓江洪 《应用声学》2014,22(10):3267-32693273
针对智能机器人在非特定人语音识别中识别率偏低的问题,提出了一种双门限的端点检测算法,精确地检测出了语音端点,对分形维数和Mel频率倒谱系数(MFCC)进行结合,同时基于隐马尔可夫(HMM)模型,提出了智能机器人命令识别系统;在实验室环境下,利用Cool Edit软件录制了5男5女的语音,采样率为8 kHz,精度为16位,内容为5个命令词,每个词均被采集6次,将每人的前3次发音作为模板语音,后3次发音作为测试语音,实验结果表明,系统识别率可以达到85%以上,MFCC与分形维数混合的语音特征参数的算法提高了系统识别率,优化了系统性能;该方法用于非特定人语音智能识别是可行的、有效的。  相似文献   

2.
汉语孤立字全音节实时识别系统   总被引:1,自引:0,他引:1  
本文在大量语音实验的基础上,对汉语语音识别方法进行了较为深入的探讨,并以IBMPC/AT配以自行研制开发的TMS320C25-E型高速信号处理板为硬件基础,建立了一个特定人汉语普通话全音节实时识别系统.该系统针对汉语普通话的语音特点,采用了分层识别策略.整个系统响应时间小于0.2秒,用4遍1240个全音节语音对系统进行的严格测试表明:系统四声识别的平均正确率为99%左右,音节识别前5个候选的正确识别率分别为82%,91%,94%,96%,97%;同时,本文根据这一测试结果建立了相应的声韵母混淆矩阵和基于Shepard方法的相似度集群分析树图,并对照汉语语音合成清晰度测试结果及汉语语音知觉结构的集群分析结果,对本系统各部分进行了较为深入的分析,提出了相应的改进措施.  相似文献   

3.
普通话孤立字四声的一种模式识别方法   总被引:4,自引:0,他引:4  
普通话孤立字的声调识别是普通话语音识别中的一项重要任务.本文提出一种新的模式识别算法进行普通话四声调的识别.在大量统计实验基础上,定义了四个参数做为基音频率轨迹的描述.并且,在假设其服从高维正态分布(统计实验表明,这一假设是合理的)的基础上,根据最小错误概率准则推导出参数矢量与每一声调类型的距离公式,实现了统计意义上的最佳识别效果.对于非特定人的四声识别实验表明,这一算法取得了十分满意的结果.  相似文献   

4.
非特定人四声识别   总被引:5,自引:0,他引:5  
本文提出一个性能可靠的非特定人汉语普通话四声识别方法.该方法采用中心削波的无偏自相关法作基音周期检测,通过对基音周期进行数据选取、误差修正、平滑、拟合等处理过程,获取两维的判决矢量供四声判决.普通话单音节发音的四声识别率达98%以上.  相似文献   

5.
汉语连续语音识别中一种新的音节间相关识别单元   总被引:1,自引:0,他引:1  
考虑汉语连续语音中的协同发音现象对语音识别性能的提高是非常重要的。针对汉语语音的特点,提出了一种新的在汉语连续语音识别中考虑音节间协同发音现象,对声学模型进行细化的识别单元。然后基于语音学知识对音节间上下文影响进行分类,实现单元间状态参数的共享,降低了模型的复杂程度,保证了模型的可训练度。这种方法和传统方法的最大不同在于:这种方法完全利用语音学知识进行聚类,而传统方法采用数据驱动的聚类方式。识别实验表明,基于语音学分类的音节间相关识别单元对识别性能有明显的改善,系统的首选误识率降低了17%。  相似文献   

6.
景春进  陈东东  周琳琦 《应用声学》2014,22(8):2571-2573
针对舰艇指挥训练系统的特点,提出了一种利用语音识别技术提高其训练效率的方法;首先分析了舰艇指挥指令的语言特点,然后研究了基于Sphinx平台的汉语连续语音识别的相关问题,包括声学模型的训练、语言模型的训练及语音识别引擎等;最后设计并实现了一个非特定人,中等专用词汇量的连续汉语语音识别系统;实验采用了一定数量的数字和专用词汇进行验证,结果表明,经过声学模型训练后,该系统的识别率有较大提高;该方法对提高舰艇指挥训练系统的自动化水平具有一定的指导意义。  相似文献   

7.
提出在参数的提取过程中用不同的感知规整因子对不同人的参数归一化,从而实现在非特定人语音识别中对不同人的归一化处理。感知规整因子是基于声门上和声门下之间耦合作用产生声门下共鸣频率来估算的,与采用声道第三共振峰作为基准频率的方法比较,它能较多的滤除语义信息的影响,更好地体现说话人的个性特征。本文提取抗噪性能优于Mel倒谱参数的感知最小方差无失真参数作为识别特征,语音模型用经典的隐马尔可夫模型(HMM)。实验证明,本文方法与传统的语音识别参数和用声道第三共振峰进行谱规整的方法相比,在干净语音中单词错误识别率分别下降了4%和3%,在噪声环境下分别下降了9%和5%,有效地改善了非特定人语音识别系统的性能。   相似文献   

8.
提出了广义模型,将动态时间规正(DTW,DynamicTimeWarping)技术和隐马尔可夫模型(HMM,HiddenMarkovModel)统一到一个语音声学模型的框架内.分析表明,广义模型更接近语音实际情况并具有很小的存储量.还利用广义模型构造了汉语全音节语音识别器,和离散HMM及DTW的对比实验结果显示:对于特定人识别,广义模型的识别性能和DTW相当而高于离散HMM;对于非特定人识别,广义模型的识别性能高于DTW和离散HMM。  相似文献   

9.
混合双语语音识别的研究   总被引:1,自引:0,他引:1  
随着现代社会信息的全球化,双语以及多语混合的语言现象日趋普遍,随之而产生的双语或多语语音识别也成为语音识别研究领域的热门课题。在双语混合语音识别中,主要面临的问题有两个:一是在保证双语识别率的前提下控制系统的复杂度;二是有效处理插入语中原用语引起的非母语口音现象。为了解决双语混合现象以及减少统计建模所需的数据量,通过音素混合聚类方法建立起一个统一的双语识别系统。在聚类算法中,提出了一种新型基于混淆矩阵的两遍音素聚类算法,并将该方法与传统的基于声学似然度准则的聚类方法进行比较;针对双语语音中非母语语音识别性能较低的问题,提出一种新型的双语模型修正算法用于提高非母语语音的识别性能。实验结果表明,通过上述方法建立起来的中英双语语音识别系统在有效控制模型规模的同时,实现了同时对两种语言的识别,且在单语言语音和混合语言语音上的识别性能也能得到有效保证。   相似文献   

10.
阴法明  赵焱  赵力 《应用声学》2019,38(1):39-44
为提高连续语音识别中的音素识别率,提出一种基于改进并行回火训练的受限波尔兹曼机的音素识别算法。首先,利用经过等能量划分后的改进并行回火算法来训练受限玻尔兹曼机,接着将受限玻尔兹曼机堆叠组成一个深信度网络,从而作为深度神经网络预训练的基础模型,然后通过softmax层输出,得到用于音素状态后验概率检测的深度神经网络。接着,利用少量的标签数据,根据反向传播算法对网络权重进行微调。最后,将所得后验概率作为隐马尔科夫的发射概率,然后利用Viterbi解码器实现音素识别。在TIMIT语料库上的实验表明,识别率相比于传统的对比散度类算法提高了约4.5%,在不增加计算量的情况下比原始并行回火算法提高约1%。  相似文献   

11.
We present the results of a large-scale study on speech perception, assessing the number and type of perceptual hypotheses which listeners entertain about possible phoneme sequences in their language. Dutch listeners were asked to identify gated fragments of all 1179 diphones of Dutch, providing a total of 488,520 phoneme categorizations. The results manifest orderly uptake of acoustic information in the signal. Differences across phonemes in the rate at which fully correct recognition was achieved arose as a result of whether or not potential confusions could occur with other phonemes of the language (long with short vowels, affricates with their initial components, etc.). These data can be used to improve models of how acoustic-phonetic information is mapped onto the mental lexicon during speech comprehension.  相似文献   

12.
Recent studies have shown that synthesized versions of American English vowels are less accurately identified when the natural time-varying spectral changes are eliminated by holding the formant frequencies constant over the duration of the vowel. A limitation of these experiments has been that vowels produced by formant synthesis are generally less accurately identified than the natural vowels after which they are modeled. To overcome this limitation, a high-quality speech analysis-synthesis system (STRAIGHT) was used to synthesize versions of 12 American English vowels spoken by adults and children. Vowels synthesized with STRAIGHT were identified as accurately as the natural versions, in contrast with previous results from our laboratory showing identification rates 9%-12% lower for the same vowels synthesized using the cascade formant model. Consistent with earlier studies, identification accuracy was not reduced when the fundamental frequency was held constant across the vowel. However, elimination of time-varying changes in the spectral envelope using STRAIGHT led to a greater reduction in accuracy (23%) than was previously found with cascade formant synthesis (11%). A statistical pattern recognition model, applied to acoustic measurements of the natural and synthesized vowels, predicted both the higher identification accuracy for vowels synthesized using STRAIGHT compared to formant synthesis, and the greater effects of holding the formant frequencies constant over time with STRAIGHT synthesis. Taken together, the experiment and modeling results suggest that formant estimation errors and incorrect rendering of spectral and temporal cues by cascade formant synthesis contribute to lower identification accuracy and underestimation of the role of time-varying spectral change in vowels.  相似文献   

13.
A controversial issue in neurolinguistics is whether basic neural auditory representations found in many animals can account for human perception of speech. This question was addressed by examining how a population of neurons in the primary auditory cortex (A1) of the naive awake ferret encodes phonemes and whether this representation could account for the human ability to discriminate them. When neural responses were characterized and ordered by spectral tuning and dynamics, perceptually significant features including formant patterns in vowels and place and manner of articulation in consonants, were readily visualized by activity in distinct neural subpopulations. Furthermore, these responses faithfully encoded the similarity between the acoustic features of these phonemes. A simple classifier trained on the neural representation was able to simulate human phoneme confusion when tested with novel exemplars. These results suggest that A1 responses are sufficiently rich to encode and discriminate phoneme classes and that humans and animals may build upon the same general acoustic representations to learn boundaries for categorical and robust sound classification.  相似文献   

14.
Auditory-perceptual interpretation of the vowel   总被引:1,自引:0,他引:1  
  相似文献   

15.
Icelandic has a phonologic contrast of quantity, distinguishing long and short vowels and consonants. Perceptual studies have shown that a major cue for quantity in perception is relational, involving the vowel-to-rhyme ratio. This cue is approximately invariant under transformations of rate, thus yielding a higher-order invariant for the perception of quantity in Icelandic. Recently it has, however, been shown that vowel spectra can also influence the perception of quantity. This holds for vowels which have different spectra in their long and short varieties. This finding raises the question of whether the durational contrast is less well articulated in those cases where vowel spectra provide another cue for quantity. To test this possibility, production measurements were carried out on vowels and consonants in words which were spoken by a number of speakers at different utterance rates in two experiments. A simple neural network was then trained on the production measurements. Using the network to classify the training stimuli shows that the durational distinctions between long and short phonemes are as clearly articulated whether or not there is a secondary, spectral, cue to quantity.  相似文献   

16.
Human listeners are better able to identify two simultaneous vowels if the fundamental frequencies of the vowels are different. A computational model is presented which, for the first time, is able to simulate this phenomenon at least qualitatively. The first stage of the model is based upon a bank of bandpass filters and inner hair-cell simulators that simulate approximately the most relevant characteristics of the human auditory periphery. The output of each filter/hair-cell channel is then autocorrelated to extract pitch and timbre information. The pooled autocorrelation function (ACF) based on all channels is used to derive a pitch estimate for one of the component vowels from a signal composed of two vowels. Individual channel ACFs showing a pitch peak at this value are combined and used to identify the first vowel using a template matching procedure. The ACFs in the remaining channels are then combined and used to identify the second vowel. Model recognition performance shows a rapid improvement in correct vowel identification as the difference between the fundamental frequencies of two simultaneous vowels increases from zero to one semitone in a manner closely resembling human performance. As this difference increases up to four semitones, performance improves further only slowly, if at all.  相似文献   

17.
There is a significant body of research examining the intelligibility of sinusoidal replicas of natural speech. Discussion has followed about what the sinewave speech phenomenon might imply about the mechanisms underlying phonetic recognition. However, most of this work has been conducted using sentence material, making it unclear what the contributions are of listeners' use of linguistic constraints versus lower level phonetic mechanisms. This study was designed to measure vowel intelligibility using sinusoidal replicas of naturally spoken vowels. The sinusoidal signals were modeled after 300 /hVd/ syllables spoken by men, women, and children. Students enrolled in an introductory phonetics course served as listeners. Recognition rates for the sinusoidal vowels averaged 55%, which is much lower than the ~95% intelligibility of the original signals. Attempts to improve performance using three different training methods met with modest success, with post-training recognition rates rising by ~5-11 percentage points. Follow-up work showed that more extensive training produced further improvements, with performance leveling off at ~73%-74%. Finally, modeling work showed that a fairly simple pattern-matching algorithm trained on naturally spoken vowels classified sinewave vowels with 78.3% accuracy, showing that the sinewave speech phenomenon does not necessarily rule out template matching as a mechanism underlying phonetic recognition.  相似文献   

18.
Information transfer analysis [G. A. Miller and P. E. Nicely, J. Acoust. Soc. Am. 27, 338-352 (1955)] is a tool used to measure the extent to which speech features are transmitted to a listener, e.g., duration or formant frequencies for vowels; voicing, place and manner of articulation for consonants. An information transfer of 100% occurs when no confusions arise between phonemes belonging to different feature categories, e.g., between voiced and voiceless consonants. Conversely, an information transfer of 0% occurs when performance is purely random. As asserted by Miller and Nicely, the maximum-likelihood estimate for information transfer is biased to overestimate its true value when the number of stimulus presentations is small. This small-sample bias is examined here for three cases: a model of random performance with pseudorandom data, a data set drawn from Miller and Nicely, and reported data from three studies of speech perception by hearing impaired listeners. The amount of overestimation can be substantial, depending on the number of samples, the size of the confusion matrix analyzed, as well as the manner in which data are partitioned therein.  相似文献   

19.
Shuiyuan Yu  Chunshan Xu 《Physica A》2011,390(7):1370-1380
The study of properties of speech sound systems is of great significance in understanding the human cognitive mechanism and the working principles of speech sound systems. Some properties of speech sound systems, such as the listener-oriented feature and the talker-oriented feature, have been unveiled with the statistical study of phonemes in human languages and the research of the interrelations between human articulatory gestures and the corresponding acoustic parameters. With all the phonemes of speech sound systems treated as a coherent whole, our research, which focuses on the dynamic properties of speech sound systems in operation, investigates some statistical parameters of Chinese phoneme networks based on real text and dictionaries. The findings are as follows: phonemic networks have high connectivity degrees and short average distances; the degrees obey normal distribution and the weighted degrees obey power law distribution; vowels enjoy higher priority than consonants in the actual operation of speech sound systems; the phonemic networks have high robustness against targeted attacks and random errors. In addition, for investigating the structural properties of a speech sound system, a statistical study of dictionaries is conducted, which shows the higher frequency of shorter words and syllables and the tendency that the longer a word is, the shorter the syllables composing it are. From these structural properties and dynamic properties one can derive the following conclusion: the static structure of a speech sound system tends to promote communication efficiency and save articulation effort while the dynamic operation of this system gives preference to reliable transmission and easy recognition. In short, a speech sound system is an effective, efficient and reliable communication system optimized in many aspects.  相似文献   

20.
A method is proposed to reduce the ambiguity of vowels in connected speech by normalizing the coarticulation effects. The method is applied to vowels in phonetic environments where great ambiguity would be likely to occur, taking as their features the first and second formant trajectories. The separability between vowel clusters is found to be greatly improved for the vowel samples. In addition, distribution of the vowels on a feature plane characterized by this method seems to reflect their perceptual nature when presented to listeners without isolation from their phonetic environments. The results suggest that the method proposed here is useful for automatic speech recognition and help infer some possible mechanisms underlying dynamic aspects of human speech recognition.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号