共查询到19条相似文献,搜索用时 109 毫秒
1.
2.
3.
基于连续高斯混合密度HMM的汉语全音节语音识别研究 总被引:5,自引:0,他引:5
本文在大量语音分析实验的基础上,对HMM用于汉语全音节语音识别进行了较为深入的探讨,建立了一个连续高斯混合密度HMM的汉语全音节语音识别系统.该系统在训练算法上撇开了传统的Baum-Welch算法,代之以计算复杂度小、存储量小、迭代次数少且具有自动分割效应的分段K平均算法。对于HMM的模型单元的选择,单元的结构以及模型参数的选取,充分考虑了汉语语音的特点;并在语音特征上做了深入的实验分析工作,采用了符合人耳听觉特性的Mel-Scaled参数,用FFT倒谱代替了LPC倒谱,同时利用了语音的动态谱特征和能量特征。另外,本文还针对汉语声母的特点,独特地提出了变帧移分析策略。整个识别系统的首选正识率为91.1%. 相似文献
4.
5.
语音识别赋予了计算机能够识别出语音内容的功能,是人机交互技术领域的重要研究内容。随着计算机技术的发展,语音识别已经得到了成熟的发展。但是关于方言的语音识别还有很大的发展空间。中国是一个幅员辽阔、人口众多的国家,因此方言种类繁多,其中有3000多万人交流使用的重庆方言就是其中之一。采集了重庆方言的部分词语的文本文件和对应的语音文件建立语料库,根据重庆方言的发音特点,选取重庆方言的声韵母作为声学建模基元,选取隐马尔可夫模型(Hidden Markov Model, HMM)为声学模型设计了一个基于HMM的重庆方言语音识别系统。在训练过程利用语料库中训练集语料对声学模型进行训练,形成HMM模型库;在识别过程利用语料库中的测试集语料进行识别测试。实验结果表明,该系统能够实现重庆方言的语音识别,并且识别的正确率为100%。 相似文献
6.
基于随机轨迹模型的汉语连续语音识别方法研究 总被引:1,自引:0,他引:1
本文在指出隐马尔可夫模型(HMM)不合理假设的基础上,介绍了随机轨迹模型(STM)的理论机制及优越性。随机轨迹模型将语音基元的声学观察表示为参数空间中轨迹的聚类,并将轨迹建模为状态随机序列概率密度函数的混合,该模型可以克服HMM的不合理假设,在理论上更合理。根据STM的特点及汉语语音特色,本文对汉语连续语音识别基元的选取进行了讨论,提出了音素类单元作为识别系统的识别基元。基于STM的汉语连续语音识别的实验结果证明了STM的有效性和音素类单元的一致性。 相似文献
7.
在将耳语音转换为正常音时,为了研究降维后语音特征对耳语音转换的影响,分别对耳语音和正常音谱包络进行自适应编码以提取耳语音和正常音的低维特征,然后使用BP网络建立耳语音和正常音低维谱包络特征之间的映射关系以及正常音基频和耳语音低维谱包络特征之间的关系。转换时,根据耳语音低维谱包络特征获得对应正常音的低维谱包络特征和基频,对低维谱包络特征进行解码后获得对应的正常音谱包络。实验结果表明,采用此方法转换后的语音与正常音之间的倒谱距离相比高斯混合模型方法下降了10%,转换后语音的自然度和可懂度都有所提高。 相似文献
8.
基于发音特征的汉语普通话语音声学建模 总被引:3,自引:0,他引:3
将表征汉语普通话语音特点的发音特征引入汉语普通话语音识别的声学建模中,根据普通话发音特点,确定了用于区别普通话元音、辅音以及声调信息的9种发音特征,并以此为目标值训练神经网络得到语音信号属于各类发音特征的后验概率,将此概率作为语音识别的输入特征建立声学模型。在汉语普通话非特定人大词表自然口语对话识别系统中进行了实验验证,并与基于频谱特征的声学模型进行了比较,在相同解码速度下,由此方法建立的声学模型汉字错误率相对下降6.8%;将发音特征和频谱特征进行了融合实验,融合以后的识别系统相对基于频谱特征系统的汉字错误率相对下降10.1%。上述结果表明,基于发音特征的声学模型更加有效的实现了对语音特性的表征,通过利用发音特征和频谱特征的互补性,能够进一步实现对语音识别性能的提高。 相似文献
9.
针对传统多数据流语音识别方法不考虑数据流内各特征分量受噪声影响差异的缺点,提出了一种基于特征分量输出概率加权的数据流结合新方法,分析了特征分量输出概率加权对识别的影响,并结合丢失数据技术中的边缘化(Marginalisation)模型和软判决(Soft decision)模型给出了两种具体的数据流结合方案.将所提数据流结合方案应用到复合子带语音识别系统中,实验结果表明,所提识别方法可以根据噪声环境的不同自适应地调整数据流对识别影响的大小,其性能显著优于传统的多数据流识别方法. 相似文献
10.
11.
Study on the acoustical characteristic is important to speech and speaker recognition in Chinese whispered speech. In this paper, the characteristics of whispered speech are introduced and the acoustical characteristics in Chinese whispered speech are discussed. There is no fundamental frequency in the whispered speech, so other characteristics such as the duration and frequency of formant are extracted and analyzed. From experiments with six simple Chinese whispered vowels, it is proved that the duration and the frequency of formant can be used as the main acoustical characteristics in the Chinese whispered recognition. 相似文献
12.
I.IntroductionRecentlytherearemanykindsofsystemsandproductsforspeechrecognition,butalmostallofthemareworkinginquietenvironment,theperformancearedegradedorevencan'tworkwhenitisoperatedinhighnoisyenvironmentssuchasincockpits,vehicle,workshopsetc.SonoiserobustnesshasbecomeoneofthemainobstaclesfortherealaPplicationsoftheautomaticspeechrecognizersanditattractstheattentionofresearchersinspeechtechnologyareas.Since1978,substantialeffortshavebeendevotedtotestandevaluatethespeechrecognizersusedinfight… 相似文献
13.
针对深度神经网络与隐马尔可夫模型(DNN-HMM)结合的声学模型在语音识别过程中建模能力有限等问题,提出了一种改进的DNN-HMM模型语音识别算法。首先根据深度置信网络(DBN)结合深度玻尔兹曼机(DBM),建立深度神经网络声学模型,然后提取梅尔频率倒谱系数(MFCC)和对数域的Mel滤波器组系数(Fbank)作为声学特征参数,通过TIMIT语音数据集进行实验。实验结果表明:结合了DBM的DNN-HMM模型相比DNN-HMM模型更具优势,其中,使用MFCC声学特征在词错误率与句错误率方面分别下降了1.26%和0.20%。此外,使用默认滤波器组的Fbank特征在词错误率与句错误率方面分别下降了0.48%和0.82%,并且适量增加滤波器组可以降低错误率。总之,研究取得句错误率与词错误率分别降低到21.06%和3.12%的好成绩。 相似文献
14.
The study of properties of speech sound systems is of great significance in understanding the human cognitive mechanism and the working principles of speech sound systems. Some properties of speech sound systems, such as the listener-oriented feature and the talker-oriented feature, have been unveiled with the statistical study of phonemes in human languages and the research of the interrelations between human articulatory gestures and the corresponding acoustic parameters. With all the phonemes of speech sound systems treated as a coherent whole, our research, which focuses on the dynamic properties of speech sound systems in operation, investigates some statistical parameters of Chinese phoneme networks based on real text and dictionaries. The findings are as follows: phonemic networks have high connectivity degrees and short average distances; the degrees obey normal distribution and the weighted degrees obey power law distribution; vowels enjoy higher priority than consonants in the actual operation of speech sound systems; the phonemic networks have high robustness against targeted attacks and random errors. In addition, for investigating the structural properties of a speech sound system, a statistical study of dictionaries is conducted, which shows the higher frequency of shorter words and syllables and the tendency that the longer a word is, the shorter the syllables composing it are. From these structural properties and dynamic properties one can derive the following conclusion: the static structure of a speech sound system tends to promote communication efficiency and save articulation effort while the dynamic operation of this system gives preference to reliable transmission and easy recognition. In short, a speech sound system is an effective, efficient and reliable communication system optimized in many aspects. 相似文献
15.
In order to increase short time whispered speaker recognition rate in variable channel conditions,the hybrid compensation in model and feature domains was proposed.This method is based on joint factor analysis in training model stage.It extracts speaker factor and eliminates channel factor by estimating training speech speaker and channel spaces.Then in the test stage,the test speech channel factor is projected into feature space to engage in feature compensation,so it can remove channel information both in model and feature domains in order to improve recognition rate.The experiment result shows that the hybrid compensation can obtain the similar recognition rate in the three different training channel conditions and this method is more effective than joint factor analysis in the test of short whispered speech. 相似文献
16.
当前基于深度神经网络模型中,虽然其隐含层可设置多层,对复杂问题适应能力强,但每层之间的节点连接是相互独立的,这种结构特性导致了在语音序列中无法利用上下文相关信息来提高识别效果,而传统的循环神经网络虽然做出了改进,但是只能对上文信息进行利用。针对以上问题,该文采用可以同时利用语音序列中上下文相关信息的双向循环神经网络模型与深度神经网络模型相结合,并应用于语音识别。构建具有5层隐含层的模型,其中第3层为双向循环神经网络结构,其他层采用深度神经网络结构。实验结果表明:加入了双向循环神经网络结构的模型与其他模型相比,较好地提高了识别正确率;噪声对双向循环神经网络汉语识别有重要影响,尤其是训练集和测试集附加噪声类型不同时,单一的含噪声语音的训练模型无法适应不同噪声类型的语音识别;调整神经网络模型中隐含层神经元数量后,识别正确率并不是一直随着隐含层中神经元数量的增加而增加,神经元数量数目增加到一定程度后正确率出现了降低的趋势。 相似文献
17.
Li Y Zhang G Kang HY Liu S Han D Fu QJ 《The Journal of the Acoustical Society of America》2011,129(6):EL242-EL247
Cochlear implant (CI) users' speech understanding may be influenced by different speaking styles. In this study, speech recognition was measured in Mandarin-speaking CI and normal-hearing (NH) subjects for sentences produced according to four styles: slow, normal, fast, and whispered. CI subjects were tested using their clinical processors; NH subjects were tested while listening to a four-channel CI simulation. Performance gradually worsened with increasing speaking rate and was much poorer with whispered speech. CI performance was generally similar to NH performance with the four-channel simulation. Results suggest that some speaking styles, especially whispering, may negatively affect Mandarin-speaking CI users' speech understanding. 相似文献
18.
Information about the acoustic properties of a talker's voice is available in optical displays of speech, and vice versa, as evidenced by perceivers' ability to match faces and voices based on vocal identity. The present investigation used point-light displays (PLDs) of visual speech and sinewave replicas of auditory speech in a cross-modal matching task to assess perceivers' ability to match faces and voices under conditions when only isolated kinematic information about vocal tract articulation was available. These stimuli were also used in a word recognition experiment under auditory-alone and audiovisual conditions. The results showed that isolated kinematic displays provide enough information to match the source of an utterance across sensory modalities. Furthermore, isolated kinematic displays can be integrated to yield better word recognition performance under audiovisual conditions than under auditory-alone conditions. The results are discussed in terms of their implications for describing the nature of speech information and current theories of speech perception and spoken word recognition. 相似文献
19.
Inspired by recent evidence that a binary pattern may provide sufficient information for human speech recognition, this letter proposes a fundamentally different approach to robust automatic speech recognition. Specifically, recognition is performed by classifying binary masks corresponding to a word utterance. The proposed method is evaluated using a subset of the TIDigits corpus to perform isolated digit recognition. Despite dramatic reduction of speech information encoded in a binary mask, the proposed system performs surprisingly well. The system is compared with a traditional HMM based approach and is shown to perform well under low SNR conditions. 相似文献