首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
基于发音特征的汉语普通话语音声学建模   总被引:3,自引:0,他引:3  
将表征汉语普通话语音特点的发音特征引入汉语普通话语音识别的声学建模中,根据普通话发音特点,确定了用于区别普通话元音、辅音以及声调信息的9种发音特征,并以此为目标值训练神经网络得到语音信号属于各类发音特征的后验概率,将此概率作为语音识别的输入特征建立声学模型。在汉语普通话非特定人大词表自然口语对话识别系统中进行了实验验证,并与基于频谱特征的声学模型进行了比较,在相同解码速度下,由此方法建立的声学模型汉字错误率相对下降6.8%;将发音特征和频谱特征进行了融合实验,融合以后的识别系统相对基于频谱特征系统的汉字错误率相对下降10.1%。上述结果表明,基于发音特征的声学模型更加有效的实现了对语音特性的表征,通过利用发音特征和频谱特征的互补性,能够进一步实现对语音识别性能的提高。   相似文献   

2.
针对在基于深度学习语音增强的方法中因采用因果式的网络输入导致语音增强性能下降的问题,提出了一种基于轻量级卷积门控循环神经网络(LCGRU)的语音增强方法。门控循环神经网络能够建模语音信号的时间相关性,但是其全连接结构破坏了语音信号的时频结构特征,并且参数数量庞大,不利于网络的训练。对此,本文采用卷积核替代门控循环神经网络中的全连接结构,在对语音信号时间相关性建模的同时保留了语音信号的时频结构特征,同时降低了网络的参数数量。为充分利用先前帧的特征信息,该网络单元当前时刻的输入融合了上一时刻的输入与输出。针对网络训练过程中容易产生过拟合的问题,本文采用了线性门控机制来控制信息的传输,这缓解了网络训练过程中的过拟合问题,提高了网络的语音增强性能。实验结果表明,本文所提出的网络结构在增强后的语音感知质量(PESQ),语音短时客观可懂度(STOI),分段信噪比(SSNR)等指标上均优于传统的网络结构。  相似文献   

3.
王玥  李平  崔杰 《声学学报》2013,38(4):501-508
为了在噪声抑制和语音失真中之间寻找最佳平衡,提出了一种听觉频域掩蔽效应的自适应β阶贝叶斯感知估计语音增强算法,以期提高语音增强的综合性能。算法利用了人耳的听觉掩蔽效应,根据计算得到的频域掩蔽阈自适应调整β阶贝叶斯感知估计语音增强算法中的β值,从而仅将噪声抑制在掩蔽阈之下,保留较多的语音信息,降低语音失真。并分别用客观和主观评价方式,对所提出的算法的性能进行了评估,并与原来基于信噪比的自适应β阶贝叶斯感知估计语音增强算法进行了比较。结果表明,频域掩蔽的β阶贝叶斯感知估计方法的综合客观评价结果在信噪比为-10 dB至5 dB之间时均高于基于信噪比的自适应β阶贝叶斯感知估计语音增强算法。主观评价结果也表明频域掩蔽的β阶贝叶斯感知估计方法能在尽量保留语音信息的同时,较好的抑制背景噪声。   相似文献   

4.
普通话三合元音音节最小时间感知阈及其声学特性   总被引:1,自引:0,他引:1       下载免费PDF全文
祖漪清 《应用声学》1994,12(2):27-34
本研究的实验材料取自中国社会科学院语言研究所语音数据库.库中存有15个男青的语音材料,共有15×15=225个三合元音音节.本研究的主要目的是从普通话三合元音入手,在对15个说话人的语音材料统计的基础上,通过对最小时间感知阈Tlim的测量与研究,从声学和感知的角度,给出三合元音必不可少的信息,指出多余信息.实验结果表明,TIim内的共振峰变化情况可分为两类.一是动态特性,它的表现是:(a)△F1>90%,△F2约50%;(b)Tlim内至少包括F1,F3两个拐点中的一个;(C)Tlim内包括F2变化最剧烈的部分.这四点对四个三合元音是一致的.第二类是边界条件,Tlim受到位置和大小两方面的限制,证明其边界共振峰频率十分重要.  相似文献   

5.
杨琳  张建平  王迪  颜永红 《声学学报》2009,34(2):151-157
在传统人工耳蜗连续交叠采样(Continuous Interleaved Sampler,CIS)算法的基础上,提出一种基于精细结构(频率调制信息)的人工耳蜗语音处理算法,在不引入过高频率成分、保证工艺可实现性的前提下,使语音识别率大幅提高。听觉仿真实验的结果表明,与传统的基于时域包络的CIS算法相比,基于精细结构的CIS算法对于元音可懂度的改进可以达到28%;声调的识别率在各种噪声条件下提高20%以上;在一般噪声环境下,辅音和句子的可懂度也分别获得了22.9%和28.3%的改进。   相似文献   

6.
汉语普通话辅音音长分析   总被引:10,自引:0,他引:10  
汉语普通话辅音音长是语音的基本参数之一,在语言合成、语言识别等研究中,这一参数有很大实用价值。 本测量是对七个男声、六个女声进行分析,得到普通话22个辅音的平均音长及其标准偏差。 按照测量的需要,设计了试验词表。词表是由22个词组成,每个词有两个音节,每个音节有相同的辅音,不同的元音和声调。当然,每个音节长度各不相同。这样便于考察在连读中辅音音长与所在位置、相拼元音、声调的关系;比较音长绝对值与相对值的关系。 发音人在消声室进行录音。他们绝大部分是青年,能讲纯正的普通话。通过录音,由语图仪进行分析。由于有些辅音频带宽、能量弱、作用时间短,在语图分析时使用高速档并提高放声电压和烧灼电压,使辅音部分在语图中能得到清晰的反映。 经过统计处理,得到以下几点初步结论: 1.辅音音长与送气状态有直接关系,不送气塞音最短,送气塞擦音最长。各种发音方式之间有一定的音长比值。而音长与发音部位关系不大。 2.辅音音长与声调、全音节长度关系不大,但送气塞擦音受后接元音影响,元音开口度大音长短。 3.在连读中,前后两音节中的辅音长度与所在前后位置无关。  相似文献   

7.
颜永红 《应用声学》2012,31(1):35-41
本文对语言声学与内容理解研究的最新进展进行综述。首先介绍人类的言语的产生、感知以及声学分析方面的进展,接着分别介绍采用计算机来对语音中的各种信息进行抽取(包括语音、说话人和语种识别)和内容分析与理解(包括文档内容分析和理解与对话)的最新成果,最后对语言声学与内容理解的研究进行了总结和展望。  相似文献   

8.
本文针对语音信号稀疏表示及压缩感知问题,将听觉感知引入稀疏系数筛选过程,用掩蔽阈值筛选重要系数,以得到更符合听觉感受的语音稀疏表示。通过对一帧浊音信号分别采用掩蔽阈值和能量阈值方法进行系数筛选对比实验,结果表明掩蔽阈值法具有更好的稀疏表示效果。为验证听觉感知对语音压缩感知性能的影响,与能量阈值法对照对测试语音进行压缩感知观测和重构,通过压缩比、信噪比、主观平均意见分等主客观指标评价其性能,结果表明,掩蔽阈值法可有效地提高压缩比且保证重构语音具有较高的主观听觉质量。  相似文献   

9.
汉语语音的非线性动力学特征及其降噪应用   总被引:4,自引:0,他引:4  
分析了汉语语音的相关维、最小嵌入维数以及重构相图。分析结果表明汉语语音具有混沌特征。根据这些非线性特征可以有效区分汉语中的浊音、清音和随机噪声,从而可以用于语音降噪。介绍了本地投影法混沌语音降噪的原理与算法,并利用该算法对一些典型的元音和辅音进行降噪,获得了较好的降噪效果。  相似文献   

10.
提出了一种采用感知语谱结构边界参数(PSSB)的语音端点检测算法,用于在低信噪比环境下的语音信号预处理。在对含噪语音进行基于听觉感知特性的语音增强之后,针对语音信号的连续分布特性与残留噪声的随机分布特性之间的不同点,对增强后语音的时-频语谱进行二维增强,从而进一步突出连续分布的纯净语音的语谱结构。通过对增强后语音语谱结构的二维边界检测,提出PSSB参数,并用于端点检测。实验结果表明,在白噪声-10 dB到10 dB的各种信噪比环境下,采用PSSB参数的端点检测算法,相对于其它端点检测算法,更有效地检测出语音的端点。在-10 dB的极低信噪比下,提出的方法仍然有75.2%的正确率。采用PSSB参数的端点检测算法,更适合于低信噪比白噪声环境下的语音端点检测。   相似文献   

11.
Different patterns of performance across vowels and consonants in tests of categorization and discrimination indicate that vowels tend to be perceived more continuously, or less categorically, than consonants. The present experiments examined whether analogous differences in perception would arise in nonspeech sounds that share critical transient acoustic cues of consonants and steady-state spectral cues of simplified synthetic vowels. Listeners were trained to categorize novel nonspeech sounds varying along a continuum defined by a steady-state cue, a rapidly-changing cue, or both cues. Listeners' categorization of stimuli varying on the rapidly changing cue showed a sharp category boundary and posttraining discrimination was well predicted from the assumption of categorical perception. Listeners more accurately discriminated but less accurately categorized steady-state nonspeech stimuli. When listeners categorized stimuli defined by both rapidly-changing and steady-state cues, discrimination performance was accurate and the categorization function exhibited a sharp boundary. These data are similar to those found in experiments with dynamic vowels, which are defined by both steady-state and rapidly-changing acoustic cues. A general account for the speech and nonspeech patterns is proposed based on the supposition that the perceptual trace of rapidly-changing sounds decays faster than the trace of steady-state sounds.  相似文献   

12.
Several studies have demonstrated that when talkers are instructed to speak clearly, the resulting speech is significantly more intelligible than speech produced in ordinary conversation. These speech intelligibility improvements are accompanied by a wide variety of acoustic changes. The current study explored the relationship between acoustic properties of vowels and their identification in clear and conversational speech, for young normal-hearing (YNH) and elderly hearing-impaired (EHI) listeners. Monosyllabic words excised from sentences spoken either clearly or conversationally by a male talker were presented in 12-talker babble for vowel identification. While vowel intelligibility was significantly higher in clear speech than in conversational speech for the YNH listeners, no clear speech advantage was found for the EHI group. Regression analyses were used to assess the relative importance of spectral target, dynamic formant movement, and duration information for perception of individual vowels. For both listener groups, all three types of information emerged as primary cues to vowel identity. However, the relative importance of the three cues for individual vowels differed greatly for the YNH and EHI listeners. This suggests that hearing loss alters the way acoustic cues are used for identifying vowels.  相似文献   

13.
The speech signal contains many acoustic properties that may contribute differently to spoken word recognition. Previous studies have demonstrated that the importance of properties present during consonants or vowels is dependent upon the linguistic context (i.e., words versus sentences). The current study investigated three potentially informative acoustic properties that are present during consonants and vowels for monosyllabic words and sentences. Natural variations in fundamental frequency were either flattened or removed. The speech envelope and temporal fine structure were also investigated by limiting the availability of these cues via noisy signal extraction. Thus, this study investigated the contribution of these acoustic properties, present during either consonants or vowels, to overall word and sentence intelligibility. Results demonstrated that all processing conditions displayed better performance for vowel-only sentences. Greater performance with vowel-only sentences remained, despite removing dynamic cues of the fundamental frequency. Word and sentence comparisons suggest that the speech envelope may be at least partially responsible for additional vowel contributions in sentences. Results suggest that speech information transmitted by the envelope is responsible, in part, for greater vowel contributions in sentences, but is not predictive for isolated words.  相似文献   

14.
This study investigated the relative contributions of consonants and vowels to the perceptual intelligibility of monosyllabic consonant-vowel-consonant (CVC) words. A noise replacement paradigm presented CVCs with only consonants or only vowels preserved. Results demonstrated no difference between overall word accuracy in these conditions; however, different error patterns were observed. A significant effect of lexical difficulty was demonstrated for both types of replacement, whereas the noise level used during replacement did not influence results. The contribution of consonant and vowel transitional information present at the consonant-vowel boundary was also explored. The proportion of speech presented, regardless of the segmental condition, overwhelmingly predicted performance. Comparisons were made with previous segment replacement results using sentences [Fogerty, and Kewley-Port (2009). J. Acoust. Soc. Am. 126, 847-857]. Results demonstrated that consonants contribute to intelligibility equally in both isolated CVC words and sentences. However, vowel contributions were mediated by context, with greater contributions to intelligibility in sentence contexts. Therefore, it appears that vowels in sentences carry unique speech cues that greatly facilitate intelligibility which are not informative and/or present during isolated word contexts. Consonants appear to provide speech cues that are equally available and informative during sentence and isolated word presentations.  相似文献   

15.
Research on the perception of vowels in the last several years has given rise to new conceptions of vowels as articulatory, acoustic, and perceptual events. Starting from a "simple" target model in which vowels were characterized articulatorily as static vocal tract shapes and acoustically as points in a first and second formant (F1/F2) vowel space, this paper briefly traces the evolution of vowel theory in the 1970s and 1980s in two directions. (1) Elaborated target models represent vowels as target zones in perceptual spaces whose dimensions are specified as formant ratios. These models have been developed primarily to account for perceivers' solution of the "speaker normalization" problem. (2) Dynamic specification models emphasize the importance of formant trajectory patterns in specifying vowel identity. These models deal primarily with the problem of "target undershoot" associated with the coarticulation of vowels with consonants in natural speech and with the issue of "vowel-inherent spectral change" or diphthongization of English vowels. Perceptual studies are summarized that motivate these theoretical developments.  相似文献   

16.
In order to assess the limitations imposed on a cochlear implant system by a wearable speech processor, the parameters extracted from a set of 11 vowels and 24 consonants were examined. An estimate of the fundamental frequency EF 0 was derived from the zero crossings of the low-pass filtered envelope of the waveform. Estimates of the first and second formant frequencies EF 1 and EF 2 were derived from the zero crossings of the waveform, which was filtered in the ranges 300-1000 and 800-4000 Hz. Estimates of the formant amplitudes EA 1 and EA 2 were derived by peak detectors operating on the outputs of the same filters. For vowels, these parameters corresponded well to the first and second formants and gave sufficient information to identify each vowel. For consonants, the relative levels and onset times of EA 1 and EA 2 and the EF 0 values gave cues to voicing. The variation in time of EA 1, EA 2, EF 1, and EF 2 gave cues to the manner of articulation. Cues to the place of articulation were given by EF 1 and EF 2. When pink noise was added, the parameters were gradually degraded as the signal-to-noise ratio decreased. Consonants were affected more than vowels, and EF 2 was affected more than EF 1. Results for three good patients using a speech processor that coded EF 0 as an electric pulse rate, EF 1 and EF 2 as electrode positions, and EA 1 and EA 2 as electric current levels confirmed that the parameters were useful for recognition of vowels and consonants. Average scores were 76% for recognition of 11 vowels and 71% for 12 consonants in the hearing-alone condition. The error rates were 4% for voicing, 12% for manner, and 25% for place.  相似文献   

17.
Previous research has suggested that speech loudness is determined primarily by the vowel in consonant-vowel-consonant (CVC) monosyllabic words, and that consonant intensity has a negligible effect. The current study further examines the unique aspects of speech loudness by manipulating consonant-vowel intensity ratios (CVRs), while holding the vowel constant at a comfortable listening level (70 dB), to determine the extent to which vowels and consonants contribute differentially to the loudness of monosyllabic words with voiced and voiceless consonants. The loudness of words edited to have CVRs ranging from -6 to +6?dB was compared to that of standard words with unaltered CVR by 10 normal-hearing listeners in an adaptive procedure. Loudness and overall level as a function of CVR were compared for four CVC word types: both voiceless consonants modified; only initial voiceless consonants modified; both voiced consonants modified; and only initial voiced consonants modified. Results indicate that the loudness of CVC monosyllabic words is not based strictly on the level of the vowel; rather, the overall level of the word and the level of the vowel contribute approximately equally. In addition to furthering the basic understanding of speech perception, the current results may be of value for the coding of loudness by hearing aids and cochlear implants.  相似文献   

18.
Synthesis (carrier) signals in acoustic models embody assumptions about perception of auditory electric stimulation. This study compared speech intelligibility of consonants and vowels processed through a set of nine acoustic models that used Spectral Peak (SPEAK) and Advanced Combination Encoder (ACE)-like speech processing, using synthesis signals which were representative of signals used previously in acoustic models as well as two new ones. Performance of the synthesis signals was determined in terms of correspondence with cochlear implant (CI) listener results for 12 attributes of phoneme perception (consonant and vowel recognition; F1, F2, and duration information transmission for vowels; voicing, manner, place of articulation, affrication, burst, nasality, and amplitude envelope information transmission for consonants) using four measures of performance. Modulated synthesis signals produced the best correspondence with CI consonant intelligibility, while sinusoids, narrow noise bands, and varying noise bands produced the best correspondence with CI vowel intelligibility. The signals that performed best overall (in terms of correspondence with both vowel and consonant attributes) were modulated and unmodulated noise bands of varying bandwidth that corresponded to a linearly varying excitation width of 0.4 mm at the apical to 8 mm at the basal channels.  相似文献   

19.
This paper examines whether correlations between speech perception and speech production exist, and, if so, whether they might provide a way of evaluating different acoustic metrics. The cues listeners use for many phonemic distinctions are not known, often because many different acoustic cues are highly correlated with one another, making it difficult to distinguish among them. Perception-production correlations may provide a new means of doing so. In the present paper, correlations were examined between acoustic measures taken on listeners' perceptual prototypes for a given speech category and on their average production of members of that category. Significant correlations were found for VOT among stop consonants, and for spectral peaks (but not centroids or skewness) for voiceless fricatives. These results suggest that correlations between speech perception and production may provide a methodology for evaluating different proposed acoustic metrics.  相似文献   

20.
Recent evidence suggests that spectral change, as measured by cochlea-scaled entropy (CSE), predicts speech intelligibility better than the information carried by vowels or consonants in sentences. Motivated by this finding, the present study investigates whether intelligibility indices implemented to include segments marked with significant spectral change better predict speech intelligibility in noise than measures that include all phonetic segments paying no attention to vowels/consonants or spectral change. The prediction of two intelligibility measures [normalized covariance measure (NCM), coherence-based speech intelligibility index (CSII)] is investigated using three sentence-segmentation methods: relative root-mean-square (RMS) levels, CSE, and traditional phonetic segmentation of obstruents and sonorants. While the CSE method makes no distinction between spectral changes occurring within vowels/consonants, the RMS-level segmentation method places more emphasis on the vowel-consonant boundaries wherein the spectral change is often most prominent, and perhaps most robust, in the presence of noise. Higher correlation with intelligibility scores was obtained when including sentence segments containing a large number of consonant-vowel boundaries than when including segments with highest entropy or segments based on obstruent/sonorant classification. These data suggest that in the context of intelligibility measures the type of spectral change captured by the measure is important.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号