期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

肖东莫福源陈庚马力《应用声学》2016,35(1):77-83

过渡段对语音清晰度、可懂度和人耳听觉感知都起到不可忽视的作用。参数语音编码中,包含有过渡段的语音帧能否得到恰当处理,是决定其合成语音是否清晰可懂的关键。本文以混合激励线性预测编码为参考,将其中的语音帧划分为静音、清音、浊音、过渡四大类后分别处理,在以往低码率语音编码(1 kbps)工作基础上,比较了八种过渡帧划分方法对合成语音PESQ MOS的影响。经分析后发现:不同的过渡帧对PESQ MOS的贡献也不同。由清、静音向浊音变化的过渡帧的贡献最大;介于浊辅音与元音之间的过渡帧的贡献也不应被忽略。相似文献

2.

小波包自适应阈值语音降噪新算法

下载免费PDF全文

田玉静左红伟董玉民王超《应用声学》2011,30(1):72-80

为了克服低信噪比输入下,语音增强造成语音清音中的弱分量损失,造成重构信号包络失真的问题。论文提出了一种新的语音增强方法。该方法根据语音感知模型,采用不完全小波包分解拟合语音临界频带,并对语音按子带能量进行清浊音区分处理,在阈值计算上,提出了一种清浊音分离,基于子带信号能量的小波包自适应阈值算法。通过仿真实验,客观评测和听音测试表明,该算法在低信噪比输入时较传统算法,能够更加有效地减少重构信号包络失真,在不损伤语音清晰度和自然度的前提下,使输出信噪比明显提高。将该算法与能量谱减法结合,进行二次增强能进一步提高降噪输出的语音质量。相似文献

3.

Joint modeling and maximum-likelihood estimation of pitch and linear prediction coefficient parameters.

D Burshtein 《The Journal of the Acoustical Society of America》1992,91(3):1531-1537

The well-known speech production model is considered, where the speech signal is modeled as the output of an all-pole filter driven either by some white noise sequence (unvoiced speech) or by the sum of a periodic excitation and a noise sequence (voiced speech). Approximate maximum-likelihood (ML) estimation algorithms for the unvoiced case are well known. The ML estimator of the parameters is obtained for the voiced speech model. These parameters consist of the parameters of the periodic excitation (pitch parameters) and the parameters of the filter [linear prediction coefficient (LPC) parameters]. The results of the application of the algorithm on simulated and on real speech data are presented. 相似文献

4.

利用倒谱方法实现气声发育的重建

下载免费PDF全文

李国锋刘莹《应用声学》1996,15(5):41-44

本文介绍了一种利用复倒谱来实现气声发音重建的方法。首先分析了气声发音的语音特征；进而在复倒谱序列中加入基频特征使其恢复到正常的语音。对元音［a］以及实际语音段进行了处理，均有较好的效果。相似文献

5.

利用倒谱方法实现气声发育的重建 总被引：1，自引：0，他引：1

下载免费PDF全文

李国锋刘莹《应用声学》1996,15(5):41-44

本文介绍了一种利用复倒谱来实现气声发音重建的方法，首先分析了气声发音的语音特征；进而在复倒谱序列中加入基频率特征其恢复到正常的语音，对元音（ａ）以及实际语音段进行了处理。均有较好的效果。相似文献

6.

短时非线性预测方法对汉语语音特性的研究

下载免费PDF全文

徐歆胡水清陶超杜功焕《应用声学》2003,22(5):36-40,44

本文应用Short[8-10]改进了的短时非线性预测方法对正常语速的汉语音节和短语进行了分析!究，揭示了汉语语音中浊音和清音的短时非线性预测能力的差异，并且发现这种差异即使在强背景噪声下仍能用短时非线性预测方法加以辨别。这些为浊音和清音的切分提供了一种可能性手段。相似文献

7.

Pitch and voiced/unvoiced determination with an auditory model.

L M Van Immerseel J P Martens 《The Journal of the Acoustical Society of America》1992,91(6):3511-3526

In this paper, an accurate pitch and voiced/unvoiced determination algorithm for speech analysis is described. The algorithm is called AMPEX (auditory model-based pitch extractor) and it performs a temporal analysis of the outputs emerging from a new auditory model. However, in spite of its use of an auditory model, AMPEX should not be regarded as a substitute for any psychophysical theory of human auditory pitch perception. What is mainly described is the design of a computationally efficient auditory model, the perceptually motivated determination of the model parameters, the conception of a reliable pitch extractor for speech analysis, and the elaboration of an experimental procedure for evaluating the performance of such a pitch extractor. In the course of the evaluation experiment several kinds of speech stimuli including clean speech, bandpass-filtered speech, and noisy speech were presented to three different pitch extractors. The experimental results clearly indicate that AMPEX outperforms the best algorithms available. 相似文献

8.

An approach based on simplified KLT and wavelet transform for enhancing speech degraded by non-stationary wideband noise

Hong Wei Lou Guang Rui Hu 《Journal of sound and vibration》2003,268(4):717-729

It is well known that the non-stationary wideband noise is the most difficult to be removed in speech enhancement. In this paper a novel speech enhancement algorithm based on the dyadic wavelet transform and the simplified Karhunen-Loeve transform (KLT) is proposed to suppress the non-stationary wideband noise. The noisy speech is decomposed into components by the wavelet space and KLT-based vector space, and the components are processed and reconstructed, respectively, by distinguishing between voiced speech and unvoiced speech. There are no requirements of noise whitening and SNR pre-calculating. In order to evaluate the performance of this algorithm in more detail, a three-dimensional spectral distortion measure is introduced. Experiments and comparison between different speech enhancement systems by means of the distortion measure show that the proposed method has no drawbacks existing in the previous methods and performs better shaping and suppressing of the non-stationary wideband noise for speech enhancement. 相似文献

9.

The control of air flow during loud soprano singing

Martin Rothenberg Donald Miller Richard Molitor Dolores Leffingwell 《Journal of voice》1987,1(3)

Previous research on the special characteristics of the professional singing voice has at least partially explained how singers can commonly use much higher lung pressures than nonsingers without vocal damage or excessive air flow during the voiced sounds. In this study, the control of air flow during the unvoiced consonants is examined for an operatic-style soprano. It was found that this singer could maintain a low average air flow during the consonants even though the lung pressure reached values over five times those used during normal conversational speech. The air flow was kept low primarily by the use of a number of mechanisms involving rapid, accurate, coordinated valving of the air flow at the point of articulation and at the glottis. 相似文献

10.

Estimation of vocal dysperiodicities in disordered connected speech by means of distant-sample bidirectional linear predictive analysis

Bettens F Grenez F Schoentgen J 《The Journal of the Acoustical Society of America》2005,117(1):328-337

The article presents an analysis of vocal dysperiodicities in connected speech produced by dysphonic speakers. The processing is based on a comparison of the present speech fragment with future and past fragments. The size of the dysperiodicity estimate is zero for periodic speech signals. A feeble increase of the vocal dysperiodicity is guaranteed to produce a feeble increase of the estimate. No spurious noise boosting occurs owing to cycle insertion and omission errors, or phonetic segment boundary artifacts. Additional objectives of the study have been investigating whether deviations from periodicity are larger or more commonplace in connected speech than in sustained vowels, and whether sentences that comprise frequent voice onsets and offsets are noisier than sentences that comprise few. The corpora contain sustained vowels as well as grammatically- and phonetically matched sentences. An acoustic marker that correlates with the perceived degree of hoarseness summarizes the size of the dysperiodicities. The marker values for sustained vowels have been highly correlated with those for connected speech, and the marker values for sentences that comprise few voiced/unvoiced transients have been highly correlated with the marker values for sentences that comprise many. 相似文献

11.

Frication noise modulated by voicing, as revealed by pitch-scaled decomposition

Jackson PJ Shadle CH 《The Journal of the Acoustical Society of America》2000,108(4):1421-1434

A decomposition algorithm that uses a pitch-scaled harmonic filter was evaluated using synthetic signals and applied to mixed-source speech, spoken by three subjects, to separate the voiced and unvoiced parts. Pulsing of the noise component was observed in voiced frication, which was analyzed by complex demodulation of the signal envelope. The timing of the pulsation, represented by the phase of the anharmonic modulation coefficient, showed a step change during a vowel-fricative transition corresponding to the change in location of the noise source within the vocal tract. Analysis of fricatives [see text] demonstrated a relationship between steady-state phase and place, and f0 glides confirmed that the main cause was a place-dependent delay. 相似文献

12.

Acoustic Analysis of Consonants in Whispered Speech

Slobodan T. Jovi i&#x; Zoran &#x;ari&#x; 《Journal of voice》2008,22(3):263-274

An acoustic analysis of whispered consonants in comparison to normally phonated consonants was conducted in time and intensity domains. Consonant duration and average root mean square intensity were measured for six speakers in both articulation modes. Each of 25 Serbian consonants (C) was sited between the vowel /a/ forming a syllable of /aCa/ type. Such a syllable was placed in initial, medial, and final position in the carrier sentence. Results showed that whispered consonants have a prolonged duration of about 10% on average (statistically significant, ANOVA test), and that the unvoiced consonants have a smaller time dimension extension (5.8%) than voiced ones (15.3%). Examination at subphonemic level showed that there is no difference in voice-onset-time and affrication duration in unvoiced plosives and affricates, in both whispered and phonated mode of articulation, but the difference is significant for voiced ones. Analysis of consonant duration versus place of articulation showed that palatal place is most sensitive in the process of whispering. In all experiments, the results are very consistent with respect to the subjects and test material (Pearson's correlation was between 0.6 and 0.9). In intensity domain, all unvoiced consonants in whispered mode of articulation have almost unchanged intensity in comparison to phonated mode (the difference is maximum 3.5 dB). On the contrary, voiced consonants in the whispered mode were reduced in intensity by as much as 25 dB, as nasals and semivowels. Average intensity of whispered consonants is lowered by 12d B in comparison to phonated ones, and does not depend on syllabic position inside the sentences. 相似文献

13.

Informational masking of speech produced by speech-like sounds without linguistic content

Chen J Li H Li L Wu X Moore BC 《The Journal of the Acoustical Society of America》2012,131(4):2914-2926

This study investigated whether speech-like maskers without linguistic content produce informational masking of speech. The target stimuli were nonsense Chinese Mandarin sentences. In experiment I, the masker contained harmonics the fundamental frequency (F0) of which was sinusoidally modulated and the mean F0 of which was varied. The magnitude of informational masking was evaluated by measuring the change in intelligibility (releasing effect) produced by inducing a perceived spatial separation of the target speech and masker via the precedence effect. The releasing effect was small and was only clear when the target and masker had the same mean F0, suggesting that informational masking was small. Performance with the harmonic maskers was better than with a steady speech-shaped noise (SSN) masker. In experiments II and III, the maskers were speech-like synthesized signals, alternating between segments with harmonic structure and segments composed of SSN. Performance was much worse than for experiment I, and worse than when an SSN masker was used, suggesting that substantial informational masking occurred. The similarity of the F0 contours of the target and masker had little effect. The informational masking effect was not influenced by whether or not the noise-like segments of the masker were synchronous with the unvoiced segments of the target speech. 相似文献

14.

改进LVAMDF及综合多因素基音检测算法

薛帅强陈波陈菲 《应用声学》2016,24(4):253-256

在对语音信号静音、清音、浊音划分的基础上,针对语音信号周期特征明显段分布随机性问题,提出改进的变长度平均幅度差函数LVAMDF及综合多因素基音检测算法,该算法对语音信号进行周期特征明显段和周期特征不明显段的聚类划分,同时,获取周期特征明显语音段的基音周期,针对少数基音周期划分倍频或半频问题,提出识别、修正方法,其识别、修正率极高。在对大量真实语音处理中,能够精确的检测出语音特征明显段的基音周期端点,基本没有倍频和半频划分,并且和AMDF、ACF算法作了对比。相似文献

15.

采用损失函数和声学特征切分声韵母的方法

下载免费PDF全文

李皓唐朝京《声学学报》2012,37(3):339-345

为实现鲁棒的声韵母切分,以满足大词汇量连续语音识别系统的需求,提出一种建立损失函数,并利用浊音的“准”周期性和声母时长进行声韵母切分的方法。首先计算语音的自相关函数,接着建立代价损失函数,对计算结果采用动态规划方法检测浊音,然后根据声母时长分布规律确定声母的检测范围,最后在检测范围内对浊音段起始点前后采用听觉事件检测方法分割出声韵母。实验结果表明,采用动态规划方法相对于阈值方法提高了浊音段的检测性能,在浊音段的基础上对声韵母进行切分能够提高切分的正确率,减少噪声及汉语音变现象的影响,切分性能受声母发音方式影响较小。相似文献

16.

Sinusoidal modeling for nonstationary voiced speech based on a local vector transform

Ito M Yano M 《The Journal of the Acoustical Society of America》2007,121(3):1717-1727

A voiced speech signal can be expressed as a sum of sinusoidal components of which instantaneous frequency and amplitude continuously vary with time. Determining these parameters from the input, the time-varying characteristics are crucial error sources for the algorithms, which assume their stationarity within a local analysis segment. To overcome this problem, a new method is proposed, local vector transform (LVT), which can determine instantaneous frequency and amplitude for nonstationary sinusoids. The method does not assume the local stationarity. The effectiveness of LVT was examined in parameter determination for synthesized and naturally uttered speech signals. The instantaneous frequency for the first harmonic component was determined with an accuracy almost equal to that of the time-corrected instantaneous frequency method and higher accuracy than that of spectral peak-picking, autocorrelation, and cepstrum. The instantaneous amplitude was also determined accurately by LVT while considerable errors were left in the other algorithms. The signal reconstructed from the determined parameters by LVT agreed well with the corresponding component of voiced speech. These results suggest that the method is effective for analyzing time-varying voiced speech signals. 相似文献

17.

Pitch-based monaural segregation of reverberant speech

Roman N Wang D 《The Journal of the Acoustical Society of America》2006,120(1):458-469

In everyday listening, both background noise and reverberation degrade the speech signal. Psychoacoustic evidence suggests that human speech perception under reverberant conditions relies mostly on monaural processing. While speech segregation based on periodicity has achieved considerable progress in handling additive noise, little research in monaural segregation has been devoted to reverberant scenarios. Reverberation smears the harmonic structure of speech signals, and our evaluations using a pitch-based segregation algorithm show that an increase in the room reverberation time causes degraded performance due to weakened periodicity in the target signal. We propose a two-stage monaural separation system that combines the inverse filtering of the room impulse response corresponding to target location and a pitch-based speech segregation method. As a result of the first stage, the harmonicity of a signal arriving from target direction is partially restored while signals arriving from other directions are further smeared, and this leads to improved segregation. A systematic evaluation of the system shows that the proposed system results in considerable signal-to-noise ratio gains across different conditions. Potential applications of this system include robust automatic speech recognition and hearing aid design. 相似文献

18.

改进谐波组织规则的单通道浊语音分离系统 总被引：1，自引：0，他引：1

张学良刘文举李鹏徐波《声学学报》2011,36(1)

针对以往单通道噪声和浊语音分离算法的不足,改进了谐波组织算法。算法利用载波包络能量比将时频单元分为确定和非确定。提取基频作为组织线索。组织阶段分别使用谐波原理和最小幅度原理对确定时频单元组织;使用改进包络自相关函数度量幅度调制率对非确定时频单元组织。对比以往算法的处理结果,改进算法平均信噪比(SNR)提高0.96 dB。通过对谐波组织规则的改进,提高了分离性能。相似文献

19.

Neural control of vocalization: Respiratory and emotional influences

Pamela J. Davis Shi Ping Zhang Alison Winkworth Richard Bandler 《Journal of voice》1996,10(1):23-38

Previous research has shown that a region of the midbrain, the periaqueductal gray matter (PAG), is critical for vocalization. In this review, we describe the results of previous investigations in which we sought to find out how PAG neurons integrate the activity and precise timing of respiratory, laryngeal, and oral muscle activity for natural-sounding vocalization using the technique of excitatory amino acid microinjections in cats. In these studies, all surgical procedures were carried out under deep anaesthesia. In the precollicular decerebrate cat two general types of vocalization, classified as voiced and unvoiced, could be evoked by exciting neurons in the lateral part of the intermediate part of the PAG. The patterns of evoked electromyographic activity were strikingly similar to previously reported patterns of human muscle activity. Coordinated patterns of activity were evoked with just-threshold excitation leading to the conclusion that patterned muscle activity corresponding to the major categories of voiced and voiceless sound production are represented in the PAG. In a parallel series of human and animal experiments, we also determined that the speech and vocalization respiratory patterns are integrated and coordinated with afferent signals related to lung volume. These data have led to the proposal of a new hypothesis for the neural control of vocalization: that the PAG is a crucial brain site for mammalian voice production, not only in the production of emotional or involuntary sounds, but also as a generator of specific respiratory and laryngeal motor patterns essential for human speech and song 相似文献

20.

Binaural segregation in multisource reverberant environments

Roman N Srinivasan S Wang D 《The Journal of the Acoustical Society of America》2006,120(6):4040-4051

In a natural environment, speech signals are degraded by both reverberation and concurrent noise sources. While human listening is robust under these conditions using only two ears, current two-microphone algorithms perform poorly. The psychological process of figure-ground segregation suggests that the target signal is perceived as a foreground while the remaining stimuli are perceived as a background. Accordingly, the goal is to estimate an ideal time-frequency (T-F) binary mask, which selects the target if it is stronger than the interference in a local T-F unit. In this paper, a binaural segregation system that extracts the reverberant target signal from multisource reverberant mixtures by utilizing only the location information of target source is proposed. The proposed system combines target cancellation through adaptive filtering and a binary decision rule to estimate the ideal T-F binary mask. The main observation in this work is that the target attenuation in a T-F unit resulting from adaptive filtering is correlated with the relative strength of target to mixture. A comprehensive evaluation shows that the proposed system results in large SNR gains. In addition, comparisons using SNR as well as automatic speech recognition measures show that this system outperforms standard two-microphone beamforming approaches and a recent binaural processor. 相似文献