期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A text-to-speech system with high intelligibility and naturalness for Chinese 总被引：1，自引：0，他引：1

CHU Min LU Shinan 《声学学报：英文版》1996,(1)

I.IntroductionResearchesonChinesesynthesisdisclosethatonlywhenboththesegmentalandsupraseg-melltalfeaturesofthesyntheticspeecharesimilartothoseofthellaturalone,thesyntheticspeechwillsoundintelligibleandnatural[1].Amongekistingsynthetictechniques,theapproachbasedonacousticparametersca-nadustboththesegmentalandsuprasegmentalfeaturesofsyntheticunitsfiekiblyandcanbeconsideredasthemostreasonablesynthetictechniqueintheory.However,theparameterbasedsynthesizerisoverAfependentonthedevelopmentsofparamet… 相似文献

2.

Constrained tone transformation technique for separation and combination of Mandarin tone and intonation

Ni J Kawai H Hirose K 《The Journal of the Acoustical Society of America》2006,119(3):1764-1782

This paper addresses a classical but important problem: The coupling of lexical tones and sentence intonation in tonal languages, such as Chinese, focusing particularly on voice fundamental frequency (F1) contours of speech. It is important because it forms the basis of speech synthesis technology and prosody analysis. We provide a solution to the problem with a constrained tone transformation technique based on structural modeling of the F1 contours. This consists of transforming target values in pairs from norms to variants. These targets are intended to sparsely specify the prosodic contributions to the F1 contours, while the alignment of target pairs between norms and variants is based on underlying lexical tone structures. When the norms take the citation forms of lexical tones, the technique makes it possible to separate sentence intonation from observed F0 contours. When the norms take normative F0 contours, it is possible to measure intonation variations from the norms to the variants, both having identical lexical tone structures. This paper explains the underlying scientific and linguistic principles and presents an algorithm that was implemented on computers. The method's capability of separating and combining tone and intonation is evaluated through analysis and re-synthesis of several hundred observed F0 contours. 相似文献

3.

Contribution of low-frequency acoustic information to Chinese speech recognition in cochlear implant simulations

Luo X Fu QJ 《The Journal of the Acoustical Society of America》2006,120(4):2260-2266

Chinese sentence recognition strongly relates to the reception of tonal information. For cochlear implant (CI) users with residual acoustic hearing, tonal information may be enhanced by restoring low-frequency acoustic cues in the nonimplanted ear. The present study investigated the contribution of low-frequency acoustic information to Chinese speech recognition in Mandarin-speaking normal-hearing subjects listening to acoustic simulations of bilaterally combined electric and acoustic hearing. Subjects listened to a 6-channel CI simulation in one ear and low-pass filtered speech in the other ear. Chinese tone, phoneme, and sentence recognition were measured in steady-state, speech-shaped noise, as a function of the cutoff frequency for low-pass filtered speech. Results showed that low-frequency acoustic information below 500 Hz contributed most strongly to tone recognition, while low-frequency acoustic information above 500 Hz contributed most strongly to phoneme recognition. For Chinese sentences, speech reception thresholds (SRTs) improved with increasing amounts of low-frequency acoustic information, and significantly improved when low-frequency acoustic information above 500 Hz was preserved. SRTs were not significantly affected by the degree of spectral overlap between the CI simulation and low-pass filtered speech. These results suggest that, for CI patients with residual acoustic hearing, preserving low-frequency acoustic information can improve Chinese speech recognition in noise. 相似文献

4.

端到端的藏语语音合成方法

下载免费PDF全文

拉巴顿珠珠杰欧珠尼玛《应用声学》2023,42(2):324-332

近年来，得益于计算机运算能力的提高和语音数据的不断积累，涌现出许多基于机器学习的语音处理新技术，其中基于深度神经网络算法，端到端的Tacotron2语音合成系统框架得到业界广泛的青睐。它是一个开源程序，简单易行，已成功地应用于多种语言和不同音色的语音合成。该文研究Tacotron2在藏语中的应用，取得了良好的实验结果。首先，通过自然语音采集、自动标注、声学分析等构建了一个中等规模(5500句)藏语卫藏方言的语音语料库，其中包括藏文音素转写、特殊符号处理和Mel谱等各项数据；其次，利用开源程序Tacotron2和上述语音库进行了藏语语音合成试验；最后，通过对合成语音和自然语音的偏差分析，和对合成语音的自然度的主观评价，表明了基于端到端的藏语语音合成方法有效地减少合成语音的频谱蜕变，提升了合成语音的自然度。因此，基于“端到端”的Tacotron2合成框架在藏语语音合成中具有重要的应用价值，值得进一步研究和推广应用。相似文献

5.

Enhancing Chinese tone recognition by manipulating amplitude envelope: implications for cochlear implants 总被引：1，自引：0，他引：1

Luo X Fu QJ 《The Journal of the Acoustical Society of America》2004,116(6):3659-3667

Tone recognition is important for speech understanding in tonal languages such as Mandarin Chinese. Cochlear implant patients are able to perceive some tonal information by using temporal cues such as periodicity-related amplitude fluctuations and similarities between the fundamental frequency (F0) contour and the amplitude envelope. The present study investigates whether modifying the amplitude envelope to better resemble the F0 contour can further improve tone recognition in multichannel cochlear implants. Chinese tone and vowel recognition were measured for six native Chinese normal-hearing subjects listening to a simulation of a four-channel cochlear implant speech processor with and without amplitude envelope enhancement. Two algorithms were proposed to modify the amplitude envelope to more closely resemble the F0 contour. In the first algorithm, the amplitude envelope as well as the modulation depth of periodicity fluctuations was adjusted for each spectral channel. In the second algorithm, the overall amplitude envelope was adjusted before multichannel speech processing, thus reducing any local distortions to the speech spectral envelope. The results showed that both algorithms significantly improved Chinese tone recognition. By adjusting the overall amplitude envelope to match the F0 contour before multichannel processing, vowel recognition was better preserved and less speech-processing computation was required. The results suggest that modifying the amplitude envelope to more closely resemble the F0 contour may be a useful approach toward improving Chinese-speaking cochlear implant patients' tone recognition. 相似文献

6.

汉语耳语音孤立字识别研究 总被引：6，自引：0，他引：6

下载免费PDF全文

杨莉莉林玮徐柏龄《应用声学》2006,25(3):187-192

耳语音识别有着广泛的应用前景,是一个全新的课题.但是由于耳语音本身的特点,如声级低、没有基频等,给耳语音识别研究带来了困难.本文根据耳语音信号发音模型,结合耳语音的声学特性,建立了一个汉语耳语音孤立字识别系统.由于耳语音信噪比低,必须对其进行语音增强处理,同时在识别系统中应用声调信息提高了识别性能.实验结果说明了MFCC结合幅值包络可作为汉语耳语音自动识别的特征参数,在小字库内用HMM模型识别得出的识别率为90.4%. 相似文献

7.

Intrinsic fundamental frequency of vowels in sentence context 总被引：1，自引：0，他引：1

C H Shadle 《The Journal of the Acoustical Society of America》1985,78(5):1562-1567

High vowels have a higher intrinsic fundamental frequency (F0) than low vowels. This phenomenon has been verified in several languages. However, most studies of intrinsic F0 of vowels have used words either in isolation or bearing the main phrasal stress in a carrier sentence. As a first step towards an understanding of how the intrinsic F0 of vowels interacts with intonation in running speech, this study examined F0 of the vowels [i,a,u] in four sentence positions. The four speakers used for this study showed a statistically significant main effect of intrinsic F0 (high vowels had higher F0). Three of the four speakers also showed an interaction between intrinsic F0 and sentence position such that no significant F0 difference was observed in the unaccented, sentence-final position. The interaction was shown not to be due to vowel neutralization or correlated with changes in the glottal waveform shape, as evidenced by measures of the first formant frequency and spectral slope. Comparison with studies of tone languages and speech of the deaf suggests that both the lack of accent and the lower F0 caused the reduction in the intrinsic F0 difference. 相似文献

8.

Effects of language experience and stimulus complexity on the categorical perception of pitch direction

Xu Y Gandour JT Francis AL 《The Journal of the Acoustical Society of America》2006,120(2):1063-1074

Whether or not categorical perception results from the operation of a special, language-specific, speech mode remains controversial. In this cross-language (Mandarin Chinese, English) study of the categorical nature of tone perception, we compared native Mandarin and English speakers' perception of a physical continuum of fundamental frequency contours ranging from a level to rising tone in both Mandarin speech and a homologous (nonspeech) harmonic tone. This design permits us to evaluate the effect of language experience by comparing Chinese and English groups; to determine whether categorical perception is speech-specific or domain-general by comparing speech to nonspeech stimuli for both groups; and to examine whether categorical perception involves a separate categorical process, distinct from regions of sensory discontinuity, by comparing speech to nonspeech stimuli for English listeners. Results show evidence of strong categorical perception of speech stimuli for Chinese but not English listeners. Categorical perception of nonspeech stimuli was comparable to that for speech stimuli for Chinese but weaker for English listeners, and perception of nonspeech stimuli was more categorical for English listeners than was perception of speech stimuli. These findings lead us to adopt a memory-based, multistore model of perception in which categorization is domain-general but influenced by long-term categorical representations. 相似文献

9.

Encoding voice pitch for profoundly hearing-impaired listeners

K W Grant 《The Journal of the Acoustical Society of America》1987,82(2):423-432

The ability of five profoundly hearing-impaired subjects to "track" connected speech and to make judgments about the intonation and stress in spoken sentences was evaluated under a variety of auditory-visual conditions. These included speechreading alone, speechreading plus speech (low-pass filtered at 4 kHz), and speechreading plus a tone whose frequency, intensity, and temporal characteristics were matched to the speaker's fundamental frequency (F0). In addition, several frequency transfer functions were applied to the normal F0 range resulting in new ranges that were both transposed and expanded with respect to the original F0 range. Three of the five subjects were able to use several of the tonal representations of F0 nearly as well as speech to improve their speechreading rates and to make appropriate judgments concerning sentence intonation and stress. The remaining two subjects greatly improved their identification performance for intonation and stress patterns when expanded F0 signals were presented alone (i.e., without speechreading), but had difficulty integrating visual and auditory information at the connected discourse level, despite intensive training in the connected discourse tracking procedure lasting from 27.8-33.8 h. 相似文献

10.

汉语自然口语中声调识别的研究 总被引：2，自引：0，他引：2

下载免费PDF全文

刘赵杰邵健张鹏远赵庆卫颜永红冯稷《物理学报》2007,56(12):7064-7069

汉语是一种带声调的语言，声调信息在汉语识别中具有非常重要的意义.传统的声调识别一般只研究朗读式语音中相对标准的声调，很少对声调调型比较复杂的自然口语进行专门的处理.针对汉语自然口语的特点，在声调建模单元的选择时提出了真实上下文的模型.同时，为了对声调模式进行精细建模，采用了一种层次聚类的方法来获得更多的声调模式.实验结果证明了方法的有效性. 关键词：声调识别自然口语真实上下文模型聚类相似文献

11.

使用变分自编码器的自回归多说话人中文语音合成

下载免费PDF全文

蒿晓阳张鹏远《声学学报》2022,47(3):405-416

常见的多说话人语音合成有参数自适应及添加说话人标签两种方法。参数自适应方法获得的模型仅支持合成经过自适应的说话人的语音,模型不够鲁棒。传统的添加说话人标签的方法需要有监督地获得语音的说话人信息,并没有从语音信号本身无监督地学习说话人标签。为解决这些问题,提出了一种基于变分自编码器的自回归多说话人语音合成方法。方法首先利用变分自编码器无监督地学习说话人的信息并将其隐式编码为说话人标签,之后与文本的语言学特征送入到一个自回归声学参数预测网络中。此外,为了抑制多说话人语音数据引起的基频预测过拟合问题,声学参数网络采用了基频多任务学习的方法。预实验表明,自回归结构的加入降低了频谱误差1.018 dB,基频多任务学习降低了基频均方根误差6.861 Hz。在后续的多说话人对比实验中,提出的方法在3个多说话人实验的平均主观意见分(MOS)打分上分别达到3.71,3.55,3.15,拼音错误率分别为6.71%,7.54%,9.87%,提升了多说话人语音合成的音质。相似文献

12.

Automatic acoustic synthesis of human-like laughter

Sundaram S Narayanan S 《The Journal of the Acoustical Society of America》2007,121(1):527-535

A technique to synthesize laughter based on time-domain behavior of real instances of human laughter is presented. In the speech synthesis community, interest in improving the expressive quality of synthetic speech has grown considerably. While the focus has been on the linguistic aspects, such as precise control of speech intonation to achieve desired expressiveness, inclusion of nonlinguistic cues could further enhance the expressive quality of synthetic speech. Laughter is one such cue used for communicating, say, a happy or amusing context. It can be generated in many varieties and qualities: from a short exhalation to a long full-blown episode. Laughter is modeled at two levels, the overall episode level and at the local call level. The first attempts to capture the overall temporal behavior in a parametric model based on the equations that govern the simple harmonic motion of a mass-spring system is presented. By changing a set of easily available parameters, the authors are able to synthesize a variety of laughter. At the call level, the authors relied on a standard linear prediction based analysis-synthesis model. Results of subjective tests to assess the acceptability and naturalness of the synthetic laughter relative to real human laughter samples are presented. 相似文献

13.

汉语重音的凸显度分析与合成

孟凡博吴志勇贾珈蔡莲红《声学学报》2015,40(1):1-11

重音是重要的语调特征,重音合成技术可以提高语音的自然度和表现力。针对重音的局部凸显性,该文提出了声学特征凸显度的表示方法,分析了不同韵律位置(韵律词首、中、尾,韵律短语首、中、尾等)重音音节的声学特征凸显度,发现在韵律单元末(韵律词末音节和韵律短语末韵律词)的重音其基频最大值凸显度要低于非韵律单元末重音,提出了基于声学特征凸显度的非线性的重音声学参数生成算法,解决了传统重音声学参数线性修改算法的修改幅度不足或过大的问题。采用该算法建立了基于隐Markov模型的支持重音合成的语音合成系统。实验表明,该系统可以有效合成带有重音的语音,提高了合成语音的自然度和表现力。相似文献

14.

Testing a model of intonation in a tone language

M Lindau 《The Journal of the Acoustical Society of America》1986,80(3):757-764

Schematic fundamental frequency curves of simple statements and questions are generated for Hausa, a two-tone language of Nigeria, using a modified version of an intonational model developed by G?rding and Bruce [Nordic Prosody II, edited by T. Fretheim (Tapir, Trondheim, 1981), pp. 33-39]. In this model, rules for intonation and tones are separated. Intonation is represented as sloping grids of (near) parallel lines, inside which tones are placed. The tones are associated with turning points of the fundamental frequency contour. Local rules may also modify the exact placement of a tone within the grid. The continuous fundamental frequency contour is modeled by concatenating the tonal points using polynomial equations. Thus the final pitch contour is modeled as an interaction between global and local factors. The slope of the intonational grid lines depends at least on sentence type (statement or question), sentence length, and tone pattern. The model is tested by reference to data from nine speakers of Kano Hausa. 相似文献

15.

一种基于音素模型感知度的发音质量评价方法 总被引：1，自引：1，他引：0

下载免费PDF全文

张茹韩纪庆《声学学报》2013,38(2):201-207

为了提高发音质量判别精度,提出了一种基于音素模型感知度的发音质量评价方法。它采用不同语音样本集合下样本声学特征的对数后验概率期望差作为音素模型对变异发音的感知度,并以此为基础,生成各音素对应的识别模型候选集。实验表明,所提出的方法使语音识别网络候选音素模型集合尺寸减少约95%;在非母语语音数据库上,该方法评分与人工专家打分相关性为0.828,基于该方法得到的声韵母错误检出率为70.8%,声调错误检出率为42.5%,均优于其它方法。相似文献

16.

Features of stimulation affecting tonal-speech perception: implications for cochlear prostheses 总被引：5，自引：0，他引：5

Xu L Tsai Y Pfingst BE 《The Journal of the Acoustical Society of America》2002,112(1):247-258

Tone languages differ from English in that the pitch pattern of a single-syllable word conveys lexical meaning. In the present study, dependence of tonal-speech perception on features of the stimulation was examined using an acoustic simulation of a CIS-type speech-processing strategy for cochlear prostheses. Contributions of spectral features of the speech signals were assessed by varying the number of filter bands, while contributions of temporal envelope features were assessed by varying the low-pass cutoff frequency used for extracting the amplitude envelopes. Ten normal-hearing native Mandarin Chinese speakers were tested. When the low-pass cutoff frequency was fixed at 512 Hz, consonant, vowel, and sentence recognition improved as a function of the number of channels and reached plateau at 4 to 6 channels. Subjective judgments of sound quality continued to improve as the number of channels increased to 12, the highest number tested. Tone recognition, i.e., recognition of the four Mandarin tone patterns, depended on both the number of channels and the low-pass cutoff frequency. The trade-off between the temporal and spectral cues for tone recognition indicates that temporal cues can compensate for diminished spectral cues for tone recognition and vice versa. An additional tone recognition experiment using syllables of equal duration showed a marked decrease in performance, indicating that duration cues contribute to tone recognition. A third experiment showed that recognition of processed FM patterns that mimic Mandarin tone patterns was poor when temporal envelope and duration cues were removed. 相似文献

17.

汉语儿童情感语声合成

下载免费PDF全文

胡航烨王蔚《应用声学》2023,42(1):76-83

情感语声合成技术对于人机交互具有重要的意义。面对儿童情感语声合成所需汉语语声数据资源缺乏以及模型训练时长较长等问题,该文提出利用迁移学习实现汉语儿童情感语声合成的方法。首先基于汉语语声数据库训练深度学习模型实现中文语声端到端合成模型,再使用高质量大样本的中文情感语料库完成情感语声合成模型,最后利用自行采样的小样本汉语儿童情感语料对模型进行迁移学习实现低资源的语声合成。客观实验结果中梅尔倒谱失真指标为4.91,主观听辨实验指标分别为3.61和4.17。通过实验对比表明,该文的方法在情感语声合成技术的应用上具有良好的性能表现,并且优于现有先进的低资源情感语声合成方法。相似文献

18.

Stimulus presentation order and the perception of lexical tones in Cantonese

Francis AL Ciocca V 《The Journal of the Acoustical Society of America》2003,114(3):1611-1621

Listeners' auditory discrimination of vowel sounds depends in part on the order in which stimuli are presented. Such presentation order effects have been argued to be language independent, and to result from psychophysical (not speech- or language-specific) factors such as the decay of memory traces over time or increased weighting of later-occurring stimuli. In the present study, native Cantonese speakers' discrimination of a linguistic tone continuum is shown to exhibit order of presentation effects similar to those shown for vowels in previous studies. When presented with two successive syllables differing in fundamental frequency by approximately 4 Hz, listeners were significantly more sensitive to this difference when the first syllable was higher in frequency than the second. However, American English-speaking listeners with no experience listening to Cantonese showed no such contrast effect when tested in the same manner using the same stimuli. Neither English nor Cantonese listeners showed any order of presentation effects in the discrimination of a nonspeech continuum in which tokens had the same fundamental frequencies as the Cantonese speech tokens but had a qualitatively non-speech-like timbre. These results suggest that tone presentation order effects, unlike vowel effects, may be language specific, possibly resulting from the need to compensate for utterance-related pitch declination when evaluating fundamental frequency for tone identification. 相似文献

19.

Speech processing studies using an acoustic model of a multiple-channel cochlear implant 总被引：1，自引：0，他引：1

P J Blamey R C Dowell Y C Tong A M Brown S M Luscombe G M Clark 《The Journal of the Acoustical Society of America》1984,76(1):104-110

The speech perception of two multiple-channel cochlear implant patients was compared with that of three normally hearing listeners using an acoustic model of the implant for 22 different speech tests. The tests used included a minimal auditory capabilities battery, both closed-set and open-set word and sentence tests, speech tracking and a 12-consonant confusion study using nonsense syllables. The acoustic model represented electrical current pulses by bursts of noise and the effects of different electrodes were represented by using bandpass filters with different center frequencies. All subjects used a speech processor that coded the fundamental voicing frequency of speech as a pulse rate and the second formant frequency of speech as the electrode position in the cochlea, or the center frequency of the bandpass filter. Very good agreement was found for the two groups of subjects, indicating that the acoustic model is a useful tool for the development and evaluation of alternative cochlear implant speech processing strategies. 相似文献

20.

Measuring the rate of change of voice fundamental frequency in fluent speech during mental depression

A Nilsonne J Sundberg S Ternstr?m A Askenfelt 《The Journal of the Acoustical Society of America》1988,83(2):716-728

A method of measuring the rate of change of fundamental frequency has been developed in an effort to find acoustic voice parameters that could be useful in psychiatric research. A minicomputer program was used to extract seven parameters from the fundamental frequency contour of tape-recorded speech samples: (1) the average rate of change of the fundamental frequency and (2) its standard deviation, (3) the absolute rate of fundamental frequency change, (4) the total reading time, (5) the percent pause time of the total reading time, (6) the mean, and (7) the standard deviation of the fundamental frequency distribution. The method is demonstrated on (a) a material consisting of synthetic speech and (b) voice recordings of depressed patients who were examined during depression and after improvement. 相似文献