首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 562 毫秒
1.
谷东  简志华 《声学学报》2018,43(5):864-872
针对目标说话人可能存在语料不足的情况,本文提出了一种有限语料下的统一张量字典语音转换算法。从语料库中选取N个说话人作为语音张量字典的基础说话人,通过多序列动态时间规整算法使这N个说话人的平行语音段对齐,从而建立由N个二维基础字典构成的张量字典。在语音转换阶段,源、目标说话人语音都可以通过张量字典中各基础字典的线性组合,构造出各自的语音字典,实现了语音转换。实验结果表明,当基础说话人个数达到14时,只需要极少的目标说话人语料,便可获得与传统的基于非负矩阵分解转换算法相当的转换效果,这极大地方便了语音转换系统的应用。   相似文献   

2.
Under the condition of limited target speaker's corpus, this paper proposed an algorithm for voice conversion using unified tensor dictionary with limited corpus. Firstly,parallel speech of N speakers was selected randomly from the speech corpus to build the base of tensor dictionary. And then, after the operation of multi-series dynamic time warping for those chosen speech, N two-dimension basic dictionaries can be generated which constituted the unified tensor dictionary. During the conversion stage, the two dictionaries of source and target speaker were established by linear combination of the N basic dictionaries using the two speakers' speech. The experimental results showed that when the number of the basic speaker was 14, our algorithm can obtain the compared performance of the traditional NMFbased method with few target speaker corpus, which greatly facilitate the application of voice conversion system.  相似文献   

3.
俞一彪  曾道建  姜莹 《声学学报》2012,37(3):346-352
提出一种基于完全独立的说话人语音模型进行语音转换的方法。首先每个说话人采用各自的语料训练结构化高斯混合模型(Structured Gaussian Mixture Model,SGMM),然后根据源和目标说话人各自的模型采用全局声学结构(AcousticalUniversal Structure,AUS)进行匹配和高斯分布对准,最终得到相应的转换函数进行语音转换。ABX和MOS实验表明可以得到与传统的平行语料联合训练方法接近的转换性能,并且转换语音的目标说话人识别正确率达到94.5%。实验结果充分说明了本文提出的方法不仅具有较好的转换性能,而且具有较小的训练量和很好的系统扩展性。   相似文献   

4.
惠琳  俞一彪 《声学学报》2017,42(6):762-768
提出一种短时频谱通用背景模型群与韵律参数相结合进行年龄语音转换的方法。谱参数转换方面,同一年龄段各说话者提取语音短时谱系数并建立高斯混合模型,然后依据语音特征相似性对说话者进行聚类,每一类训练一个通用背景模型,最终得到通用背景模型群和一组短时频谱转换函数。谱参数转换之后再对共振峰进一步微调。韵律参数转换方面,基频和语速分别建立单高斯和平均时长率模型来推导转换函数。实验结果显示,提出的方法在ABX和MOS等评价指标上比传统的双线性法有明显的优势,相对单一通用背景模型法的对数似然度变化率提高了4%。这一结果表明提出的方法能够使转换语音具有良好目标倾向性的同时有较好的语音质量,性能较传统方法有明显提升。   相似文献   

5.
《Journal of voice》2020,34(5):806.e7-806.e18
There is a high prevalence of dysphonia among professional voice users and the impact of the disordered voice on the speaker is well documented. However, there is minimal research on the impact of the disordered voice on the listener. Considering that professional voice users include teachers and air-traffic controllers, among others, it is imperative to determine the impact of a disordered voice on the listener. To address this, the objectives of the current study included: (1) determine whether there are differences in speech intelligibility between individuals with healthy voices and those with dysphonia; (2) understand whether cognitive-perceptual strategies increase speech intelligibility for dysphonic speakers; and (3) determine the relationship between subjective voice quality ratings and speech intelligibility. Sentence stimuli were recorded from 12 speakers with dysphonia and four age- and gender-matched typical, healthy speakers and presented to 129 healthy listeners divided into one of three strategy groups (ie, control, acknowledgement, and listener strategies). Four expert raters also completed a perceptual voice assessment using the Consensus Assessment Perceptual Evaluation of Voice for each speaker. Results indicated that dysphonic voices were significantly less intelligible than healthy voices (P0.001) and the use of cognitive-perceptual strategies provided to the listener did not significantly improve speech intelligibility scores (P = 0.602). Using the subjective voice quality ratings, regression analysis found that breathiness was able to predict 41% of the variance associated with number of errors (P = 0.008). Overall results of the study suggest that speakers with dysphonia demonstrate reduced speech intelligibility and that providing the listener with specific strategies may not result in improved intelligibility.  相似文献   

6.
The goal of cross-language voice conversion is to preserve the speech characteristics of one speaker when that speaker's speech is translated and used to synthesize speech in another language. In this paper, two preliminary studies, i.e., a statistical analysis of spectrum differences in different languages and the first attempt at a cross-language voice conversion, are reported. Speech uttered by a bilingual speaker is analyzed to examine spectrum difference between English and Japanese. Experimental results are (1) the codebook size for mixed speech from English and Japanese should be almost twice the codebook size of either English or Japanese; (2) although many code vectors occurred in both English and Japanese, some have a tendency to predominate in one language or the other; (3) code vectors that predominantly occurred in English are contained in the phonemes /r/, /ae/, /f/, /s/, and code vectors that predominantly occurred in Japanese are contained in /i/, /u/, /N/; and (4) judged from listening tests, listeners cannot reliably indicate the distinction between English speech decoded by a Japanese codebook and English speech decoded by an English codebook. A voice conversion algorithm based on codebook mapping was applied to cross-language voice conversion, and its performance was somewhat less effective than for voice conversion in the same language.  相似文献   

7.
现阶段用于语音转换的深度学习方法多是通过使用大量的训练数据来生成高质量的语音。本文提出了一种基于平均模型和误差削减网络的语音转换框架,可用于有限数量的训练数据。首先,基于CBHG网络的平均模型使用排除源说话人和目标说话人的多说话人语音数据进行训练;然后,在有限数量的目标语音数据下对平均模型执行自适应训练;最后,提出一种误差削减网络,可以进一步改善转换后语音的质量。实验表明,所提出的语音转换框架可以灵活地处理有限的训练数据,并且在客观和主观评估方面均优于传统框架。  相似文献   

8.
A new methodology of voice conversion in cepstrum eigenspace based on structured Gaussian mixture model is proposed for non-parallel corpora without joint training.For each speaker,the cepstrum features of speech are extracted,and mapped to the eigenspace which is formed by eigenvectors of its scatter matrix,thereby the Structured Gaussian Mixture Model in the EigenSpace(SGMM-ES)is trained.The source and target speaker's SGMM-ES are matched based on Acoustic Universal Structure(AUS)principle to achieve spectrum transform function.Experimental results show the speaker identification rate of conversion speech achieves95.25%,and the value of average cepstrum distortion is 1.25 which is 0.8%and 7.3%higher than the performance of SGMM method respectively.ABX and MOS evaluations indicate the conversion performance is quite close to the traditional method under the parallel corpora condition.The results show the eigenspace based structured Gaussian mixture model for voice conversion under the non-parallel corpora is effective.  相似文献   

9.

Background  

The speech signal contains both information about phonological features such as place of articulation and non-phonological features such as speaker identity. These are different aspects of the 'what'-processing stream (speaker vs. speech content), and here we show that they can be further segregated as they may occur in parallel but within different neural substrates. Subjects listened to two different vowels, each spoken by two different speakers. During one block, they were asked to identify a given vowel irrespectively of the speaker (phonological categorization), while during the other block the speaker had to be identified irrespectively of the vowel (speaker categorization). Auditory evoked fields were recorded using 148-channel magnetoencephalography (MEG), and magnetic source imaging was obtained for 17 subjects.  相似文献   

10.
针对非平行语料非联合训练条件下的语音转换,提出一种基于倒谱本征空间结构化高斯混合模型的方法。提取说话人语音倒谱特征参数之后,根据其散布矩阵计算本征向量构造倒谱本征空间并训练结构化高斯混合模型SGMM-ES(Structured Gaussian Mixture Model in Eigen Space)。源和目标说话人各自独立训练的SGMM-ES根据全局声学结构AUS(Acoustical Universal Structure)原理进行匹配对准,最终得到基于倒谱本征空间的短时谱转换函数。实验结果表明,转换语音的目标说话人平均识别率达到95.25%,平均谱失真度为1.25,相对基于原始倒谱特征空间的SGMM方法分别提高了0.8%和7.3%,而ABX和MOS测评表明转换性能非常接近于传统平行语料方法。这一结果说明采用倒谱本征空间结构化高斯混合模型进行非平行语料条件下的语音转换是有效的。   相似文献   

11.
How are listeners able to identify whether the pitch of a brief isolated sample of an unknown voice is high or low in the overall pitch range of that speaker? Does the speaker's voice quality convey crucial information about pitch level? Results and statistical models of two experiments that provide answers to these questions are presented. First, listeners rated the pitch levels of vowels taken over the full pitch ranges of male and female speakers. The absolute f0 of the samples was by far the most important determinant of listeners' ratings, but with some effect of the sex of the speaker. Acoustic measures of voice quality had only a very small effect on these ratings. This result suggests that listeners have expectations about f0s for average speakers of each sex, and judge voice samples against such expectations. Second, listeners judged speaker sex for the same speech samples. Again, absolute f0 was the most important determinant of listeners' judgments, but now voice quality measures also played a role. Thus it seems that pitch level judgments depend on voice quality mostly indirectly, through its information about sex. Absolute f0 is the most important information for deciding both pitch level and speaker sex.  相似文献   

12.
Spectral- and cepstral-based acoustic measures are preferable to time-based measures for accurately representing dysphonic voices during continuous speech. Although these measures show promising relationships to perceptual voice quality ratings, less is known regarding their ability to differentiate normal from dysphonic voice during continuous speech and the consistency of these measures across multiple utterances by the same speaker. The purpose of this study was to determine whether spectral moments of the long-term average spectrum (LTAS) (spectral mean, standard deviation, skewness, and kurtosis) and cepstral peak prominence measures were significantly different for speakers with and without voice disorders when assessed during continuous speech. The consistency of these measures within a speaker across utterances was also addressed. Continuous speech samples from 27 subjects without voice disorders and 27 subjects with mixed voice disorders were acoustically analyzed. In addition, voice samples were perceptually rated for overall severity. Acoustic analyses were performed on three continuous speech stimuli from a reading passage: two full sentences and one constituent phrase. Significant between-group differences were found for both cepstral measures and three LTAS measures (P < 0.001): spectral mean, skewness, and kurtosis. These five measures also showed moderate to strong correlations to overall voice severity. Furthermore, high degrees of within-speaker consistency (correlation coefficients ≥0.89) across utterances with varying length and phonemic content were evidenced for both subject groups.  相似文献   

13.
A novel panel-form loudspeaker in which the panel of the speaker is excited by the forces generated through the flat voice coil of a rectangular electro-magnetic type exciter for sound radiation is presented. The exciter when properly designed has the advantage of exerting appropriate loads to the panel so that the major sound pressure level (SPL) dips of the speaker can be suppressed or even eliminated. For designing such panel-form speaker, a method formulated on the basis of the classical plate theory (CPT), Ritz method, and first Rayleigh integral is proposed for predicting the SPL curve of the speaker. An experimental investigation was performed to verify the feasibility of the proposed method. The effects of some system parameters on the major SPL dips of the proposed panel-form speakers are investigated by means of several numerical examples. The optimal locations of flat voice coils for exciting several panel-form speakers are determined to illustrate the important role of excitation location for enhancing sound quality of such speakers via the removal or suppression of the major SPL dips.  相似文献   

14.
15.
The contribution of the nasal murmur and vocalic formant transition to the perception of the [m]-[n] distinction by adult listeners was investigated for speakers of different ages in both consonant-vowel (CV) and vowel-consonant (VC) syllables. Three children in each of the speaker groups 3, 5, and 7 years old, and three adult females and three adult males produced CV and VC syllables consisting of either [m] or [n] and followed or preceded by [i ae u a], respectively. Two productions of each syllable were edited into seven murmur and transitions segments. Across speaker groups, a segment including the last 25 ms of the murmur and the first 25 ms of the vowel yielded higher perceptual identification of place of articulation than any other segment edited from the CV syllable. In contrast, the corresponding vowel+murmur segment in the VC syllable position improved nasal identification relative to other segment types for only the adult talkers. Overall, the CV syllable was perceptually more distinctive than the VC syllable, but this distinctiveness interacted with speaker group and stimulus duration. As predicted by previous studies and the current results of perceptual testing, acoustic analyses of adult syllable productions showed systematic differences between labial and alveolar places of articulation, but these differences were only marginally observed in the youngest children's speech. Also predicted by the current perceptual results, these acoustic properties differentiating place of articulation of nasal consonants were reliably different for CV syllables compared to VC syllables. A series of comparisons of perceptual data across speaker groups, segment types, and syllable shape provided strong support, in adult speakers, for the "discontinuity hypothesis" [K. N. Stevens, in Phonetic Linguistics: Essays in Honor of Peter Ladefoged, edited by V. A. Fromkin (Academic, London, 1985), pp. 243-255], according to which spectral discontinuities at acoustic boundaries provide critical cues to the perception of place of articulation. In child speakers, the perceptual support for the "discontinuity hypothesis" was weaker and the results indicative of developmental changes in speech production.  相似文献   

16.
The objectives of this prospective and exploratory study are to determine: (1) na?ve listener preference for gender in tracheoesophageal (TE) speech when speech severity is controlled; (2) the accuracy of identifying TE speaker gender; (3) the effects of gender identification on judgments of speech acceptability (ACC) and naturalness (NAT); and (4) the acoustic basis of ACC and NAT judgments. Six male and six female adult TE speakers were matched for speech severity. Twenty na?ve listeners made auditory-perceptual judgments of speech samples in three listening sessions. First, listeners performed preference judgments using a paired comparison paradigm. Second, listeners made judgments of speaker gender, speech ACC, and NAT using rating scales. Last, listeners made ACC and NAT judgments when speaker gender was provided coincidentally. Duration, frequency, and spectral measures were performed. No significant differences were found for preference of male or female speakers. All male speakers were accurately identified, but only two of six female speakers were accurately identified. Significant interactions were found between gender and listening condition (gender known) for NAT and ACC judgments. Males were judged more natural when gender was known; female speakers were judged less natural and less acceptable when gender was known. Regression analyses revealed that judgments of female speakers were best predicted with duration measures when gender was unknown, but with spectral measures when gender was known; judgments of males were best predicted with spectral measures. Na?ve listeners have difficulty identifying the gender of female TE speakers. Listeners show no preference for speaker gender, but when gender is known, female speakers are least acceptable and natural. The nature of the perceptual task may affect the acoustic basis of listener judgments.  相似文献   

17.
Three-dimensional vocal tract shapes and consequent area functions representing the vowels [i, ae, a, u] have been obtained from one male and one female speaker using magnetic resonance imaging (MRI). The two speakers were trained vocal performers and both were adept at manipulation of vocal tract shape to alter voice quality. Each vowel was performed three times, each with one of the three voice qualities: normal, yawny, and twangy. The purpose of the study was to determine some ways in which the vocal tract shape can be manipulated to alter voice quality while retaining a desired phonetic quality. To summarize any overall tract shaping tendencies mean area functions were subsequently computed across the four vowels produced within each specific voice quality. Relative to normal speech, both the vowel area functions and mean area functions showed, in general, that the oral cavity is widened and tract length increased for the yawny productions. The twangy vowels were characterized by shortened tract length, widened lip opening, and a slightly constricted oral cavity. The resulting acoustic characteristics of these articulatory alterations consisted of the first two formants (F1 and F2) being close together for all yawny vowels and far apart for all the twangy vowels.  相似文献   

18.
In the experiments reported here, perceived speaker identity was controlled by manipulating the fundamental frequency (F0) range of carrier phrases in which speech tokens were embedded. In the first experiment, words from two "hood"-"hud" continua were synthesized with different F0. The words were then embedded in synthetic carrier phrases with intonation contours which reduced perceived speaker identity differences for test items with different F0. The results indicated that when perceived speaker identity differences were reduced, the effect of F0 on vowel identification was also reduced. Experiment 2 indicated that when items presented in carrier phrases are matched for speaker identity and F0 with items in isolation, there is no effect for presentation in a carrier phrase. Experiment 3 involved the presentation of vowels from the "hood"-"hud" continuum in two different intonational contexts which were judged to have been produced by different speakers, even though the F0 of the test word was identical in the two contexts. There was a shift in identification as a result of the intonational context which was interpreted as evidence for the role of perceived identity in vowel normalization. Overall, the experiments suggest that perceived speaker identity is a better predictor of vowel normalization effects than is intrinsic F0. This indicates that the role of F0 in vowel normalization is mediated through perceived speaker identity.  相似文献   

19.
The effects of variations in vocal effort corresponding to common conversation situations on spectral properties of vowels were investigated. A database in which three degrees of vocal effort were suggested to the speakers by varying the distance to their interlocutor in three steps (close--0.4 m, normal--1.5 m, and far--6 m) was recorded. The speech materials consisted of isolated French vowels, uttered by ten naive speakers in a quiet furnished room. Manual measurements of fundamental frequency F0, frequencies, and amplitudes of the first three formants (F1, F2, F3, A1, A2, and A3), and on total amplitude were carried out. The speech materials were perceptually validated in three respects: identity of the vowel, gender of the speaker, and vocal effort. Results indicated that the speech materials were appropriate for the study. Acoustic analysis showed that F0 and F1 were highly correlated with vocal effort and varied at rates close to 5 Hz/dB for F0 and 3.5 Hz/dB for F1. Statistically F2 and F3 did not vary significantly with vocal effort. Formant amplitudes A1, A2, and A3 increased significantly; The amplitudes in the high-frequency range increased more than those in the lower part of the spectrum, revealing a change in spectral tilt. On the average, when the overall amplitude is increased by 10 dB, A1, A2, and A3 are increased by 11, 12.4, and 13 dB, respectively. Using "auditory" dimensions, such as the F1-F0 difference, and a "spectral center of gravity" between adjacent formants for representing vowel features did not reveal a better constancy of these parameters with respect to the variations of vocal effort and speaker. Thus a global view is evoked, in which all of the aspects of the signal should be processed simultaneously.  相似文献   

20.
This study investigated the perceptual and acoustical characteristicsof vocal presentation in both the masculine and the feminine modes by the same group of male subjects. Listeners (N = 88) evaluated 22 voice samples by using 18 semantic differential scales and 57 adjectives. The 22 voice samples were provided by I I biologically male speakers, who described themselves as heterosexual crossdressers. Each speaker read a standard passage under controlled conditions. In one reading, they demonstrated their typical masculine voice and in the other they spoke in their feminine voice. Acoustical analyses included mean fundamental frequency, frequency range, overall passage duration, and duration of a sample of stressed vowels. Results indicated that listeners heard significant differences between masculine and feminine presentations across the I I speakers and the 18 semantic differential scales. Masculine-feminine and high-low pitch were the most salient scales in the perceptual judgments. Acoustical analyses indicated wide variation according to speaker and condition. Clinical applications are provided.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号