首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 468 毫秒
1.
The goal of cross-language voice conversion is to preserve the speech characteristics of one speaker when that speaker's speech is translated and used to synthesize speech in another language. In this paper, two preliminary studies, i.e., a statistical analysis of spectrum differences in different languages and the first attempt at a cross-language voice conversion, are reported. Speech uttered by a bilingual speaker is analyzed to examine spectrum difference between English and Japanese. Experimental results are (1) the codebook size for mixed speech from English and Japanese should be almost twice the codebook size of either English or Japanese; (2) although many code vectors occurred in both English and Japanese, some have a tendency to predominate in one language or the other; (3) code vectors that predominantly occurred in English are contained in the phonemes /r/, /ae/, /f/, /s/, and code vectors that predominantly occurred in Japanese are contained in /i/, /u/, /N/; and (4) judged from listening tests, listeners cannot reliably indicate the distinction between English speech decoded by a Japanese codebook and English speech decoded by an English codebook. A voice conversion algorithm based on codebook mapping was applied to cross-language voice conversion, and its performance was somewhat less effective than for voice conversion in the same language.  相似文献   

2.
多通路声重放系统能够增强听者的现实感与空间感,但在免提通信条件下,其不可避免会受到噪声和回声干扰,严重影响通信质量。针对上述问题,本文提出了一种基于门控卷积循环神经网络的多通路声学回声消除和噪声抑制方法。该方法以传声器接收信号和重放声道的压缩复数谱为网络输入,以近端语音的压缩复数谱为网络的输出目标,直接从传声器拾取信号中恢复近端纯净语音,无需对声重放信号进行去相关处理,解决了传统自适应滤波方法中存在的非唯一解问题,同时保证了多通路声重放质量。仿真和真实声学环境实验均表明本文所提出的方法可显著消除多通路声重放系统的噪声和回声,在语音质量和回声返回衰减增益方面均优于传统算法。  相似文献   

3.
This study investigated the relationship among the magnitude of jaw opening, intrinsic fundamental frequency (F0), and glottal parameters in natural speech. Acoustic, jaw opening, and electroglottographic (EGG) signals were simultaneously recorded. The subjects were 10 healthy men with New Zealand English as their native language. Subjects were asked to repeat a standard nonemphasized sentence in which one of the target vowels (/a/, /e/, /i/, /o/, and /u/) was embedded in various contexts. The glottal parameters F0, open quotient (OQ), and speed quotient (SQ) were measured from the EGG signal. Results of a series of one-way repeated-measures analyses of variance (ANOVA) showed a significant vowel effect on the magnitude of jaw opening [F(4, 24) = 25.512, P < .001], F0 [F(4, 28) = 45.415, P < .001] and speed quotient [F(4, 28) = 5.233, P = .003], but not on the open quotient [F(4, 28) = 0.501, P = .735]. The magnitude of jaw opening was found to be inversely related with F0 (r = -0.624, n = 25, P = .0009). These findings showed that the magnitude of jaw opening was related to F0 and that jaw opening might be a control signal for simulation of long-term F0 variation to achieve a higher degree of naturalness in artificial voice.  相似文献   

4.
Discharge patterns of auditory-nerve fibers in anesthetized cats were obtained for two stimulus levels in response to synthetic stimuli with dynamic characteristics appropriate for selected consonants. A set of stimuli was constructed by preceding a signal that was identified as /da/by another sound that was systematically manipulated so that the entire complex would sound like either /da/, /ada/, /na/, /sa/, /sa/, or others. Discharge rates of auditory-nerve fibers in response to the common /da/-like formant transitions depended on the preceding context. Average discharge rates during these transitions decreased most for fibers whose CFs were in frequency regions where the context had considerable energy. Some effect of the preceding context on fine time patterns of response to the transitions was also found, but the identity of the largest response components (which often corresponded to the formant frequencies) was in general unaffected. Thus the response patterns during the formant transitions contain cues about both the nature of the transitions and the preceding context. A second set of stimuli sounding like /s/ and /c/ was obtained by varying the duration of the rise in amplitude at the onset of a filtered noise burst. At both 45 and 60 dB SPL, there were fibers which showed a more prominent peak in discharge rate at stimulus onset for /c/ than for /s/, but the CF regions that reflected the clearest distinctions depended on stimulus level. The peaks in discharge rate that occur in response to rapid changes in amplitude or spectrum might be used by the central processor as pointers to portions of speech signals that are rich in phonetic information.  相似文献   

5.
近年来大量全卷积网络、U-Net等编解码网络结构应用于语音增强,它们具有计算复杂度低、模型参数少等优势。然而,与长短时记忆模型等方法相比,这些编解码结构仍存在不能充分利用先后时间之间和高低频率之间的关联信息等缺点,尤其对于长序列数据的输入,编解码结构存在信息丢失的问题。为保持计算效率的同时考虑更充分的时频关联信息建模,本文提出一种融合注意力机制的U-Net网络的骨导语音增强方法(Att-U-Net),通过在跳跃连接中引入注意力机制,生成一个权重矩阵,将编码层中的全局信息根据权重融入对应的解码层中,使网络在编解码过程中能够关注输入数据中与增强目标相关程度高的重要信息,同时抑制不相关的信息。在骨导语音数据集上的实验表明,融合注意力机制的U-Net网络能在保持模型轻量化的同时有效提升骨导语音的增强效果,增强后的语音在各项客观评价指标上均优于基线模型。通过对编解码网络中间层的可视化分析发现,在解码过程中注意力机制有效地保留了有声段的信息,滤除了骨导语音由于骨导传声特性带来的中频共振,从而使得增强后的骨导语音具有较好的听觉效果。  相似文献   

6.
HearFones (HF) have been designed to enhance auditory feedback during phonation. This study investigated the effects of HF (1) on sound perceivable by the subject, (2) on voice quality in reading and singing, and (3) on voice production in speech and singing at the same pitch and sound level.

Test 1: Text reading was recorded with two identical microphones in the ears of a subject. One ear was covered with HF, and the other was free. Four subjects attended this test. Tests 2 and 3: A reading sample was recorded from 13 subjects and a song from 12 subjects without and with HF on. Test 4: Six females repeated [pa:p:a] in speaking and singing modes without and with HF on same pitch and sound level.

Long-term average spectra were made (Tests 1–3), and formant frequencies, fundamental frequency, and sound level were measured (Tests 2 and 3). Subglottic pressure was estimated from oral pressure in [p], and simultaneously electroglottography (EGG) was registered during voicing on [a:] (Test 4). Voice quality in speech and singing was evaluated by three professional voice trainers (Tests 2–4).

HF seemed to enhance sound perceivable at the whole range studied (0–8 kHz), with the greatest enhancement (up to ca 25 dB) being at 1–3 kHz and at 4–7 kHz. The subjects tended to decrease loudness with HF (when sound level was not being monitored). In more than half of the cases, voice quality was evaluated “less strained” and “better controlled” with HF. When pitch and loudness were constant, no clear differences were heard but closed quotient of the EGG signal was higher and the signal more skewed, suggesting a better glottal closure and/or diminished activity of the thyroarytenoid muscle.  相似文献   


7.
Effects of Family Therapy on Children''s Voices   总被引:1,自引:0,他引:1  
The families of nine children with deviant voice qualities were selected for family treatment according to the SYGESTI model. Recordings of the children's speech were made before and after therapy. Perceptual evaluation of their voice quality showed significant improvement in various perceptual parameters after the therapy. Acoustical analysis confirmed changes of voice quality and mean fundamental frequency in speech. The therapy also was found to improve relations between family members, conflict management and other aspects of communication. The results suggest that these children's deviant voices were related to family conditions.  相似文献   

8.
This study documents the vocal characteristics of an actor before and after a series of eight performances involving extended voice use. The hypothesis was that this type of extended voice use would result in symptoms of vocal abuse and that damage to the actor's voice would be evident in measures made after the performance series. Three pre-performance and three post-performance speech samples were gathered and analyzed using the CSL and Visipitch II. Measurements taken included maximum phonational range; maximum sustained phonation; fundamental frequency during reading; maximum intensity levels; sound pressure levels for soft, moderate, and loud productions of sustained /a/; and perturbation including jitter, shimmer, harmonics-to-noise ratio, and an s/z ratio. Pre- and post-performance samples of the “Rainbow passage” and sustained vowel phonation were rated by a group of blinded listeners that included professional voice trainers and speech pathologists. In addition, sample lines from the performance were played for the listeners to judge whether this technique would result in symptoms of vocal abuse. Eleven out of 12 professional voice trainers rated that this technique would result in symptoms of vocal abuse. The data revealed post-performance improvement in phonational range, maximum intensity levels, perturbation measures, and s/z ratio. Measures of maximum sustained phonation, fundamental frequency, and sound pressure levels remained stable. Videoendoscopy revealed normal function of the larynx and vocal folds.  相似文献   

9.
The aim of the study was to identify the acoustic correlates of female teachers' subjective voice complaints by recording their voices in their working environment. The subjects made recordings during lessons (N = 10) and breaks (N = 11). The subjects were divided into 2 groups: those with few voice complaints (FC group) and those with many voice complaints (MC group). The speech sample made in the breaks was maximally sustained /a/, from which fundamental frequency (F0), jitter, and shimmer were analyzed. The classroom samples were analyzed for F0, sound pressure level (SPL), and F0 time (the active vibration time of the vocal folds). Additionally, an index for assessing voice loading is presented. The results revealed a tendency of the MC group to have higher F0 and lower SPL and perturbation values than the FC group. The index values correlated moderately with the subjective vocal complaints.  相似文献   

10.
The effects of six-channel compression and expansion amplification on the intelligibility of nonsense syllables embedded in speech spectrum noise were examined for four hearing-impaired subjects. For one condition (linear) the stimulus was given six-channel amplification with frequency shaping to suit the subject's hearing loss. The other condition (nonlinear) was the same except that low level inputs, to any given channel, received expansion amplification and high level inputs received compression. For each condition, each subject received the nonsense syllables at three different input levels, representing low, average, and high intensity speech. The results of this study, like those of most other studies of multichannel compression, are mainly negative. Nonlinear processing (mainly expansion) of low intensity speech resulted in a significant degradation of speech intelligibility for two subjects and in no improvement for the others. One subject showed a significant improvement in intelligibility for the nonlinearly processed average intensity speech and another subject showed significant improvement for the high intensity input (mainly compression). Clearly, nonlinear processing is beneficial for some subjects, under some listening conditions, but further research is needed to identify the relevent characteristics of such subjects. An acoustic analysis of selected items revealed that the failure of expansion to improve intelligibility was primarily due to the very low intensity consonants /e/ and /k/, in final position, being presented at an even lower intensity in the expansion condition than in the linear condition. Expansion may be worth further investigation with different parameters. Several other problems caused by the multichannel processing were also revealed. These included alteration of spectral shapes and band interaction effects. Ways of overcoming these problems, and of capitalizing on the likely advantages of multichannel amplification, are currently being investigated.  相似文献   

11.
王泽林  陈锴  卢晶 《声学学报》2020,45(5):696-706
在车载分布式传声器阵列场景中,结合盲源分离TRINICON (Triple-N ICA for convolutive mixtures)算法与多说话人状态判决实现期望语音抽取。根据分布式传声器阵列与声源的相对位置关系,设计特定的盲源分离初始化条件以保证输出通道与声源的映射关系;根据分布式传声器阵列的频响特点,设计特征矢量来进行多说话人判决,并将判决结果引入TRINICON算法参数迭代过程。在使用实际车载录音数据的仿真评测中,所提方法在不同信噪比下有较高的鲁棒性,可有效提升TRINICON算法的收敛速度和语音信号的信扰比,且可以确保准确的通道映射。评测结果表明该方法可以在车载场景中有效抽取出期望语音,为车载复杂场景下的声信息提取提供了一种可靠且收敛快速的解决方法。   相似文献   

12.
Ten male-to-female transsexuals participated in five sessions of oral resonance voice therapy targeting lip spreading and forward tongue carriage. Acoustic analysis of recordings made pre- and posttherapy found that participant formant frequency values (F1, F2, and F3, from the vowels /a/, /i/, and /mho/), as well as fundamental frequency (F0), underwent a general increase posttherapy. F3 values, in particular, increased significantly posttreatment. Trends in listener ratings of these recordings showed that the majority of participants were perceived to sound more feminine following treatment. Participants' self-ratings of their voices pre- and posttreatment also indicated that participants perceived their voices as sounding more feminine and that they were more satisfied with their voices following treatment. The present study supports the findings of previous studies that have demonstrated that resonance characteristics in male-to-female transsexuals can be changed to more closely approximate those of females through oral resonance therapy. This intervention study also demonstrates that a spontaneous increase in F0 is achieved during the course of therapy. Further, this study provides preliminary evidence to suggest that oral resonance therapy may be effective in increasing femininity of voice in male-to-female transsexual clients.  相似文献   

13.
Several dereverberation algorithms have been studied. The sampling frequencies used in conventional studies are typically 8–16 kHz because their main purpose is preprocessing for improving the intelligibility of speech communication and articulation for automatic speech recognition. However, in next-generation communication systems, techniques to analyze and reproduce not only semantic information of sound but also more high-definition components such as spatial information and directivity will be increasingly necessary. To decompose these sound field characteristics with high definition, a dereverberation algorithm that is useful at high sampling frequencies is an important technique to process sound that includes high-frequency spectra such as musical sounds. The LInear-predictive Multichannel Equalization (LIME) algorithm is a promising dereverberation method. Using the LIME algorithm, however, a dereverberation signal cannot be solved at high sampling frequencies when the source signal is colored, such as in the case of speech and sound of musical signals. Because the rank of the correlation matrix calculated from such a colored signal is not full, the characteristic polynomial cannot be calculated precisely. To alleviate this problem, we propose preprocessing of all input signals with filters to whiten their spectra so that this algorithm can function for colored signals at high sampling frequencies.  相似文献   

14.
To quantify several acoustic features of the voice in patients with essentialtremor (ET), 28 patients and 28 age- and sex-matched controls were studied. ET severity was assessed with the rating scale for tremor of Fahn, Tolosa, and Marín. The Computerized Speech Lab 4300 program (Kay Elemetrics) was used. Two-second samples of a sustained /a/ and a sentence were captured with a microphone and laryngograph equipment. Measures included fundamental frequency (F0), frequency perturbation (fitter, Koike algorithm), intensity perturbation (shimmer, Horii algorithm), and harmonic-to-noise ratio (H/N, Yumoto algorithm) of the vowel /a/, and the frequency and intensity variability of the sentence, phonational range, and dynamic range at the natural frequency, maximum phonational time, and s/z ratio. All subjects underwent indirect laryngoscopy and/or laryngeal fibroscopy. When compared with controls, ET patients showed higher jitter, lower H/N ratio (the last one only with laryngographic signal), of the vowel /a/, lower frequency variability in the microphonc signal, lower intensity variability in the laryngographic signal of the sentence, and significantly lower dynamic range at natural frequency of phonation. ET patients reported higher frequency of the presence of high voice intensity, tremor, and struggle. Several acoustic parameters were influenced by the severity of the disease, including shimmer, jitter, H/N ratio, frequency variability of the sentence, and s/z ratio, although neither of the acoustic analysis values or the phonetometric measurements were affected by the presence of voice tremor or by a successful pharmacological treatment of ET.  相似文献   

15.
A non-audible murmur (NAM), a very weak speech sound produced without vocal cord vibration, can be detected by a special NAM microphone attached to the neck, thereby providing a new speech communication tool for functional speech disorders as well as human-to-machine and human-to-human interfaces with inaudible voice input for use with unimpaired. The NAM microphone is a condenser microphone covered with soft-silicone impression material that provides good impedance matching with the soft tissues of the neck. Because higher-frequency components are suppressed severely, however, the NAM detected with this device can be insufficiently clear. To improve NAM clarity, the mechanism of NAM production as well as the transfer characteristics of the NAM in soft neck tissues must be clarified. We have investigated sound propagation from the vocal tract to the neck surface, using a finite difference time domain method and a head model based on magnetic resonance imaging scans. Numerical results show that, compared to air-conducted sound detected in front of a mouth, soft-tissue-conducted sound attenuates 50 dB at 1 kHz, which consists of 30 dB full-range attenuation due to air-to-soft-tissues transmission loss and −10 dB/octave spectral decay due to a propagation loss in soft tissues. The decay agrees well with the spectral characteristics of the measured NAM.  相似文献   

16.
Five commonly used methods for determining the onset of voicing of syllable-initial stop consonants were compared. The speech and glottal activity of 16 native speakers of Cantonese with normal voice quality were investigated during the production of consonant vowel (CV) syllables in Cantonese. Syllables consisted of the initial consonants /ph/, /th/, /kh/, /p/, /t/, and /k/ followed by the vowel /a/. All syllables had a high level tone, and were all real words in Cantonese. Measurements of voicing onset were made based on the onset of periodicity in the acoustic waveform, and on spectrographic measures of the onset of a voicing bar (f0), the onset of the first formant (F1), second formant (F2), and third formant (F3). These measurements were then compared against the onset of glottal opening as determined by electroglottography. Both accuracy and variability of each measure were calculated. Results suggest that the presence of aspiration in a syllable decreased the accuracy and increased the variability of spectrogram-based measurements, but did not strongly affect measurements made from the acoustic waveform. Overall, the acoustic waveform provided the most accurate estimate of voicing onset; measurements made from the amplitude waveform were also the least variable of the five measures. These results can be explained as a consequence of differences in spectral tilt of the voicing source in breathy versus modal phonation.  相似文献   

17.
陈学煌 《应用声学》2007,26(6):341-346
周期性信号基频的检测具有重要的意义,通常的硬件检测方法是采用过零比较器或施密特触发器等电路对输入信号进行整形,从而获得基频脉冲。当信号波形复杂时,这种检测方法就会失准。本文提出了一种新的硬件检测方法有效转变点实时基频检测法,它用两个适当电平切割被测信号,从而获得有效的零点,解决了信号复杂时过零点增多的现象,文内并给出了一个检测人声基频进而实现人声向乐器声转化的应用实例。  相似文献   

18.
Time-reversed speech has been known to effectively mask information for speech privacy applications. However, the annoyance and distraction caused by the time-reversed speech-like masking sound is higher than other masking sound. This study investigates the effects of adding artificial reverberation to the time-reversed speech. Subjective listening tests have been conducted to measure the intelligibility of target speech, annoyance and distraction caused by the masking sound. The experimental results suggest that adding artificial reverberation to a speech-like masking sound has a significant effect to reduce the annoyance level while maintaining the masking effectiveness of the original masking sound. A trend was also observed that the addition of artificial reverberation could reduce the level of distraction caused by the masking sound.  相似文献   

19.
20.
薛帅强  陈波  陈菲 《应用声学》2016,24(4):253-256
在对语音信号静音、清音、浊音划分的基础上,针对语音信号周期特征明显段分布随机性问题,提出改进的变长度平均幅度差函数LVAMDF及综合多因素基音检测算法,该算法对语音信号进行周期特征明显段和周期特征不明显段的聚类划分,同时,获取周期特征明显语音段的基音周期,针对少数基音周期划分倍频或半频问题,提出识别、修正方法,其识别、修正率极高。在对大量真实语音处理中,能够精确的检测出语音特征明显段的基音周期端点,基本没有倍频和半频划分,并且和AMDF、ACF算法作了对比。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号