首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
When listening to natural speech, listeners are fairly adept at using cues such as pitch, vocal tract length, prosody, and level differences to extract a target speech signal from an interfering speech masker. However, little is known about the cues that listeners might use to segregate synthetic speech signals that retain the intelligibility characteristics of speech but lack many of the features that listeners normally use to segregate competing talkers. In this experiment, intelligibility was measured in a diotic listening task that required the segregation of two simultaneously presented synthetic sentences. Three types of synthetic signals were created: (1) sine-wave speech (SWS); (2) modulated noise-band speech (MNB); and (3) modulated sine-band speech (MSB). The listeners performed worse for all three types of synthetic signals than they did with natural speech signals, particularly at low signal-to-noise ratio (SNR) values. Of the three synthetic signals, the results indicate that SWS signals preserve more of the voice characteristics used for speech segregation than MNB and MSB signals. These findings have implications for cochlear implant users, who rely on signals very similar to MNB speech and thus are likely to have difficulty understanding speech in cocktail-party listening environments.  相似文献   

2.
3.
In several auditoria, it has been observed that the reverberation time is longer than expected and that the cause is a horizontal reverberant field established in the region near the ceiling, a field which is remote from the sound absorbing audience. This has been observed in the Boston Symphony Hall, Massachusetts, and the Stadthalle Göttingen, Germany. Subjective remarks on their acoustics suggest that there are no unfavourable comments linked to the secondary sound field. Two acoustic scale models are considered here. In a generic rectangular concert hall model, the walls and ceiling contained openings in which either plane or scattering panels could be placed. With plane panels, the model reverberation time (RT) was measured as 53% higher than the Sabine prediction (frequency 500/1000 Hz), compared with 8% higher with scattering panels. The second model of a 300 seat lecture theatre with a 6 m or 8 m high ceiling had raked seating. In this case, the amount of absorption in the model was increased until the point was reached where speech had acceptable intelligibility, with the early energy fraction, D ? 0.5. For this acceptable speech condition with the 6 m ceiling, the measured mid-frequency T15 was 1.47 s, whereas the Sabine predicted RT was 1.06 s. The sound decay was basically non-linear with T30 > T15 > EDT. Exploiting a high-level horizontal reverberant field offers the possibility of acoustics that are better adapted as suitable for both speech and unamplified music, without any physical change in the auditorium. Using secondary reverberation in an auditorium for a wide variety of music might also be beneficial.  相似文献   

4.
噪声估计的准确性直接影响语音增强算法的好坏,为提升当前语音增强算法的噪声抑制效果,有效求解无约束优化问题,提出一种联合深度神经网络(DNN)和凸优化的时频掩蔽优化算法进行单通道语音增强。首先,提取带噪语音的能量谱作为DNN的输入特征;接着,将噪声与带噪语音的频带内互相关系数(ICC Factor)作为DNN的训练目标;然后,利用DNN模型得到的互相关系数构造凸优化的目标函数;最后,联合DNN和凸优化,利用新混合共轭梯度法迭代处理初始掩蔽,通过新的掩蔽合成增强语音。仿真实验表明,在不同背景噪声的低信噪比下,相比改进前,新的掩蔽使增强语音获得了更好的对数谱距离(LSD)、主观语音质量(PESQ)、短时客观可懂度(STOI)和分段信噪比(segSNR)指标,提升了语音的整体质量并且可以有效抑制噪声。  相似文献   

5.
赵立恒  汪增福 《声学学报》2012,37(2):218-224
提出了一种基于谐波和能量特征的单声道浊语音分离方法。该方法将浊语音分离问题转化为声音在时频域的分类问题。首先,在已有谐波特征的基础上,引入能量特征。然后,对于谐波特征明显且能量大的时频单元,在分类器训练阶段复制它们的特征。实验结果表明该方法相比之前的方法有更好的信噪比增益。通过引入能量特征和特征复制,改善了浊语音的分离效果。  相似文献   

6.
在低信噪比和突发背景噪声条件下,已有的深度学习网络模型在单通道语音增强方面效果并不理想,而人类可以利用语音的长时相关性对不同的语音信号形成综合感知。因此刻画语音的长时依赖关系有助于改进低信噪比和突发背景噪声下的增强性能。受该特性的启发,提出一种融合多头注意力机制和U-net深度网络的增强模型TU-net,实现基于时域的端到端单通道语音增强。TU-net网络模型采用U-net网络的编解码层对带噪语音信号进行多尺度特征融合,并利用多头注意力机制实现双路径Transformer,用于计算语音掩模,更好地建模长时相关性。该模型在时域、时频域和感知域计算损失函数,并通过加权组合损失函数指导训练。仿真实验结果表明,TU-net在低信噪比和突发背景噪声条件下增强语音信号的语音质量感知评估(PESQ)、短时客观可懂度(STOI)和信噪比增益等多个评价指标都优于同类的单通道增强网络模型,且保持相对较少的网络模型参数。  相似文献   

7.
改进谐波组织规则的单通道浊语音分离系统   总被引:1,自引:0,他引:1  
针对以往单通道噪声和浊语音分离算法的不足,改进了谐波组织算法。算法利用载波包络能量比将时频单元分为确定和非确定。提取基频作为组织线索。组织阶段分别使用谐波原理和最小幅度原理对确定时频单元组织;使用改进包络自相关函数度量幅度调制率对非确定时频单元组织。对比以往算法的处理结果,改进算法平均信噪比(SNR)提高0.96 dB。通过对谐波组织规则的改进,提高了分离性能。  相似文献   

8.
This paper describes acoustic cues for classification of consonant voicing in a distinctive feature-based speech recognition system. Initial acoustic cues are selected by studying consonant production mechanisms. Spectral representations, band-limited energies, and correlation values, along with Mel-frequency cepstral coefficients features (MFCCs) are also examined. Analysis of variance is performed to assess relative significance of features. Overall, 82.2%, 80.6%, and 78.4% classification rates are obtained on the TIMIT database for stops, fricatives, and affricates, respectively. Combining acoustic parameters with MFCCs shows performance improvement in all cases. Also, performance in the NTIMIT telephone channel speech shows that acoustic parameters are more robust than MFCCs.  相似文献   

9.
Although there have been numerous studies investigating subjective spatial impression in rooms, only a few of those studies have addressed the influence of visual cues on the judgment of auditory measures. In the psychophysical study presented here, video footage of five solo music/speech performers was shown for four different listening positions within a general-purpose space. The videos were presented in addition to the acoustic signals, which were auralized using binaural room impulse responses (BRIR) that were recorded in the same general-purpose space. The participants were asked to adjust the direct-to-reverberant energy ratio (D/R ratio) of the BRIR according to their expectation considering the visual cues. They were also directed to rate the apparent source width (ASW) and listener envelopment (LEV) for each condition. Visual cues generated by changing the sound-source position in the multi-purpose space, as well as the makeup of the sound stimuli affected the judgment of spatial impression. Participants also scaled the direct-to-reverberant energy ratio with greater direct sound energy than was measured in the acoustical environment.  相似文献   

10.
为实现噪声情况下的人声分离,提出了一种采用稀疏非负矩阵分解与深度吸引子网络的单通道人声分离算法。首先,通过训练得到人声与噪声的字典矩阵,将其作为先验信息从带噪混合语音中分离出人声与噪声的系数矩阵;然后,根据人声系数矩阵中不同的声源成分在嵌入空间中的相似性不同,使用深度吸引子网络将其分离为各声源语音的系数矩阵;最后,使用分离得到的各语音系数矩阵与人声的字典矩阵重构干净的分离语音。在不同噪声情况下的实验结果表明,本文算法能够在抑制背景噪声的同时提高分离语音的整体质量,优于结合声噪人声分离模型的对比算法。  相似文献   

11.
Objective parameters for the evaluation of the Rudolfinum concert hall in Prague, Czech Republic are the focus of the present article. The measured results for Reverberation parameters, Energy parameters, Intelligibility parameters, and Spatial parameters of the building’s two halls are presented and discussed including a comparison with recommended values or theory, as well as several unique architectural and acoustical qualities of the halls. The early lateral energy fraction parameter is measured by the intensity probe method discussed in the supplement. The performance is verified by tests in anechoic and reverberant rooms.  相似文献   

12.
为实现噪声情况下的人声分离,提出了一种采用稀疏非负矩阵分解与深度吸引子网络的单通道人声分离算法。首先,通过训练得到人声与噪声的字典矩阵,将其作为先验信息从带噪混合语音中分离出人声与噪声的系数矩阵;然后,根据人声系数矩阵中不同的声源成分在嵌入空间中的相似性不同,使用深度吸引子网络将其分离为各声源语音的系数矩阵;最后,使用分离得到的各语音系数矩阵与人声的字典矩阵重构干净的分离语音。在不同噪声情况下的实验结果表明,本文算法能够在抑制背景噪声的同时提高分离语音的整体质量,优于结合声噪人声分离模型的对比算法。   相似文献   

13.
We determined how the perceived naturalness of music and speech (male and female talkers) signals was affected by various forms of linear filtering, some of which were intended to mimic the spectral "distortions" introduced by transducers such as microphones, loudspeakers, and earphones. The filters introduced spectral tilts and ripples of various types, variations in upper and lower cutoff frequency, and combinations of these. All of the differently filtered signals (168 conditions) were intermixed in random order within one block of trials. Levels were adjusted to give approximately equal loudness in all conditions. Listeners were required to judge the perceptual quality (naturalness) of the filtered signals on a scale from 1 to 10. For spectral ripples, perceived quality decreased with increasing ripple density up to 0.2 ripple/ERB(N) and with increasing ripple depth. Spectral tilts also degraded quality, and the effects were similar for positive and negative tilts. Ripples and/or tilts degraded quality more when they extended over a wide frequency range (87-6981 Hz) than when they extended over subranges. Low- and mid-frequency ranges were roughly equally important for music, but the mid-range was most important for speech. For music, the highest quality was obtained for the broadband signal (55-16,854 Hz). Increasing the lower cutoff frequency from 55 Hz resulted in a clear degradation of quality. There was also a distinct degradation as the upper cutoff frequency was decreased from 16,845 Hz. For speech, there was a marked degradation when the lower cutoff frequency was increased from 123 to 208 Hz and when the upper cutoff frequency was decreased from 10,869 Hz. Typical telephone bandwidth (313 to 3547 Hz) gave very poor quality.  相似文献   

14.
15.
An integral equation generalizing a variety of known geometrical room acoustics modeling algorithms is presented. The formulation of the room acoustic rendering equation is adopted from computer graphics. Based on the room acoustic rendering equation, an acoustic radiance transfer method, which can handle both diffuse and nondiffuse reflections, is derived. In a case study, the method is used to predict several acoustic parameters of a room model. The results are compared to measured data of the actual room and to the results given by other acoustics prediction software. It is concluded that the method can predict most acoustic parameters reliably and provides results as accurate as current commercial room acoustic prediction software. Although the presented acoustic radiance transfer method relies on geometrical acoustics, it can be extended to model diffraction and transmission through materials in future.  相似文献   

16.
An unconstrained optimization technique is used to find the values of parameters, of a combination of an articulatory and a vocal tract model, that minimize the difference between model spectra and natural speech spectra. The articulatory model is anatomically realistic and the vocal tract model is a "lossy" Webster equation for which a method of solution is given. For English vowels in the steady state, anatomically reasonable articulatory configurations whose corresponding spectra match those of human speech to within 2 dB have been computed in fewer than ten iterations. Results are also given which demonstrate a limited ability of the system to track the articulatory dynamics of voiced speech.  相似文献   

17.
Congenital amusia is a lifelong disorder of music processing that has been ascribed to impaired pitch perception and memory. The present study tested a large group of amusics (n=17) and provided evidence that their pitch deficit affects pitch processing in speech to a lesser extent: Fine-grained pitch discrimination was better in spoken syllables than in acoustically matched tones. Unlike amusics, control participants performed fine-grained pitch discrimination better for musical material than for verbal material. These findings suggest that pitch extraction can be influenced by the nature of the material (music vs speech), and that amusics' pitch deficit is not restricted to musical material, but extends to segmented speech events.  相似文献   

18.
YIN,a fundamental frequency estimator for speech and music   总被引:22,自引:0,他引:22  
An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds. It is based on the well-known autocorrelation method with a number of modifications that combine to prevent errors. The algorithm has several desirable features. Error rates are about three times lower than the best competing methods, as evaluated over a database of speech recorded together with a laryngograph signal. There is no upper limit on the frequency search range, so the algorithm is suited for high-pitched voices and music. The algorithm is relatively simple and may be implemented efficiently and with low latency, and it involves few parameters that must be tuned. It is based on a signal model (periodic signal) that may be extended in several ways to handle various forms of aperiodicity that occur in particular applications. Finally, interesting parallels may be drawn with models of auditory processing.  相似文献   

19.
Reiterant speech, or nonsense syllable mimicry, has been proposed as a way to study prosody, particularly syllable and word durations, unconfounded by segmental influences. Researchers have shown that segmental influences on durations can be neutralized in reiterant speech. If it is to be a useful tool in the study of prosody, it must also be shown that reiterant speech preserves the suprasegmental duration and intonation differences relevant to perception. In the present study, syllable durations for nonreiterant and reiterant ambiguous sentences were measured to seek evidence of the duration differences which can enable listeners to resolve surface structure ambiguities in nonreiterant speech. These duration patterns were found in both nonreiterant and reiterant speech. A perceptual study tested listeners' perception of these ambiguous sentences as spoken by four "good" speakers--speakers who neutralized intrinsic duration differences and whose sentences were independently rated by skilled listeners as good imitations of normal speech. The listeners were able to choose the correct interpretation when the ambiguous sentences were in reiterant form as well as they did when the sentences were spoken normally. These results support the notion that reiterant speech is like nonreiterant speech in aspects which are important in the study of prosody.  相似文献   

20.
This study investigated which acoustic cues within the speech signal are responsible for bimodal speech perception benefit. Seven cochlear implant (CI) users with usable residual hearing at low frequencies in the non-implanted ear participated. Sentence tests were performed in near-quiet (some noise on the CI side to reduce scores from ceiling) and in a modulated noise background, with the implant alone and with the addition, in the hearing ear, of one of four types of acoustic signals derived from the same sentences: (1) a complex tone modulated by the fundamental frequency (F0) and amplitude envelope contours; (2) a pure tone modulated by the F0 and amplitude contours; (3) a noise-vocoded signal; (4) unprocessed speech. The modulated tones provided F0 information without spectral shape information, whilst the vocoded signal presented spectral shape information without F0 information. For the group as a whole, only the unprocessed speech condition provided significant benefit over implant-alone scores, in both near-quiet and noise. This suggests that, on average, F0 or spectral cues in isolation provided limited benefit for these subjects in the tested listening conditions, and that the significant benefit observed in the full-signal condition was derived from implantees' use of a combination of these cues.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号