首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 531 毫秒
1.
A model for predicting the intelligibility of processed noisy speech is proposed. The speech-based envelope power spectrum model has a similar structure as the model of Ewert and Dau [(2000). J. Acoust. Soc. Am. 108, 1181-1196], developed to account for modulation detection and masking data. The model estimates the speech-to-noise envelope power ratio, SNR(env), at the output of a modulation filterbank and relates this metric to speech intelligibility using the concept of an ideal observer. Predictions were compared to data on the intelligibility of speech presented in stationary speech-shaped noise. The model was further tested in conditions with noisy speech subjected to reverberation and spectral subtraction. Good agreement between predictions and data was found in all cases. For spectral subtraction, an analysis of the model's internal representation of the stimuli revealed that the predicted decrease of intelligibility was caused by the estimated noise envelope power exceeding that of the speech. The classical concept of the speech transmission index fails in this condition. The results strongly suggest that the signal-to-noise ratio at the output of a modulation frequency selective process provides a key measure of speech intelligibility.  相似文献   

2.
A large number of single-channel noise-reduction algorithms have been proposed based largely on mathematical principles. Most of these algorithms, however, have been evaluated with English speech. Given the different perceptual cues used by native listeners of different languages including tonal languages, it is of interest to examine whether there are any language effects when the same noise-reduction algorithm is used to process noisy speech in different languages. A comparative evaluation and investigation is taken in this study of various single-channel noise-reduction algorithms applied to noisy speech taken from three languages: Chinese, Japanese, and English. Clean speech signals (Chinese words and Japanese words) were first corrupted by three types of noise at two signal-to-noise ratios and then processed by five single-channel noise-reduction algorithms. The processed signals were finally presented to normal-hearing listeners for recognition. Intelligibility evaluation showed that the majority of noise-reduction algorithms did not improve speech intelligibility. Consistent with a previous study with the English language, the Wiener filtering algorithm produced small, but statistically significant, improvements in intelligibility for car and white noise conditions. Significant differences between the performances of noise-reduction algorithms across the three languages were observed.  相似文献   

3.
This study demonstrates a new possibility of estimating intelligibility of speech in informational maskers. The temporal and spectral properties of sound maskers are investigated to achieve acoustic privacy in public spaces. Speech intelligibility (SI) tests were conducted using Japanese sentences in daily use for energy (white noise) or informational (reversed speech) maskers. We found that the masking effects including informational masking on SI might not be estimated by analyzing the narrow-band temporal envelopes, which is a common way of predicting SI under noisy conditions. The masking effects might instead be visualized by spectral auto-correlation analysis on a frame-by-frame basis, for the series of dominant-spectral peaks of the masked target in the frequency domain. Consequently, we found that dissimilarity in frame-based spectral-auto-correlation sequences between the original and masked targets was the key to evaluating maskers including informational masking effects on SI.  相似文献   

4.
Artificial bandwidth extension methods have been developed to improve the quality and intelligibility of narrowband telephone speech and to reduce the difference with wideband speech. Such methods have commonly been evaluated with objective measures or subjective listening-only tests, but conversational evaluations have been rare. This article presents a conversational evaluation of two methods for the artificial bandwidth extension of telephone speech. Bandwidth-extended narrowband speech is compared with narrowband and wideband speech in a test setting including a simulated telephone connection, realistic conversation tasks, and various background noise conditions. The responses of the subjects indicate that speech processed with one of the methods is preferred to narrowband speech in noise, but wideband speech is superior to both narrowband and bandwidth-extended speech. Bandwidth extension was found to be beneficial for telephone conversation in noisy listening conditions.  相似文献   

5.
We proposed and evaluated an estimation method for the forced selection speech intelligibility tests. Our proposal takes into account the forced selection manner of the Diagnostic Rhyme Test (DRT), which forces selection from a pair of rhyming words. A distance measure is calculated between the test word and the two candidate words, respectively, and the distance is compared to select the most likely word. We compared two distance measures. The first objective distance measure used here was based on the Articulation index Band Correlation (ABC). The ABC is the correlation of time–frequency (T–F) patterns between the test word and the template word speech of the two words in the candidate word pair. The word with the higher correlation was decided to be the likely candidate word. The T–F pattern was calculated in the Articulation Index (AI) bands, and the correlation was calculated between the corresponding bands of the test and candidate word sample. In order to estimate the intelligibility, we calculate the ratio of the number of bands in which higher correlation is seen for the correct word vs. the total number of bands (named ABC-est). This ratio quantifies how well the test word matches the correct word in the word pair. For the second objective distance, we used a measure based on the frequency-weighted segmental SNR (fwSNRseg). Segmental SNR (SNRseg) was calculated in AI bands, and compared among the candidate word templates. We then calculated the frequency-weighted ratio of the number of bands in which higher SNRseg was observed for the correct word vs. the total number of bands (named fwSNRseg-est), again to quantify how well the test word matches the selected candidate word in the pair. We estimated a logistic mapping function from the above two ratios to intelligibility scores using speech mixed with known noise. The mapping functions were then used to estimate the intelligibility of speech mixed with unknown noise. This estimation was compared to another measure that we previously evaluated, the conventional fwSNRseg, which directly maps the measure to intelligibility. Both proposed measures were proven to be significantly more accurate than conventional fwSNRseg. For most cases, the accuracy was comparable between the two proposed distance measures, ABC-est and fwSNRseg-est, with the latter showing correlation between the subjective and estimated intelligibility as high as 0.97, and root mean square as low as 0.11 for one of the test sets, but not as accurate for other sets. The ABC-est showed more stable accuracy for all sets. However, both measures show practical accuracies in all conditions tested. Thus, it should be possible to “screen” the intelligibility in many of the noise conditions to be tested, and cut down on the scale of the subjective test needed.  相似文献   

6.
设计了一个适用于端到端语音增强的改进的U-Net (Attention Dilated Convolution U-Net,ADC-U-Net)网络模型.与基线U-Net网络相比,一方面通过加入空洞卷积减小由采样带来的信息损失;另一方面引入了注意力机制结构,结合了含噪语音更多的上下文信息,提取更深层次和更丰富的特征信息...  相似文献   

7.
Recent evidence suggests that spectral change, as measured by cochlea-scaled entropy (CSE), predicts speech intelligibility better than the information carried by vowels or consonants in sentences. Motivated by this finding, the present study investigates whether intelligibility indices implemented to include segments marked with significant spectral change better predict speech intelligibility in noise than measures that include all phonetic segments paying no attention to vowels/consonants or spectral change. The prediction of two intelligibility measures [normalized covariance measure (NCM), coherence-based speech intelligibility index (CSII)] is investigated using three sentence-segmentation methods: relative root-mean-square (RMS) levels, CSE, and traditional phonetic segmentation of obstruents and sonorants. While the CSE method makes no distinction between spectral changes occurring within vowels/consonants, the RMS-level segmentation method places more emphasis on the vowel-consonant boundaries wherein the spectral change is often most prominent, and perhaps most robust, in the presence of noise. Higher correlation with intelligibility scores was obtained when including sentence segments containing a large number of consonant-vowel boundaries than when including segments with highest entropy or segments based on obstruent/sonorant classification. These data suggest that in the context of intelligibility measures the type of spectral change captured by the measure is important.  相似文献   

8.
Objective measures were investigated as predictors of the speech security of closed offices and rooms. A new signal-to-noise type measure is shown to be a superior indicator for security than existing measures such as the Articulation Index, the Speech Intelligibility Index, the ratio of the loudness of speech to that of noise, and the A-weighted level difference of speech and noise. This new measure is a weighted sum of clipped one-third-octave-band signal-to-noise ratios; various weightings and clipping levels are explored. Listening tests had 19 subjects rate the audibility and intelligibility of 500 English sentences, filtered to simulate transmission through various wall constructions, and presented along with background noise. The results of the tests indicate that the new measure is highly correlated with sentence intelligibility scores and also with three security thresholds: the threshold of intelligibility (below which speech is unintelligible), the threshold of cadence (below which the cadence of speech is inaudible), and the threshold of audibility (below which speech is inaudible). The ratio of the loudness of speech to that of noise, and simple A-weighted level differences are both shown to be well correlated with these latter two thresholds (cadence and audibility), but not well correlated with intelligibility.  相似文献   

9.
The previous work [Morimoto et al., J. Acoust. Soc. Am. 116, 1607-1613] showed that listening difficulty ratings can be used to evaluate speech transmission performance more exactly and sensitively than intelligibility. Meanwhile, speech transmission performance is usually evaluated using acoustical objective measures, which are directly associated with physical parameters of room acoustic design. However, the relationship between listening difficulty ratings and acoustical objective measures was not minutely investigated. In the present study, a total of 96 impulse responses were used to investigate the relationship between listening difficulty ratings and several objective measures in unidirectional sound fields. The result of the listening test showed that (1) the correlation between listening difficulty ratings and speech transmission index (STI) is the strongest of all tested objective measures, and (2) A-weighted D(50), C(50), and center time, which are obtained from the impulse responses passed through an A-weighted filter, also strongly correlate with listening difficulty ratings, and their correlations with listening difficulty ratings are not statistically different from the correlation between listening difficulty ratings and STI.  相似文献   

10.
Most noise-reduction algorithms used in hearing aids apply a gain to the noisy envelopes to reduce noise interference. The present study assesses the impact of two types of speech distortion introduced by noise-suppressive gain functions: amplification distortion occurring when the amplitude of the target signal is over-estimated, and attenuation distortion occurring when the target amplitude is under-estimated. Sentences corrupted by steady noise and competing talker were processed through a noise-reduction algorithm and synthesized to contain either amplification distortion, attenuation distortion or both. The attenuation distortion was found to have a minimal effect on speech intelligibility. In fact, substantial improvements (>80 percentage points) in intelligibility, relative to noise-corrupted speech, were obtained when the processed sentences contained only attenuation distortion. When the amplification distortion was limited to be smaller than 6 dB, performance was nearly unaffected in the steady-noise conditions, but was severely degraded in the competing-talker conditions. Overall, the present data suggest that one reason that existing algorithms do not improve speech intelligibility is because they allow amplification distortions in excess of 6 dB. These distortions are shown in this study to be always associated with masker-dominated envelopes and should thus be eliminated.  相似文献   

11.
A number of objective evaluation methods are currently used to quantify the speech intelligibility in a built environment, including the speech transmission index (STI), rapid speech transmission index (RASTI), articulation index (AI), and the percent articulation loss of consonants (%ALCons). Certain software programs can quickly evaluate STI, RASTI, and %ALCons from a measured room impulse response. In this project, two impulse-response-based software packages (WinMLS and SIA-Smaart Acoustic Tools) were evaluated for their ability to determine intelligibility accurately. In four different spaces with background noise levels less than NC 45, speech intelligibility was measured via three methods: (1) with WinMLS 2000; (2) with SIA-Smaart Acoustic Tools (v4.0.2); and (3) from listening tests with humans. The study found that WinMLS measurements of speech intelligibility based on STI, RASTI, and %ALCons corresponded well with performance on the listening tests. SIA-Smaart results were correlated to human responses, but tended to under-predict intelligibility based on STI and RASTI, and over-predict intelligibility based on %ALCons.  相似文献   

12.
Recent research results show that combined electric and acoustic stimulation (EAS) significantly improves speech recognition in noise, and it is generally established that access to the improved F0 representation of target speech, along with the glimpse cues, provide the EAS benefits. Under noisy listening conditions, noise signals degrade these important cues by introducing undesired temporal-frequency components and corrupting harmonics structure. In this study, the potential of combining noise reduction and harmonics regeneration techniques was investigated to further improve speech intelligibility in noise by providing improved beneficial cues for EAS. Three hypotheses were tested: (1) noise reduction methods can improve speech intelligibility in noise for EAS; (2) harmonics regeneration after noise reduction can further improve speech intelligibility in noise for EAS; and (3) harmonics sideband constraints in frequency domain (or equivalently, amplitude modulation in temporal domain), even deterministic ones, can provide additional benefits. Test results demonstrate that combining noise reduction and harmonics regeneration can significantly improve speech recognition in noise for EAS, and it is also beneficial to preserve the harmonics sidebands under adverse listening conditions. This finding warrants further work into the development of algorithms that regenerate harmonics and the related sidebands for EAS processing under noisy conditions.  相似文献   

13.
The Speech Transmission Index (STI) is a physical metric that is well correlated with the intelligibility of speech degraded by additive noise and reverberation. The traditional STI uses modulated noise as a probe signal and is valid for assessing degradations that result from linear operations on the speech signal. Researchers have attempted to extend the STI to predict the intelligibility of nonlinearly processed speech by proposing variations that use speech as a probe signal. This work considers four previously proposed speech-based STI methods and four novel methods, studied under conditions of additive noise, reverberation, and two nonlinear operations (envelope thresholding and spectral subtraction). Analyzing intermediate metrics in the STI calculation reveals why some methods fail for nonlinear operations. Results indicate that none of the previously proposed methods is adequate for all of the conditions considered, while four proposed methods produce qualitatively reasonable results and warrant further study. The discussion considers the relevance of this work to predicting the intelligibility of cochlear-implant processed speech.  相似文献   

14.
A method for computing the speech transmission index (STI) using real speech stimuli is presented and evaluated. The method reduces the effects of some of the artifacts that can be encountered when speech waveforms are used as probe stimuli. Speech-based STIs are computed for conversational and clearly articulated speech in several noisy, reverberant, and noisy-reverberant environments and compared with speech intelligibility scores. The results indicate that, for each speaking style, the speech-based STI values are monotonically related to intelligibility scores for the degraded speech conditions tested. Therefore, the STI can be computed using speech probe waveforms and the values of the resulting indices are as good predictors of intelligibility scores as those derived from MTFs by theoretical methods.  相似文献   

15.
This paper addresses the problem of the speech quality improvement using adaptive filtering algorithms. Recently in Djendi and Bendoumia (2014) [1], we have proposed a new two-channel backward algorithm for noise reduction and speech intelligibility enhancement. The main drawback of proposed two-channel subband algorithm is its poor performance when the number of subband is high. This inconvenience is well seen in the steady state regime values. The source of this problem is the fixed step-sizes of the cross-adaptive filtering algorithms that distort the speech signal when they are selected high and degrade the convergence speed behaviours when they are selected small. In this paper, we propose four modifications of this algorithm which allow improving both the convergence speed and the steady state values even in very noisy condition and a high number of subbands. To confirm the good performance of the four proposed variable-step-size SBBSS algorithms, we have carried out several simulations in various noisy environments. In these simulations, we have evaluated objective and subjective criteria as the system mismatch, the cepstral distance, the output signal-to-noise-ratio, and the mean opinion score (MOS) method to compare the four proposed variables step-size versions of the SBBSS algorithm with their original versions and with the two-channel fullband backward (2CFB) least mean square algorithm.  相似文献   

16.
Listening difficulty ratings, using words with high word familiarity, are proposed as a new subjective measure for the evaluation of speech transmission in public spaces to provide realistic and objective results. Two listening tests were performed to examine their validity, compared with intelligibility scores. The tests included a reverberant signal and noise as detrimental sounds. The subject was asked to repeat each word and simultaneously to rate the listening difficulty into one of four categories: (1) not difficult, (2) a little difficult, (3) fairly difficult, and (4) extremely difficult. After the tests, the four categories were reclassified into, not difficult [response (1)] and some level of difficulty, (the other 3 responses). Listening difficulty is defined as the percentage of the total number of responses indicating some level of difficulty [i.e. not (1)]. The results of two listening tests demonstrated that listening difficulty ratings can evaluate speech transmission performance more accurately and sensitively than intelligibility scores for sound fields with higher speech transmission performance.  相似文献   

17.
When a target-speech/masker mixture is processed with the signal-separation technique, ideal binary mask (IBM), intelligibility of target speech is remarkably improved in both normal-hearing listeners and hearing-impaired listeners. Intelligibility of speech can also be improved by filling in speech gaps with un-modulated broadband noise. This study investigated whether intelligibility of target speech in the IBM-treated target-speech/masker mixture can be further improved by adding a broadband-noise background. The results of this study show that following the IBM manipulation, which remarkably released target speech from speech-spectrum noise, foreign-speech, or native-speech masking (experiment 1), adding a broadband-noise background with the signal-to-noise ratio no less than 4 dB significantly improved intelligibility of target speech when the masker was either noise (experiment 2) or speech (experiment 3). The results suggest that since adding the noise background shallows the areas of silence in the time-frequency domain of the IBM-treated target-speech/masker mixture, the abruption of transient changes in the mixture is smoothed and the perceived continuity of target-speech components becomes enhanced, leading to improved target-speech intelligibility. The findings are useful for advancing computational auditory scene analysis, hearing-aid/cochlear-implant designs, and understanding of speech perception under "cocktail-party" conditions.  相似文献   

18.
蒋斌  匡正  吴鸣  杨军 《声学学报》2012,37(6):659-666
实验研究了帧长对汉语音段反转言语可懂度的影响。实验结果表明,帧长在64 ms以下,汉语音段反转言语具有较高的可懂度;帧长在64~203 ms之间,可懂度随帧长的增加逐渐降低;帧长在203 ms以上,可懂度为0。在帧长8 ms时,汉语的声调失真导致可懂度下降。原始语音信号和音段反转言语的调制谱的分析表明,调制谱失真大小和可懂度密切相关。因此,用原始语音信号和音段反转言语的窄带包络间的归一化相关值可以衡量调制谱失真大小,基于语音的语言传输指数法计算的客观值和实验结果显著相关(r=0.876,p<0.01)。研究表明,语言可懂度与窄带包络有关,音段反转言语的可懂度和保留原始语音信号的窄带包络密切相关。   相似文献   

19.
In the n-of-m strategy, the signal is processed through m bandpass filters from which only the n maximum envelope amplitudes are selected for stimulation. While this maximum selection criterion, adopted in the advanced combination encoder strategy, works well in quiet, it can be problematic in noise as it is sensitive to the spectral composition of the input signal and does not account for situations in which the masker completely dominates the target. A new selection criterion is proposed based on the signal-to-noise ratio (SNR) of individual channels. The new criterion selects target-dominated (SNR > or = 0 dB) channels and discards masker-dominated (SNR<0 dB) channels. Experiment 1 assessed cochlear implant users' performance with the proposed strategy assuming that the channel SNRs are known. Results indicated that the proposed strategy can restore speech intelligibility to the level attained in quiet independent of the type of masker (babble or continuous noise) and SNR level (0-10 dB) used. Results from experiment 2 showed that a 25% error rate can be tolerated in channel selection without compromising speech intelligibility. Overall, the findings from the present study suggest that the SNR criterion is an effective selection criterion for n-of-m strategies with the potential of restoring speech intelligibility.  相似文献   

20.
Recent papers have discussed the optimal reverberation times in classrooms for speech intelligibility, based on the assumption of a diffuse sound field. Here this question was investigated for more ‘typical’ classrooms with non-diffuse sound fields. A ray-tracing model was modified to predict speech-intelligibility metric U50. It was used to predict U50 in various classroom configurations for various values of the room absorption, allowing the optimal absorption (that predicting the highest U50)—and the corresponding optimal reverberation time—to be identified in each case. The range of absorptions and reverberation times corresponding to high speech intelligibility were also predicted in each case. Optimal reverberation times were also predicted from the optimal surface-absorption coefficients using Sabine and Eyring versions of diffuse-field theory, and using the diffuse-field expression of Hodgson and Nosal. In order to validate the ray-tracing model, predictions were made for three classrooms with highly diffuse sound fields; these were compared to values obtained by the diffuse-field models, with good agreement. The methods were then applied to three ‘typical’ classrooms with non-diffuse fields. Optimal reverberation times increased with room volume and noise level to over 1 s. The accuracy of the Hodgson and Nosal expression varied with classroom size and noise level. The optimal average surface-absorption coefficients varied from 0.19 to 0.83 in the different classroom configurations tested. High speech intelligibility was, in general, predicted for a wide range of coefficients, but could not be obtained in a large, noisy classroom.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号