首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The application of the ideal binary mask to an auditory mixture has been shown to yield substantial improvements in intelligibility. This mask is commonly applied to the time-frequency (T-F) representation of a mixture signal and eliminates portions of a signal below a signal-to-noise-ratio (SNR) threshold while allowing others to pass through intact. The factors influencing intelligibility of ideal binary-masked speech are not well understood and are examined in the present study. Specifically, the effects of the local SNR threshold, input SNR level, masker type, and errors introduced in estimating the ideal mask are examined. Consistent with previous studies, intelligibility of binary-masked stimuli is quite high even at -10 dB SNR for all maskers tested. Performance was affected the most when the masker dominated T-F units were wrongly labeled as target-dominated T-F units. Performance plateaued near 100% correct for SNR thresholds ranging from -20 to 5 dB. The existence of the plateau region suggests that it is the pattern of the ideal binary mask that matters the most rather than the local SNR of each T-F unit. This pattern directs the listener's attention to where the target is and enables them to segregate speech effectively in multitalker environments.  相似文献   

2.
In the n-of-m strategy, the signal is processed through m bandpass filters from which only the n maximum envelope amplitudes are selected for stimulation. While this maximum selection criterion, adopted in the advanced combination encoder strategy, works well in quiet, it can be problematic in noise as it is sensitive to the spectral composition of the input signal and does not account for situations in which the masker completely dominates the target. A new selection criterion is proposed based on the signal-to-noise ratio (SNR) of individual channels. The new criterion selects target-dominated (SNR > or = 0 dB) channels and discards masker-dominated (SNR<0 dB) channels. Experiment 1 assessed cochlear implant users' performance with the proposed strategy assuming that the channel SNRs are known. Results indicated that the proposed strategy can restore speech intelligibility to the level attained in quiet independent of the type of masker (babble or continuous noise) and SNR level (0-10 dB) used. Results from experiment 2 showed that a 25% error rate can be tolerated in channel selection without compromising speech intelligibility. Overall, the findings from the present study suggest that the SNR criterion is an effective selection criterion for n-of-m strategies with the potential of restoring speech intelligibility.  相似文献   

3.
For a mixture of target speech and noise in anechoic conditions, the ideal binary mask is defined as follows: It selects the time-frequency units where target energy exceeds noise energy by a certain local threshold and cancels the other units. In this study, the definition of the ideal binary mask is extended to reverberant conditions. Given the division between early and late reflections in terms of speech intelligibility, three ideal binary masks can be defined: an ideal binary mask that uses the direct path of the target as the desired signal, an ideal binary mask that uses the direct path and early reflections of the target as the desired signal, and an ideal binary mask that uses the reverberant target as the desired signal. The effects of these ideal binary mask definitions on speech intelligibility are compared across two types of interference: speech shaped noise and concurrent female speech. As suggested by psychoacoustical studies, the ideal binary mask based on the direct path and early reflections of target speech outperforms the other masks as reverberation time increases and produces substantial reductions in terms of speech reception threshold for normal hearing listeners.  相似文献   

4.
The effect of perceived spatial differences on masking release was examined using a 4AFC speech detection paradigm. Targets were 20 words produced by a female talker. Maskers were recordings of continuous streams of nonsense sentences spoken by two female talkers and mixed into each of two channels (two talker, and the same masker time reversed). Two masker spatial conditions were employed: "RF" with a 4 ms time lead to the loudspeaker 60 degrees horizontally to the right, and "FR" with the time lead to the front (0 degrees ) loudspeaker. The reference nonspatial "F" masker was presented from the front loudspeaker only. Target presentation was always from the front loudspeaker. In Experiment 1, target detection threshold for both natural and time-reversed spatial maskers was 17-20 dB lower than that for the nonspatial masker, suggesting that significant release from informational masking occurs with spatial speech maskers regardless of masker understandability. In Experiment 2, the effectiveness of the FR and RF maskers was evaluated as the right loudspeaker output was attenuated until the two-source maskers were indistinguishable from the F masker, as measured independently in a discrimination task. Results indicated that spatial release from masking can be observed with barely noticeable target-masker spatial differences.  相似文献   

5.
Speech recognition in noise improves with combined acoustic and electric stimulation compared to electric stimulation alone [Kong et al., J. Acoust. Soc. Am. 117, 1351-1361 (2005)]. Here the contribution of fundamental frequency (F0) and low-frequency phonetic cues to speech recognition in combined hearing was investigated. Normal-hearing listeners heard vocoded speech in one ear and low-pass (LP) filtered speech in the other. Three listening conditions (vocode-alone, LP-alone, combined) were investigated. Target speech (average F0=120 Hz) was mixed with a time-reversed masker (average F0=172 Hz) at three signal-to-noise ratios (SNRs). LP speech aided performance at all SNRs. Low-frequency phonetic cues were then removed by replacing the LP speech with a LP equal-amplitude harmonic complex, frequency and amplitude modulated by the F0 and temporal envelope of voiced segments of the target. The combined hearing advantage disappeared at 10 and 15 dB SNR, but persisted at 5 dB SNR. A similar finding occurred when, additionally, F0 contour cues were removed. These results are consistent with a role for low-frequency phonetic cues, but not with a combination of F0 information between the two ears. The enhanced performance at 5 dB SNR with F0 contour cues absent suggests that voicing or glimpsing cues may be responsible for the combined hearing benefit.  相似文献   

6.
In a natural environment, speech signals are degraded by both reverberation and concurrent noise sources. While human listening is robust under these conditions using only two ears, current two-microphone algorithms perform poorly. The psychological process of figure-ground segregation suggests that the target signal is perceived as a foreground while the remaining stimuli are perceived as a background. Accordingly, the goal is to estimate an ideal time-frequency (T-F) binary mask, which selects the target if it is stronger than the interference in a local T-F unit. In this paper, a binaural segregation system that extracts the reverberant target signal from multisource reverberant mixtures by utilizing only the location information of target source is proposed. The proposed system combines target cancellation through adaptive filtering and a binary decision rule to estimate the ideal T-F binary mask. The main observation in this work is that the target attenuation in a T-F unit resulting from adaptive filtering is correlated with the relative strength of target to mixture. A comprehensive evaluation shows that the proposed system results in large SNR gains. In addition, comparisons using SNR as well as automatic speech recognition measures show that this system outperforms standard two-microphone beamforming approaches and a recent binaural processor.  相似文献   

7.
梁山  刘文举  江巍 《声学学报》2013,38(5):632-637
虽然浮值掩蔽比二值掩蔽有更好的语音分离效果,但是由于理想浮值掩蔽难以直接估计,现有的语音分离系统通常以理想二值掩蔽估计作为计算目标。我们提出了一个二值掩蔽到浮值掩蔽的泛化算法。由于实现浮值掩蔽估计的关键在于噪声能量追踪,我们首先采用指数分布刻画以混合谱和噪声能量以混合能量及二值掩蔽为观测的条件分布。其次,采用高斯马尔柯夫条件随机场刻画噪声估计在连续几帧内的关联。最后,采用马尔柯夫链-蒙特卡洛计算噪声能量最小均方误差估计并进一步计算浮值掩蔽。实验表明,相比于基于二值掩蔽估计的常规算法,我们所提出的算法在信噪比增益和客观感知质量两方面都有显著提高。   相似文献   

8.
When a target-speech/masker mixture is processed with the signal-separation technique, ideal binary mask (IBM), intelligibility of target speech is remarkably improved in both normal-hearing listeners and hearing-impaired listeners. Intelligibility of speech can also be improved by filling in speech gaps with un-modulated broadband noise. This study investigated whether intelligibility of target speech in the IBM-treated target-speech/masker mixture can be further improved by adding a broadband-noise background. The results of this study show that following the IBM manipulation, which remarkably released target speech from speech-spectrum noise, foreign-speech, or native-speech masking (experiment 1), adding a broadband-noise background with the signal-to-noise ratio no less than 4 dB significantly improved intelligibility of target speech when the masker was either noise (experiment 2) or speech (experiment 3). The results suggest that since adding the noise background shallows the areas of silence in the time-frequency domain of the IBM-treated target-speech/masker mixture, the abruption of transient changes in the mixture is smoothed and the perceived continuity of target-speech components becomes enhanced, leading to improved target-speech intelligibility. The findings are useful for advancing computational auditory scene analysis, hearing-aid/cochlear-implant designs, and understanding of speech perception under "cocktail-party" conditions.  相似文献   

9.
The effects of variations in vocal effort corresponding to common conversation situations on spectral properties of vowels were investigated. A database in which three degrees of vocal effort were suggested to the speakers by varying the distance to their interlocutor in three steps (close--0.4 m, normal--1.5 m, and far--6 m) was recorded. The speech materials consisted of isolated French vowels, uttered by ten naive speakers in a quiet furnished room. Manual measurements of fundamental frequency F0, frequencies, and amplitudes of the first three formants (F1, F2, F3, A1, A2, and A3), and on total amplitude were carried out. The speech materials were perceptually validated in three respects: identity of the vowel, gender of the speaker, and vocal effort. Results indicated that the speech materials were appropriate for the study. Acoustic analysis showed that F0 and F1 were highly correlated with vocal effort and varied at rates close to 5 Hz/dB for F0 and 3.5 Hz/dB for F1. Statistically F2 and F3 did not vary significantly with vocal effort. Formant amplitudes A1, A2, and A3 increased significantly; The amplitudes in the high-frequency range increased more than those in the lower part of the spectrum, revealing a change in spectral tilt. On the average, when the overall amplitude is increased by 10 dB, A1, A2, and A3 are increased by 11, 12.4, and 13 dB, respectively. Using "auditory" dimensions, such as the F1-F0 difference, and a "spectral center of gravity" between adjacent formants for representing vowel features did not reveal a better constancy of these parameters with respect to the variations of vocal effort and speaker. Thus a global view is evoked, in which all of the aspects of the signal should be processed simultaneously.  相似文献   

10.
Electrical field interaction caused by current spread in a cochlear implant was modeled in an explicit way in an acoustic model (the SPREAD model) presented to six listeners with normal hearing. The typical processing of cochlear implants was modeled more closely than in traditional acoustic models by careful selection of parameters related to current spread or parameters that could amplify the electrical field interactions caused by current spread. These parameters were the insertion depth, electrode spacing, electrical dynamic range, and dynamic range compression function. The hypothesis was that current spread could account for the asymptote in performance in speech intelligibility experiments observed at around seven stimulation channels in a number of cochlear implant studies. Speech intelligibility for sentences, vowels, and consonants at three noise levels (SNR of +15 dB, +10 dB, and +5 dB) was measured as a function of the number of spectral channels (4, 7, and 16). The SPREAD model appears to explain the asymptote in speech intelligibility at seven channels for all noise levels for all speech material used in this study. It is shown that the compressive amplitude mapping used in cochlear implants can have a detrimental effect on the number of effective channels.  相似文献   

11.
Inspired by recent evidence that a binary pattern may provide sufficient information for human speech recognition, this letter proposes a fundamentally different approach to robust automatic speech recognition. Specifically, recognition is performed by classifying binary masks corresponding to a word utterance. The proposed method is evaluated using a subset of the TIDigits corpus to perform isolated digit recognition. Despite dramatic reduction of speech information encoded in a binary mask, the proposed system performs surprisingly well. The system is compared with a traditional HMM based approach and is shown to perform well under low SNR conditions.  相似文献   

12.
Little is known about the extent to which reverberation affects speech intelligibility by cochlear implant (CI) listeners. Experiment 1 assessed CI users' performance using Institute of Electrical and Electronics Engineers (IEEE) sentences corrupted with varying degrees of reverberation. Reverberation times of 0.30, 0.60, 0.80, and 1.0 s were used. Results indicated that for all subjects tested, speech intelligibility decreased exponentially with an increase in reverberation time. A decaying-exponential model provided an excellent fit to the data. Experiment 2 evaluated (offline) a speech coding strategy for reverberation suppression using a channel-selection criterion based on the signal-to-reverberant ratio (SRR) of individual frequency channels. The SRR reflects implicitly the ratio of the energies of the signal originating from the early (and direct) reflections and the signal originating from the late reflections. Channels with SRR larger than a preset threshold were selected, while channels with SRR smaller than the threshold were zeroed out. Results in a highly reverberant scenario indicated that the proposed strategy led to substantial gains (over 60 percentage points) in speech intelligibility over the subjects' daily strategy. Further analysis indicated that the proposed channel-selection criterion reduces the temporal envelope smearing effects introduced by reverberation and also diminishes the self-masking effects responsible for flattened formants.  相似文献   

13.
I.IntroductionKa1manfilteringisjustamethodtoestimatestatistica1lythestateoftheobservedsystemfromthecorruptedsigna1s,andthiskindofcstimationisarecurrcneeestimationbasedon1inear,nonbiasandminimumvariance.Moreover,Ka1manfilteringisapplicabletonon-sta-honarysignalsandtime-variantdynamicsystem.Therefore,Kalmanfilteringisveryapplica-bletoenhancingthespeechsigna1sthatarecorruptedbynoise.ThispaperreportStheconcretcmethodofenhanccmentofnoisyspccchanditscxperimentresults.Experimentsindicate:Afterthes…  相似文献   

14.
15.
Weak consonants (e.g., stops) are more susceptible to noise than vowels, owing partially to their lower intensity. This raises the question whether hearing-impaired (HI) listeners are able to perceive (and utilize effectively) the high-frequency cues present in consonants. To answer this question, HI listeners were presented with clean (noise absent) weak consonants in otherwise noise-corrupted sentences. Results indicated that HI listeners received significant benefit in intelligibility (4 dB decrease in speech reception threshold) when they had access to clean consonant information. At extremely low signal-to-noise ratio (SNR) levels, however, HI listeners received only 64% of the benefit obtained by normal-hearing listeners. This lack of equitable benefit was investigated in Experiment 2 by testing the hypothesis that the high-frequency cues present in consonants were not audible to HI listeners. This was tested by selectively amplifying the noisy consonants while leaving the noisy sonorant sounds (e.g., vowels) unaltered. Listening tests indicated small (~10%), but statistically significant, improvements in intelligibility at low SNR conditions when the consonants were amplified in the high-frequency region. Selective consonant amplification provided reliable low-frequency acoustic landmarks that in turn facilitated a better lexical segmentation of the speech stream and contributed to the small improvement in intelligibility.  相似文献   

16.
Previous research has demonstrated reduced speech recognition when speech is presented at higher-than-normal levels (e.g., above conversational speech levels), particularly in the presence of speech-shaped background noise. Persons with hearing loss frequently listen to speech-in-noise at these levels through hearing aids, which incorporate multiple-channel, wide dynamic range compression. This study examined the interactive effects of signal-to-noise ratio (SNR), speech presentation level, and compression ratio on consonant recognition in noise. Nine subjects with normal hearing identified CV and VC nonsense syllables in a speech-shaped noise at two SNRs (0 and +6 dB), three presentation levels (65, 80, and 95 dB SPL) and four compression ratios (1:1, 2:1, 4:1, and 6:1). Stimuli were processed through a simulated three-channel, fast-acting, wide dynamic range compression hearing aid. Consonant recognition performance decreased as compression ratio increased and presentation level increased. Interaction effects were noted between SNR and compression ratio, as well as between presentation level and compression ratio. Performance decrements due to increases in compression ratio were larger at the better (+6 dB) SNR and at the lowest (65 dB SPL) presentation level. At higher levels (95 dB SPL), such as those experienced by persons with hearing loss, increasing compression ratio did not significantly affect speech intelligibility.  相似文献   

17.
These experiments examined how high presentation levels influence speech recognition for high- and low-frequency stimuli in noise. Normally hearing (NH) and hearing-impaired (HI) listeners were tested. In Experiment 1, high- and low-frequency bandwidths yielding 70%-correct word recognition in quiet were determined at levels associated with broadband speech at 75 dB SPL. In Experiment 2, broadband and band-limited sentences (based on passbands measured in Experiment 1) were presented at this level in speech-shaped noise filtered to the same frequency bandwidths as targets. Noise levels were adjusted to produce approximately 30%-correct word recognition. Frequency bandwidths and signal-to-noise ratios supporting criterion performance in Experiment 2 were tested at 75, 87.5, and 100 dB SPL in Experiment 3. Performance tended to decrease as levels increased. For NH listeners, this "rollover" effect was greater for high-frequency and broadband materials than for low-frequency stimuli. For HI listeners, the 75- to 87.5-dB increase improved signal audibility for high-frequency stimuli and rollover was not observed. However, the 87.5- to 100-dB increase produced qualitatively similar results for both groups: scores decreased most for high-frequency stimuli and least for low-frequency materials. Predictions of speech intelligibility by quantitative methods such as the Speech Intelligibility Index may be improved if rollover effects are modeled as frequency dependent.  相似文献   

18.
Beamformer performance with acoustic vector sensors in air   总被引:1,自引:0,他引:1  
For some time, compact acoustic vector sensors (AVSs) capable of sensing particle velocity in three orthogonal directions have been used in underwater acoustic sensing applications. Potential advantages of using AVSs in air include substantial noise reduction with a very small aperture and few channels. For this study, a four-microphone array approximating a small (1 cm3) AVS in air was constructed using three gradient microphones and one omnidirectional microphone. This study evaluates the signal extraction performance of one nonadaptive and four adaptive beamforming algorithms. Test signals, consisting of two to five speech sources, were processed with each algorithm, and the signal extraction performance was quantified by calculating the signal-to-noise ratio (SNR) of the output. For a three-microphone array, robust and nonrobust versions of a frequency-domain minimum-variance (FMV) distortionless-response beamformer produced SNR improvements of 11 to 14 dB, and a generalized sidelobe canceller (GSC) produced improvements of 5.5 to 8.5 dB. In comparison, a two-microphone omnidirectional array with a spacing of 15 cm yielded slightly lower SNR improvements for similar multi-interferer speech signals.  相似文献   

19.
The amount of masking exerted by one speech sound on another can be reduced by presenting the masker twice, from two different locations in the horizontal plane, with one of the presentations delayed to simulate an acoustical reflection. Three experiments were conducted on various aspects of this phenomenon. Experiment 1 varied the number of masking talkers from one to three and the signal-to-noise (S/N) ratio from -12 to +4 dB. Evidence of masking release was found for every combination of these variables tested. For the most difficult conditions (multiple maskers and negative S/N) the amount of release was approximately 10 dB. Experiment 2 varied the timing of leading and lagging masker presentations over a broad range, to include shorter delay times where room reflections of speech are rarely noticed by listeners and longer delays where reflections can become disruptive. Substantial masking release was found for all of the shorter delay times tested, and negligible release was found at the longer delays. Finally, Experiment 3 used speech-spectrum noise as a masker and searched for possible energetic masking release as a function of the lead-lag time delay. Release of up to 4 dB was found whenever delays were 2 ms or less. No energetic masking release was found at longer delays.  相似文献   

20.
The ability of hearing-impaired (HI) listeners to use high-rate envelope information in a competing-talker situation was assessed. In experiment 1, signals were tone vocoded and the cutoff frequency (f(c)) of the envelope extraction filter was either 50?Hz (E filter) or 200?Hz (P filter). The channels for which the P or E filter was used were varied. Intelligibility was higher with the P filter regardless of whether it was used for low or high center frequencies. Performance was best when the P filter was used for all channels. Experiment 2 explored the dynamic range over which HI listeners made use of high-rate cues. In each channel of a vocoder, the envelope extracted using f(c)?=?16?Hz was replaced by the envelope extracted using f(c)?=?300?Hz, either at the peaks or valleys, with a parametrically varied "switching threshold." For a target-to-background ratio of +5?dB, changes in speech intelligibility occurred mainly when the switching threshold was between -8 and +8?dB relative to the channel root-mean-square level. This range is similar in width to, but about 3?dB higher in absolute level than, that found for normal-hearing listeners, despite the reduced dynamic range of the HI listeners.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号