首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 281 毫秒
1.
The classic [MN55] confusion matrix experiment (16 consonants, white noise masker) was repeated by using computerized procedures, similar to those of Phatak and Allen (2007). ["Consonant and vowel confusions in speech-weighted noise," J. Acoust. Soc. Am. 121, 2312-2316]. The consonant scores in white noise can be categorized in three sets: low-error set [/m/, /n/], average-error set [/p/, /t/, /k/, /s/, /[please see text]/, /d/, /g/, /z/, /Z/], and high-error set /f/theta/b/, /v/, /E/,/theta/]. The consonant confusions match those from MN55, except for the highly asymmetric voicing confusions of fricatives, biased in favor of voiced consonants. Masking noise cannot only reduce the recognition of a consonant, but also perceptually morph it into another consonant. There is a significant and systematic variability in the scores and confusion patterns of different utterances of the same consonant, which can be characterized as (a) confusion heterogeneity, where the competitors in the confusion groups of a consonant vary, and (b) threshold variability, where confusion threshold [i.e., signal-to-noise ratio (SNR) and score at which the confusion group is formed] varies. The average consonant error and errors for most of the individual consonants and consonant sets can be approximated as exponential functions of the articulation index (AI). An AI that is based on the peak-to-rms ratios of speech can explain the SNR differences across experiments.  相似文献   

2.
Studies on consonant perception under noise conditions typically describe the average consonant error as exponential in the Articulation Index (AI). While this AI formula nicely fits the average error over all consonants, it does not fit the error for any consonant at the utterance level. This study analyzes the error patterns of six stop consonants /p, t, k, b, d, g/ with four vowels (/α/, /ε/, /I/, /ae/), at the individual consonant (i.e., utterance) level. The findings include that the utterance error is essentially zero for signal to noise ratios (SNRs) at least -2 dB, for >78% of the stop consonant utterances. For these utterances, the error is essentially a step function in the SNR at the utterance's detection threshold. This binary error dependence is consistent with the audibility of a single binary defining acoustic feature, having zero error above the feature's detection threshold. Also 11% of the sounds have high error, defined as ≥ 20% for SNRs greater than or equal to -2 dB. A grand average across many such sounds, having a natural distribution in thresholds, results in the error being exponential in the AI measure, as observed. A detailed analysis of the variance from the AI error is provided along with a Bernoulli-trials analysis of the statistical significance.  相似文献   

3.
This paper presents the results of a closed-set recognition task for 64 consonant-vowel sounds (16 C X 4 V, spoken by 18 talkers) in speech-weighted noise (-22,-20,-16,-10,-2 [dB]) and in quiet. The confusion matrices were generated using responses of a homogeneous set of ten listeners and the confusions were analyzed using a graphical method. In speech-weighted noise the consonants separate into three sets: a low-scoring set C1 (/f/, /theta/, /v/, /d/, /b/, /m/), a high-scoring set C2 (/t/, /s/, /z/, /S/, /Z/) and set C3 (/n/, /p/, /g/, /k/, /d/) with intermediate scores. The perceptual consonant groups are C1: {/f/-/theta/, /b/-/v/-/d/, /theta/-/d/}, C2: {/s/-/z/, /S/-/Z/}, and C3: /m/-/n/, while the perceptual vowel groups are /a/-/ae/ and /epsilon/-/iota/. The exponential articulation index (AI) model for consonant score works for 12 of the 16 consonants, using a refined expression of the AI. Finally, a comparison with past work shows that white noise masks the consonants more uniformly than speech-weighted noise, and shows that the AI, because it can account for the differences in noise spectra, is a better measure than the wideband signal-to-noise ratio for modeling and comparing the scores with different noise maskers.  相似文献   

4.
A glimpsing model of speech perception in noise   总被引:5,自引:0,他引:5  
Do listeners process noisy speech by taking advantage of "glimpses"-spectrotemporal regions in which the target signal is least affected by the background? This study used an automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise. Twelve masking conditions were chosen to create a range of glimpse sizes. Several different glimpsing models were employed, differing in the local signal-to-noise ratio (SNR) used for detection, the minimum glimpse size, and the use of information in the masked regions. Recognition results were compared with behavioral data. A quantitative analysis demonstrated that the proportion of the time-frequency plane glimpsed is a good predictor of intelligibility. Recognition scores in each noise condition confirmed that sufficient information exists in glimpses to support consonant identification. Close fits to listeners' performance were obtained at two local SNR thresholds: one at around 8 dB and another in the range -5 to -2 dB. A transmitted information analysis revealed that cues to voicing are degraded more in the model than in human auditory processing.  相似文献   

5.
Weak consonants (e.g., stops) are more susceptible to noise than vowels, owing partially to their lower intensity. This raises the question whether hearing-impaired (HI) listeners are able to perceive (and utilize effectively) the high-frequency cues present in consonants. To answer this question, HI listeners were presented with clean (noise absent) weak consonants in otherwise noise-corrupted sentences. Results indicated that HI listeners received significant benefit in intelligibility (4 dB decrease in speech reception threshold) when they had access to clean consonant information. At extremely low signal-to-noise ratio (SNR) levels, however, HI listeners received only 64% of the benefit obtained by normal-hearing listeners. This lack of equitable benefit was investigated in Experiment 2 by testing the hypothesis that the high-frequency cues present in consonants were not audible to HI listeners. This was tested by selectively amplifying the noisy consonants while leaving the noisy sonorant sounds (e.g., vowels) unaltered. Listening tests indicated small (~10%), but statistically significant, improvements in intelligibility at low SNR conditions when the consonants were amplified in the high-frequency region. Selective consonant amplification provided reliable low-frequency acoustic landmarks that in turn facilitated a better lexical segmentation of the speech stream and contributed to the small improvement in intelligibility.  相似文献   

6.
The evaluation of intelligibility of noise reduction algorithms is reported. IEEE sentences and consonants were corrupted by four types of noise including babble, car, street and train at two signal-to-noise ratio levels (0 and 5 dB), and then processed by eight speech enhancement methods encompassing four classes of algorithms: spectral subtractive, sub-space, statistical model based and Wiener-type algorithms. The enhanced speech was presented to normal-hearing listeners for identification. With the exception of a single noise condition, no algorithm produced significant improvements in speech intelligibility. Information transmission analysis of the consonant confusion matrices indicated that no algorithm improved significantly the place feature score, significantly, which is critically important for speech recognition. The algorithms which were found in previous studies to perform the best in terms of overall quality, were not the same algorithms that performed the best in terms of speech intelligibility. The subspace algorithm, for instance, was previously found to perform the worst in terms of overall quality, but performed well in the present study in terms of preserving speech intelligibility. Overall, the analysis of consonant confusion matrices suggests that in order for noise reduction algorithms to improve speech intelligibility, they need to improve the place and manner feature scores.  相似文献   

7.
Speech waveform envelope cues for consonant recognition   总被引:4,自引:0,他引:4  
This study investigated the cues for consonant recognition that are available in the time-intensity envelope of speech. Twelve normal-hearing subjects listened to three sets of spectrally identical noise stimuli created by multiplying noise with the speech envelopes of 19(aCa) natural-speech nonsense syllables. The speech envelope for each of the three noise conditions was derived using a different low-pass filter cutoff (20, 200, and 2000 Hz). Average consonant identification performance was above chance for the three noise conditions and improved significantly with the increase in envelope bandwidth from 20-200 Hz. SINDSCAL multidimensional scaling analysis of the consonant confusions data identified three speech envelope features that divided the 19 consonants into four envelope feature groups ("envemes"). The enveme groups in combination with visually distinctive speech feature groupings ("visemes") can distinguish most of the 19 consonants. These results suggest that near-perfect consonant identification performance could be attained by subjects who receive only enveme and viseme information and no spectral information.  相似文献   

8.
Nonlinear sensory and neural processing mechanisms have been exploited to enhance spectral contrast for improvement of speech understanding in noise. The "companding" algorithm employs both two-tone suppression and adaptive gain mechanisms to achieve spectral enhancement. This study implemented a 50-channel companding strategy and evaluated its efficiency as a front-end noise suppression technique in cochlear implants. The key parameters were identified and evaluated to optimize the companding performance. Both normal-hearing (NH) listeners and cochlear-implant (CI) users performed phoneme and sentence recognition tests in quiet and in steady-state speech-shaped noise. Data from the NH listeners showed that for noise conditions, the implemented strategy improved vowel perception but not consonant and sentence perception. However, the CI users showed significant improvements in both phoneme and sentence perception in noise. Maximum average improvement for vowel recognition was 21.3 percentage points (p<0.05) at 0 dB signal-to-noise ratio (SNR), followed by 17.7 percentage points (p<0.05) at 5 dB SNR for sentence recognition and 12.1 percentage points (p<0.05) at 5 dB SNR for consonant recognition. While the observed results could be attributed to the enhanced spectral contrast, it is likely that the corresponding temporal changes caused by companding also played a significant role and should be addressed by future studies.  相似文献   

9.
Speech recognition in noise improves with combined acoustic and electric stimulation compared to electric stimulation alone [Kong et al., J. Acoust. Soc. Am. 117, 1351-1361 (2005)]. Here the contribution of fundamental frequency (F0) and low-frequency phonetic cues to speech recognition in combined hearing was investigated. Normal-hearing listeners heard vocoded speech in one ear and low-pass (LP) filtered speech in the other. Three listening conditions (vocode-alone, LP-alone, combined) were investigated. Target speech (average F0=120 Hz) was mixed with a time-reversed masker (average F0=172 Hz) at three signal-to-noise ratios (SNRs). LP speech aided performance at all SNRs. Low-frequency phonetic cues were then removed by replacing the LP speech with a LP equal-amplitude harmonic complex, frequency and amplitude modulated by the F0 and temporal envelope of voiced segments of the target. The combined hearing advantage disappeared at 10 and 15 dB SNR, but persisted at 5 dB SNR. A similar finding occurred when, additionally, F0 contour cues were removed. These results are consistent with a role for low-frequency phonetic cues, but not with a combination of F0 information between the two ears. The enhanced performance at 5 dB SNR with F0 contour cues absent suggests that voicing or glimpsing cues may be responsible for the combined hearing benefit.  相似文献   

10.
Cochlear implant users receive limited spectral and temporal information. Their speech recognition deteriorates dramatically in noise. The aim of the present study was to determine the relative contributions of spectral and temporal cues to speech recognition in noise. Spectral information was manipulated by varying the number of channels from 2 to 32 in a noise-excited vocoder. Temporal information was manipulated by varying the low-pass cutoff frequency of the envelope extractor from 1 to 512 Hz. Ten normal-hearing, native speakers of English participated in tests of phoneme recognition using vocoder processed consonants and vowels under three conditions (quiet, and +6 and 0 dB signal-to-noise ratios). The number of channels required for vowel-recognition performance to plateau increased from 12 in quiet to 16-24 in the two noise conditions. However, for consonant recognition, no further improvement in performance was evident when the number of channels was > or =12 in any of the three conditions. The contribution of temporal cues for phoneme recognition showed a similar pattern in both quiet and noise conditions. Similar to the quiet conditions, there was a trade-off between temporal and spectral cues for phoneme recognition in noise.  相似文献   

11.
Voice onset time (VOT) signifies the interval between consonant onset and the start of rhythmic vocal-cord vibrations. Differential perception of consonants such as /d/ and /t/ is categorical in American English, with the boundary generally lying at a VOT of 20-40 ms. This study tests whether previously identified response patterns that differentially reflect VOT are maintained in large-scale population activity within primary auditory cortex (A1) of the awake monkey. Multiunit activity and current source density patterns evoked by the syllables /da/ and /ta/ with variable VOTs are examined. Neural representation is determined by the tonotopic organization. Differential response patterns are restricted to lower best-frequency regions. Response peaks time-locked to both consonant and voicing onsets are observed for syllables with a 40- and 60-ms VOT, whereas syllables with a 0- and 20-ms VOT evoke a single response time-locked only to consonant onset. Duration of aspiration noise is represented in higher best-frequency regions. Representation of VOT and aspiration noise in discrete tonotopic areas of A1 suggest that integration of these phonetic cues occurs in secondary areas of auditory cortex. Findings are consistent with the evolving concept that complex stimuli are encoded by synchronized activity in large-scale neuronal ensembles.  相似文献   

12.
The purpose of this study was to determine the role of static, dynamic, and integrated cues for perception in three adult age groups, and to determine whether age has an effect on both consonant and vowel perception, as predicted by the "age-related deficit hypothesis." Eight adult subjects in each of the age ranges of young (ages 20-26), middle aged (ages 52-59), and old (ages 70-76) listened to synthesized syllables composed of combinations of [b d g] and [i u a]. The synthesis parameters included manipulations of the following stimulus variables: formant transition (moving or straight), noise burst (present or absent), and voicing duration (10, 30, or 46 ms). Vowel perception was high across all conditions and there were no significant differences among age groups. Consonant identification showed a definite effect of age. Young and middle-aged adults were significantly better than older adults at identifying consonants from secondary cues only. Older adults relied on the integration of static and dynamic cues to a greater extent than younger and middle-aged listeners for identification of place of articulation of stop consonants. Duration facilitated correct stop-consonant identification in the young and middle-aged groups for the no-burst conditions, but not in the old group. These findings for the duration of stop-consonant transitions indicate reductions in processing speed with age. In general, the results did not support the age-related deficit hypothesis for adult identification of vowels and consonants from dynamic spectral cues.  相似文献   

13.
The purpose of this study was to evaluate the efficiency of three acoustic modifications derived from clear speech for improving consonant recognition by young and elderly normal-hearing subjects. Percent-correct nonsense syllable recognition was measured for four stimulus sets: unmodified stimuli; stimuli with consonant duration increased by 100%; stimuli with consonant-vowel ratio increased by 10 dB; and stimuli with both consonant duration and consonant-vowel ratio increased. Analyses of overall nonsense syllable recognition, consonant feature recognition, and consonant confusion patterns demonstrated that the consonant-vowel ratio increase modification produced better performance than the other acoustic modifications by both subject groups. However, elderly subjects exhibited poorer performance than young subjects in most conditions. These results and their implications are discussed.  相似文献   

14.
Previous research has demonstrated reduced speech recognition when speech is presented at higher-than-normal levels (e.g., above conversational speech levels), particularly in the presence of speech-shaped background noise. Persons with hearing loss frequently listen to speech-in-noise at these levels through hearing aids, which incorporate multiple-channel, wide dynamic range compression. This study examined the interactive effects of signal-to-noise ratio (SNR), speech presentation level, and compression ratio on consonant recognition in noise. Nine subjects with normal hearing identified CV and VC nonsense syllables in a speech-shaped noise at two SNRs (0 and +6 dB), three presentation levels (65, 80, and 95 dB SPL) and four compression ratios (1:1, 2:1, 4:1, and 6:1). Stimuli were processed through a simulated three-channel, fast-acting, wide dynamic range compression hearing aid. Consonant recognition performance decreased as compression ratio increased and presentation level increased. Interaction effects were noted between SNR and compression ratio, as well as between presentation level and compression ratio. Performance decrements due to increases in compression ratio were larger at the better (+6 dB) SNR and at the lowest (65 dB SPL) presentation level. At higher levels (95 dB SPL), such as those experienced by persons with hearing loss, increasing compression ratio did not significantly affect speech intelligibility.  相似文献   

15.
This study investigated the effect of five speech processing parameters, currently employed in cochlear implant processors, on speech understanding. Experiment 1 examined speech recognition as a function of stimulation rate in six Med-E1/CIS-Link cochlear implant listeners. Results showed that higher stimulation rates (2100 pulses/s) produced a significantly higher performance on word and consonant recognition than lower stimulation rates (<800 pulses/s). The effect of stimulation rate on consonant recognition was highly dependent on the vowel context. The largest benefit was noted for consonants in the /uCu/ and /iCi/ contexts, while the smallest benefit was noted for consonants in the /aCa/ context. This finding suggests that the /aCa/ consonant test, which is widely used today, is not sensitive enough to parametric variations of implant processors. Experiment 2 examined vowel and consonant recognition as a function of pulse width for low-rate (400 and 800 pps) implementations of the CIS strategy. For the 400-pps condition, wider pulse widths (208 micros/phase) produced significantly higher performance on consonant recognition than shorter pulse widths (40 micros/phase). Experiments 3-5 examined vowel and consonant recognition as a function of the filter overlap in the analysis filters, shape of the amplitude mapping function, and signal bandwidth. Results showed that the amount of filter overlap (ranging from -20 to -60 dB/oct) and the signal bandwidth (ranging from 6.7 to 9.9 kHz) had no effect on phoneme recognition. The shape of the amplitude mapping functions (ranging from strongly compressive to weakly compressive) had only a minor effect on performance, with the lowest performance obtained for nearly linear mapping functions. Of the five speech processing parameters examined in this study, the pulse rate and the pulse width had the largest (positive) effect on speech recognition. For a fixed pulse width, higher rates (2100 pps) of stimulation provided a significantly better performance on word recognition than lower rates (<800 pps) of stimulation. High performance was also achieved by jointly varying the pulse rate and pulse width. The above results indicate that audiologists can optimize the implant listener's performance either by increasing the pulse rate or by jointly varying the pulse rate and pulse width.  相似文献   

16.
Children 5, 9, and 11 years of age and young adults attempted to identify the final word of sentences recorded by a female speaker. The sentences were presented in two levels of multitalker babble, and participants responded by selecting one of four pictures. In a low-noise condition, the signal-to-noise ratio (SNR) was adjusted for each age group to yield 85% correct performance. In a high-noise condition, the SNR was set 7 dB lower than the low-noise condition. Although children required more favorable SNRs than adults to achieve comparable performance in low noise, an equivalent decrease in SNR had comparable consequences for all age groups. Thus age-related differences on this task can be attributed primarily to sensory factors.  相似文献   

17.
This study examined the degree to which masker-spectral variability contributes to children's susceptibility to informational masking. Listeners were younger children (5-7 years), older children (8-10 years), and adults (19-34 years). Masked thresholds were measured using a 2IFC, adaptive procedure for a 300-ms, 1000-Hz signal presented simultaneously with (1) broadband noise, (2) a random-frequency ten-tone complex, or (3) a fixed-frequency ten-tone complex. Maskers were presented at an overall level of 60 dB SPL. Thresholds were similar across age for the noise condition. Thresholds for most children were higher than for most adults, however, for both ten-tone conditions. The average difference in threshold between random and fixed ten-tone conditions was comparable across age, suggesting a similar effect of reducing masker-spectral variability in children and adults. Children appear more likely to be susceptible to informational masking than adults, however, both with and in the absence of masker-spectral variability. The addition of a masker fringe (delayed onset of signal relative to masker) provided a release from masking for fixed and random ten-tone conditions in all age groups, suggesting at least part of the masking observed for both ten-tone maskers was informational.  相似文献   

18.
Four normal-hearing young adults have been extensively trained in the use of a tactile speech-transmission system. Subjects were tested in the recognition of various phonetic elements including vowels, and stop, nasal, and fricative consonants under three receiving conditions; visual reception alone (lipreading), tactile reception alone, and tactile plus visual reception. Subjects were artificially deafened using earplugs and white noise and all speech tokens were presented live voice. Analysis of the data demonstrates that the tactile transform enables receivers to achieve excellent recognition of vowels in CVC context and the consonantal features of voicing and nasality. This, in combination with high recognition of vowels and the consonantal feature place of articulation through visual receptors, leads to recognition performance in the combined condition (visual plus tactual) which far exceeds either reception condition in isolation.  相似文献   

19.
Listeners identified a phonetically balanced set of consonant-vowel-consonant (CVC) words and nonsense syllables in noise at four signal-to-noise ratios. The identification scores for phonemes and syllables were analyzed using the j-factor model [Boothroyd and Nittrouer, J. Acoust. Soc. Am. 84, 101-114 (1988)], which measures the perceptual independence of the parts of a whole. Results indicate that nonsense CVC syllables are perceived as having three independent phonemes, while words show j = 2.34 independent units. Among the words, high-frequency words are perceived as having significantly fewer independent units than low-frequency words. Words with dense phonetic neighborhoods are perceived as having 0.5 more independent units than words with sparse neighborhoods. The neighborhood effect in these data is due almost entirely to density as determined by the initial consonant and vowel, demonstrated in analyses by subjects and items, and correlation analyses of syllable recognition with the neighborhood activation model [Luce and Pisoni, Ear Hear. 19, 1-36 (1998)]. The j factors are interpreted as measuring increased efficiency of the perception of word-final consonants of words in sparse neighborhoods during spoken word recognition.  相似文献   

20.
Consonant recognition in quiet using the Nonsense Syllable Test (NST) [Resnick et al., J. Acoust. Soc. Am. Suppl. 1 58, S114 (1975)] was investigated in 62 normal hearing subjects 20 to 65 years of age at their most comfortable listening levels (MCLs) and at 8 dB above and below MCL. Although overall consonant recognition performance was high (as expected for normal listeners), the effects of age decade, relative presentation level, and NST subsets were all significant, as was the interaction of age X level. The interactions of age X NST subset, and age X subset X level were nonsignificant. These findings suggest that consonant recognition decreases with normal aging, particularly below MCL. However, the relative perceptual difficulty of the seven subtests is the same across age groups. Confusion matrices were similar across levels and age groups. Percent information transmitted for several consonant features was calculated from the confusion matrices. Older subjects showed decrements in performance primarily for the features recognized relatively less accurately by the younger subjects. The results suggest that normal hearing older individuals listening in quiet have decreased consonant recognition ability, but that their confusions are similar to those of younger persons.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号