期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Speech recognition with altered spectral distribution of envelope cues. 总被引：8，自引：0，他引：8

R V Shannon F G Zeng J Wygonski 《The Journal of the Acoustical Society of America》1998,104(4):2467-2476

Recognition of consonants, vowels, and sentences was measured in conditions of reduced spectral resolution and distorted spectral distribution of temporal envelope cues. Speech materials were processed through four bandpass filters (analysis bands), half-wave rectified, and low-pass filtered to extract the temporal envelope from each band. The envelope from each speech band modulated a band-limited noise (carrier bands). Analysis and carrier bands were manipulated independently to alter the spectral distribution of envelope cues. Experiment I demonstrated that the location of the cutoff frequencies defining the bands was not a critical parameter for speech recognition, as long as the analysis and carrier bands were matched in frequency extent. Experiment II demonstrated a dramatic decrease in performance when the analysis and carrier bands did not match in frequency extent, which resulted in a warping of the spectral distribution of envelope cues. Experiment III demonstrated a large decrease in performance when the carrier bands were shifted in frequency, mimicking the basal position of electrodes in a cochlear implant. And experiment IV showed a relatively minor effect of the overlap in the noise carrier bands, simulating the overlap in neural populations responding to adjacent electrodes in a cochlear implant. Overall, these results show that, for four bands, the frequency alignment of the analysis bands and carrier bands is critical for good performance, while the exact frequency divisions and overlap in carrier bands are not as critical. 相似文献

2.

Importance of temporal-envelope cues in consonant recognition

van der Horst R Leeuw AR Dreschler WA 《The Journal of the Acoustical Society of America》1999,105(3):1801-1809

The role of different modulation frequencies in the speech envelope were studied by means of the manipulation of vowel-consonant-vowel (VCV) syllables. The envelope of the signal was extracted from the speech and the fine-structure was replaced by speech-shaped noise. The temporal envelopes in every critical band of the speech signal were notch filtered in order to assess the relative importance of different modulation frequency regions between 0 and 20 Hz. For this purpose notch filters around three center frequencies (8, 12, and 16 Hz) with three different notch widths (4-, 8-, and 12-Hz wide) were used. These stimuli were used in a consonant-recognition task in which ten normal-hearing subjects participated, and their results were analyzed in terms of recognition scores. More qualitative information was obtained with a multidimensional scaling method (INDSCAL) and sequential information analysis (SINFA). Consonant recognition is very robust for the removal of certain modulation frequency areas. Only when a wide notch around 8 Hz is applied does the speech signal become heavily degraded. As expected, the voicing information is lost, while there are different effects on plosiveness and nasality. Even the smallest filtering has a substantial effect on the transfer of the plosiveness feature, while on the other hand, filtering out only the low-modulation frequencies has a substantial effect on the transfer of nasality cues. 相似文献

3.

Effects of hearing loss on utilization of short-duration spectral cues in stop consonant recognition

J R Dubno D D Dirks A B Schaefer 《The Journal of the Acoustical Society of America》1987,81(6):1940-1947

The purpose of this experiment was to evaluate the utilization of short-term spectral cues for recognition of initial plosive consonants (/b,d,g/) by normal-hearing and by hearing-impaired listeners differing in audiometric configuration. Recognition scores were obtained for these consonants paired with three vowels (/a,i,u/) while systematically reducing the duration (300 to 10 ms) of the synthetic consonant-vowel syllables. Results from 10 normal-hearing and 15 hearing-impaired listeners suggest that audiometric configuration interacts in a complex manner with the identification of short-duration stimuli. For consonants paired with the vowels /a/ and /u/, performance deteriorated as the slope of the audiometric configuration increased. The one exception to this result was a subject who had significantly elevated pure-tone thresholds relative to the other hearing-impaired subjects. Despite the changes in the shape of the onset spectral cues imposed by hearing loss, with increasing duration, consonant recognition in the /a/ and /u/ context for most hearing-impaired subjects eventually approached that of the normal-hearing listeners. In contrast, scores for consonants paired with /i/ were poor for a majority of hearing-impaired listeners for stimuli of all durations. 相似文献

4.

Temporal cues for consonant recognition: training, talker generalization, and use in evaluation of cochlear implants. 总被引：5，自引：0，他引：5

D J Van Tasell D G Greenfield J J Logemann D A Nelson 《The Journal of the Acoustical Society of America》1992,92(3):1247-1257

Limited consonant phonemic information can be conveyed by the temporal characteristics of speech. In the two experiments reported here, the effects of practice and of multiple talkers on identification of temporal consonant information were evaluated. Naturally produced /aCa/disyllables were used to create "temporal-only" stimuli having instantaneous amplitudes identical to the natural speech stimuli, but flat spectra. Practice improved normal-hearing subjects' identification of temporal-only stimuli from a single talker over that reported earlier for a different group of unpracticed subjects [J. Acoust. Soc. Am. 82, 1152-1161 (1987)]. When the number of talkers was increased to six, however, performance was poorer than that observed for one talker, demonstrating that subjects had been able to learn the individual stimulus items derived from the speech of the single talker. Even after practice, subjects varied greatly in their abilities to extract temporal information related to consonant voicing and manner. Identification of consonant place was uniformly poor in the multiple-talker situation, indicating that for these stimuli consonant place is cued via spectral information. Comparison of consonant identification by users of multi-channel cochlear implants showed that the implant users' identification of temporal consonant information was largely within the range predicted from the normal data. In the instances where the implant users were performing especially well, they were identifying consonant place information at levels well beyond those predicted by the normal-subject data. Comparison of implant-user performance with the temporal-only data reported here can help determine whether the speech information available to the implant user consists of entirely temporal cues, or is augmented by spectral cues. 相似文献

5.

Comodulation masking release in consonant recognition

Kwon BJ 《The Journal of the Acoustical Society of America》2002,112(2):634-641

Comodulation masking release (CMR) refers to an improvement in the detection threshold of a signal masked by noise with coherent amplitude fluctuation across frequency, as compared to noise without the envelope coherence. The present study tested whether such an advantage for signal detection would facilitate the identification of speech phonemes. Consonant identification of bandpass speech was measured under the following three masker conditions: (1) a single band of noise in the speech band ("on-frequency" masker); (2) two bands of noise, one in the on-frequency band and the other in the "flanking band," with coherence of temporal envelope fluctuation between the two bands (comodulation); and (3) two bands of noise (on-frequency band and flanking band), without the coherence of the envelopes (noncomodulation). A pilot experiment with a small number of consonant tokens was followed by the main experiment with 12 consonants and the following masking conditions: three frequency locations of the flanking band and two masker levels. Results showed that in all conditions, the comodulation condition provided higher identification scores than the noncomodulation condition, and the difference in score was 3.5% on average. No significant difference was observed between the on-frequency only condition and the comodulation condition, i.e., an "unmasking" effect by the addition of a comodulated flaking band was not observed. The positive effect of CMR on consonant recognition found in the present study endorses a "cued-listening" theory, rather than an envelope correlation theory, as a basis of CMR in a suprathreshold task. 相似文献

6.

The influence of noise on vowel and consonant cues

Parikh G Loizou PC 《The Journal of the Acoustical Society of America》2005,118(6):3874-3888

This study assessed the acoustic and perceptual effect of noise on vowel and stop-consonant spectra. Multi-talker babble and speech-shaped noise were added to vowel and stop stimuli at -5 to +10 dB S/N, and the effect of noise was quantified in terms of (a) spectral envelope differences between the noisy and clean spectra in three frequency bands, (b) presence of reliable F1 and F2 information in noise, and (c) changes in burst frequency and slope. Acoustic analysis indicated that F1 was detected more reliably than F2 and the largest spectral envelope differences between the noisy and clean vowel spectra occurred in the mid-frequency band. This finding suggests that in extremely noisy conditions listeners must be relying on relatively accurate F1 frequency information along with partial F2 information to identify vowels. Stop consonant recognition remained high even at -5 dB despite the disruption of burst cues due to additive noise, suggesting that listeners must be relying on other cues, perhaps formant transitions, to identify stops. 相似文献

7.

Auditory filter characteristics and consonant recognition for hearing-impaired listeners 总被引：1，自引：0，他引：1

J R Dubno D D Dirks 《The Journal of the Acoustical Society of America》1989,85(4):1666-1675

To examine the association between frequency resolution and speech recognition, auditory filter parameters and stop-consonant recognition were determined for 9 normal-hearing and 24 hearing-impaired subjects. In an earlier investigation, the relationship between stop-consonant recognition and the articulation index (AI) had been established on normal-hearing listeners. Based on AI predictions, speech-presentation levels for each subject in this experiment were selected to obtain a wide range of recognition scores. This strategy provides a method of interpreting speech-recognition performance among listeners who vary in magnitude and configuration of hearing loss by assuming that conditions which yield equal audible spectra will result in equivalent performance. It was reasoned that an association between frequency resolution and consonant recognition may be more appropriately estimated if hearing-impaired listeners' performance was measured under conditions that assured equivalent audibility of the speech stimuli. Derived auditory filter parameters indicated that filter widths and dynamic ranges were strongly associated with threshold. Stop-consonant recognition scores for most hearing-impaired listeners were not significantly poorer than predicted by the AI model. Furthermore, differences between observed recognition scores and those predicted by the AI were not associated with auditory filter characteristics, suggesting that frequency resolution and speech recognition may appear to be associated primarily because both are degraded by threshold elevation. 相似文献

8.

Acoustic cues for consonant identification by patients who use the Ineraid cochlear implant

M F Dorman S Soli K Dankowski L M Smith G McCandless J Parkin 《The Journal of the Acoustical Society of America》1990,88(5):2074-2079

Ten patients who use the Ineraid cochlear implant were tested on a consonant identification task. The stimuli were 16 consonants in the "aCa" environment. The patients who scored greater than 60 percent correct were found to have high feature information scores for amplitude envelope features and for features requiring the detection of high-frequency energy. The patients who scored less than 60 percent correct exhibited lower scores for all features of the signal. The difference in performance between the two groups of patients may be due, at least in part, to differences in the detection or resolution of high-frequency components in the speech signal. 相似文献

9.

The role of vowel and consonant fundamental frequency, envelope, and temporal fine structure cues to the intelligibility of words and sentences

Fogerty D Humes LE 《The Journal of the Acoustical Society of America》2012,131(2):1490-1501

The speech signal contains many acoustic properties that may contribute differently to spoken word recognition. Previous studies have demonstrated that the importance of properties present during consonants or vowels is dependent upon the linguistic context (i.e., words versus sentences). The current study investigated three potentially informative acoustic properties that are present during consonants and vowels for monosyllabic words and sentences. Natural variations in fundamental frequency were either flattened or removed. The speech envelope and temporal fine structure were also investigated by limiting the availability of these cues via noisy signal extraction. Thus, this study investigated the contribution of these acoustic properties, present during either consonants or vowels, to overall word and sentence intelligibility. Results demonstrated that all processing conditions displayed better performance for vowel-only sentences. Greater performance with vowel-only sentences remained, despite removing dynamic cues of the fundamental frequency. Word and sentence comparisons suggest that the speech envelope may be at least partially responsible for additional vowel contributions in sentences. Results suggest that speech information transmitted by the envelope is responsible, in part, for greater vowel contributions in sentences, but is not predictive for isolated words. 相似文献

10.

The use of visual cues in the perception of non-native consonant contrasts

Hazan V Sennema A Faulkner A Ortega-Llebaria M Iba M Chunge H 《The Journal of the Acoustical Society of America》2006,119(3):1740-1751

This study assessed the extent to which second-language learners are sensitive to phonetic information contained in visual cues when identifying a non-native phonemic contrast. In experiment 1, Spanish and Japanese learners of English were tested on their perception of a labial/ labiodental consonant contrast in audio (A), visual (V), and audio-visual (AV) modalities. Spanish students showed better performance overall, and much greater sensitivity to visual cues than Japanese students. Both learner groups achieved higher scores in the AV than in the A test condition, thus showing evidence of audio-visual benefit. Experiment 2 examined the perception of the less visually-salient /1/-/r/ contrast in Japanese and Korean learners of English. Korean learners obtained much higher scores in auditory and audio-visual conditions than in the visual condition, while Japanese learners generally performed poorly in both modalities. Neither group showed evidence of audio-visual benefit. These results show the impact of the language background of the learner and visual salience of the contrast on the use of visual cues for a non-native contrast. Significant correlations between scores in the auditory and visual conditions suggest that increasing auditory proficiency in identifying a non-native contrast is linked with an increasing proficiency in using visual cues to the contrast. 相似文献

11.

Differential contribution of envelope fluctuations across frequency to consonant identification in quiet

Apoux F Bacon SP 《The Journal of the Acoustical Society of America》2008,123(5):2792

Two experiments investigated the effects of critical bandwidth and frequency region on the use of temporal envelope cues for speech. In both experiments, spectral details were reduced using vocoder processing. In experiment 1, consonant identification scores were measured in a condition for which the cutoff frequency of the envelope extractor was half the critical bandwidth (HCB) of the auditory filters centered on each analysis band. Results showed that performance is similar to those obtained in conditions for which the envelope cutoff was set to 160 Hz or above. Experiment 2 evaluated the impact of setting the cutoff frequency of the envelope extractor to values of 4, 8, and 16 Hz or to HCB in one or two contiguous bands for an eight-band vocoder. The cutoff was set to 16 Hz for all the other bands. Overall, consonant identification was not affected by removing envelope fluctuations above 4 Hz in the low- and high-frequency bands. In contrast, speech intelligibility decreased as the cutoff frequency was decreased in the midfrequency region from 16 to 4 Hz. The behavioral results were fairly consistent with a physical analysis of the stimuli, suggesting that clearly measurable envelope fluctuations cannot be attenuated without affecting speech intelligibility. 相似文献

12.

Spectral and temporal cues for phoneme recognition in noise

Xu L Zheng Y 《The Journal of the Acoustical Society of America》2007,122(3):1758

Cochlear implant users receive limited spectral and temporal information. Their speech recognition deteriorates dramatically in noise. The aim of the present study was to determine the relative contributions of spectral and temporal cues to speech recognition in noise. Spectral information was manipulated by varying the number of channels from 2 to 32 in a noise-excited vocoder. Temporal information was manipulated by varying the low-pass cutoff frequency of the envelope extractor from 1 to 512 Hz. Ten normal-hearing, native speakers of English participated in tests of phoneme recognition using vocoder processed consonants and vowels under three conditions (quiet, and +6 and 0 dB signal-to-noise ratios). The number of channels required for vowel-recognition performance to plateau increased from 12 in quiet to 16-24 in the two noise conditions. However, for consonant recognition, no further improvement in performance was evident when the number of channels was > or =12 in any of the three conditions. The contribution of temporal cues for phoneme recognition showed a similar pattern in both quiet and noise conditions. Similar to the quiet conditions, there was a trade-off between temporal and spectral cues for phoneme recognition in noise. 相似文献

13.

Speech identification based on temporal fine structure cues

Sheft S Ardoint M Lorenzi C 《The Journal of the Acoustical Society of America》2008,124(1):562-575

The contribution of temporal fine structure (TFS) cues to consonant identification was assessed in normal-hearing listeners with two speech-processing schemes designed to remove temporal envelope (E) cues. Stimuli were processed vowel-consonant-vowel speech tokens. Derived from the analytic signal, carrier signals were extracted from the output of a bank of analysis filters. The "PM" and "FM" processing schemes estimated a phase- and frequency-modulation function, respectively, of each carrier signal and applied them to a sinusoidal carrier at the analysis-filter center frequency. In the FM scheme, processed signals were further restricted to the analysis-filter bandwidth. A third scheme retaining only E cues from each band was used for comparison. Stimuli processed with the PM and FM schemes were found to be highly intelligible (50-80% correct identification) over a variety of experimental conditions designed to affect the putative reconstruction of E cues subsequent to peripheral auditory filtering. Analysis of confusions between consonants showed that the contribution of TFS cues was greater for place than manner of articulation, whereas the converse was observed for E cues. Taken together, these results indicate that TFS cues convey important phonetic information that is not solely a consequence of E reconstruction. 相似文献

14.

Effects of acoustic modification on consonant recognition by elderly hearing-impaired subjects

S Gordon-Salant 《The Journal of the Acoustical Society of America》1987,81(4):1199-1202

In a recent study [S. Gordon-Salant, J. Acoust. Soc. Am. 80, 1599-1607 (1986)], young and elderly normal-hearing listeners demonstrated significant improvements in consonant-vowel (CV) recognition with acoustic modification of the speech signal incorporating increments in the consonant-vowel ratio (CVR). Acoustic modification of consonant duration failed to enhance performance. The present study investigated whether consonant recognition deficits of elderly hearing-impaired listeners would be reduced by these acoustic modifications, as well as by increases in speech level. Performance of elderly hearing-impaired listeners with gradually sloping and sharply sloping sensorineural hearing losses was compared to performance of elderly normal-threshold listeners (reported previously) for recognition of a variety of nonsense syllable stimuli. These stimuli included unmodified CVs, CVs with increases in CVR, CVs with increases in consonant duration, and CVs with increases in both CVR and consonant duration. Stimuli were presented at each of two speech levels with a background of noise. Results obtained from the hearing-impaired listeners agreed with those observed previously from normal-hearing listeners. Differences in performance between the three subject groups as a function of level were observed also. 相似文献

15.

Speech perception by infants: categorization based on nasal consonant place of articulation

J Hillenbrand 《The Journal of the Acoustical Society of America》1984,75(5):1613-1622

This study examined the ability of six-month-old infants to recognize the perceptual similarity of syllables sharing a phonetic segment when variations were introduced in phonetic environment and talker. Infants in a "phonetic" group were visually reinforced for head turns when a change occurred from a background category of labial nasals to a comparison category of alveolar nasals . The infants were initially trained on a [ma]-[na] contrast produced by a male talker. Novel tokens differing in vowel environment and talker were introduced over several stages of increasing complexity. In the most complex stage infants were required to make a head turn when a change occurred from [ma,mi,mu] to [na,ni,nu], with the tokens in each category produced by both male and female talkers. A " nonphonetic " control group was tested using the same pool of stimuli as the phonetic condition. The only difference was that the stimuli in the background and comparison categories were chosen in such a way that the sounds could not be organized by acoustic or phonetic characteristics. Infants in the phonetic group transferred training to novel tokens produced by different talkers and in different vowel contexts. However, infants in the nonphonetic control group had difficulty learning the phonetically unrelated tokens that were introduced as the experiment progressed. These findings suggest that infants recognize the similarity of nasal consonants sharing place of articulation independent of variation in talker and vowel context. 相似文献

16.

Relative contributions of spectral and temporal cues for phoneme recognition 总被引：4，自引：0，他引：4

Xu L Thompson CS Pfingst BE 《The Journal of the Acoustical Society of America》2005,117(5):3255-3267

Cochlear implants provide users with limited spectral and temporal information. In this study, the amount of spectral and temporal information was systematically varied through simulations of cochlear implant processors using a noise-excited vocoder. Spectral information was controlled by varying the number of channels between 1 and 16, and temporal information was controlled by varying the lowpass cutoff frequencies of the envelope extractors from 1 to 512 Hz. Consonants and vowels processed using those conditions were presented to seven normal-hearing native-English-speaking listeners for identification. The results demonstrated that both spectral and temporal cues were important for consonant and vowel recognition with the spectral cues having a greater effect than the temporal cues for the ranges of numbers of channels and lowpass cutoff frequencies tested. The lowpass cutoff for asymptotic performance in consonant and vowel recognition was 16 and 4 Hz, respectively. The number of channels at which performance plateaued for consonants and vowels was 8 and 12, respectively. Within the above-mentioned ranges of lowpass cutoff frequency and number of channels, the temporal and spectral cues showed a tradeoff for phoneme recognition. Information transfer analyses showed different relative contributions of spectral and temporal cues in the perception of various phonetic/acoustic features. 相似文献

17.

Sound segregation based on temporal envelope structure and binaural cues

Schimmel O van de Par S Breebaart J Kohlrausch A 《The Journal of the Acoustical Society of America》2008,124(2):1130-1145

The ability to segregate two spectrally and temporally overlapping signals based on differences in temporal envelope structure and binaural cues was investigated. Signals were a harmonic tone complex (HTC) with 20 Hz fundamental frequency and a bandpass noise (BPN). Both signals had interaural differences of the same absolute value, but with opposite signs to establish lateralization to different sides of the medial plane, such that their combination yielded two different spatial configurations. As an indication for segregation ability, threshold interaural time and level differences were measured for discrimination between these spatial configurations. Discrimination based on interaural level differences was good, although absolute thresholds depended on signal bandwidth and center frequency. Discrimination based on interaural time differences required the signals' temporal envelope structures to be sufficiently different. Long-term interaural cross-correlation patterns or long-term averaged patterns after equalization-cancellation of the combined signals did not provide information for the discrimination. The binaural system must, therefore, have been capable of processing changes in interaural time differences within the period of the harmonic tone complex, suggesting that monaural information from the temporal envelopes influences the use of binaural information in the perceptual organization of signal components. 相似文献

18.

Temporal envelope compensation for robust phoneme recognition using modulation spectrum

Ganapathy S Thomas S Hermansky H 《The Journal of the Acoustical Society of America》2010,128(6):3769-3780

A robust feature extraction technique for phoneme recognition is proposed which is based on deriving modulation frequency components from the speech signal. The modulation frequency components are computed from syllable-length segments of sub-band temporal envelopes estimated using frequency domain linear prediction. Although the baseline features provide good performance in clean conditions, the performance degrades significantly in noisy conditions. In this paper, a technique for noise compensation is proposed where an estimate of the noise envelope is subtracted from the noisy speech envelope. The noise compensation technique suppresses the effect of additive noise in speech. The robustness of the proposed features is further enhanced by the gain normalization technique. The normalized temporal envelopes are compressed with static (logarithmic) and dynamic (adaptive loops) compression and are converted into modulation frequency features. These features are used in an automatic phoneme recognition task. Experiments are performed in mismatched train/test conditions where the test data are corrupted with various environmental distortions like telephone channel noise, additive noise, and room reverberation. Experiments are also performed on large amounts of real conversational telephone speech. In these experiments, the proposed features show substantial improvements in phoneme recognition rates compared to other speech analysis techniques. Furthermore, the contribution of various processing stages for robust speech signal representation is analyzed. 相似文献

19.

Vowel and consonant recognition of cochlear implant patients using formant-estimating speech processors

P J Blamey R C Dowell A M Brown G M Clark P M Seligman 《The Journal of the Acoustical Society of America》1987,82(1):48-57

Vowel and consonant confusion matrices were collected in the hearing alone (H), lipreading alone (L), and hearing plus lipreading (HL) conditions for 28 patients participating in the clinical trial of the multiple-channel cochlear implant. All patients were profound-to-totally deaf and "hearing" refers to the presentation of auditory information via the implant. The average scores were 49% for vowels and 37% for consonants in the H condition and the HL scores were significantly higher than the L scores. Information transmission and multidimensional scaling analyses showed that different speech features were conveyed at different levels in the H and L conditions. In the HL condition, the visual and auditory signals provided independent information sources for each feature. For vowels, the auditory signal was the major source of duration information, while the visual signal was the major source of first and second formant frequency information. The implant provided information about the amplitude envelope of the speech and the estimated frequency of the main spectral peak between 800 and 4000 Hz, which was useful for consonant recognition. A speech processor that coded the estimated frequency and amplitude of an additional peak between 300 and 1000 Hz was shown to increase the vowel and consonant recognition in the H condition by improving the transmission of first formant and voicing information. 相似文献

20.

The role of single-channel cues in synchrony perception: the summed waveform

V M Richards 《The Journal of the Acoustical Society of America》1990,88(2):786-795

Human observers are able to discriminate between simultaneously presented bands of noise having envelopes that are identical (synchronous) rather than statistically independent (asynchronous). The possibility that the detection of envelope synchrony is based on cues available in a single critical band, rather than on a simultaneous comparison of envelopes extracted via independent critical bands, is examined. Two potential single-channel cues were examined, both relying on the assumption that information present in the envelope of the summed bands is available to the listener. One such single-channel cue is the rms of the envelope of the summed waveform; the envelope is more deeply modulated for the summed synchronous bands than for the summed asynchronous bands. The second cue examined was envelope regularity; the envelope of the summed synchronous bands has periodic envelope minima, while the summed asynchronous bands exhibit aperiodic envelope minima. Psychophysical results suggest that such within-channel cues may be both available to, and utilized by, the listener when the component bands are separated by less than one-third of an octave. 相似文献