首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 265 毫秒
1.
Recent studies have shown that synthesized versions of American English vowels are less accurately identified when the natural time-varying spectral changes are eliminated by holding the formant frequencies constant over the duration of the vowel. A limitation of these experiments has been that vowels produced by formant synthesis are generally less accurately identified than the natural vowels after which they are modeled. To overcome this limitation, a high-quality speech analysis-synthesis system (STRAIGHT) was used to synthesize versions of 12 American English vowels spoken by adults and children. Vowels synthesized with STRAIGHT were identified as accurately as the natural versions, in contrast with previous results from our laboratory showing identification rates 9%-12% lower for the same vowels synthesized using the cascade formant model. Consistent with earlier studies, identification accuracy was not reduced when the fundamental frequency was held constant across the vowel. However, elimination of time-varying changes in the spectral envelope using STRAIGHT led to a greater reduction in accuracy (23%) than was previously found with cascade formant synthesis (11%). A statistical pattern recognition model, applied to acoustic measurements of the natural and synthesized vowels, predicted both the higher identification accuracy for vowels synthesized using STRAIGHT compared to formant synthesis, and the greater effects of holding the formant frequencies constant over time with STRAIGHT synthesis. Taken together, the experiment and modeling results suggest that formant estimation errors and incorrect rendering of spectral and temporal cues by cascade formant synthesis contribute to lower identification accuracy and underestimation of the role of time-varying spectral change in vowels.  相似文献   

2.
Vowel identification was tested in quiet, noise, and reverberation with 20 normal-hearing subjects and 20 hearing-impaired subjects. Stimuli were 15 English vowels spoken in a /b-t/context by six male talkers. Each talker produced five tokens of each vowel. In quiet, all stimuli were identified by two judges as the intended targets. The stimuli were degraded by reverberation or speech-spectrum noise. Vowel identification scores depended upon talker, listening condition, and subject type. The relationship between identification errors and spectral details of the vowels is discussed.  相似文献   

3.
Neural-population interactions resulting from excitation overlap in multi-channel cochlear implants (CI) may cause blurring of the "internal" auditory representation of complex sounds such as vowels. In experiment I, confusion matrices for eight German steady-state vowellike signals were obtained from seven CI listeners. Identification performance ranged between 42% and 74% correct. On the basis of an information transmission analysis across all vowels, pairs of most and least frequently confused vowels were selected for each subject. In experiment II, vowel masking patterns (VMPs) were obtained using the previously selected vowels as maskers. The VMPs were found to resemble the "electrical" vowel spectra to a large extent, indicating a relatively weak effect of neural-population interactions. Correlation between vowel identification data and VMP spectral similarity, measured by means of several spectral distance metrics, showed that the CI listeners identified the vowels based on differences in the between-peak spectral information as well as the location of spectral peaks. The effect of nonlinear amplitude mapping of acoustic into "electrical" vowels, as performed in the implant processors, was evaluated separately and compared to the effect of neural-population interactions. Amplitude mapping was found to cause more blurring than neural-population interactions. Subjects exhibiting strong blurring effects yielded lower overall vowel identification scores.  相似文献   

4.
The purpose of this experiment was to evaluate the utilization of short-term spectral cues for recognition of initial plosive consonants (/b,d,g/) by normal-hearing and by hearing-impaired listeners differing in audiometric configuration. Recognition scores were obtained for these consonants paired with three vowels (/a,i,u/) while systematically reducing the duration (300 to 10 ms) of the synthetic consonant-vowel syllables. Results from 10 normal-hearing and 15 hearing-impaired listeners suggest that audiometric configuration interacts in a complex manner with the identification of short-duration stimuli. For consonants paired with the vowels /a/ and /u/, performance deteriorated as the slope of the audiometric configuration increased. The one exception to this result was a subject who had significantly elevated pure-tone thresholds relative to the other hearing-impaired subjects. Despite the changes in the shape of the onset spectral cues imposed by hearing loss, with increasing duration, consonant recognition in the /a/ and /u/ context for most hearing-impaired subjects eventually approached that of the normal-hearing listeners. In contrast, scores for consonants paired with /i/ were poor for a majority of hearing-impaired listeners for stimuli of all durations.  相似文献   

5.
Two signal-processing algorithms, designed to separate the voiced speech of two talkers speaking simultaneously at similar intensities in a single channel, were compared and evaluated. Both algorithms exploit the harmonic structure of voiced speech and require a difference in fundamental frequency (F0) between the voices to operate successfully. One attenuates the interfering voice by filtering the cepstrum of the combined signal. The other uses the method of harmonic selection [T. W. Parsons, J. Acoust. Soc. Am. 60, 911-918 (1976)] to resynthesize the target voice from fragmentary spectral information. Two perceptual evaluations were carried out. One involved the separation of pairs of vowels synthesized on static F0's; the other involved the recovery of consonant-vowel (CV) words masked by a synthesized vowel. Normal-hearing listeners and four listeners with moderate-to-severe, bilateral, symmetrical, sensorineural hearing impairments were tested. All listeners showed increased accuracy of identification when the target voice was enhanced by processing. The vowel-identification data show that intelligibility enhancement is possible over a range of F0 separations between the target and interfering voice. The recovery of CV words demonstrates that the processing is valid not only for spectrally static vowels but also for less intense time-varying voiced consonants. The results for the impaired listeners suggest that the algorithms may be applicable as components of a noise-reduction system in future digital signal-processing hearing aids. The vowel-separation test, and subjective listening, suggest that harmonic selection, which is the more computationally expensive method, produces the more effective voice separation.  相似文献   

6.
The present study measured the recognition of spectrally degraded and frequency-shifted vowels in both acoustic and electric hearing. Vowel stimuli were passed through 4, 8, or 16 bandpass filters and the temporal envelopes from each filter band were extracted by half-wave rectification and low-pass filtering. The temporal envelopes were used to modulate noise bands which were shifted in frequency relative to the corresponding analysis filters. This manipulation not only degraded the spectral information by discarding within-band spectral detail, but also shifted the tonotopic representation of spectral envelope information. Results from five normal-hearing subjects showed that vowel recognition was sensitive to both spectral resolution and frequency shifting. The effect of a frequency shift did not interact with spectral resolution, suggesting that spectral resolution and spectral shifting are orthogonal in terms of intelligibility. High vowel recognition scores were observed for as few as four bands. Regardless of the number of bands, no significant performance drop was observed for tonotopic shifts equivalent to 3 mm along the basilar membrane, that is, for frequency shifts of 40%-60%. Similar results were obtained from five cochlear implant listeners, when electrode locations were fixed and the spectral location of the analysis filters was shifted. Changes in recognition performance in electrical and acoustic hearing were similar in terms of the relative location of electrodes rather than the absolute location of electrodes, indicating that cochlear implant users may at least partly accommodate to the new patterns of speech sounds after long-time exposure to their normal speech processor.  相似文献   

7.
Abilities to detect and discriminate ten synthetic steady-state English vowels were compared in Old World monkeys (Cercopithecus, Macaca) and humans using standard animal psychophysical procedures and positive-reinforcement operant conditioning techniques. Monkeys' detection thresholds were close to humans' for the front vowels /i-I-E-ae-E), but 10-20 dB higher for the back vowels /V-D-C-U-u/. Subjects were subsequently presented with groups of vowels to discriminate. All monkeys experienced difficulty with spectrally similar pairs such as /V-D/, /E-ae/, and /U-u/, but macaques were superior to Cercopithecus monkeys. Humans discriminated all vowels at 100% correct levels, but their increased response latencies reflected spectral similarity and correlated with higher error rates by monkeys. Varying the intensity level of the vowel stimuli had little effect on either monkey or human discrimination, except at the lowest levels tested. These qualitative similarities in monkey and human vowel discrimination suggest that some monkey species may provide useful models of human vowel processing at the sensory level.  相似文献   

8.
The identification of front vowels was studied in normal-hearing listeners using stimuli whose spectra had been altered to approximate the spectrum of vowels processed by auditory filters similar to those that might accompany sensorineural hearing loss. In the first experiment, front vowels were identified with greater than 95% accuracy when the first formant was specified in a normal manner and the higher frequency formants were represented by a broad, flat spectral plateau ranging from approximately 1600 to 3500 Hz. In the second experiment, the bandwidth of the first formant was systematically widened for stimuli with already flattened higher frequency formants. Normal vowel identification was preserved until the first formant was widened to six times its normal bandwidth. These results may account for the coexistence of abnormal vowel masking patterns (indicating flattened auditory spectra) and normal vowel recognition.  相似文献   

9.
This paper addresses the problem of automatic identification of vowels uttered in isolation by female and child speakers. In this case, the magnitude spectrum of voiced vowels is sparsely sampled since only frequencies at integer multiples of F0 are significant. This impacts negatively on the performance of vowel identification techniques that either ignore pitch or rely on global shape models. A new pitch-dependent approach to vowel identification is proposed that emerges from the concept of timbre and that defines perceptual spectral clusters (PSC) of harmonic partials. A representative set of static PSC-related features are estimated and their performance is evaluated in automatic classification tests using the Mahalanobis distance. Linear prediction features and Mel-frequency cepstral coefficients (MFCC) coefficients are used as a reference and a database of five (Portuguese) natural vowel sounds uttered by 44 speakers (including 27 child speakers) is used for training and testing the Gaussian models. Results indicate that perceptual spectral cluster (PSC) features perform better than plain linear prediction features, but perform slightly worse than MFCC features. However, PSC features have the potential to take full advantage of the pitch structure of voiced vowels, namely in the analysis of concurrent voices, or by using pitch as a normalization parameter.  相似文献   

10.
The research presented here concerns the simultaneous grouping of the components of a vocal sound source. McAdams [J. Acoust. Soc. Am. 86, 2148-2159 (1989)] found that when three simultaneous vowels at different pitches were presented with subaudio frequency modulation, subjects judged them as being more prominent than when no vibrato was present. In a normal voice, when the harmonics of a vowel undergo frequency modulation they also undergo an amplitude modulation that traces the spectral envelope. Hypothetically, this spectral tracing could be one of the criteria used by the ear to group components of each vowel, which may help explain the lack of effect of frequency modulation coherence among different vowels in the previous study. In this experiment, two types of vowel synthesis were used in which the component amplitudes of each vowel either remained constant with frequency modulation or traced the spectral envelope. The stimuli for the experiment were chords of three different vowels at pitch intervals of five semitones (ratio 1.33). All the vowels of a given stimulus were produced by the same synthesis method. The subjects' task involved rating the prominence of each vowel in the stimulus. It was assumed that subjects would judge this prominence to be lower when they were not able to distinguish the vowel from the background sound. Also included as stimulus parameters were the different permutations of the three vowels at three pitches and a number of modulation conditions in which vowels were unmodulated, modulated alone, and modulated either coherently with, or independently of, the other vowels. Spectral tracing did not result in increased ratings of vowel prominence compared to stimuli where no spectral tracing was present. It would therefore seem that it has no effect on grouping components of sound sources. Modulated vowels received higher prominence ratings than unmodulated vowels. Vowels modulated alone were judged to be more prominent than vowels modulated with other vowels. There was, however, no significant difference between coherent and independent modulation of the three vowels. Differences among modulation conditions were more marked when the modulation width was 6% than when it was 3%.  相似文献   

11.
Three adult male baboons were trained on a psychophysical procedure to discriminate five synthetic, steady-state vowel sounds [a), (ae), (c), (U), and (epsilon] from one another. A pulsed train of one vowel comprised the reference stimulus during a session. Animals were trained to press a lever and release the lever only when this reference vowel sound changed to one of the comparison vowels. All animals learned the vowel discriminations rapidly and, once learned, performed the discriminations at the 95%-100% correct level. During the initial acquisition of the discriminations, however, percent correct detections were higher for those vowels with greater spectral differences from the reference vowel. For some cases, the detection scores correlated closely with differences between first formant peaks, while for others the detection scores correlated more closely with differences between second formant peaks. Once the discriminations were acquired, no discriminability differences were apparent among the different vowels. Underlying discriminability differences were still present, however, and could be revealed by giving a minor tranquilizer (diazepam) that lowered discrimination performances. These drug-induced decrements in vowel discriminability were also correlated with spectral differences, with lower vowel discriminability scores found for those vowels with smaller spectral differences from the reference vowel.  相似文献   

12.
The ability of subjects to identify vowels in vibrotactile transformations of consonant-vowel syllables was measured for two types of displays: a spectral display (frequency by intensity), and a vocal tract area function display (vocal tract location by cross-sectional area). Both displays were presented to the fingertip via the tactile display of the Optacon transducer. In the first experiments the spectral display was effective for identifying vowels in /b/V/ context when as many as 24 or as few as eight spectral channels were presented to the skin. However, performance fell when the 12- and 8-channel displays were reduced in size to occupy 1/2 or 1/3 of the 24-row tactile matrix. The effect of reducing the size of the display was greater when the spectrum was represented as a solid histogram ("filled" patterns) than when it was represented as a simple spectral contour ("unfilled" patterns). Spatial masking within the filled pattern was postulated as the cause for this decline in performance. Another experiment measured the utility of the spectral display when the syllables were produced by multiple speakers. The resulting increase in response confusions was primarily attributable to variations in the tactile patterns caused by differences in vocal tract resonances among the speakers. The final experiment found an area function display to be inferior to the spectral display for identification of vowels. The results demonstrate that a two-dimensional spectral display is worthy of further development as a basic vibrotactile display for speech.  相似文献   

13.
The effects of noise and reverberation on the identification of monophthongs and diphthongs were evaluated for ten subjects with moderate sensorineural hearing losses. Stimuli were 15 English vowels spoken in a /b-t/ context, in a carrier sentence. The original tape was recorded without reverberation, in a quiet condition. This test tape was degraded either by recording in a room with reverberation time of 1.2 s, or by adding a babble of 12 voices at a speech-to-noise ratio of 0 dB. Both types of degradation caused statistically significant reductions of mean identification scores as compared to the quiet condition. Although the mean identification scores for the noise and reverberant conditions were not significantly different, the patterns of errors for these two conditions were different. Errors for monophthongs in reverberation but not in noise seemed to be related to an overestimation of vowel duration, and there was a tendency to weight the formant frequencies differently in the reverberation and quiet conditions. Errors for monophthongs in noise seemed to be related to spectral proximity of formant frequencies for confused pairs. For the diphthongs in both noise and reverberation, there was a tendency to judge a diphthong as the beginning monophthong. This may have been due to temporal smearing in the reverberation condition, and to a higher masked threshold for changing compared to stationary formant frequencies in the noise condition.  相似文献   

14.
It has been hypothesized that the wider-than-normal auditory bandwidths attributed to sensorineural hearing loss lead to a reduced ability to discriminate spectral characteristics in speech signals. In order to investigate this possibility, the minimum detectable depth of a spectral "notch" between the second (F2) and third (F3) formants of a synthetic vowel-like stimulus was determined for normal and hearing-impaired subjects. The minimum detectable notch for all subjects was surprisingly small; values obtained were much smaller than those found in actual vowels. An analysis of the stimuli based upon intensity discrimination within a single critical band predicted only small differences in performance on this task for rather large differences in the size of the auditory bandwidth. These results suggest that impairments of auditory frequency resolution in sensorineural hearing loss may not be critical in the perception of steady-state vowels.  相似文献   

15.
This study explored how across-talker differences influence non-native vowel perception. American English (AE) and Korean listeners were presented with recordings of 10 AE vowels in /bVd/ context. The stimuli were mixed with noise and presented for identification in a 10-alternative forced-choice task. The two listener groups heard recordings of the vowels produced by 10 talkers at three signal-to-noise ratios. Overall the AE listeners identified the vowels 22% more accurately than the Korean listeners. There was a wide range of identification accuracy scores across talkers for both AE and Korean listeners. At each signal-to-noise ratio, the across-talker intelligibility scores were highly correlated for AE and Korean listeners. Acoustic analysis was conducted for 2 vowel pairs that exhibited variable accuracy across talkers for Korean listeners but high identification accuracy for AE listeners. Results demonstrated that Korean listeners' error patterns for these four vowels were strongly influenced by variability in vowel production that was within the normal range for AE talkers. These results suggest that non-native listeners are strongly influenced by across-talker variability perhaps because of the difficulty they have forming native-like vowel categories.  相似文献   

16.
An experiment investigated the effects of amplitude ratio (-35 to 35 dB in 10-dB steps) and fundamental frequency difference (0%, 3%, 6%, and 12%) on the identification of pairs of concurrent synthetic vowels. Vowels as weak as -25 dB relative to their competitor were easier to identify in the presence of a fundamental frequency difference (delta F0). Vowels as weak as -35 dB were not. Identification was generally the same at delta F0 = 3%, 6%, and 12% for all amplitude ratios: unfavorable amplitude ratios could not be compensated by larger delta F0's. Data for each vowel pair and each amplitude ratio, at delta F0 = 0%, were compared to the spectral envelope of the stimulus at the same ratio, in order to determine which spectral cues determined identification. This information was then used to interpret the pattern of improvement with delta F0 for each vowel pair, to better understand mechanisms of F0-guided segregation. Identification of a vowel was possible in the presence of strong cues belonging to its competitor, as long as cues to its own formants F1 and F2 were prominent. delta F0 enhanced the prominence of a target vowel's cues, even when the spectrum of the target was up to 10 dB below that of its competitor at all frequencies. The results are incompatible with models of segregation based on harmonic enhancement, beats, or channel selection.  相似文献   

17.
A new method to code the speech envelope in continuous interleaved sampling (CIS) processors for cochlear implants is proposed. In this enhanced envelope, the rapid adaptation seen in the response of auditory nerves to sound stimuli is incorporated. Two strategies, one using the standard envelope (CIS) and one using the enhanced envelope (EECIS), were tested perceptually with six postlingually deafened users of the LAURA cochlear implant. The tests included identification of stop consonants in three different vowel contexts and monosyllabic consonant-vowel-consonant (CVC) words. Significant improvements in correct identification scores were observed for stop consonants in intervocalic /a/ context (p = 0.026): average results varied from 46% correct for CIS to 55% for EECIS. This improvement was mainly due to the better transmission of place of articulation. The differences in identification scores for stop consonants in /i/ and /u/ context were not significant. The identification scores for the medial vowels of the CVC words were significantly higher when the EECIS strategy was used: average results increased from 39% correct to 46% correct (p = 0.018). No significant differences were observed between the results for initial and final consonants of the CVC words. The present results demonstrate that the inclusion of the rapid adaptation in the speech processing for cochlear implants can improve speech intelligibility.  相似文献   

18.
The ability of listeners to identify pairs of simultaneous synthetic vowels has been investigated in the first of a series of studies on the extraction of phonetic information from multiple-talker waveforms. Both members of the vowel pair had the same onset and offset times and a constant fundamental frequency of 100 Hz. Listeners identified both vowels with an accuracy significantly greater than chance. The pattern of correct responses and confusions was similar for vowels generated by (a) cascade formant synthesis and (b) additive harmonic synthesis that replaced each of the lowest three formants with a single pair of harmonics of equal amplitude. In order to choose an appropriate model for describing listeners' performance, four pattern-matching procedures were evaluated. Each predicted the probability that (i) any individual vowel would be selected as one of the two responses, and (ii) any pair of vowels would be selected. These probabilities were estimated from measures of the similarities of the auditory excitation patterns of the double vowels to those of single-vowel reference patterns. Up to 88% of the variance in individual responses and up to 67% of the variance in pairwise responses could be accounted for by procedures that highlighted spectral peaks and shoulders in the excitation pattern. Procedures that assigned uniform weight to all regions of the excitation pattern gave poorer predictions. These findings support the hypothesis that the auditory system pays particular attention to the frequencies of spectral peaks, and possibly also of shoulders, when identifying vowels. One virtue of this strategy is that the spectral peaks and shoulders can indicate the frequencies of formants when other aspects of spectral shape are obscured by competing sounds.  相似文献   

19.
Vowel identification in quiet, noise, and reverberation was tested with 40 subjects who varied in age and hearing level. Stimuli were 15 English vowels spoken in a (b-t) context in a carrier sentence, which were degraded by reverberation or noise (a babble of 12 voices). Vowel identification scores were correlated with various measures of hearing loss and with age. The mean of four hearing levels at 0.5, 1, 2, and 4 kHz, termed HTL4, produced the highest correlation coefficients in all three listening conditions. The correlation with age was smaller than with HTL4 and significant only for the degraded vowels. Further analyses were performed for subjects assigned to four groups on the basis of the amount of hearing loss. In noise, performance of all four groups was significantly different, whereas, in both quiet and reverberation, only the group with the greatest hearing loss performed differently from the other groups. The relationship among hearing loss, age, and number and type of errors is discussed in light of acoustic cues available for vowel identification.  相似文献   

20.
Research on the perception of vowels in the last several years has given rise to new conceptions of vowels as articulatory, acoustic, and perceptual events. Starting from a "simple" target model in which vowels were characterized articulatorily as static vocal tract shapes and acoustically as points in a first and second formant (F1/F2) vowel space, this paper briefly traces the evolution of vowel theory in the 1970s and 1980s in two directions. (1) Elaborated target models represent vowels as target zones in perceptual spaces whose dimensions are specified as formant ratios. These models have been developed primarily to account for perceivers' solution of the "speaker normalization" problem. (2) Dynamic specification models emphasize the importance of formant trajectory patterns in specifying vowel identity. These models deal primarily with the problem of "target undershoot" associated with the coarticulation of vowels with consonants in natural speech and with the issue of "vowel-inherent spectral change" or diphthongization of English vowels. Perceptual studies are summarized that motivate these theoretical developments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号