首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
An experiment investigated the effects of amplitude ratio (-35 to 35 dB in 10-dB steps) and fundamental frequency difference (0%, 3%, 6%, and 12%) on the identification of pairs of concurrent synthetic vowels. Vowels as weak as -25 dB relative to their competitor were easier to identify in the presence of a fundamental frequency difference (delta F0). Vowels as weak as -35 dB were not. Identification was generally the same at delta F0 = 3%, 6%, and 12% for all amplitude ratios: unfavorable amplitude ratios could not be compensated by larger delta F0's. Data for each vowel pair and each amplitude ratio, at delta F0 = 0%, were compared to the spectral envelope of the stimulus at the same ratio, in order to determine which spectral cues determined identification. This information was then used to interpret the pattern of improvement with delta F0 for each vowel pair, to better understand mechanisms of F0-guided segregation. Identification of a vowel was possible in the presence of strong cues belonging to its competitor, as long as cues to its own formants F1 and F2 were prominent. delta F0 enhanced the prominence of a target vowel's cues, even when the spectrum of the target was up to 10 dB below that of its competitor at all frequencies. The results are incompatible with models of segregation based on harmonic enhancement, beats, or channel selection.  相似文献   

2.
3.
Four experiments explored the relative contributions of spectral content and phonetic labeling in effects of context on vowel perception. Two 10-step series of CVC syllables ([bVb] and [dVd]) varying acoustically in F2 midpoint frequency and varying perceptually in vowel height from [delta] to [epsilon] were synthesized. In a forced-choice identification task, listeners more often labeled vowels as [delta] in [dVd] context than in [bVb] context. To examine whether spectral content predicts this effect, nonspeech-speech hybrid series were created by appending 70-ms sine-wave glides following the trajectory of CVC F2's to 60-ms members of a steady-state vowel series varying in F2 frequency. In addition, a second hybrid series was created by appending constant-frequency sine-wave tones equivalent in frequency to CVC F2 onset/offset frequencies. Vowels flanked by frequency-modulated glides or steady-state tones modeling [dVd] were more often labeled as [delta] than were the same vowels surrounded by nonspeech modeling [bVb]. These results suggest that spectral content is important in understanding vowel context effects. A final experiment tested whether spectral content can modulate vowel perception when phonetic labeling remains intact. Voiceless consonants, with lower-amplitude more-diffuse spectra, were found to exert less of an influence on vowel perception than do their voiced counterparts. The data are discussed in terms of a general perceptual account of context effects in speech perception.  相似文献   

4.
The identification of front vowels was studied in normal-hearing listeners using stimuli whose spectra had been altered to approximate the spectrum of vowels processed by auditory filters similar to those that might accompany sensorineural hearing loss. In the first experiment, front vowels were identified with greater than 95% accuracy when the first formant was specified in a normal manner and the higher frequency formants were represented by a broad, flat spectral plateau ranging from approximately 1600 to 3500 Hz. In the second experiment, the bandwidth of the first formant was systematically widened for stimuli with already flattened higher frequency formants. Normal vowel identification was preserved until the first formant was widened to six times its normal bandwidth. These results may account for the coexistence of abnormal vowel masking patterns (indicating flattened auditory spectra) and normal vowel recognition.  相似文献   

5.
Confusion matrices for seven synthetic steady-state vowels were obtained from ten normal and three hearing-impaired subjects. The vowels were identified at greater than 96% accuracy by the normals, and less accurately by the impaired subjects. Shortened versions of selected vowels then were used as maskers, and vowel masking patterns (VMPs) consisting of forward-masked threshold for sinusoidal probes at all vowel masker harmonics were obtained from the impaired subjects and from one normal subject. Vowel-masked probe thresholds were transformed using growth-of-masking functions obtained with flat-spectrum noise. VMPs of the impaired subjects, relative to those of the normal, were characterized by smaller dynamic range, poorer peak resolution, and poorer preservation of the vowel formant structure. These VMP characteristics, however, did not necessarily coincide with inaccurate vowel recognition. Vowel identification appeared to be related primarily to VMP peak frequencies rather than to the levels at the peaks or to between-peak characteristics of the patterns.  相似文献   

6.
Vowel identification was tested in quiet, noise, and reverberation with 20 normal-hearing subjects and 20 hearing-impaired subjects. Stimuli were 15 English vowels spoken in a /b-t/context by six male talkers. Each talker produced five tokens of each vowel. In quiet, all stimuli were identified by two judges as the intended targets. The stimuli were degraded by reverberation or speech-spectrum noise. Vowel identification scores depended upon talker, listening condition, and subject type. The relationship between identification errors and spectral details of the vowels is discussed.  相似文献   

7.
The purpose of this paper is to propose and evaluate a new model of vowel perception which assumes that vowel identity is recognized by a template-matching process involving the comparison of narrow band input spectra with a set of smoothed spectral-shape templates that are learned through ordinary exposure to speech. In the present simulation of this process, the input spectra are computed over a sufficiently long window to resolve individual harmonics of voiced speech. Prior to template creation and pattern matching, the narrow band spectra are amplitude equalized by a spectrum-level normalization process, and the information-bearing spectral peaks are enhanced by a "flooring" procedure that zeroes out spectral values below a threshold function consisting of a center-weighted running average of spectral amplitudes. Templates for each vowel category are created simply by averaging the narrow band spectra of like vowels spoken by a panel of talkers. In the present implementation, separate templates are used for men, women, and children. The pattern matching is implemented with a simple city-block distance measure given by the sum of the channel-by-channel differences between the narrow band input spectrum (level-equalized and floored) and each vowel template. Spectral movement is taken into account by computing the distance measure at several points throughout the course of the vowel. The input spectrum is assigned to the vowel template that results in the smallest difference accumulated over the sequence of spectral slices. The model was evaluated using a large database consisting of 12 vowels in /hVd/ context spoken by 45 men, 48 women, and 46 children. The narrow band model classified vowels in this database with a degree of accuracy (91.4%) approaching that of human listeners.  相似文献   

8.
To determine the minimum difference in amplitude between spectral peaks and troughs sufficient for vowel identification by normal-hearing and hearing-impaired listeners, four vowel-like complex sounds were created by summing the first 30 harmonics of a 100-Hz tone. The amplitudes of all harmonics were equal, except for two consecutive harmonics located at each of three "formant" locations. The amplitudes of these harmonics were equal and ranged from 1-8 dB more than the remaining components. Normal-hearing listeners achieved greater than 75% accuracy when peak-to-trough differences were 1-2 dB. Normal-hearing listeners who were tested in a noise background sufficient to raise their thresholds to the level of a flat, moderate hearing loss needed a 4-dB difference for identification. Listeners with a moderate, flat hearing loss required a 6- to 7-dB difference for identification. The results suggest, for normal-hearing listeners, that the peak-to-trough amplitude difference required for identification of this set of vowels is very near the threshold for detection of a change in the amplitude spectrum of a complex signal. Hearing-impaired listeners may have difficulty using closely spaced formants for vowel identification due to abnormal smoothing of the internal representation of the spectrum by broadened auditory filters.  相似文献   

9.
Cross-generational and cross-dialectal variation in vowels among speakers of American English was examined in terms of vowel identification by listeners and vowel classification using pattern recognition. Listeners from Western North Carolina and Southeastern Wisconsin identified 12 vowel categories produced by 120 speakers stratified by age (old adults, young adults, and children), gender, and dialect. The vowels /?, o, ?, u/ were well identified by both groups of listeners. The majority of confusions were for the front /i, ?, e, ?, ?/, the low back /ɑ, ?/ and the monophthongal North Carolina /a?/. For selected vowels, generational differences in acoustic vowel characteristics were perceptually salient, suggesting listeners' responsiveness to sound change. Female exemplars and native-dialect variants produced higher identification rates. Linear discriminant analyses which examined dialect and generational classification accuracy showed that sampling the formant pattern at vowel midpoint only is insufficient to separate the vowels. Two sample points near onset and offset provided enough information for successful classification. The models trained on one dialect classified the vowels from the other dialect with much lower accuracy. The results strongly support the importance of dynamic information in accurate classification of cross-generational and cross-dialectal variations.  相似文献   

10.
11.
As part of an investigation of the temporal implementation rules of English, measurements were made of voice-onset time for initial English stops and the duration of the following voiced vowel in monosyllabic words for New York City speakers. It was found that the VOT of a word-initial consonant was longer before a voiceless final cluster than before a single nasal, and longer before tense vowels than lax vowels. The vowels were also longer in environments where VOT was longer, but VOT did not maintain a constant ratio with the vowel duration, even for a single place of articulation. VOT was changed by a smaller proportion than the following voiced vowel in both cases. VOT changes associated with the vowel were consistent across place of articulation of the stop. In the final experiment, when vowel tensity and final consonant effects were combined, it was found that the proportion of vowel duration change that carried over to the preceding VOT is different for the two phonetic changes. These results imply that temporal implementation rules simultaneously influence several acoustic intervals including both VOT and the "inherent" interval corresponding to a segment, either by independent control of the relevant articulatory variables or by some unknown common mechanism.  相似文献   

12.
Standard continuous interleaved sampling processing, and a modified processing strategy designed to enhance temporal cues to voice pitch, were compared on tests of intonation perception, and vowel perception, both in implant users and in acoustic simulations. In standard processing, 400 Hz low-pass envelopes modulated either pulse trains (implant users) or noise carriers (simulations). In the modified strategy, slow-rate envelope modulations, which convey dynamic spectral variation crucial for speech understanding, were extracted by low-pass filtering (32 Hz). In addition, during voiced speech, higher-rate temporal modulation in each channel was provided by 100% amplitude-modulation by a sawtooth-like wave form whose periodicity followed the fundamental frequency (F0) of the input. Channel levels were determined by the product of the lower- and higher-rate modulation components. Both in acoustic simulations and in implant users, the ability to use intonation information to identify sentences as question or statement was significantly better with modified processing. However, while there was no difference in vowel recognition in the acoustic simulation, implant users performed worse with modified processing both in vowel recognition and in formant frequency discrimination. It appears that, while enhancing pitch perception, modified processing harmed the transmission of spectral information.  相似文献   

13.
A hybrid PARAFAC and principal-component model of tongue configuration in vowel production is presented, using a corpus of German vowels in multiple consonant contexts (fleshpoint data for seven speakers at two speech rates from electromagnetic articulography). The PARAFAC approach is attractive for explicitly separating speaker-independent and speaker-dependent effects within a parsimonious linear model. However, it proved impossible to derive a PARAFAC solution of the complete dataset (estimated to require three factors) due to complexities introduced by the consonant contexts. Accordingly, the final model was derived in two stages. First, a two-factor PARAFAC model was extracted. This succeeded; the result was treated as the basic vowel model. Second, the PARAFAC model error was subjected to a separate principal-component analysis for each subject. This revealed a further articulatory component mainly involving tongue-blade activity associated with the flanking consonants. However, the subject-specific details of the mapping from raw fleshpoint coordinates to this component were too complex to be consistent with the PARAFAC framework. The final model explained over 90% of the variance and gave a succinct and physiologically plausible articulatory representation of the German vowel space.  相似文献   

14.
A model of the vocal-tract area function is described that consists of four tiers. The first tier is a vowel substrate defined by a system of spatial eigenmodes and a neutral area function determined from MRI-based vocal-tract data. The input parameters to the first tier are coefficient values that, when multiplied by the appropriate eigenmode and added to the neutral area function, construct a desired vowel. The second tier consists of a consonant shaping function defined along the length of the vocal tract that can be used to modify the vowel substrate such that a constriction is formed. Input parameters consist of the location, area, and range of the constriction. Location and area roughly correspond to the standard phonetic specifications of place and degree of constriction, whereas the range defines the amount of vocal-tract length over which the constriction will influence the tract shape. The third tier allows length modifications for articulatory maneuvers such as lip rounding/spreading and larynx lowering/raising. Finally, the fourth tier provides control of the level of acoustic coupling of the vocal tract to the nasal tract. All parameters can be specified either as static or time varying, which allows for multiple levels of coarticulation or coproduction.  相似文献   

15.
16.
Research on the perception of vowels in the last several years has given rise to new conceptions of vowels as articulatory, acoustic, and perceptual events. Starting from a "simple" target model in which vowels were characterized articulatorily as static vocal tract shapes and acoustically as points in a first and second formant (F1/F2) vowel space, this paper briefly traces the evolution of vowel theory in the 1970s and 1980s in two directions. (1) Elaborated target models represent vowels as target zones in perceptual spaces whose dimensions are specified as formant ratios. These models have been developed primarily to account for perceivers' solution of the "speaker normalization" problem. (2) Dynamic specification models emphasize the importance of formant trajectory patterns in specifying vowel identity. These models deal primarily with the problem of "target undershoot" associated with the coarticulation of vowels with consonants in natural speech and with the issue of "vowel-inherent spectral change" or diphthongization of English vowels. Perceptual studies are summarized that motivate these theoretical developments.  相似文献   

17.
Auditory-perceptual interpretation of the vowel   总被引:1,自引:0,他引:1  
  相似文献   

18.
The harmonic sieve has been proposed as a mechanism for excluding extraneous frequency components from the estimate of the pitch of a complex sound. The experiments reported here examine whether a harmonic sieve could also determine whether a particular harmonic contributes to the phonetic quality of a vowel. Mistuning a harmonic in the first formant region of vowels from an /I/-/e/ continuum gave shifts in the phoneme boundary that could be explained by (i) phase effects for small amounts of mistuning and (ii) a harmonic sievelike grouping mechanism for larger amounts of mistuning. Similar grouping criteria to those suggested for pitch may operate for the determination of first formant frequency in voiced speech.  相似文献   

19.
20.
A quantitative perceptual model of human vowel recognition based upon psychoacoustic and speech perception data is described. At an intermediate auditory stage of processing, the specific bark difference level of the model represents the pattern of peripheral auditory excitation as the distance in critical bands (barks) between neighboring formants and between the fundamental frequency (F0) and first formant (F1). At a higher, phonetic stage of processing, represented by the critical bark difference level of the model, the transformed vowels may be dichotomously classified based on whether the difference between formants in each dimension falls within or exceeds the critical distance of 3 bark for the spectral center of gravity effect [Chistovich et al., Hear. Res. 1, 185-195 (1979)]. Vowel transformations and classifications correspond well to several major phonetic dimensions and features by which vowels are perceived and traditionally classified. The F1-F0 dimension represents vowel height, and high vowels have F1-F0 differences within 3 bark. The F3-F2 dimension corresponds to vowel place of articulation, and front vowels have F3-F2 differences of less than 3 bark. As an inherent, speaker-independent normalization procedure, the model provides excellent vowel clustering while it greatly reduces between-speaker variability. It offers robust normalization through feature classification because gross binary categorization allows for considerable acoustic variability. There was generally less formant and bark difference variability for closely spaced formants than for widely spaced formants. These findings agree with independently observed perceptual results and support Stevens' quantal theory of vowel production and perceptual constraints on production predicted from the critical bark difference level of the model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号