期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Voiceless affricate/fricative distinction by frication duration and amplitude rise slope

Mitani S Kitama T Sato Y 《The Journal of the Acoustical Society of America》2006,120(3):1600-1607

Previous psychophysical studies have shown that the perceptual distinction between voiceless fricatives and affricates in consonant-vowel syllables depends primarily on frication duration, whereas amplitude rise slope was suggested as the cue in automatic classification experiments. The effects of both cues on the manner of articulation between /integral of/ and /t integral of/ were investigated. Subjects performed a forced-choice task (/integral of/ or /t integral of) in response to edited waveforms of Japanese fricatives /integral of i/, /integral of u/, and /integral of a/. We found that frication duration, onset slope, and the interaction between duration and onset slope influenced the perceptual distinction. That is, the percent of /integral of/ responses increased with an increase in frication duration (experiments 1-3). The percent of /integral of/ responses also increased with a decrease in slope steepness (experiment 3), and the relative importance between slope portions was not even but weighted at onset (experiments 1 and 2). There was an interaction between the two cues of frication duration and steepness. The relative importance of the slope cue was maximum at a frication duration of 150 ms (experiment 3). It is concluded that the frication duration and amplitude rise slope at frication onset are acoustic cues that discriminate between /integral of/ and /t integral of/, and that the two cues interact with each other. 相似文献

2.

Perception of familiar contrasts in unfamiliar positions

Broersma M 《The Journal of the Acoustical Society of America》2005,117(6):3890-3901

This paper investigates the perception of non-native phoneme contrasts which exist in the native language, but not in the position tested. Like English, Dutch contrasts voiced and voiceless obstruents. Unlike English, Dutch allows only voiceless obstruents in word-final position. Dutch and English listeners' accuracy on English final voicing contrasts and their use of preceding vowel duration as a voicing cue were tested. The phonetic structure of Dutch should provide the necessary experience for a native-like use of this cue. Experiment 1 showed that Dutch listeners categorized English final /z/-/s/, /v/-/f/, /b/-/p/, and /d/-/t/ contrasts in nonwords as accurately as initial contrasts, and as accurately as English listeners did, even when release bursts were removed. In experiment 2, English listeners used vowel duration as a cue for one final contrast, although it was uninformative and sometimes mismatched other voicing characteristics, whereas Dutch listeners did not. Although it should be relatively easy for them, Dutch listeners did not use vowel duration. Nevertheless, they attained native-like accuracy, and sometimes even outperformed the native listeners who were liable to be misled by uninformative vowel duration information. Thus, native-like use of cues for non-native but familiar contrasts in unfamiliar positions may hardly ever be attained. 相似文献

3.

Phonetic identification in quiet and in noise by listeners with cochlear implants

Munson B Nelson PB 《The Journal of the Acoustical Society of America》2005,118(4):2607-2617

This study examined the effect of noise on the identification of four synthetic speech continua (/ra/-/la/, /wa/-/ja/, /i/-/u/, and say-stay) by adults with cochlea implants (CIs) and adults with normal-hearing (NH) sensitivity in quiet and noise. Significant group-by-SNR interactions were found for endpoint identification accuracy for all continua except /i/-/u/. The CI listeners showed the least NH-like identification functions for the /ra/-/la/ and /wa/-/ja/ continua. In a second experiment, NH adults identified four- and eight-band cochlear implant stimulations of the four continua, to examine whether group differences in frequency selectivity could account for the group differences in the first experiment. Number of bands and SNR interacted significantly for /ra/-/la/, /wa/-/ja/, and say-stay endpoint identification; strongest effects were found for the /ra/-/la/ and say-stay continua. Results suggest that the speech features that are most vulnerable to misperception in noise by listeners with CIs are those whose acoustic cues are rapidly changing spectral patterns, like the formant transitions in the /wa/-/ja/ and /ra/-/la/ continua. However, the group differences in the first experiment cannot be wholly attributable to frequency selectivity differences, as the number of bands in the second experiment affected performance differently than suggested by group differences in the first experiment. 相似文献

4.

Speech coding in the auditory nerve: IV. Sounds with consonant-like dynamic characteristics

B Delgutte N Y Kiang 《The Journal of the Acoustical Society of America》1984,75(3):897-907

Discharge patterns of auditory-nerve fibers in anesthetized cats were obtained for two stimulus levels in response to synthetic stimuli with dynamic characteristics appropriate for selected consonants. A set of stimuli was constructed by preceding a signal that was identified as /da/by another sound that was systematically manipulated so that the entire complex would sound like either /da/, /ada/, /na/, /sa/, /sa/, or others. Discharge rates of auditory-nerve fibers in response to the common /da/-like formant transitions depended on the preceding context. Average discharge rates during these transitions decreased most for fibers whose CFs were in frequency regions where the context had considerable energy. Some effect of the preceding context on fine time patterns of response to the transitions was also found, but the identity of the largest response components (which often corresponded to the formant frequencies) was in general unaffected. Thus the response patterns during the formant transitions contain cues about both the nature of the transitions and the preceding context. A second set of stimuli sounding like /s/ and /c/ was obtained by varying the duration of the rise in amplitude at the onset of a filtered noise burst. At both 45 and 60 dB SPL, there were fibers which showed a more prominent peak in discharge rate at stimulus onset for /c/ than for /s/, but the CF regions that reflected the clearest distinctions depended on stimulus level. The peaks in discharge rate that occur in response to rapid changes in amplitude or spectrum might be used by the central processor as pointers to portions of speech signals that are rich in phonetic information. 相似文献

5.

Cortical auditory evoked potential correlates of categorical perception of voice-onset time.

A Sharma M F Dorman 《The Journal of the Acoustical Society of America》1999,106(2):1078-1083

The goal of this study was to examine the neural encoding of voice-onset time distinctions that indicate the phonetic categories /da/ and /ta/ for human listeners. Cortical Auditory Evoked Potentials (CAEP) were measured in conjunction with behavioral perception of a /da/-/ta/ continuum. Sixteen subjects participated in identification and discrimination experiments. A sharp category boundary was revealed between /da/ and /ta/ around the same location for all listeners. Subjects' discrimination of a VOT change of equal magnitude was significantly more accurate across the /da/-/ta/ categories than within the /ta/ category. Neurophysiologic correlates of VOT encoding were investigated using the N1 CAEP which reflects sensory encoding of stimulus features and the MMN CAEP which reflects sensory discrimination. The MMN elicited by the across-category pair was larger and more robust than the MMN which occurred in response to the within-category pair. Distinct changes in N1 morphology were related to VOT encoding. For stimuli that were behaviorally identified as /da/, a single negativity (N1) was apparent; however, for stimuli identified as /ta/, two distinct negativities (N1 and N1') were apparent. Thus the enhanced MMN responses and the morphological discontinuity in N1 morphology observed in the region of the /da/-/ta/ phonetic boundary appear to provide neurophysiologic correlates of categorical perception for VOT. 相似文献

6.

Vowel perception by adults and children with normal language and specific language impairment: based on steady states or transitions?

Sussman JE 《The Journal of the Acoustical Society of America》2001,109(3):1173-1180

The current investigation studied whether adults, children with normally developing language aged 4-5 years, and children with specific language impairment, aged 5-6 years identified vowels on the basis of steady-state or transitional formant frequencies. Four types of synthetic tokens, created with a female voice, served as stimuli: (1) steady-state centers for the vowels [i] and [ae]; (2) voweless tokens with transitions appropriate for [bib] and [baeb]; (3) "congruent" tokens that combined the first two types of stimuli into [bib] and [baeb]; and (4) "conflicting" tokens that combined the transitions from [bib] with the vowel from [baeb] and vice versa. Results showed that children with language impairment identified the [i] vowel more poorly than other subjects for both the voweless and congruent tokens. Overall, children identified vowels most accurately in steady-state centers and congruent stimuli (ranging between 94%-96%). They identified the vowels on the basis of transitions only from "voweless" tokens with 89% and 83.5% accuracy for the normally developing and language impaired groups, respectively. Children with normally developing language used steady-state cues to identify vowels in 87% of the conflicting stimuli, whereas children with language impairment did so for 79% of the stimuli. Adults were equally accurate for voweless, steady-state, and congruent tokens (ranging between 99% to 100% accuracy) and used both steady-state and transition cues for vowel identification. Results suggest that most listeners prefer the steady state for vowel identification but are capable of using the onglide/offglide transitions for vowel identification. Results were discussed with regard to Nittrouer's developmental weighting shift hypothesis and Strange and Jenkin's dynamic specification theory. 相似文献

7.

Perceptual effects of plosive feature modification

Kapoor A Allen JB 《The Journal of the Acoustical Society of America》2012,131(1):478-491

In the 1970-1980's, a number of papers explored the role of the transitional and burst features in consonant-vowel context. These papers left unresolved the relative importance of these two acoustic cues. This research takes advantage of refined signal processing methods, allowing for the visualization and modification of acoustic details. This experiment explores the impact of modifying the strength of the acoustic burst feature on the recognition scores P(c)(SNR) (function of the signal-to-noise ratio), for four plosive sounds /ta, ka, da, ga/. These results show high correlations between the relative burst intensity and the scores P(c)(SNR). Based on this correlation, one must conclude that these bursts are the primary acoustic cues used for the identification of these four consonants. This is in contrast to previous experiments, which used less precise methods to manipulate speech, and observe complex relationships between the scores, bursts and transition cues. In cases where the burst feature is removed entirely, it is shown that naturally existing conflicting acoustic features dominate the score. These observations seem directly inconsistent with transition cues playing a role: if the transition cues were important, they would dominate over low-level conflicting burst cues. These limited results arguably rule out the concept of redundant cues. 相似文献

8.

Effects of deafness on acoustic characteristics of American English tense/lax vowels in maternal speech to infants

MV Kondaurova TR Bergeson LC Dilley 《The Journal of the Acoustical Society of America》2012,132(2):1039-1049

Recent studies have demonstrated that mothers exaggerate phonetic properties of infant-directed (ID) speech. However, these studies focused on a single acoustic dimension (frequency), whereas speech sounds are composed of multiple acoustic cues. Moreover, little is known about how mothers adjust phonetic properties of speech to children with hearing loss. This study examined mothers' production of frequency and duration cues to the American English tense/lax vowel contrast in speech to profoundly deaf (N?=?14) and normal-hearing (N?=?14) infants, and to an adult experimenter. First and second formant frequencies and vowel duration of tense (/i/,?/u/) and lax (/I/,?/?/) vowels were measured. Results demonstrated that for both infant groups mothers hyperarticulated the acoustic vowel space and increased vowel duration in ID speech relative to adult-directed speech. Mean F2 values were decreased for the /u/ vowel and increased for the /I/ vowel, and vowel duration was longer for the /i/, /u/, and /I/ vowels in ID speech. However, neither acoustic cue differed in speech to hearing-impaired or normal-hearing infants. These results suggest that both formant frequencies and vowel duration that differentiate American English tense/lx vowel contrasts are modified in ID speech regardless of the hearing status of the addressee. 相似文献

9.

Production of the word-final English /t/-/d/ contrast by native speakers of English, Mandarin, and Spanish.

J E Flege M J Munro L Skelton 《The Journal of the Acoustical Society of America》1992,92(1):128-143

The primary aim of this study was to determine if adults whose native language permits neither voiced nor voiceless stops to occur in word-final position can master the English word-final /t/-/d/ contrast. Native English-speaking listeners identified the voicing feature in word-final stops produced by talkers in five groups: native speakers of English, experienced and inexperienced native Spanish speakers of English, and experienced and inexperienced native Mandarin speakers of English. Contrary to hypothesis, the experienced second language (L2) learners' stops were not identified significantly better than stops produced by the inexperienced L2 learners; and their stops were correctly identified significantly less often than stops produced by the native English speakers. Acoustic analyses revealed that the native English speakers made vowels significantly longer before /d/ than /t/, produced /t/-final words with a higher F1 offset frequency than /d/-final words, produced more closure voicing in /d/ than /t/, and sustained closure longer for /t/ than /d/. The L2 learners produced the same kinds of acoustic differences between /t/ and /d/, but theirs were usually of significantly smaller magnitude. Taken together, the results suggest that only a few of the 40 L2 learners examined in the present study had mastered the English word-final /t/-/d/ contrast. Several possible explanations for this negative finding are presented. Multiple regression analyses revealed that the native English listeners made perceptual use of the small, albeit significant, vowel duration differences produced in minimal pairs by the nonnative speakers. A significantly stronger correlation existed between vowel duration differences and the listeners' identifications of final stops in minimal pairs when the perceptual judgments were obtained in an "edited" condition (where post-vocalic cues were removed) than in a "full cue" condition. This suggested that listeners may modify their identification of stops based on the availability of acoustic cues. 相似文献

10.

The development of skill in producing word-final English stops: kinematic parameters

J E Flege 《The Journal of the Acoustical Society of America》1988,84(5):1639-1652

It was hypothesized that native English adults would be more skillful in producing word-final English /p/ and /b/ than native English children who, in turn, would be more skillful in doing so than adult native speakers of a language (Mandarin Chinese) that does not possess word-final stops. A video tracking system was used to monitor lip and jaw movements. The subjects in all three groups made vowels significantly longer before /b/ than /p/, but the effect seen for the English subjects was three times as large as the Chinese subjects' effect and depended less on differences in lip closing velocity for (b) and /p/. The English subjects also showed a difference in duration between /a/ and /i/ that was twice as large as the difference seen for the Chinese subjects. Of the three groups, only the English adults showed significantly greater displacement and peak movement velocity for the final stop consonant of /bap/ than /bab/. This suggested that their central phonetic representations specified a more forceful constriction of the lips for /p/ than /b/. The English adults seemed to compensate more effectively for a bite block in producing the final stops in /bip/ and /bib/. The results obtained for the English children were intermediate to those obtained for the English and Chinese adults, which is consistent with the hypothesized experience-based differences in level of skill. 相似文献

11.

Cue-specific effects of categorization training on the relative weighting of acoustic cues to consonant voicing in English

Francis AL Kaganovich N Driscoll-Huber C 《The Journal of the Acoustical Society of America》2008,124(2):1234-1251

In English, voiced and voiceless syllable-initial stop consonants differ in both fundamental frequency at the onset of voicing (onset F0) and voice onset time (VOT). Although both correlates, alone, can cue the voicing contrast, listeners weight VOT more heavily when both are available. Such differential weighting may arise from differences in the perceptual distance between voicing categories along the VOT versus onset F0 dimensions, or it may arise from a bias to pay more attention to VOT than to onset F0. The present experiment examines listeners' use of these two cues when classifying stimuli in which perceptual distance was artificially equated along the two dimensions. Listeners were also trained to categorize stimuli based on one cue at the expense of another. Equating perceptual distance eliminated the expected bias toward VOT before training, but successfully learning to base decisions more on VOT and less on onset F0 was easier than vice versa. Perceptual distance along both dimensions increased for both groups after training, but only VOT-trained listeners showed a decrease in Garner interference. Results lend qualified support to an attentional model of phonetic learning in which learning involves strategic redeployment of selective attention across integral acoustic cues. 相似文献

12.

Harmonics-to-noise ratios in normally speaking prepubescent girls and boys

Carole T. Ferrand 《Journal of voice》2000,14(1):17-21

This study examined harmonics-to-noise ratios (HNR) in 4 groups of normally speaking children. HNRs were calculated for the vowels /ai/ and /inverted v/, selected from conversational speech samples of 80 children aged 4, 6, 8, and 10 years (10 boys and 10 girls at each age level). HNR values for /inverted v/ were significantly higher than those for /ai/. Significant age differences emerged for /ai/ between ages 4 and 8, and ages 8 and 10. Girls obtained significantly higher HNRs than boys for the /ai/ vowel. Overall, the HNR values for these normally speaking children were lower than those reported for normally speaking adults. These findings suggest that acoustic values for children cannot be validly compared to those for adults, and that the child's gender and age should be taken into account when applying spectral analyses to research and/or clinical situations. 相似文献

13.

Training humans in non-native phoneme perception using a monkey psychoacoustic procedure

Sinnott JM Gonzales CL Masood AF Ishihara T 《The Journal of the Acoustical Society of America》2007,121(6):3846-3857

Humans were trained to categorize problem non-native phonemes using an animal psychoacoustic procedure that trains monkeys to greater than 90% correct in phoneme identification [Sinnott and Gilmore, Percept. Psychophys. 66, 1341-1350 (2004)]. This procedure uses a manual left versus right response on a lever, a continuously repeated stimulus on each trial, extensive feedback for errors in the form of a repeated correction procedure, and training until asymptotic levels of performance. Here, Japanese listeners categorized the English liquid contrast /r-l/, and English listeners categorized the Middle Eastern dental-retroflex contrast /d-D/. Consonant-vowel stimuli were constructed using four talkers and four vowels. Native listeners and phoneme contrasts familiar to all listeners were included as controls. Responses were analyzed using percent correct, response time, and vowel context effects as measures. All measures indicated nativelike Japanese perception of /r-l/ after 32 daily training sessions, but this was not the case for English perception of /d-D/. Results are related to the concept of "robust" (more easily recovered) versus "fragile" (more easily lost) phonetic contrasts [Burnham, Appl. Psycholing. 7, 207-240 (1986)]. 相似文献

14.

Stimulus factors influencing the identification of voiced stop consonants by normal-hearing and hearing-impaired adults

J M Lindholm M Dorman B E Taylor M T Hannley 《The Journal of the Acoustical Society of America》1988,83(4):1608-1614

The effects of mild-to-moderate hearing impairment on the perceptual importance of three acoustic correlates of stop consonant place of articulation were examined. Normal-hearing and hearing-impaired adults identified a stimulus set comprising all possible combinations of the levels of three factors: formant transition type (three levels), spectral tilt type (three levels), and abruptness of frequency change (two levels). The levels of these factors correspond to those appropriate for /b/, /d/, and /g/ in the /ae/ environment. Normal-hearing subjects responded primarily in accord with the place of articulation specified by the formant transitions. Hearing-impaired subjects showed less-than-normal reliance on formant transitions and greater-than-normal reliance on spectral tilt and abruptness of frequency change. These results suggest that hearing impairment affects the perceptual importance of cues to stop consonant identity, increasing the importance of information provided by both temporal characteristics and gross spectral shape and decreasing the importance of information provided by the formant transitions. 相似文献

15.

Acoustic modeling of American English /r/

Espy-Wilson CY Boyce SE Jackson M Narayanan S Alwan A 《The Journal of the Acoustical Society of America》2000,108(1):343-356

Recent advances in physiological data collection methods have made it possible to test the accuracy of predictions against speaker-specific vocal tracts and acoustic patterns. Vocal tract dimensions for /r/ derived via magnetic-resonance imaging (MRI) for two speakers of American English [Alwan, Narayanan, and Haker, J. Acoust. Soc. Am. 101, 1078-1089 (1997)] were used to construct models of the acoustics of /r/. Because previous models have not sufficiently accounted for the very low F3 characteristic of /r/, the aim was to match formant frequencies predicted by the models to the full range of formant frequency values produced by the speakers in recordings of real words containing /r/. In one set of experiments, area functions derived from MRI data were used to argue that the Perturbation Theory of tube acoustics cannot adequately account for /r/, primarily because predicted locations did not match speakers' actual constriction locations. Different models of the acoustics of /r/ were tested using the Maeda computer simulation program [Maeda, Speech Commun. 1, 199-299 (1982)]; the supralingual vocal-tract dimensions reported in Alwan et al. were found to be adequate at predicting only the highest of attested F3 values. By using (1) a recently developed adaptation of the Maeda model that incorporates the sublingual space as a side branch from the front cavity, and by including (2) the sublingual space as an increment to the dimensions of the front cavity, the mid-to-low values of the speakers' F3 range were matched. Finally, a simple tube model with dimensions derived from MRI data was developed to account for cavity affiliations. This confirmed F3 as a front cavity resonance, and variations in F1, F2, and F4 as arising from mid- and back-cavity geometries. Possible trading relations for F3 lowering based on different acoustic mechanisms for extending the front cavity are also proposed. 相似文献

16.

Nittrouer S 《The Journal of the Acoustical Society of America》2005,118(2):1072-1088

Because laboratory studies are conducted in optimal listening conditions, often with highly stylized stimuli that attenuate or eliminate some naturally occurring cues, results may have constrained applicability to the "real world." Such studies show that English-speaking adults weight vocalic duration greatly and formant offsets slightly in voicing decisions for word-final obstruents. Using more natural stimuli, Nittrouer [J. Acoust. Soc. Am. 115, 1777-1790 (2004)] found different results, raising questions about what would happen if experimental conditions were even more like the real world. In this study noise was used to simulate the real world. Edited natural words with voiced and voiceless final stops were presented in quiet and noise to adults and children (4 to 8 years) for labeling. Hypotheses tested were (1) Adults (and perhaps older children) would weight vocalic duration more in noise than in quiet; (2) Previously reported age-related differences in cue weighting might not be found in this real-world simulation; and (3) Children would experience greater masking than adults. Results showed: (1) no increase for any age listeners in the weighting of vocalic duration in noise; (2) age-related differences in the weighting of cues in both quiet and noise; and (3) masking effects for all listeners, but more so for children than adults. 相似文献

17.

Cues for intervocalic /t/ and /d/ in children and adults

G D Allen J A Norwood 《The Journal of the Acoustical Society of America》1988,84(3):868-875

One naturally spoken token of each of the words petal and pedal was computer edited to produce stimuli varying in voice onset time (VOT), silent closure duration, and initial /e/ vowel duration. These stimuli were then played, in the sentence frame "Push the button for the----," to four adult and four 6-year-old listeners who responded by pressing a button associated with a flower (petal) or a bicycle (pedal). Among the findings of interest were the following: (a) VOT was statistically the strongest cue for both listener groups, followed by closure duration and initial vowel duration; (b) VOT was relatively stronger for children than for adults, whereas closure and initial vowel durations were relatively stronger for adults than for children; (c) except for a probable ceiling/floor effect, there were no statistically significant interactions among the three acoustic cues, although there were interactions between those cues and both listener group (adults versus children) and the token for which the stimulus had been derived (petal versus pedal). 相似文献

18.

Speech discrimination in deaf subjects with cochlear implants 总被引：3，自引：0，他引：3

D K Eddington 《The Journal of the Acoustical Society of America》1980,68(3):885-891

Electrical stimulation of the auditory nerve is being investigated as a way to provide information useful for speech communication in the profoundly deaf. Single-channel systems that tend to stimulate all fibers alike have had little success in achieving this goal. Multichannel systems that allow excitation of more complex temporal-spatial patterns of activity are now being introduced. Psychoacoustical experiments providing evidence that electrodes of a multichannel implant are able to separately excite distinct groups of neural elements are reviewed. New results using multiple electrodes and speech-like stimuli are presented. The synthetic stimuli were vowels (/a/, /i/, /u/) and consonant-vowel (CV) syllables (/ba/, /da/, /ga/, /ta/). Vowels and CV syllables were presented in an AXB discrimination task with different signal processing schemes and electrode configurations. A four-channel, frequency-selective system produced faultless discrimination scores for all stimuli and spontaneous recognition of the vowels while the scores for the single-channel system were generally much lower. Although understanding free running speech by the profoundly deaf does not seem imminent, the results presented indicate that the multichannel system tested shows more promise of approaching this goal than the single-channel scheme. 相似文献

19.

Synchronized discharge rate representation of voice-onset time in the chinchilla auditory nerve

D G Sinex L P McDonald 《The Journal of the Acoustical Society of America》1989,85(5):1995-2004

Responses of chinchilla auditory nerve fibers to synthesized stop consonant syllables differing in voice-onset time (VOT) were obtained. The syllables, heard as /ga/-/ka/ or /da/-/ta/, were similar to those previously used by others in psychophysical experiments with human and chinchilla subjects. Synchronized discharge rates of neurons tuned to frequencies near the first formant increased at the onset of voicing for VOTs longer than 20 ms. Stimulus components near the formant or the neuron's characteristic frequency accounted for the increase. In these neurons, synchronized response changes were closely related to the same neuron's average discharge rates [D. G. Sinex and L. P. McDonald, J. Acoust. Soc. Am. 83, 1817-1827 (1988)]. Neurons tuned to frequency regions near the second and third formants usually responded to components near the second formant prior to the onset of voicing. These neurons' synchronized discharges could be captured by the first formant at the onset of voicing or with a latency of 50-60 ms, whichever was later. Since these neurons' average rate responses were unaffected by the onset of voicing, the latency of the synchronized response did provide as much additional neural cue to VOT. Overall, however, discharge synchrony did not provide as much information about VOT as was provided by the best average rate responses. The results are compared to other measurements of the peripheral encoding of speech sounds and to aspects of VOT perception. 相似文献

20.

Consonant and vowel confusions in speech-weighted noise

Phatak SA Allen JB 《The Journal of the Acoustical Society of America》2007,121(4):2312-2326

This paper presents the results of a closed-set recognition task for 64 consonant-vowel sounds (16 C X 4 V, spoken by 18 talkers) in speech-weighted noise (-22,-20,-16,-10,-2 [dB]) and in quiet. The confusion matrices were generated using responses of a homogeneous set of ten listeners and the confusions were analyzed using a graphical method. In speech-weighted noise the consonants separate into three sets: a low-scoring set C1 (/f/, /theta/, /v/, /d/, /b/, /m/), a high-scoring set C2 (/t/, /s/, /z/, /S/, /Z/) and set C3 (/n/, /p/, /g/, /k/, /d/) with intermediate scores. The perceptual consonant groups are C1: {/f/-/theta/, /b/-/v/-/d/, /theta/-/d/}, C2: {/s/-/z/, /S/-/Z/}, and C3: /m/-/n/, while the perceptual vowel groups are /a/-/ae/ and /epsilon/-/iota/. The exponential articulation index (AI) model for consonant score works for 12 of the 16 consonants, using a refined expression of the AI. Finally, a comparison with past work shows that white noise masks the consonants more uniformly than speech-weighted noise, and shows that the AI, because it can account for the differences in noise spectra, is a better measure than the wideband signal-to-noise ratio for modeling and comparing the scores with different noise maskers. 相似文献