首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
This paper reports acoustic measurements and results from a series of perceptual experiments on the voiced-voiceless distinction for syllable-final stop consonants in absolute final position and in the context of a following syllable beginning with a different stop consonant. The focus is on temporal cues to the distinction, with vowel duration and silent closure duration as the primary and secondary dimensions, respectively. The main results are that adding a second syllable to a monosyllable increases the number of voiced stop consonant responses, as does shortening of the closure duration in disyllables. Both of these effects are consistent with temporal regularities in speech production: Vowel durations are shorter in the first syllable of disyllables than in monosyllables, and closure durations are shorter for voiced than for voiceless stops in disyllabic utterances of this type. While the perceptual effects thus may derive from two separate sources of tacit phonetic knowledge available to listeners, the data are also consistent with an interpretation in terms of a single effect; one of temporal proximity of following context.  相似文献   

2.
Phoneme discrimination using connectionist networks   总被引:1,自引:0,他引:1  
The application of connectionist networks to speech recognition is assessed using a set of eight representative phonetic discrimination problems chose with respect to a theory of phonetics. A connectionist network model called the temporal flow model (TFM) is defined which represents temporal relationships using delay links and permits general patterns of connectivity. It is argued that the model has properties appropriate for time varying signals such as speech. Networks are trained using gradient descent methods of iterative nonlinear optimization to reduce the mean-squared error between the actual and the desired response of the output units. Separate network solutions are demonstrated for all eight phonetic discrimination problems for one male speaker. The network solutions are analyzed carefully and are shown in every case to make use of known acoustic phonetic cues. The network solutions vary in the degree to which they make use of context-dependent cues to achieve phoneme recognition. The network solutions were tested on data not used for training and achieved an average accuracy of 99.5%. It is concluded that acoustic phonetic speech recognition can be accomplished using connectionist networks.  相似文献   

3.
A probabilistic framework for a landmark-based approach to speech recognition is presented for obtaining multiple landmark sequences in continuous speech. The landmark detection module uses as input acoustic parameters (APs) that capture the acoustic correlates of some of the manner-based phonetic features. The landmarks include stop bursts, vowel onsets, syllabic peaks and dips, fricative onsets and offsets, and sonorant consonant onsets and offsets. Binary classifiers of the manner phonetic features-syllabic, sonorant and continuant-are used for probabilistic detection of these landmarks. The probabilistic framework exploits two properties of the acoustic cues of phonetic features-(1) sufficiency of acoustic cues of a phonetic feature for a probabilistic decision on that feature and (2) invariance of the acoustic cues of a phonetic feature with respect to other phonetic features. Probabilistic landmark sequences are constrained using manner class pronunciation models for isolated word recognition with known vocabulary. The performance of the system is compared with (1) the same probabilistic system but with mel-frequency cepstral coefficients (MFCCs), (2) a hidden Markov model (HMM) based system using APs and (3) a HMM based system using MFCCs.  相似文献   

4.
Visual information from a speaker's face profoundly influences auditory perception of speech. However, relatively little is known about the extent to which visual influences may depend on experience, and extent to which new sources of visual speech information can be incorporated in speech perception. In the current study, participants were trained on completely novel visual cues for phonetic categories. Participants learned to accurately identify phonetic categories based on novel visual cues. These newly-learned visual cues influenced identification responses to auditory speech stimuli, but not to the same extent as visual cues from a speaker's face. The novel methods and results of the current study raise theoretical questions about the nature of information integration in speech perception, and open up possibilities for further research on learning in multimodal perception, which may have applications in improving speech comprehension among the hearing-impaired.  相似文献   

5.
In tone languages there are potential conflicts in the perception of lexical tone and intonation, as both depend mainly on the differences in fundamental frequency (F0) patterns. The present study investigated the acoustic cues associated with the perception of sentences as questions or statements in Cantonese, as a function of the lexical tone in sentence final position. Cantonese listeners performed intonation identification tasks involving complete sentences, isolated final syllables, and sentences without the final syllable (carriers). Sensitivity (d' scores) were similar for complete sentences and final syllables but were significantly lower for carriers. Sensitivity was also affected by tone identity. These findings show that the perception of questions and statements relies primarily on the F0 characteristics of the final syllables (local F0 cues). A measure of response bias (c) provided evidence for a general bias toward the perception of statements. Logistic regression analyses showed that utterances were accurately classified as questions or statements by using average F0 and F0 interval. Average F0 of carriers (global F0 cue) was also found to be a reliable secondary cue. These findings suggest that the use of F0 cues for the perception of intonation question in tonal languages is likely to be language-specific.  相似文献   

6.
On the role of spectral transition for speech perception   总被引:2,自引:0,他引:2  
This paper examines the relationship between dynamic spectral features and the identification of Japanese syllables modified by initial and/or final truncation. The experiments confirm several main points. "Perceptual critical points," where the percent correct identification of the truncated syllable as a function of the truncation position changes abruptly, are related to maximum spectral transition positions. A speech wave of approximately 10 ms in duration that includes the maximum spectral transition position bears the most important information for consonant and syllable perception. Consonant and vowel identification scores simultaneously change as a function of the truncation position in the short period, including the 10-ms period for final truncation. This suggests that crucial information for both vowel and consonant identification is contained across the same initial part of each syllable. The spectral transition is more crucial than unvoiced and buzz bar periods for consonant (syllable) perception, although the latter features are of some perceptual importance. Also, vowel nuclei are not necessary for either vowel or syllable perception.  相似文献   

7.
Traditional accounts of speech perception generally hold that listeners use isolable acoustic "cues" to label phonemes. For syllable-final stops, duration of the preceding vocalic portion and formant transitions at syllable's end have been considered the primary cues to voicing decisions. The current experiment tried to extend traditional accounts by asking two questions concerning voicing decisions by adults and children: (1) What weight is given to vocalic duration versus spectral structure, both at syllable's end and across the syllable? (2) Does the naturalness of stimuli affect labeling? Adults and children (4, 6, and 8 years old) labeled synthetic stimuli that varied in vocalic duration and spectral structure, either at syllable's end or earlier in the syllable. Results showed that all listeners weighted dynamic spectral structure, both at syllable's end and earlier in the syllable, more than vocalic duration, and listeners performed with these synthetic stimuli as listeners had performed previously with natural stimuli. The conclusion for accounts of human speech perception is that rather than simply gathering acoustic cues and summing them to derive strings of phonemic segments, listeners are able to attend to global spectral structure, and use it to help recover explicitly phonetic structure.  相似文献   

8.
In a series of experiments, a variant of duplex perception was investigated. In its original form, duplex perception is created by presenting an isolated transition to one ear and the remainder of the syllable, the standard base, to the other ear. Listeners hear a chirp at the ear receiving the isolated transition, and a full syllable at the ear receiving the base. The new version of duplex perception was created by presenting a third-formant transition in isolation to one ear and the same transition electronically mixed with the base to the other ear; the modified base now has all the information necessary for syllabic perception. With the new procedure, listeners reported hearing a chirp centered in the middle of their head and a syllable in the ear presented the modified base that was clearer than that produced by the isolated transition and standard base. They could also reliably choose the patterns that contained the additional transition in the base when attending to either the phonetic or nonphonetic sides of the duplex percept. In addition, when the fundamental frequency, onset time, and intensity of the isolated third-formant transition were varied relative to the base, the phonetic and nonphonetic (lateralization) percepts were differentially affected, although not always reliably. In general, nonphonetic fusion was more affected by large differences in these variables than was phonetic fusion. However, when two isolated third-formant transitions were presented dichotically, fusion and the resulting central location of the chirp failed markedly with relatively small differences in each variable. The results were discussed in terms of the role of fusion in the new version of duplex perception and the nature of the information that undergoes both phonetic and nonphonetic fusion.  相似文献   

9.
Classic non-native speech perception findings suggested that adults have difficulty discriminating segmental distinctions that are not employed contrastively in their own language. However, recent reports indicate a gradient of performance across non-native contrasts, ranging from near-chance to near-ceiling. Current theoretical models argue that such variations reflect systematic effects of experience with phonetic properties of native speech. The present research addressed predictions from Best's perceptual assimilation model (PAM), which incorporates both contrastive phonological and noncontrastive phonetic influences from the native language in its predictions about discrimination levels for diverse types of non-native contrasts. We evaluated the PAM hypotheses that discrimination of a non-native contrast should be near-ceiling if perceived as phonologically equivalent to a native contrast, lower though still quite good if perceived as a phonetic distinction between good versus poor exemplars of a single native consonant, and much lower if both non-native segments are phonetically equivalent in goodness of fit to a single native consonant. Two experiments assessed native English speakers' perception of Zulu and Tigrinya contrasts expected to fit those criteria. Findings supported the PAM predictions, and provided evidence for some perceptual differentiation of phonological, phonetic, and nonlinguistic information in perception of non-native speech. Theoretical implications for non-native speech perception are discussed, and suggestions are made for further research.  相似文献   

10.
The purpose of this study was to determine the role of static, dynamic, and integrated cues for perception in three adult age groups, and to determine whether age has an effect on both consonant and vowel perception, as predicted by the "age-related deficit hypothesis." Eight adult subjects in each of the age ranges of young (ages 20-26), middle aged (ages 52-59), and old (ages 70-76) listened to synthesized syllables composed of combinations of [b d g] and [i u a]. The synthesis parameters included manipulations of the following stimulus variables: formant transition (moving or straight), noise burst (present or absent), and voicing duration (10, 30, or 46 ms). Vowel perception was high across all conditions and there were no significant differences among age groups. Consonant identification showed a definite effect of age. Young and middle-aged adults were significantly better than older adults at identifying consonants from secondary cues only. Older adults relied on the integration of static and dynamic cues to a greater extent than younger and middle-aged listeners for identification of place of articulation of stop consonants. Duration facilitated correct stop-consonant identification in the young and middle-aged groups for the no-burst conditions, but not in the old group. These findings for the duration of stop-consonant transitions indicate reductions in processing speed with age. In general, the results did not support the age-related deficit hypothesis for adult identification of vowels and consonants from dynamic spectral cues.  相似文献   

11.
At a cocktail party, listeners must attend selectively to a target speaker and segregate their speech from distracting speech sounds uttered by other speakers. To solve this task, listeners can draw on a variety of vocal, spatial, and temporal cues. Recently, Vestergaard et al. [J. Acoust. Soc. Am. 125, 1114-1124 (2009)] developed a concurrent-syllable task to control temporal glimpsing within segments of concurrent speech, and this allowed them to measure the interaction of glottal pulse rate and vocal tract length and reveal how the auditory system integrates information from independent acoustic modalities to enhance recognition. The current paper shows how the interaction of these acoustic cues evolves as the temporal overlap of syllables is varied. Temporal glimpses as short as 25 ms are observed to improve syllable recognition substantially when the target and distracter have similar vocal characteristics, but not when they are dissimilar. The effect of temporal glimpsing on recognition performance is strongly affected by the form of the syllable (consonant-vowel versus vowel-consonant), but it is independent of other phonetic features such as place and manner of articulation.  相似文献   

12.
Cochlear implants provide users with limited spectral and temporal information. In this study, the amount of spectral and temporal information was systematically varied through simulations of cochlear implant processors using a noise-excited vocoder. Spectral information was controlled by varying the number of channels between 1 and 16, and temporal information was controlled by varying the lowpass cutoff frequencies of the envelope extractors from 1 to 512 Hz. Consonants and vowels processed using those conditions were presented to seven normal-hearing native-English-speaking listeners for identification. The results demonstrated that both spectral and temporal cues were important for consonant and vowel recognition with the spectral cues having a greater effect than the temporal cues for the ranges of numbers of channels and lowpass cutoff frequencies tested. The lowpass cutoff for asymptotic performance in consonant and vowel recognition was 16 and 4 Hz, respectively. The number of channels at which performance plateaued for consonants and vowels was 8 and 12, respectively. Within the above-mentioned ranges of lowpass cutoff frequency and number of channels, the temporal and spectral cues showed a tradeoff for phoneme recognition. Information transfer analyses showed different relative contributions of spectral and temporal cues in the perception of various phonetic/acoustic features.  相似文献   

13.
The contribution of the nasal murmur and vocalic formant transition to the perception of the [m]-[n] distinction by adult listeners was investigated for speakers of different ages in both consonant-vowel (CV) and vowel-consonant (VC) syllables. Three children in each of the speaker groups 3, 5, and 7 years old, and three adult females and three adult males produced CV and VC syllables consisting of either [m] or [n] and followed or preceded by [i ae u a], respectively. Two productions of each syllable were edited into seven murmur and transitions segments. Across speaker groups, a segment including the last 25 ms of the murmur and the first 25 ms of the vowel yielded higher perceptual identification of place of articulation than any other segment edited from the CV syllable. In contrast, the corresponding vowel+murmur segment in the VC syllable position improved nasal identification relative to other segment types for only the adult talkers. Overall, the CV syllable was perceptually more distinctive than the VC syllable, but this distinctiveness interacted with speaker group and stimulus duration. As predicted by previous studies and the current results of perceptual testing, acoustic analyses of adult syllable productions showed systematic differences between labial and alveolar places of articulation, but these differences were only marginally observed in the youngest children's speech. Also predicted by the current perceptual results, these acoustic properties differentiating place of articulation of nasal consonants were reliably different for CV syllables compared to VC syllables. A series of comparisons of perceptual data across speaker groups, segment types, and syllable shape provided strong support, in adult speakers, for the "discontinuity hypothesis" [K. N. Stevens, in Phonetic Linguistics: Essays in Honor of Peter Ladefoged, edited by V. A. Fromkin (Academic, London, 1985), pp. 243-255], according to which spectral discontinuities at acoustic boundaries provide critical cues to the perception of place of articulation. In child speakers, the perceptual support for the "discontinuity hypothesis" was weaker and the results indicative of developmental changes in speech production.  相似文献   

14.
This study examined the effect of linguistic experience on perception of the English /s/-/z/ contrast in word-final position. The durations of the periodic ("vowel") and aperiodic ("fricative") portions of stimuli, ranging from peas to peace, were varied in a 5 X 5 factorial design. Forced-choice identification judgments were elicited from two groups of native speakers of American English differing in dialect, and from two groups each of native speakers of French, Swedish, and Finnish differing in English-language experience. The results suggested that the non-native subjects used cues established for the perception of phonetic contrasts in their native language to identify fricatives as /s/ or /z/. Lengthening vowel duration increased /z/ judgments in all eight subject groups, although the effect was smaller for native speakers of French than for native speakers of the other languages. Shortening fricative duration, on the other hand, significantly decreased /z/ judgments only by the English and French subjects. It did not influence voicing judgments by the Swedish and Finnish subjects, even those who had lived for a year or more in an English-speaking environment. These findings raise the question of whether adults who learn a foreign language can acquire the ability to integrate multiple acoustic cues to a phonetic contrast which does not exist in their native language.  相似文献   

15.
This study presents a psychoacoustic analysis of the integration of spectral and temporal cues in the discrimination of simple nonspeech sounds. The experimental task was a same-different discrimination between a standard and a comparison pair of tones. Each pair consists of two 80-ms, 1500-Hz tone bursts separated by a 60-ms interval. The just-discriminable (d' = 2.0) increment in duration delta t, of one of the bursts was measured as a function of increments in the frequency delta f, of the other burst. A trade off between the values of delta t and delta f required to perform at d' = 2.0 was observed, which suggests that listeners integrate the evidence from the two dimensions. Integration occurred with both sub- and supra-threshold values of delta t or delta f, regardless of the order in which the cues were presented. The performance associated to the integration of cues was found to be determined by the discriminability of delta t plus that of delta f, and thus, it is within the psychophysical limits of auditory processing. To a first approximation the results agreed with the prediction of orthogonal vector summation of evidence stemming from signal detection theory. It is proposed that the ability to integrate spectral and temporal cues is in the repertoire of auditory processing capabilities. This integration does not appear to depend on perceiving sounds as members of phonetic classes.  相似文献   

16.
The present study evaluated auditory-visual speech perception in cochlear-implant users as well as normal-hearing and simulated-implant controls to delineate relative contributions of sensory experience and cues. Auditory-only, visual-only, or auditory-visual speech perception was examined in the context of categorical perception, in which an animated face mouthing ba, da, or ga was paired with synthesized phonemes from an 11-token auditory continuum. A three-alternative, forced-choice method was used to yield percent identification scores. Normal-hearing listeners showed sharp phoneme boundaries and strong reliance on the auditory cue, whereas actual and simulated implant listeners showed much weaker categorical perception but stronger dependence on the visual cue. The implant users were able to integrate both congruent and incongruent acoustic and optical cues to derive relatively weak but significant auditory-visual integration. This auditory-visual integration was correlated with the duration of the implant experience but not the duration of deafness. Compared with the actual implant performance, acoustic simulations of the cochlear implant could predict the auditory-only performance but not the auditory-visual integration. These results suggest that both altered sensory experience and improvised acoustic cues contribute to the auditory-visual speech perception in cochlear-implant users.  相似文献   

17.
This paper formalizes and tests two key assumptions of the concept of suprasegmental timing: segmental independence and suprasegmental mediation. Segmental independence holds that the duration of a suprasegmental unit such as a syllable or foot is only minimally dependent on its segments. Suprasegmental mediation states that the duration of a segment is determined by the duration of its suprasegmental unit and its identity, but not directly by the specific prosodic context responsible for suprasegmental unit duration. Both assumptions are made by various versions of the isochrony hypothesis [I. Lehiste, J. Phonetics 5, 253-263 (1977)], and by the syllable timing hypothesis [W. Campbell, Speech Commun. 9, 57-62 (1990)]. The validity of these assumptions was studied using the syllable as suprasegmental unit in American English and Mandarin Chinese. To avoid unnatural timing patterns that might be induced when reading carrier phrase material, meaningful, nonrepetitive sentences were used with a wide range of lengths. Segmental independence was tested by measuring how the average duration of a syllable in a fixed prosodic context depends on its segmental composition. A strong association was found; in many cases the increase in average syllabic duration when one segment was substituted for another (e.g., bin versus pin) was the same as the difference in average duration between the two segments (i.e., [b] versus [p]). Thus, the [i] and [n] were not compressed to make room for the longer [p], which is inconsistent with segmental independence. Syllabic mediation was tested by measuring which locations in a syllable are most strongly affected by various contextual factors, including phrasal position, within-word position, tone, and lexical stress. Systematic differences were found between these factors in terms of the intrasyllabic locus of maximal effect. These and earlier results obtained by van Son and van Santen [R. J. J. H van Son and J. P. H. van Santen, "Modeling the interaction between factors affecting consonant duration," Proceedings Eurospeech-97, 1997, pp. 319-322] showing a three-way interaction between consonantal identity (coronals vs labials), within-word position of the syllable, and stress of surrounding vowels, imply that segmental duration cannot be predicted by compressing or elongating segments to fit into a predetermined syllabic time interval. In conclusion, while there is little doubt that suprasegmental units play important predictive and explanatory roles as phonological units, the concept of suprasegmental timing is less promising.  相似文献   

18.
The perceptive multi-dimension structure of Chinese syllables is studied by psychological-physical experiment. The results indicate that FO and duration are interrelated to two main dimensions of the perceptive structure of Chinese syllable. And the prosodic characteristics such as the position of syllable in prosodic hierarchical structure, as well as the stress will be induced the various distribution of syllable in perception space.  相似文献   

19.
This study examined whether cochlear implant users must perceive differences along phonetic continua in the same way as do normal hearing listeners (i.e., sharp identification functions, poor within-category sensitivity, high between-category sensitivity) in order to recognize speech accurately. Adult postlingually deafened cochlear implant users, who were heterogeneous in terms of their implants and processing strategies, were tested on two phonetic perception tasks using a synthetic /da/-/ta/ continuum (phoneme identification and discrimination) and two speech recognition tasks using natural recordings from ten talkers (open-set word recognition and forced-choice /d/-/t/ recognition). Cochlear implant users tended to have identification boundaries and sensitivity peaks at voice onset times (VOT) that were longer than found for normal-hearing individuals. Sensitivity peak locations were significantly correlated with individual differences in cochlear implant performance; individuals who had a /d/-/t/ sensitivity peak near normal-hearing peak locations were most accurate at recognizing natural recordings of words and syllables. However, speech recognition was not strongly related to identification boundary locations or to overall levels of discrimination performance. The results suggest that perceptual sensitivity affects speech recognition accuracy, but that many cochlear implant users are able to accurately recognize speech without having typical normal-hearing patterns of phonetic perception.  相似文献   

20.
This study assessed the extent to which second-language learners are sensitive to phonetic information contained in visual cues when identifying a non-native phonemic contrast. In experiment 1, Spanish and Japanese learners of English were tested on their perception of a labial/ labiodental consonant contrast in audio (A), visual (V), and audio-visual (AV) modalities. Spanish students showed better performance overall, and much greater sensitivity to visual cues than Japanese students. Both learner groups achieved higher scores in the AV than in the A test condition, thus showing evidence of audio-visual benefit. Experiment 2 examined the perception of the less visually-salient /1/-/r/ contrast in Japanese and Korean learners of English. Korean learners obtained much higher scores in auditory and audio-visual conditions than in the visual condition, while Japanese learners generally performed poorly in both modalities. Neither group showed evidence of audio-visual benefit. These results show the impact of the language background of the learner and visual salience of the contrast on the use of visual cues for a non-native contrast. Significant correlations between scores in the auditory and visual conditions suggest that increasing auditory proficiency in identifying a non-native contrast is linked with an increasing proficiency in using visual cues to the contrast.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号