首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Integral processing of phonemes: evidence for a phonetic mode of perception   总被引:1,自引:0,他引:1  
To investigate the extent and locus of integral processing in speech perception, a speeded classification task was utilized with a set of noise-tone analogs of the fricative-vowel syllables (fae), (integral of ae), (fu), and (integral of u). Unlike the stimuli used in previous studies of selective perception of syllables, these stimuli did not contain consonant-vowel transitions. Subjects were asked to classify on the basis of one of the two syllable components. Some subjects were told that the stimuli were computer generated noise-tone sequences. These subjects processed the noise and tone separably. Irrelevant variation of the noise did not affect reaction times (RTs) for the classification of the tone, and vice versa. Other subjects were instructed to treat the stimuli as speech. For these subjects, irrelevant variation of the fricative increased RTs for the classification of the vowel, and vice versa. A second experiment employed naturally spoken fricative-vowel syllables with the same task. Classification RTs showed a pattern of integrality in that irrelevant variation of either component increased RTs to the other. These results indicate that knowledge of coarticulation (or its acoustic consequences) is a basic element of speech perception. Furthermore, the use of this knowledge in phonetic coding is mandatory, even in situations where the stimuli do not contain coarticulatory information.  相似文献   

2.
Stress is an important parameter for prosody processing in speech synthesis. In this paper, we compare the acoustic features of neutral tone syllables and strong stress syllables with moderate stress syllables, including pitch, syllable duration, intensity and pause length after syllable. The relation between duration and pitch, as well as the Third Tone (T3) and pitch are also studied. Three stress prediction models based on ANN, i.e. the acoustic model, the linguistic model and the mixed model, are presented for predicting Chinese sentential stress. The results show that the mixed model performs better than the other two models. In order to solve the problem of the diversity of manual labeling, an evaluation index of support ratio is proposed.  相似文献   

3.
Frequency resolution was evaluated for two normal-hearing and seven hearing-impaired subjects with moderate, flat sensorineural hearing loss by measuring percent correct detection of a 2000-Hz tone as the width of a notch in band-reject noise increased. The level of the tone was fixed for each subject at a criterion performance level in broadband noise. Discrimination of synthetic speech syllables that differed in spectral content in the 2000-Hz region was evaluated as a function of the notch width in the same band-reject noise. Recognition of natural speech consonant/vowel syllables in quiet was also tested; results were analyzed for percent correct performance and relative information transmitted for voicing and place features. In the hearing-impaired subjects, frequency resolution at 2000 Hz was significantly correlated with the discrimination of synthetic speech information in the 2000-Hz region and was not related to the recognition of natural speech nonsense syllables unless (a) the speech stimuli contained the vowel /i/ rather than /a/, and (b) the score reflected information transmitted for place of articulation rather than percent correct.  相似文献   

4.
Identification of multiple-electrode stimulus patterns was evaluated in nine adult subjects, to assess the feasibility of providing additional speech information through the tactual display of an electrotactile speech processor. Absolute identification scores decreased from 97.8% for single electrodes, to 61.9% for electrode pairs, and, to 31.8% for electrode triplets. Although input information increased with paired-and triple-electrode stimuli, information transmission scores were not significantly increased for either electrode pairs (2.99 bits) or triplets (2.84 bits) as compared with single electrodes (2.84 bits). These results suggest that speech coding strategies using stimulus patterns of electrode pairs or triplets would provide little improvement beyond that found for the present single-electrode scheme. However, higher absolute identification scores (73.6%), and an increase in information transmission to 3.88 bits, were recorded for test stimuli containing all combinations of paired and single electrodes. Based on this finding, two stimulus sets using a restricted number of combinations of paired and single electrodes were evaluated. The two stimulus sets simulated the spatial patterns of paired and single electrodes arising from use of alternative speech coding schemes to increase consonant voicing information. Results for the two stimulus sets showed higher electrode identification scores (79.7% and 90.4%), as compared with paired-electrode stimuli. Although electrode identification score was not as high as for single electrodes, information transmission was increased to 3.31 bits for the VF2 stimulus set. Analysis of the responses also showed that scores for identification of simulated voicing information conveyed by the two stimulus sets were 99.4 and 90.4% correct.(ABSTRACT TRUNCATED AT 250 WORDS)  相似文献   

5.
An important problem in speech perception is to determine how humans extract the perceptually invariant place of articulation information in the speech wave across variable acoustic contexts. Although analyses have been developed that attempted to classify the voiced stops /b/ versus /d/ from stimulus onset information, most of the human perceptual research to date suggests that formant transition information is more important than onset information. The purpose of the present study was to determine if animal subjects, specifically Japanese macaque monkeys, are capable of categorizing /b/ versus /d/ in synthesized consonant-vowel (CV) syllables using only formant transition information. Three monkeys were trained to differentiate CV syllables with a "go-left" versus a "go-right" label. All monkeys first learned to differentiate a /za/ versus /da/ manner contrast and easily transferred to three new vowel contexts /[symbol: see text], epsilon, I/. Next, two of the three monkeys learned to differentiate a /ba/ versus /da/ stop place contrast, but were unable to transfer it to the different vowel contexts. These results suggest that animals may not use the same mechanisms as humans do for classifying place contrasts, and call for further investigation of animal perception of formant transition information versus stimulus onset information in place contrasts.  相似文献   

6.
The goals of the present study were to measure acoustic temporal modulation transfer functions (TMTFs) in cochlear implant listeners and examine the relationship between modulation detection and speech recognition abilities. The effects of automatic gain control, presentation level and number of channels on modulation detection thresholds (MDTs) were examined using the listeners' clinical sound processor. The general form of the TMTF was low-pass, consistent with previous studies. The operation of automatic gain control had no effect on MDTs when the stimuli were presented at 65 dBA. MDTs were not dependent on the presentation levels (ranging from 50 to 75 dBA) nor on the number of channels. Significant correlations were found between MDTs and speech recognition scores. The rates of decay of the TMTFs were predictive of speech recognition abilities. Spectral-ripple discrimination was evaluated to examine the relationship between temporal and spectral envelope sensitivities. No correlations were found between the two measures, and 56% of the variance in speech recognition was predicted jointly by the two tasks. The present study suggests that temporal modulation detection measured with the sound processor can serve as a useful measure of the ability of clinical sound processing strategies to deliver clinically pertinent temporal information.  相似文献   

7.
The present study investigated anticipatory labial coarticulation in the speech of adults and children. CV syllables, composed of [s], [t], and [d] before [i] and [u], were produced by four adult speakers and eight child speakers aged 3-7 years. Each stimulus was computer edited to include only the aperiodic portion of fricative-vowel and stop-vowel syllables. LPC spectra were then computed for each excised segment. Analyses of the effect of the following vowel on the spectral peak associated with the second formant frequency and on the characteristic spectral prominence for each consonant were performed. Perceptual data were obtained by presenting the aperiodic consonantal segments to subjects who were instructed to identify the following vowel as [i] or [u]. Both the acoustic and the perceptual data show strong coarticulatory effects for the adults and comparable, although less consistent, coarticulation in the speech stimuli of the children. The results are discussed in terms of the articulatory and perceptual aspects of coarticulation in language learning.  相似文献   

8.

Background

Recent studies have shown that the human right-hemispheric auditory cortex is particularly sensitive to reduction in sound quality, with an increase in distortion resulting in an amplification of the auditory N1m response measured in the magnetoencephalography (MEG). Here, we examined whether this sensitivity is specific to the processing of acoustic properties of speech or whether it can be observed also in the processing of sounds with a simple spectral structure. We degraded speech stimuli (vowel /a/), complex non-speech stimuli (a composite of five sinusoidals), and sinusoidal tones by decreasing the amplitude resolution of the signal waveform. The amplitude resolution was impoverished by reducing the number of bits to represent the signal samples. Auditory evoked magnetic fields (AEFs) were measured in the left and right hemisphere of sixteen healthy subjects.

Results

We found that the AEF amplitudes increased significantly with stimulus distortion for all stimulus types, which indicates that the right-hemispheric N1m sensitivity is not related exclusively to degradation of acoustic properties of speech. In addition, the P1m and P2m responses were amplified with increasing distortion similarly in both hemispheres. The AEF latencies were not systematically affected by the distortion.

Conclusions

We propose that the increased activity of AEFs reflects cortical processing of acoustic properties common to both speech and non-speech stimuli. More specifically, the enhancement is most likely caused by spectral changes brought about by the decrease of amplitude resolution, in particular the introduction of periodic, signal-dependent distortion to the original sound. Converging evidence suggests that the observed AEF amplification could reflect cortical sensitivity to periodic sounds.  相似文献   

9.
The speech perception of two multiple-channel cochlear implant patients was compared with that of three normally hearing listeners using an acoustic model of the implant for 22 different speech tests. The tests used included a minimal auditory capabilities battery, both closed-set and open-set word and sentence tests, speech tracking and a 12-consonant confusion study using nonsense syllables. The acoustic model represented electrical current pulses by bursts of noise and the effects of different electrodes were represented by using bandpass filters with different center frequencies. All subjects used a speech processor that coded the fundamental voicing frequency of speech as a pulse rate and the second formant frequency of speech as the electrode position in the cochlea, or the center frequency of the bandpass filter. Very good agreement was found for the two groups of subjects, indicating that the acoustic model is a useful tool for the development and evaluation of alternative cochlear implant speech processing strategies.  相似文献   

10.
The extent to which context influences speech categorization can inform theories of pre-lexical speech perception. Across three conditions, listeners categorized speech targets preceded by speech context syllables. These syllables were presented as the sole context or paired with nonspeech tone contexts previously shown to affect speech categorization. Listeners' context-dependent categorization across these conditions provides evidence that speech and nonspeech context stimuli jointly influence speech processing. Specifically, when the spectral characteristics of speech and nonspeech context stimuli are mismatched such that they are expected to produce opposing effects on speech categorization the influence of nonspeech contexts may undermine, or even reverse, the expected effect of adjacent speech context. Likewise, when spectrally matched, the cross-class contexts may collaborate to increase effects of context. Similar effects are observed even when natural speech syllables, matched in source to the speech categorization targets, serve as the speech contexts. Results are well-predicted by spectral characteristics of the context stimuli.  相似文献   

11.
Both dyslexics and auditory neuropathy (AN) subjects show inferior consonant-vowel (CV) perception in noise, relative to controls. To better understand these impairments, natural acoustic speech stimuli that were masked in speech-shaped noise at various intensities were presented to dyslexic, AN, and control subjects either in isolation or accompanied by visual articulatory cues. AN subjects were expected to benefit from the pairing of visual articulatory cues and auditory CV stimuli, provided that their speech perception impairment reflects a relatively peripheral auditory disorder. Assuming that dyslexia reflects a general impairment of speech processing rather than a disorder of audition, dyslexics were not expected to similarly benefit from an introduction of visual articulatory cues. The results revealed an increased effect of noise masking on the perception of isolated acoustic stimuli by both dyslexic and AN subjects. More importantly, dyslexics showed less effective use of visual articulatory cues in identifying masked speech stimuli and lower visual baseline performance relative to AN subjects and controls. Last, a significant positive correlation was found between reading ability and the ameliorating effect of visual articulatory cues on speech perception in noise. These results suggest that some reading impairments may stem from a central deficit of speech processing.  相似文献   

12.
We present results from a pilot study directed at developing an anchorable subjective speech quality test. The test uses multidimensional scaling techniques to obtain quantitative information about the perceptual attributes of speech. In the first phase of the study, subjects ranked perceptual distances between samples of speech produced by two different talkers, one male and one female, processed by a variety of codecs. The resulting distance matrices were processed to obtain, for each talker, a stimulus space for the various speech samples. This stimulus space has the properties that distances between stimuli in this space correspond to perceptual distances between stimuli and that the dimensions of this space correspond to attributes used by the subjects in determining perceptual distances. Mean opinion scores (MOS) scores obtained in an earlier study were found to be highly correlated with position in the stimulus space, and the three dimensions of the stimulus space were found to have identifiable physical and perceptual correlates. In the second phase of the study, we developed techniques for fitting speech generated by a new codec under investigation into a previously established stimulus space. The user is provided with a collection of speech samples and with the stimulus space for these speech samples as determined by a large-scale listening test. The user then carries out a much smaller listening test to determine the position of the new stimulus in the previously established stimulus space. This system is anchorable, so that different versions of a codec under development can be compared directly, and it provides more detailed information than the single number provided by MOS testing. We suggest that this information could be used to advantage in algorithm development and in development of objective measures of speech quality.  相似文献   

13.
Subjects presented with coherent auditory and visual streams generally fuse them into a single percept. This results in enhanced intelligibility in noise, or in visual modification of the auditory percept in the McGurk effect. It is classically considered that processing is done independently in the auditory and visual systems before interaction occurs at a certain representational stage, resulting in an integrated percept. However, some behavioral and neurophysiological data suggest the existence of a two-stage process. A first stage would involve binding together the appropriate pieces of audio and video information before fusion per se in a second stage. Then it should be possible to design experiments leading to unbinding. It is shown here that if a given McGurk stimulus is preceded by an incoherent audiovisual context, the amount of McGurk effect is largely reduced. Various kinds of incoherent contexts (acoustic syllables dubbed on video sentences or phonetic or temporal modifications of the acoustic content of a regular sequence of audiovisual syllables) can significantly reduce the McGurk effect even when they are short (less than 4?s). The data are interpreted in the framework of a two-stage "binding and fusion" model for audiovisual speech perception.  相似文献   

14.
Auditory event-related potentials (ERPs) to speech sounds were recorded in a demanding selective attention task to measure how the mismatch negativity (MMN) was affected by attention, deviant feature, and task relevance, i.e., whether the feature was target or nontarget type. With vowel-consonant-vowel (VCV) disyllables randomly presented to the right and left ears, subjects attended to the VCVs in one ear. In different conditions, the subjects responded to either intensity or phoneme deviance in the consonant. The position of the deviance within the VCV also varied, being in the first (VC), second (CV), or both (VC and CV) formant-transition regions. The MMN amplitudes were larger for deviants in the attended ear. Task relevance affected the MMNs to intensity and phoneme deviants differently. Target-type intensity deviants yielded larger MMNs than nontarget types. For phoneme deviants there was no main effect of task relevance, but there was a critical interaction with deviance position. The both position gave the largest MMN amplitudes for target-type phoneme deviants, as it did for target- and nontarget-type intensity deviants. The MMN for nontarget-type phoneme deviants, however, showed an inverse pattern such that the MMN for the both position had the smallest amplitude despite its greater spectro-temporal deviance and its greater detectability when it was the target. These data indicate that the MMN reflects differences in phonetic structure as well as differences in acoustic spectral-energy structure of the deviant stimuli. Furthermore, the task relevance effects demonstrate that top-down controls not only affect the amplitude of the MMN, but can reverse the pattern of MMN amplitudes among different stimuli.  相似文献   

15.
The auditory mismatch negativity (MMN) has been considered a preattentive index of auditory processing and/or a signature of prediction error computation. This study tries to demonstrate the presence of an MMN to deviant trials included in complex auditory stimuli sequences, and its possible relationship to predictive coding. Additionally, the transfer of information between trials is expected to be represented by stimulus-preceding negativity (SPN), which would possibly fit the predictive coding framework. To accomplish these objectives, the EEG of 31 subjects was recorded during an auditory paradigm in which trials composed of stimulus sequences with increasing or decreasing frequencies were intermingled with deviant trials presenting an unexpected ending. Our results showed the presence of an MMN in response to deviant trials. An SPN appeared during the intertrial interval and its amplitude was reduced in response to deviant trials. The presence of an MMN in complex sequences of sounds and the generation of an SPN component, with different amplitudes in deviant and standard trials, would support the predictive coding framework.  相似文献   

16.
The effects of six-channel compression and expansion amplification on the intelligibility of nonsense syllables embedded in speech spectrum noise were examined for four hearing-impaired subjects. For one condition (linear) the stimulus was given six-channel amplification with frequency shaping to suit the subject's hearing loss. The other condition (nonlinear) was the same except that low level inputs, to any given channel, received expansion amplification and high level inputs received compression. For each condition, each subject received the nonsense syllables at three different input levels, representing low, average, and high intensity speech. The results of this study, like those of most other studies of multichannel compression, are mainly negative. Nonlinear processing (mainly expansion) of low intensity speech resulted in a significant degradation of speech intelligibility for two subjects and in no improvement for the others. One subject showed a significant improvement in intelligibility for the nonlinearly processed average intensity speech and another subject showed significant improvement for the high intensity input (mainly compression). Clearly, nonlinear processing is beneficial for some subjects, under some listening conditions, but further research is needed to identify the relevent characteristics of such subjects. An acoustic analysis of selected items revealed that the failure of expansion to improve intelligibility was primarily due to the very low intensity consonants /e/ and /k/, in final position, being presented at an even lower intensity in the expansion condition than in the linear condition. Expansion may be worth further investigation with different parameters. Several other problems caused by the multichannel processing were also revealed. These included alteration of spectral shapes and band interaction effects. Ways of overcoming these problems, and of capitalizing on the likely advantages of multichannel amplification, are currently being investigated.  相似文献   

17.
Which acoustic properties of the speech signal differ between rhythmically prominent syllables and non-prominent ones? A production experiment was conducted to identify these acoustic properties. Subjects read out repetitive text to a metronome, trying to match stressed syllables to its beat. The analysis searched for the function of the speech signal that best predicts the timing of the metronome ticks. The most important factor in this function is found to be the contrast in loudness between a syllable and its neighbors. The prominence of a syllable can be deduced from the specific loudness in an (approximately) 360 ms window centered on the syllable in question relative to an (approximately) 800-ms-wide symmetric window.  相似文献   

18.
Systems designed to recognize continuous speech must be able to adapt to many types of acoustic variation, including variations in stress. A speaker-dependent recognition study was conducted on a group of stressed and destressed syllables. These syllables, some containing the short vowel /I/ and others the long vowel /ae/, were excised from continuous speech and transformed into arrays of cepstral coefficients at two levels of precision. From these data, four types of template dictionaries varying in size and stress composition were formed by a time-warping procedure. Recognition performance data were gathered from listeners and from a computer recognition algorithm that also employed warping. It was found that for a significant portion of the data base, stressed and destressed versions of the same syllable are sufficiently different from one another as to justify the use of separate dictionary templates. Second, destressed syllables exhibit roughly the same acoustic variance as their stressed counterparts. Third, long vowels tend to be involved in proportionally fewer cross-vowel errors but tend to diminish the warping algorithm's ability to discriminate consonantal information. Finally, the pattern of consonant errors that listeners make as a function of vowel length shows significant differences from that produced by the computer.  相似文献   

19.
Many older people have greater difficulty processing speech at suprathreshold levels than can be explained by standard audiometric configurations. Some of the difficulty may involve the processing of temporal information. Temporal information can signal linguistic distinctions. The voicing distinction, for example, that separates pairs of words such as "rapid" and "rabid" can be signaled by temporal information: longer first vowel and shorter closure characterize "rabid"; shorter vowel and longer closure characterize "rapid." In this study, naturally produced tokens of "rabid" were low-pass filtered at 3500 Hz and edited to create vowel and (silent) closure duration continua. Pure-tone audiograms and speech recognition scores were used to select the ten best-hearing subjects among 50 volunteers over age 55. Randomizations of the stimuli were presented for labeling at intensity levels of 60 and 80 dB HL to this group and to ten normal-hearing volunteers under age 25. Results showed highly significant interactions of age with the temporal factors and with intensity: the older subjects required longer silence durations before reporting "rapid," especially for the shorter vowel durations and for the higher intensity level. These data suggest that age may affect the relative salience of different acoustic cues in speech perception, and that age-related hearing loss may involve deficits in the processing of temporal information, deficits that are not measured by standard audiometry.  相似文献   

20.
This study investigated, first, the effect of stimulus frequency on mismatch negativity (MMN), N1, and P2 components of the cortical auditory event-related potential (ERP) evoked during passive listening to an oddball sequence. The hypothesis was that these components would show frequency-related changes, reflected in their latency and magnitude. Second, the effect of stimulus complexity on those same ERPs was investigated using words and consonant-vowel tokens (CVs) discriminated on the basis of formant change. Twelve normally hearing listeners were tested with tone bursts in the speech frequency range (400/440, 1,500/1,650, and 3,000/3,300 Hz), words (/baed/ vs /daed/) and CVs (/bae/ vs /dae/). N1 amplitude and latency decreased as frequency increased. P2 amplitude, but not latency, decreased as frequency increased. Frequency-related changes in MMN were similar to those for N1, resulting in a larger MMN area to low frequency contrasts. N1 amplitude and latency for speech sounds were similar to those found for low tones but MMN had a smaller area. Overall, MMN was present in 46%-71% of tests for tone contrasts but for only 25%-32% of speech contrasts. The magnitude of N1 and MMN for tones appear to be closely related, and both reflect the tonotopicity of the auditory cortex.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号