首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
Five commonly used methods for determining the onset of voicing of syllable-initial stop consonants were compared. The speech and glottal activity of 16 native speakers of Cantonese with normal voice quality were investigated during the production of consonant vowel (CV) syllables in Cantonese. Syllables consisted of the initial consonants /ph/, /th/, /kh/, /p/, /t/, and /k/ followed by the vowel /a/. All syllables had a high level tone, and were all real words in Cantonese. Measurements of voicing onset were made based on the onset of periodicity in the acoustic waveform, and on spectrographic measures of the onset of a voicing bar (f0), the onset of the first formant (F1), second formant (F2), and third formant (F3). These measurements were then compared against the onset of glottal opening as determined by electroglottography. Both accuracy and variability of each measure were calculated. Results suggest that the presence of aspiration in a syllable decreased the accuracy and increased the variability of spectrogram-based measurements, but did not strongly affect measurements made from the acoustic waveform. Overall, the acoustic waveform provided the most accurate estimate of voicing onset; measurements made from the amplitude waveform were also the least variable of the five measures. These results can be explained as a consequence of differences in spectral tilt of the voicing source in breathy versus modal phonation.  相似文献   

2.
Perception of sine-wave analogs of voice onset time stimuli   总被引:1,自引:0,他引:1  
It has been argued that perception of stop consonant voicing contrasts is based on auditory mechanisms responsible for the resolution of temporal order. As one source of evidence, category boundaries for nonspeech stimuli whose components vary in relative onset time are reasonably close to the labeling boundary for a labial stop voiced-voiceless continuum. However, voicing boundaries change considerably when the onset frequency of the first formant (F1) is varied--either directly or as a side effect of a change in F1 transition duration. Stimuli consisted of a midfrequency sinusoid that was initiated 0-50 ms prior to the onset of a low-frequency sinusoid. Results showed that the labeling boundary for relative onset time increased for longer durations of a low-frequency tone sweep. This effect is analogous to the F1 transition duration effect with synthetic speech. Further, the discrimination of differences in relative onset time was poorer for stimuli with longer frequency sweeps. However, unlike synthetic speech, there were no systematic effects when the frequency of a transitionless lower sinusoid was varied. These findings are discussed in relation to the potential contributions of auditory mechanisms and speech-specific processes in the perception of the voicing contrast.  相似文献   

3.
The voice onset time (VOT) of a stop consonant is the interval between its burst onset and voicing onset. Among a variety of research topics on VOT, one that has been studied for years is how VOTs are efficiently measured. Manual annotation is a feasible way, but it becomes a time-consuming task when the corpus size is large. This paper proposes an automatic VOT estimation method based on an onset detection algorithm. At first, a forced alignment is applied to identify the locations of stop consonants. Then a random forest based onset detector searches each stop segment for its burst and voicing onsets to estimate a VOT. The proposed onset detection can detect the onsets in an efficient and accurate manner with only a small amount of training data. The evaluation data extracted from the TIMIT corpus were 2344 words with a word-initial stop. The experimental results showed that 83.4% of the estimations deviate less than 10 ms from their manually labeled values, and 96.5% of the estimations deviate by less than 20 ms. Some factors that influence the proposed estimation method, such as place of articulation, voicing of a stop consonant, and quality of succeeding vowel, were also investigated.  相似文献   

4.
Responses of chinchilla auditory nerve fibers to synthesized stop consonant syllables differing in voice-onset time (VOT) were obtained. The syllables, heard as /ga/-/ka/ or /da/-/ta/, were similar to those previously used by others in psychophysical experiments with human and chinchilla subjects. Synchronized discharge rates of neurons tuned to frequencies near the first formant increased at the onset of voicing for VOTs longer than 20 ms. Stimulus components near the formant or the neuron's characteristic frequency accounted for the increase. In these neurons, synchronized response changes were closely related to the same neuron's average discharge rates [D. G. Sinex and L. P. McDonald, J. Acoust. Soc. Am. 83, 1817-1827 (1988)]. Neurons tuned to frequency regions near the second and third formants usually responded to components near the second formant prior to the onset of voicing. These neurons' synchronized discharges could be captured by the first formant at the onset of voicing or with a latency of 50-60 ms, whichever was later. Since these neurons' average rate responses were unaffected by the onset of voicing, the latency of the synchronized response did provide as much additional neural cue to VOT. Overall, however, discharge synchrony did not provide as much information about VOT as was provided by the best average rate responses. The results are compared to other measurements of the peripheral encoding of speech sounds and to aspects of VOT perception.  相似文献   

5.
Responses of "high-spontaneous" single auditory-nerve fibers in anesthetized cat to nine different spoken stop and nasal consonant-vowel syllables presented in four different levels of speech-shaped noise are reported. The temporal information contained in the responses was analyzed using "composite" spectrograms and pseudo-3D spatial-frequency plots. Spectral characteristics of both consonant and vowel segments of the CV syllables were strongly encoded at S/N ratios of 30 and 20 dB. At S/N = 10 dB, formant information during the vowel segments was all that was reliably detectable in most cases. Even at S/N = 0 dB, most vowel formants were detectable, but only with relatively long analysis windows (40 ms). The increases (and decreases) in discharge rate during various phases of the responses were also determined. The rate responses to the "release" and to the voicing of the stop-consonant syllables were quite robust, being detectable at least half of the time, even at the highest noise level. Comparisons with psychoacoustic studies using similar stimuli are made.  相似文献   

6.
Previous studies [Lisker, J. Acoust. Soc. Am. 57, 1547-1551 (1975); Summerfield and Haggard, J. Acoust. Soc. Am. 62, 435-448 (1977)] have shown that voice onset time (VOT) and the onset frequency of the first formant are important perceptual cues of voicing in syllable-initial plosives. Most prior work, however, has focused on speech perception in quiet environments. The present study seeks to determine which cues are important for the perception of voicing in syllable-initial plosives in the presence of noise. Perceptual experiments were conducted using stimuli consisting of naturally spoken consonant-vowel syllables by four talkers in various levels of additive white Gaussian noise. Plosives sharing the same place of articulation and vowel context (e.g., /pa,ba/) were presented to subjects in two alternate forced choice identification tasks, and a threshold signal-to-noise-ratio (SNR) value (corresponding to the 79% correct classification score) was estimated for each voiced/voiceless pair. The threshold SNR values were then correlated with several acoustic measurements of the speech tokens. Results indicate that the onset frequency of the first formant is critical in perceiving voicing in syllable-initial plosives in additive white Gaussian noise, while the VOT duration is not.  相似文献   

7.
This study investigated whether F2 and F3 transition onsets could encode the vowel place feature as well as F2 and F3 "steady-state" measures [Syrdal and Gopal, J. Acoust. Soc. Am. 79, 1086-1100 (1986)]. Multiple comparisons were made using (a) scatterplots in multidimensional space, (b) critical band differences, and (c) linear discriminant functional analyses. Four adult male speakers produced /b/(v)/t/, /d/(v)/t/, and /g/(v)/t/ tokens with medial vowel contexts /i,I, E, ey, ae, a, v, c, o, u/. Each token was repeated in a random order five times, yielding a total of 150 tokens per subject. Formant measurements were taken at four loci: F2 onset, F2 vowel, F3 onset, and F3 vowel. Onset points coincided with the first glottal pulse following the release burst and steady-state measures were taken approximately 60-70 ms post-onset. Graphic analyses revealed two distinct, minimally overlapping subsets grouped by front versus back. This dichotomous grouping was also seen in two-dimensional displays using only "onset" data as coordinates. Conversion to a critical band (bark) scale confirmed that front vowels were characterized by F3-F2 bark differences within a critical 3-bark distance, while back vowels exceeded the 3-bark critical distance. Using the critical distance metric onset values categorized front vowels as well as steady-state measures, but showed a 20% error rate for back vowels. Front vowels had less variability than back vowels. Statistical separability was quantified with linear discriminant function analysis. Percent correct classification into vowel place groups was 87.5% using F2 and F3 onsets as input variables, and 95.7% using F2 and F3 vowel. Acoustic correlates of the vowel place feature are already present at second and third formant transition onsets.  相似文献   

8.
Voice onset time (VOT) signifies the interval between consonant onset and the start of rhythmic vocal-cord vibrations. Differential perception of consonants such as /d/ and /t/ is categorical in American English, with the boundary generally lying at a VOT of 20-40 ms. This study tests whether previously identified response patterns that differentially reflect VOT are maintained in large-scale population activity within primary auditory cortex (A1) of the awake monkey. Multiunit activity and current source density patterns evoked by the syllables /da/ and /ta/ with variable VOTs are examined. Neural representation is determined by the tonotopic organization. Differential response patterns are restricted to lower best-frequency regions. Response peaks time-locked to both consonant and voicing onsets are observed for syllables with a 40- and 60-ms VOT, whereas syllables with a 0- and 20-ms VOT evoke a single response time-locked only to consonant onset. Duration of aspiration noise is represented in higher best-frequency regions. Representation of VOT and aspiration noise in discrete tonotopic areas of A1 suggest that integration of these phonetic cues occurs in secondary areas of auditory cortex. Findings are consistent with the evolving concept that complex stimuli are encoded by synchronized activity in large-scale neuronal ensembles.  相似文献   

9.
This study investigates cross-speaker differences in the factors that predict voicing thresholds during abduction-adduction gestures in six normal women. Measures of baseline airflow, pulse amplitude, subglottal pressure, and fundamental frequency were made at voicing offset and onset during intervocalic /h/, produced in varying vowel environments and at different loudness levels, and subjected to relational analyses to determine which factors were most strongly related to the timing of voicing cessation or initiation. The data indicate that (a) all speakers showed differences between voicing offsets and onsets, but the degree of this effect varied across speakers; (b) loudness and vowel environment have speaker-specific effects on the likelihood of devoicing during /h/; and (c) baseline flow measures significantly predicted times of voicing offset and onset in all participants, but other variables contributing to voice timing differed across speakers. Overall, the results suggest that individual speakers have unique methods of achieving phonatory goals during running speech. These data contribute to the literature on individual differences in laryngeal function, and serve as a means of evaluating how well laryngeal models can reproduce the range of voicing behavior used by speakers during running speech tasks.  相似文献   

10.
Responses of chinchilla auditory-nerve fibers to synthesized stop consonants differing in voice onset time (VOT) were obtained. The syllables, heard as /ga/-/ka/ or /da/-/ta/, were similar to those previously used by others in psychophysical experiments with human and with chinchilla subjects. Average discharge rates of neurons tuned to the frequency region near the first formant generally increased at the onset of voicing, for VOTs longer than 20 ms. These rate increases were closely related to spectral amplitude changes associated with the onset of voicing and with the activation of the first formant; as a result, they provided accurate information about VOT. Neurons tuned to frequency regions near the second and third formants did not encode VOT in their average discharge rates. Modulations in the average rates of these neurons reflected spectral variations that were independent of VOT. The results are compared to other measurements of the peripheral encoding of speech sounds and to psychophysical observations suggesting that syllables with large variations in VOT are heard as belonging to one of only two phonemic categories.  相似文献   

11.
The perception of voicing in final velar stop consonants was investigated by systematically varying vowel duration, change in offset frequency of the final first formant (F1) transition, and rate of frequency change in the final F1 transition for several vowel contexts. Consonant-vowel-consonant (CVC) continua were synthesized for each of three vowels, [i,I,ae], which represent a range of relatively low to relatively high-F1 steady-state values. Subjects responded to the stimuli under both an open- and closed-response condition. Results of the study show that both vowel duration and F1 offset properties influence perception of final consonant voicing, with the salience of the F1 offset property higher for vowels with high-F1 steady-state frequencies than low-F1 steady-state frequencies, and the opposite occurring for the vowel duration property. When F1 onset and offset frequencies were controlled, rate of the F1 transition change had inconsistent and minimal effects on perception of final consonant voicing. Thus the findings suggest that it is the termination value of the F1 offset transition rather than rate and/or duration of frequency change, which cues voicing in final velar stop consonants during the transition period preceding closure.  相似文献   

12.
Effect of masker level on overshoot   总被引:5,自引:0,他引:5  
Overshoot refers to the phenomenon where signal detectability improves for a short-duration signal as the onset of that signal is delayed relative to the onset of a longer duration masker. A popular explanation for overshoot is that it reflects short-term adaptation in auditory-nerve fibers. In this study, overshoot was measured for a 10-ms, 4-kHz signal masked by a broadband noise. In the first experiment, masker duration was 400 ms and signal onset delay was 1 or 195 ms; masker spectrum level ranged from - 10-50 dB SPL. Overshoot was negligible at the lowest masker levels, grew to about 10-15 dB at the moderate masker levels, but declined and approached 0 dB at the highest masker levels. In the second experiment, the masker duration was reduced to 100 ms, and the signal was presented with a delay of 1 or 70 ms; masker spectrum level was 10, 30, or 50 dB SPL. Overshoot was about 10 dB for the two lower masker levels, but about 0 dB at the highest masker level. The results from the second experiment suggest that the decline in overshoot at high masker levels is probably not due to auditory fatigue. It is suggested, instead, that the decline may be attributable to the neural response at high levels being dominated by those auditory-nerve fibers that do not exhibit short-term adaptation (i.e., those with low spontaneous rates and high thresholds).  相似文献   

13.
There exists no clear understanding of the importance of spectral tilt for perception of stop consonants. It is hypothesized that spectral tilt may be particularly salient when formant patterns are ambiguous or degraded. Here, it is demonstrated that relative change in spectral tilt over time, not absolute tilt, significantly influences perception of /b/ vs /d/. Experiments consisted of burstless synthesized stimuli that varied in spectral tilt and onset frequency of the second formant. In Experiment 1, tilt of the consonant at voice onset was varied. In Experiment 2, tilt of the vowel steady state was varied. Results of these experiments were complementary and revealed a significant contribution of relative spectral tilt change only when formant information was ambiguous. Experiments 3 and 4 replicated Experiments 1 and 2 in an /aba/-/ada/ context. The additional tilt contrast provided by the initial vowel modestly enhanced effects. In Experiment 5, there was no effect for absolute tilt when consonant and vowel tilts were identical. Consistent with earlier studies demonstrating contrast between successive local spectral features, perceptual effects of gross spectral characteristics are likewise relative. These findings have implications for perception in nonlaboratory environments and for listeners with hearing impairment.  相似文献   

14.
Auditory feedback influences human speech production, as demonstrated by studies using rapid pitch and loudness changes. Feedback has also been investigated using the gradual manipulation of formants in adaptation studies with whispered speech. In the work reported here, the first formant of steady-state isolated vowels was unexpectedly altered within trials for voiced speech. This was achieved using a real-time formant tracking and filtering system developed for this purpose. The first formant of vowel /epsilon/ was manipulated 100% toward either /ae/ or /I/, and participants responded by altering their production with average Fl compensation as large as 16.3% and 10.6% of the applied formant shift, respectively. Compensation was estimated to begin <460 ms after stimulus onset. The rapid formant compensations found here suggest that auditory feedback control is similar for both F0 and formants.  相似文献   

15.
Although some cochlear implant (CI) listeners can show good word recognition accuracy, it is not clear how they perceive and use the various acoustic cues that contribute to phonetic perceptions. In this study, the use of acoustic cues was assessed for normal-hearing (NH) listeners in optimal and spectrally degraded conditions, and also for CI listeners. Two experiments tested the tense/lax vowel contrast (varying in formant structure, vowel-inherent spectral change, and vowel duration) and the word-final fricative voicing contrast (varying in F1 transition, vowel duration, consonant duration, and consonant voicing). Identification results were modeled using mixed-effects logistic regression. These experiments suggested that under spectrally-degraded conditions, NH listeners decrease their use of formant cues and increase their use of durational cues. Compared to NH listeners, CI listeners showed decreased use of spectral cues like formant structure and formant change and consonant voicing, and showed greater use of durational cues (especially for the fricative contrast). The results suggest that although NH and CI listeners may show similar accuracy on basic tests of word, phoneme or feature recognition, they may be using different perceptual strategies in the process.  相似文献   

16.
Many older people have greater difficulty processing speech at suprathreshold levels than can be explained by standard audiometric configurations. Some of the difficulty may involve the processing of temporal information. Temporal information can signal linguistic distinctions. The voicing distinction, for example, that separates pairs of words such as "rapid" and "rabid" can be signaled by temporal information: longer first vowel and shorter closure characterize "rabid"; shorter vowel and longer closure characterize "rapid." In this study, naturally produced tokens of "rabid" were low-pass filtered at 3500 Hz and edited to create vowel and (silent) closure duration continua. Pure-tone audiograms and speech recognition scores were used to select the ten best-hearing subjects among 50 volunteers over age 55. Randomizations of the stimuli were presented for labeling at intensity levels of 60 and 80 dB HL to this group and to ten normal-hearing volunteers under age 25. Results showed highly significant interactions of age with the temporal factors and with intensity: the older subjects required longer silence durations before reporting "rapid," especially for the shorter vowel durations and for the higher intensity level. These data suggest that age may affect the relative salience of different acoustic cues in speech perception, and that age-related hearing loss may involve deficits in the processing of temporal information, deficits that are not measured by standard audiometry.  相似文献   

17.
An important speech cue is that of voice onset time (VOT), a cue for the perception of voicing and aspiration in word-initial stops. Preaspiration, an [h]-like sound between a vowel and the following stop, can be cued by voice offset time, a cue which in most respects mirrors VOT. In Icelandic VOffT is much more sensitive to the duration of the preceding vowel than is VOT to the duration of the following vowel. This has been explained by noting that preaspiration can only follow a phonemically short vowel. Lengthening of the vowel, either by changing its duration or by moving the spectrum towards that appropriate for a long vowel, will thus demand a longer VOffT to cue preaspiration. An experiment is reported showing that this greater effect that vowel quantity has on the perception of VOffT than on the perception of VOT cannot be explained by the effect of F1 frequency at vowel offset.  相似文献   

18.
In order to assess the limitations imposed on a cochlear implant system by a wearable speech processor, the parameters extracted from a set of 11 vowels and 24 consonants were examined. An estimate of the fundamental frequency EF 0 was derived from the zero crossings of the low-pass filtered envelope of the waveform. Estimates of the first and second formant frequencies EF 1 and EF 2 were derived from the zero crossings of the waveform, which was filtered in the ranges 300-1000 and 800-4000 Hz. Estimates of the formant amplitudes EA 1 and EA 2 were derived by peak detectors operating on the outputs of the same filters. For vowels, these parameters corresponded well to the first and second formants and gave sufficient information to identify each vowel. For consonants, the relative levels and onset times of EA 1 and EA 2 and the EF 0 values gave cues to voicing. The variation in time of EA 1, EA 2, EF 1, and EF 2 gave cues to the manner of articulation. Cues to the place of articulation were given by EF 1 and EF 2. When pink noise was added, the parameters were gradually degraded as the signal-to-noise ratio decreased. Consonants were affected more than vowels, and EF 2 was affected more than EF 1. Results for three good patients using a speech processor that coded EF 0 as an electric pulse rate, EF 1 and EF 2 as electrode positions, and EA 1 and EA 2 as electric current levels confirmed that the parameters were useful for recognition of vowels and consonants. Average scores were 76% for recognition of 11 vowels and 71% for 12 consonants in the hearing-alone condition. The error rates were 4% for voicing, 12% for manner, and 25% for place.  相似文献   

19.
The intelligibility of speech is sustained at lower signal-to-noise ratios when the speech has a different interaural configuration from the noise. This paper argues that the advantage arises in part because listeners combine evidence of the spectrum of speech in the across-frequency profile of interaural decorrelation with evidence in the across-frequency profile of intensity. To support the argument, three experiments examined the ability of listeners to integrate and segregate evidence of vowel formants in these two profiles. In experiment 1, listeners achieved accurate identification of the members of a small set of vowels whose first formant was defined by a peak in one profile and whose second formant was defined by a peak in the other profile. This result demonstrates that integration is possible. Experiment 2 demonstrated that integration is not mandatory, insofar as listeners could report the identity of a vowel defined entirely in one profile despite the presence of a competing vowel in the other profile. The presence of the competing vowel reduced accuracy of identification, however, showing that segregation was incomplete. Experiment 3 demonstrated that segregation of the binaural vowel, in particular, can be increased by the introduction of an onset asynchrony between the competing vowels. The results of experiments 2 and 3 show that the intrinsic cues for segregation of the profiles are relatively weak. Overall, the results are compatible with the argument that listeners can integrate evidence of spectral peaks from the two profiles.  相似文献   

20.
Spectral analysis of vowels during connected speech can be performed using the spectral intensity distribution within critical bands corresponding to a natural scale on the basilar membrane. Normalization of the spectra provides the opportunity to make objective comparisons independent from the recording level. An increasing envelope peak between 3,150 and 3,700 Hz has been confirmed statistically for a combination of seven vowels in three groups of male speakers with hoarse, normal, and professional voices. Each vowel is also analyzed individually. The local energy maximum is called “the speaker's formant” and can be found in the region of the fourth formant. The steepness of the spectral slope (i.e. the rate of decline) becomes less pronounced when the sonority or the intensity of the voice increases. The speaker's formant is connected with the sonorous quality of the voice. It increases gradually and is approximately 10 dB higher in professional male voices than in normal male voices at neutral loudness (60 dB at 0.3 min). The peak intensity becomes stronger (30 dB above normal voices) when the overall speaking loudness is increased to 80 dB. Shouting increases the spectral energy of the adjacent critical bands but not the speaker's formant itself.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号