首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Sound coming directly from a source is often accompanied by reflections arriving from different directions. However, the "precedence effect" occurs when listeners judge such a source's direction: information in the direct, first-arriving sound tends to govern the direction heard for the overall sound. This paper asks whether the spectral envelope of the direct sound has a similar, dominant influence on the spectral envelope perceived for the whole sound. A continuum between two vowels was produced and then a "two-part" filter distorted each step. The beginning of this filter's unit-sample response simulated a direct sound with no distortion of the spectral envelope. The second part simulated a reflection pattern that distorted the spectral envelope. The reflections' frequency response was designed to give the spectral envelope of one of the continuum's end-points to the other end-point. Listeners' identifications showed that the reflections in two-part filters had a substantial influence because sounds tended to be identified as the positive vowel of the reflection pattern. This effect was not reduced when the interaural delays of the reflections and the direct sound were substantially different. Also, when the reflections were caused to precede the direct sound, the effects were much the same. By contrast, in measurements of lateralization the precedence effect was obtained. Here, the lateral position of the whole sound was largely governed by the interaural delay of the direct sound, and was hardly affected by the interaural delay of the reflections.  相似文献   

2.
Perceptual compensation for reverberation was measured by embedding test words in contexts that were either spoken phrases or processed versions of this speech. The processing gave steady-spectrum contexts with no changes in the shape of the short-term spectral envelope over time, but with fluctuations in the temporal envelope. Test words were from a continuum between "sir" and "stir." When the amount of reverberation in test words was increased, to a level above the amount in the context, they sounded more like "sir." However, when the amount of reverberation in the context was also increased, to the level present in the test word, there was perceptual compensation in some conditions so that test words sounded more like "stir" again. Experiments here found compensation with speech contexts and with some steady-spectrum contexts, indicating that fluctuations in the context's temporal envelope can be sufficient for compensation. Other results suggest that the effectiveness of speech contexts is partly due to the narrow-band "frequency-channels" of the auditory periphery, where temporal-envelope fluctuations can be more pronounced than they are in the sound's broadband temporal envelope. Further results indicate that for compensation to influence speech, the context needs to be in a broad range of frequency channels.  相似文献   

3.
Listeners were asked to identify modified recordings of the words "sir" and "stir," which were spoken by an adult male British-English speaker. Steps along a continuum between the words were obtained by a pointwise interpolation of their temporal-envelopes. These test words were embedded in a longer "context" utterance, and played with different amounts of reverberation. Increasing only the test-word's reverberation shifts the listener's category boundary so that more "sir"-identifications are made. This effect reduces when the context's reverberation is also increased, indicating perceptual compensation that is informed by the context. Experiment 1 finds that compensation is more prominent in rapid speech, that it varies between rooms, that it is more prominent when the test-word's reverberation is high, and that it increases with the context's reverberation. Further experiments show that compensation persists when the room is switched between the context and the test word, when presentation is monaural, and when the context is reversed. However, compensation reduces when the context's reverberation pattern is reversed, as well as when noise-versions of the context are used. "Tails" that reverberation introduces at the ends of sounds and at spectral transitions may inform the compensation mechanism about the amount of reflected sound in the signal.  相似文献   

4.
The vowel in part-word repetitions in stuttered speech often sounds neutralized. In the present article, measurements of the excitatory source made during such episodes of dysfluency are reported. These measurements show that, compared with fluent utterances, the glottal volume velocities are lower in amplitude and shorter in duration and that the energy occurs more towards the low-frequency end of the spectrum. In a first perceptual experiment, the effects of varying the amplitude and duration of the glottal source were assessed. The glottal volume velocity recordings of the /ae/ vowels used in the analyses were employed as driving sources for an articulatory synthesizer so that judgments about the vowel quality could be made. With dysfluent glottal sources (either as spoken or by editing a fluent source so that it was low in amplitude and brief), the vowels sounded more neutralized than with fluent glottal sources (as spoken or by editing a dysfluent source to increase its amplitude and lengthen it). In a second perceptual experiment, synthetic glottal volume velocities were used to verify these findings and to assess the influence of the low-frequency emphasis in the dysfluent speech. This experiment showed that spectral bias and duration both cause stuttered vowels to sound neutralized.  相似文献   

5.
In the experiments reported here, perceived speaker identity was controlled by manipulating the fundamental frequency (F0) range of carrier phrases in which speech tokens were embedded. In the first experiment, words from two "hood"-"hud" continua were synthesized with different F0. The words were then embedded in synthetic carrier phrases with intonation contours which reduced perceived speaker identity differences for test items with different F0. The results indicated that when perceived speaker identity differences were reduced, the effect of F0 on vowel identification was also reduced. Experiment 2 indicated that when items presented in carrier phrases are matched for speaker identity and F0 with items in isolation, there is no effect for presentation in a carrier phrase. Experiment 3 involved the presentation of vowels from the "hood"-"hud" continuum in two different intonational contexts which were judged to have been produced by different speakers, even though the F0 of the test word was identical in the two contexts. There was a shift in identification as a result of the intonational context which was interpreted as evidence for the role of perceived identity in vowel normalization. Overall, the experiments suggest that perceived speaker identity is a better predictor of vowel normalization effects than is intrinsic F0. This indicates that the role of F0 in vowel normalization is mediated through perceived speaker identity.  相似文献   

6.
Research on the perception of vowels in the last several years has given rise to new conceptions of vowels as articulatory, acoustic, and perceptual events. Starting from a "simple" target model in which vowels were characterized articulatorily as static vocal tract shapes and acoustically as points in a first and second formant (F1/F2) vowel space, this paper briefly traces the evolution of vowel theory in the 1970s and 1980s in two directions. (1) Elaborated target models represent vowels as target zones in perceptual spaces whose dimensions are specified as formant ratios. These models have been developed primarily to account for perceivers' solution of the "speaker normalization" problem. (2) Dynamic specification models emphasize the importance of formant trajectory patterns in specifying vowel identity. These models deal primarily with the problem of "target undershoot" associated with the coarticulation of vowels with consonants in natural speech and with the issue of "vowel-inherent spectral change" or diphthongization of English vowels. Perceptual studies are summarized that motivate these theoretical developments.  相似文献   

7.
This study investigated the effects of age and hearing loss on perception of accented speech presented in quiet and noise. The relative importance of alterations in phonetic segments vs. temporal patterns in a carrier phrase with accented speech also was examined. English sentences recorded by a native English speaker and a native Spanish speaker, together with hybrid sentences that varied the native language of the speaker of the carrier phrase and the final target word of the sentence were presented to younger and older listeners with normal hearing and older listeners with hearing loss in quiet and noise. Effects of age and hearing loss were observed in both listening environments, but varied with speaker accent. All groups exhibited lower recognition performance for the final target word spoken by the accented speaker compared to that spoken by the native speaker, indicating that alterations in segmental cues due to accent play a prominent role in intelligibility. Effects of the carrier phrase were minimal. The findings indicate that recognition of accented speech, especially in noise, is a particularly challenging communication task for older people.  相似文献   

8.
On the role of spectral transition for speech perception   总被引:2,自引:0,他引:2  
This paper examines the relationship between dynamic spectral features and the identification of Japanese syllables modified by initial and/or final truncation. The experiments confirm several main points. "Perceptual critical points," where the percent correct identification of the truncated syllable as a function of the truncation position changes abruptly, are related to maximum spectral transition positions. A speech wave of approximately 10 ms in duration that includes the maximum spectral transition position bears the most important information for consonant and syllable perception. Consonant and vowel identification scores simultaneously change as a function of the truncation position in the short period, including the 10-ms period for final truncation. This suggests that crucial information for both vowel and consonant identification is contained across the same initial part of each syllable. The spectral transition is more crucial than unvoiced and buzz bar periods for consonant (syllable) perception, although the latter features are of some perceptual importance. Also, vowel nuclei are not necessary for either vowel or syllable perception.  相似文献   

9.
Natural spoken language processing includes not only speech recognition but also identification of the speaker's gender, age, emotional, and social status. Our purpose in this study is to evaluate whether temporal cues are sufficient to support both speech and speaker recognition. Ten cochlear-implant and six normal-hearing subjects were presented with vowel tokens spoken by three men, three women, two boys, and two girls. In one condition, the subject was asked to recognize the vowel. In the other condition, the subject was asked to identify the speaker. Extensive training was provided for the speaker recognition task. Normal-hearing subjects achieved nearly perfect performance in both tasks. Cochlear-implant subjects achieved good performance in vowel recognition but poor performance in speaker recognition. The level of the cochlear implant performance was functionally equivalent to normal performance with eight spectral bands for vowel recognition but only to one band for speaker recognition. These results show a disassociation between speech and speaker recognition with primarily temporal cues, highlighting the limitation of current speech processing strategies in cochlear implants. Several methods, including explicit encoding of fundamental frequency and frequency modulation, are proposed to improve speaker recognition for current cochlear implant users.  相似文献   

10.
Spectral sharpness and vowel dissimilarity   总被引:1,自引:0,他引:1  
The effect of sharpening or smoothing the spectral envelopes of synthetic vowel-like sounds on the dissimilarities perceived among these sounds was investigated by means of triadic comparisons. When a spectral envelope (dB on a log-frequency scale) is considered the sum of a series of sinusoidal spectral modulations (or ripples) of different densities (the ripple spectrum), spectral sharpening or smoothing can be described as an amplification or attenuation of a part of the original ripple spectrum. For a set of nine sounds comprising different degrees of spectral sharpening of a single vowel, the perceived dissimilarities were found to be dominated by a specific part of the ripple spectrum, i.e., by spectral modulations with a density of about 2 ripples/oct. The possible role of lateral suppression in relation to this dominant region is discussed. For a set of 18 sounds comprising six vowels, each in three different versions (sharpened, normal, or smoothed), the dissimilarities were found to be determined mainly by the global shape of the spectral envelopes, i.e., by spectral modulations up to about 1.5-2 ripples/oct. Details of the spectral envelope (including the region of 2 ripples/oct where lateral suppression is effective) appear to be of minor influence on vowel dissimilarities.  相似文献   

11.
A recent study [Smith and Patterson, J. Acoust. Soc. Am. 118, 3177-3186 (2005)] demonstrated that both the glottal-pulse rate (GPR) and the vocal-tract length (VTL) of vowel sounds have a large effect on the perceived sex and age (or size) of a speaker. The vowels for all of the "different" speakers in that study were synthesized from recordings of the sustained vowels of one, adult male speaker. This paper presents a follow-up study in which a range of vowels were synthesized from recordings of four different speakers--an adult man, an adult woman, a young boy, and a young girl--to determine whether the sex and age of the original speaker would have an effect upon listeners' judgments of whether a vowel was spoken by a man, woman, boy, or girl, after they were equated for GPR and VTL. The sustained vowels of the four speakers were scaled to produce the same combinations of GPR and VTL, which covered the entire range normally encountered in every day life. The results show that listeners readily distinguish children from adults based on their sustained vowels but that they struggle to distinguish the sex of the speaker.  相似文献   

12.
Detection thresholds for gaps and overlaps, that is acoustic and perceived silences and stretches of overlapping speech in speaker changes, were determined. Subliminal gaps and overlaps were categorized as no-gap-no-overlaps. The established gap and overlap detection thresholds both corresponded to the duration of a long vowel, or about 120 ms. These detection thresholds are valuable for mapping the perceptual speaker change categories gaps, overlaps, and no-gap-no-overlaps into the acoustic domain. Furthermore, the detection thresholds allow generation and understanding of gaps, overlaps, and no-gap-no-overlaps in human-like spoken dialogue systems.  相似文献   

13.
Dynamic specification of coarticulated vowels   总被引:1,自引:0,他引:1  
An adequate theory of vowel perception must account for perceptual constancy over variations in the acoustic structure of coarticulated vowels contributed by speakers, speaking rate, and consonantal context. We modified recorded consonant-vowel-consonant syllables electronically to investigate the perceptual efficacy of three types of acoustic information for vowel identification: (1) static spectral "targets," (2) duration of syllabic nuclei, and (3) formant transitions into and out of the vowel nucleus. Vowels in /b/-vowel-/b/ syllables spoken by one adult male (experiment 1) and by two females and two males (experiment 2) served as the corpus, and seven modified syllable conditions were generated in which different parts of the digitized waveforms of the syllables were deleted and the temporal relationships of the remaining parts were manipulated. Results of identification tests by untrained listeners indicated that dynamic spectral information, contained in initial and final transitions taken together, was sufficient for accurate identification of vowels even when vowel nuclei were attenuated to silence. Furthermore, the dynamic spectral information appeared to be efficacious even when durational parameters specifying intrinsic vowel length were eliminated.  相似文献   

14.
It is well known that the spectral envelope is a perceptually salient attribute in musical instrument timbre perception. While a number of studies have explored discrimination thresholds for changes to the spectral envelope, the question of how sensitivity varies as a function of center frequency and bandwidth for musical instruments has yet to be addressed. In this paper a two-alternative forced-choice experiment was conducted to observe perceptual sensitivity to modifications made on trumpet, clarinet and viola sounds. The experiment involved attenuating 14 frequency bands for each instrument in order to determine discrimination thresholds as a function of center frequency and bandwidth. The results indicate that perceptual sensitivity is governed by the first few harmonics and sensitivity does not improve when extending the bandwidth any higher. However, sensitivity was found to decrease if changes were made only to the higher frequencies and continued to decrease as the distorted bandwidth was widened. The results are analyzed and discussed with respect to two other spectral envelope discrimination studies in the literature as well as what is predicted from a psychoacoustic model.  相似文献   

15.
This paper presents a parameter for objectively evaluating singing voice quality. Power spectrum of vowel sound / a / was analyzed by Fast Fourier Transform. The greatest harmonics peak between 2 and 4 kHz and the greatest harmonics peak between 0 and 2 kHz were identified. Power ratio of these peaks, termed singing power ratio (SPR), was calculated in 37 singers and 20 nonsingers. SPR of sung / a / in singers was significantly greater than in nonsingers. In singers, SPR of sung / a / was significantly greater than that of spoken / a /. By digital signal processing, power spectrum of sung / a / was varied, and the processed sounds were perceptually analyzed. SPR had a significant relationship with perceptual scores of “ringing” quality. SPR provides an important quantitative measurement for evaluating singing voice quality for all voice types, including soprano.  相似文献   

16.
Two methods are described for speaker normalizing vowel spectral features: one is a multivariable linear transformation of the features and the other is a polynomial warping of the frequency scale. Both normalization algorithms minimize the mean-square error between the transformed data of each speaker and vowel target values obtained from a "typical speaker." These normalization techniques were evaluated both for formants and a form of cepstral coefficients (DCTCs) as spectral parameters, for both static and dynamic features, and with and without fundamental frequency (F0) as an additional feature. The normalizations were tested with a series of automatic classification experiments for vowels. For all conditions, automatic vowel classification rates increased for speaker-normalized data compared to rates obtained for nonnormalized parameters. Typical classification rates for vowel test data for nonnormalized and normalized features respectively are as follows: static formants--69%/79%; formant trajectories--76%/84%; static DCTCs 75%/84%; DCTC trajectories--84%/91%. The linear transformation methods increased the classification rates slightly more than the polynomial frequency warping. The addition of F0 improved the automatic recognition results for nonnormalized vowel spectral features as much as 5.8%. However, the addition of F0 to speaker-normalized spectral features resulted in much smaller increases in automatic recognition rates.  相似文献   

17.
Coarticulatory influences on the perceived height of nasal vowels   总被引:1,自引:0,他引:1  
Certain of the complex spectral effects of vowel nasalization bear a resemblance to the effects of modifying the tongue or jaw position with which the vowel is produced. Perceptual evidence suggests that listener misperceptions of nasal vowel height arise as a result of this resemblance. Whereas previous studies examined isolated nasal vowels, this research focused on the role of phonetic context in shaping listeners' judgments of nasal vowel height. Identification data obtained from native American English speakers indicated that nasal coupling does not necessarily lead to listener misperceptions of vowel quality when the vowel's nasality is coarticulatory in nature. The perceived height of contextually nasalized vowels (in a [bVnd] environment) did not differ from that of oral vowels (in a [bVd] environment) produced with the same tongue-jaw configuration. In contrast, corresponding noncontextually nasalized vowels (in a [bVd] environment) were perceived as lower in quality than vowels in the other two conditions. Presumably the listeners' lack of experience with distinctive vowel nasalization prompted them to resolve the spectral effects of noncontextual nasalization in terms of tongue or jaw height, rather than velic height. The implications of these findings with respect to sound changes affecting nasal vowel height are also discussed.  相似文献   

18.
The four experiments reported here measure listeners' accuracy and consistency in adjusting a formant frequency of one- or two-formant complex sounds to match the timbre of a target sound. By presenting the target and the adjustable sound on different fundamental frequencies, listeners are prevented from performing the task by comparing the absolute or relative levels of resolved spectral components. Experiment 1 uses two-formant vowellike sounds. When the two sounds have the same F0, the variability of matches (within-subject standard deviation) for either the first or the second formant is around 1%-3%, which is comparable to existing data on formant frequency discrimination thresholds. With a difference in F0, variability increases to around 8% for first-formant matches, but to only about 4% for second-formant matches. Experiment 2 uses sounds with a single formant at 1100 or 1200 Hz with both sounds on either low or high fundamental frequencies. The increase in variability produced by a difference in F0 is greater for high F0's (where the harmonics close to the formant peak are resolved) than it is for low F0's (where they are unresolved). Listeners also showed systematic errors in their mean matches to sounds with different high F0's. The direction of the systematic errors was towards the most intense harmonic. Experiments 3 and 4 showed that introduction of a vibratolike frequency modulation (FM) on F0 reduces the variability of matches, but does not reduce the systematic error. The experiments demonstrate, for the specific frequencies and FM used, that there is a perceptual cost to interpolating a spectral envelope across resolved harmonics.  相似文献   

19.
Current speech perception models propose that relative perceptual difficulties with non-native segmental contrasts can be predicted from cross-language phonetic similarities. Japanese (J) listeners performed a categorical discrimination task in which nine contrasts (six adjacent height pairs, three front/back pairs) involving eight American (AE) vowels [i?, ?, ε, ??, ɑ?, ?, ?, u?] in /hVb?/ disyllables were tested. The listeners also completed a perceptual assimilation task (categorization as J vowels with category goodness ratings). Perceptual assimilation patterns (quantified as categorization overlap scores) were highly predictive of discrimination accuracy (r(s)=0.93). Results suggested that J listeners used both spectral and temporal information in discriminating vowel contrasts.  相似文献   

20.
The length of the vocal tract is correlated with speaker size and, so, speech sounds have information about the size of the speaker in a form that is interpretable by the listener. A wide range of different vocal tract lengths exist in the population and humans are able to distinguish speaker size from the speech. Smith et al. [J. Acoust. Soc. Am. 117, 305-318 (2005)] presented vowel sounds to listeners and showed that the ability to discriminate speaker size extends beyond the normal range of speaker sizes which suggests that information about the size and shape of the vocal tract is segregated automatically at an early stage in the processing. This paper reports an extension of the size discrimination research using a much larger set of speech sounds, namely, 180 consonant-vowel and vowel-consonant syllables. Despite the pronounced increase in stimulus variability, there was actually an improvement in discrimination performance over that supported by vowel sounds alone. Performance with vowel-consonant syllables was slightly better than with consonant-vowel syllables. These results support the hypothesis that information about the length of the vocal tract is segregated at an early stage in auditory processing.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号