首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 562 毫秒
1.
The benefit of supplementing speechreading with frequency-selective sound-pressure information was studied by auditorily presenting this information to normal-hearing listeners. The sound-pressure levels in one or two frequency bands of the speech signal with center frequencies of 500, 1600, and 3160 Hz, respectively, and with 1- or 1/3-oct bandwidth were used to amplitude-modulate pure-tone carriers with frequencies equal to the center frequencies of the filter bands. Short sentences were presented to 18 normal-hearing listeners under the conditions of speechreading-only and speechreading combined with the sound-pressure information. The mean number of correctly perceived syllables increased from 22.8% for speechreading-only to 65.7% when sound-pressure information was supplied in a single 1-oct band at 500 Hz and to 86.7% with two 1-oct bands at 500 and 3160 Hz, respectively. The latter signal scored only 26.7% correct syllables without accompanying visual information.  相似文献   

2.
The ability of five profoundly hearing-impaired subjects to "track" connected speech and to make judgments about the intonation and stress in spoken sentences was evaluated under a variety of auditory-visual conditions. These included speechreading alone, speechreading plus speech (low-pass filtered at 4 kHz), and speechreading plus a tone whose frequency, intensity, and temporal characteristics were matched to the speaker's fundamental frequency (F0). In addition, several frequency transfer functions were applied to the normal F0 range resulting in new ranges that were both transposed and expanded with respect to the original F0 range. Three of the five subjects were able to use several of the tonal representations of F0 nearly as well as speech to improve their speechreading rates and to make appropriate judgments concerning sentence intonation and stress. The remaining two subjects greatly improved their identification performance for intonation and stress patterns when expanded F0 signals were presented alone (i.e., without speechreading), but had difficulty integrating visual and auditory information at the connected discourse level, despite intensive training in the connected discourse tracking procedure lasting from 27.8-33.8 h.  相似文献   

3.
The ability to combine speechreading (i.e., lipreading) with prosodic information extracted from the low-frequency regions of speech was evaluated with three normally hearing subjects. The subjects were tested in a connected discourse tracking procedure which measures the rate at which spoken text can be repeated back without any errors. Receptive conditions included speechreading alone (SA), speechreading plus amplitude envelope cues (AM), speechreading plus fundamental frequency cues (FM), and speechreading plus intensity-modulated fundamental frequency cues (AM + FM). In a second experiment, one subject was further tested in a speechreading plus voicing duration cue condition (DUR). Speechreading performance was best in the AM + FM condition (83.6 words per minute,) and worst in the SA condition (41.1 words per minute). Tracking levels in the AM, FM, and DUR conditions were 73.7, 73.6, and 65.4 words per minute, respectively. The average tracking rate obtained when subjects were allowed to listen to the talker's normal (unfiltered) speech (NS condition) was 108.3 words per minute. These results demonstrate that speechreaders can use information related to the rhythm, stress, and intonation patterns of speech to improve their speechreading performance.  相似文献   

4.
Speechreading supplemented with auditorily presented speech parameters   总被引:2,自引:0,他引:2  
Results are reported from two experiments in which the benefit of supplementing speechreading with auditorily presented information about the speech signal was investigated. In experiment I, speechreading was supplemented with information about the prosody of the speech signal. For ten normal-hearing subjects with no experience in speechreading, the intelligibility score for sentences increased significantly when speechreading was supplemented with information about the overall amplitude of the speech signal, information about the fundamental frequency, or both. Binary information about voicing appeared not to be a significant supplement. In experiment II, the best-scoring supplements of experiment I were compared with two supplementary signals from our previous studies, i.e., information about the sound-pressure levels in two 1-oct filter bands centered at 500 and 3160 Hz, or information about the frequencies of the first and second formants from voiced speech segments. Sentence-intelligibility scores were measured for 24 normal-hearing subjects with no experience in speechreading, and for 12 normal-hearing experienced speechreaders. For the inexperienced speechreaders, the sound-pressure levels appeared to be the best supplement (87.1% correct syllables). For the experienced speechreaders, the formant-frequency information (88.6% correct), and the fundamental-frequency plus amplitude information (86.0% correct), were equally efficient supplements as the sound-pressure information (86.1% correct). Discrimination of phonemes (both consonants and vowels) was measured for the group of 24 inexperienced speechreaders. Percentage correct responses, confusion among phonemes, and the percentage of transmitted information about different types of manner and place of articulation and about the feature voicing are presented.  相似文献   

5.
Fundamental frequency (F0) information extracted from low-pass-filtered speech and aurally presented as frequency-modulated sinusoids can greatly improve speechreading performance [Grant et al., J. Acoust. Soc. Am. 77, 671-677 (1985)]. To use this source of information, listeners must be able to detect the presence or absence of F0 (i.e., voicing), discriminate changes in frequency, and make judgments about the linguistic meaning of perceived variations in F0. In the present study, normally hearing and hearing-impaired subjects were required to locate the stressed peak of an intonation contour according to the extent of frequency transition at the primary peak. The results showed that listeners with profound hearing impairments required frequency transitions that were 1.5-6 times greater than those required by normally hearing subjects. These results were consistent with the subjects' identification performance for intonation and stress patterns in natural speech, and suggest that natural variations in F0 may be too small for some impaired listeners to perceive and follow accurately.  相似文献   

6.
Two experiments compared the effect of supplying visual speech information (e.g., lipreading cues) on the ability to hear one female talker's voice in the presence of steady-state noise or a masking complex consisting of two other female voices. In the first experiment intelligibility of sentences was measured in the presence of the two types of maskers with and without perceived spatial separation of target and masker. The second study tested detection of sentences in the same experimental conditions. Results showed that visual cues provided more benefit for both recognition and detection of speech when the masker consisted of other voices (versus steady-state noise). Moreover, visual cues provided greater benefit when the target speech and masker were spatially coincident versus when they appeared to arise from different spatial locations. The data obtained here are consistent with the hypothesis that lipreading cues help to segregate a target voice from competing voices, in addition to the established benefit of supplementing masked phonetic information.  相似文献   

7.
Although both perceived vocal effort and intensity are known to influence the perceived distance of speech, little is known about the processes listeners use to integrate these two parameters into a single estimate of talker distance. In this series of experiments, listeners judged the distances of prerecorded speech samples presented over headphones in a large open field. In the first experiment, virtual synthesis techniques were used to simulate speech signals produced by a live talker at distances ranging from 0.25 to 64 m. In the second experiment, listeners judged the apparent distances of speech stimuli produced over a 60-dB range of different vocal effort levels (production levels) and presented over a 34-dB range of different intensities (presentation levels). In the third experiment, the listeners judged the distances of time-reversed speech samples. The results indicate that production level and presentation level influence distance perception differently for each of three distinct categories of speech. When the stimulus was high-level voiced speech (produced above 66 dB SPL 1 m from the talker's mouth), the distance judgments doubled with each 8-dB increase in production level and each 12-dB decrease in presentation level. When the stimulus was low-level voiced speech (produced at or below 66 dB SPL at 1 m), the distance judgments doubled with each 15-dB increase in production level but were relatively insensitive to changes in presentation level at all but the highest intensity levels tested. When the stimulus was whispered speech, the distance judgments were unaffected by changes in production level and only decreased with increasing presentation level when the intensity of the stimulus exceeded 66 dB SPL. The distance judgments obtained in these experiments were consistent across a range of different talkers, listeners, and utterances, suggesting that voice-based distance cueing could provide a robust way to control the apparent distances of speech sounds in virtual audio displays.  相似文献   

8.
The focus of this study was the release from informational masking that could be obtained in a speech task by viewing a video of the target talker. A closed-set speech recognition paradigm was used to measure informational masking in 23 children (ages 6-16 years) and 10 adults. An audio-only condition required attention to a monaural target speech message that was presented to the same ear with a time-synchronized distracter message. In an audiovisual condition, a synchronized video of the target talker was also presented to assess the release from informational masking that could be achieved by speechreading. Children required higher target/distracter ratios than adults to reach comparable performance levels in the audio-only condition, reflecting a greater extent of informational masking in these listeners. There was a monotonic age effect, such that even the children in the oldest age group (12-16.9 years) demonstrated performance somewhat poorer than adults. Older children and adults improved significantly in the audiovisual condition, producing a release from informational masking of 15 dB or more in some adult listeners. Audiovisual presentation produced no informational masking release for the youngest children. Across all ages, the benefit of a synchronized video was strongly associated with speechreading ability.  相似文献   

9.
A key problem for telecommunication or human-machine communication systems concerns speech enhancement in noise. In this domain, a certain number of techniques exist, all of them based on an acoustic-only approach--that is, the processing of the audio corrupted signal using audio information (from the corrupted signal only or additive audio information). In this paper, an audio-visual approach to the problem is considered, since it has been demonstrated in several studies that viewing the speaker's face improves message intelligibility, especially in noisy environments. A speech enhancement prototype system that takes advantage of visual inputs is developed. A filtering process approach is proposed that uses enhancement filters estimated with the help of lip shape information. The estimation process is based on linear regression or simple neural networks using a training corpus. A set of experiments assessed by Gaussian classification and perceptual tests demonstrates that it is indeed possible to enhance simple stimuli (vowel-plosive-vowel sequences) embedded in white Gaussian noise.  相似文献   

10.
Subjects presented with coherent auditory and visual streams generally fuse them into a single percept. This results in enhanced intelligibility in noise, or in visual modification of the auditory percept in the McGurk effect. It is classically considered that processing is done independently in the auditory and visual systems before interaction occurs at a certain representational stage, resulting in an integrated percept. However, some behavioral and neurophysiological data suggest the existence of a two-stage process. A first stage would involve binding together the appropriate pieces of audio and video information before fusion per se in a second stage. Then it should be possible to design experiments leading to unbinding. It is shown here that if a given McGurk stimulus is preceded by an incoherent audiovisual context, the amount of McGurk effect is largely reduced. Various kinds of incoherent contexts (acoustic syllables dubbed on video sentences or phonetic or temporal modifications of the acoustic content of a regular sequence of audiovisual syllables) can significantly reduce the McGurk effect even when they are short (less than 4?s). The data are interpreted in the framework of a two-stage "binding and fusion" model for audiovisual speech perception.  相似文献   

11.
Information about the acoustic properties of a talker's voice is available in optical displays of speech, and vice versa, as evidenced by perceivers' ability to match faces and voices based on vocal identity. The present investigation used point-light displays (PLDs) of visual speech and sinewave replicas of auditory speech in a cross-modal matching task to assess perceivers' ability to match faces and voices under conditions when only isolated kinematic information about vocal tract articulation was available. These stimuli were also used in a word recognition experiment under auditory-alone and audiovisual conditions. The results showed that isolated kinematic displays provide enough information to match the source of an utterance across sensory modalities. Furthermore, isolated kinematic displays can be integrated to yield better word recognition performance under audiovisual conditions than under auditory-alone conditions. The results are discussed in terms of their implications for describing the nature of speech information and current theories of speech perception and spoken word recognition.  相似文献   

12.
The benefit of supplementing speechreading with information about the frequencies of the first and second formants from the voiced sections of the speech signal was studied by presenting short sentences to 18 normal-hearing listeners under the following three conditions: (a) speechreading combined with listening to the formant-frequency information, (b) speechreading only, and (c) formant-frequency information only. The formant frequencies were presented either as pure tones or as a complex speechlike signal, obtained by filtering a periodic pulse sequence of 250 Hz by a cascade of four second-order bandpass filters (with constant bandwidth); the center frequencies of two of these filters followed the frequencies of the first and second formants, whereas the frequencies of the others remained constant. The percentage of correctly identified syllables increased from 22.8 in the case of speechreading only to 82.0 in the case of speechreading while listening to the complex speechlike signal. Listening to the formant information only scored 33.2% correct. However, comparison with the best-scoring condition of our previous study [Breeuwer and Plomp, J. Acoust. Soc. Am. 76, 686-691 (1984)] indicates that information about the sound-pressure levels in two one-octave filter bands with center frequencies of 500 and 3160 Hz is a more effective supplement to speechreading than the formant-frequency information.  相似文献   

13.
At a cocktail party, listeners must attend selectively to a target speaker and segregate their speech from distracting speech sounds uttered by other speakers. To solve this task, listeners can draw on a variety of vocal, spatial, and temporal cues. Recently, Vestergaard et al. [J. Acoust. Soc. Am. 125, 1114-1124 (2009)] developed a concurrent-syllable task to control temporal glimpsing within segments of concurrent speech, and this allowed them to measure the interaction of glottal pulse rate and vocal tract length and reveal how the auditory system integrates information from independent acoustic modalities to enhance recognition. The current paper shows how the interaction of these acoustic cues evolves as the temporal overlap of syllables is varied. Temporal glimpses as short as 25 ms are observed to improve syllable recognition substantially when the target and distracter have similar vocal characteristics, but not when they are dissimilar. The effect of temporal glimpsing on recognition performance is strongly affected by the form of the syllable (consonant-vowel versus vowel-consonant), but it is independent of other phonetic features such as place and manner of articulation.  相似文献   

14.
The present experiments examine the effects of listener age and hearing sensitivity on the ability to understand temporally altered speech in quiet when the proportion of a sentence processed by time compression is varied. Additional conditions in noise investigate whether or not listeners are affected by alterations in the presentation rate of background speech babble, relative to the presentation rate of the target speech signal. Younger and older adults with normal hearing and with mild-to-moderate sensorineural hearing losses served as listeners. Speech stimuli included sentences, syntactic sets, and random-order words. Presentation rate was altered via time compression applied to the entire stimulus or to selected phrases within the stimulus. Older listeners performed more poorly than younger listeners in most conditions involving time compression, and their performance decreased progressively with the proportion of the stimulus that was processed with time compression. Older listeners also performed more poorly than younger listeners in all noise conditions, but both age groups demonstrated better performance in conditions incorporating a mismatch in the presentation rate between target signal and background babble compared to conditions with matched rates. The age effects in quiet are consistent with the generalized slowing hypothesis of aging. Performance patterns in noise tentatively support the notion that altered rates of speech signal and background babble may provide a cue to enhance auditory figure-ground perception by both younger and older listeners.  相似文献   

15.
A blind method for suppressing late reverberation from speech and audio signals is presented. The proposed technique operates both on the spectral and on the sub-band domains employing a single input channel. At first, a preliminary rough clean signal estimation is required and for this, any standard technique may be applied; however here the estimate is obtained through spectral subtraction. Then, an auditory masking model is employed in sub-bands to extract the reverberation masking index (RMI) which identifies signal regions with perceived alterations due to late reverberation. Utilizing a selective signal processing technique only these regions are suppressed through sub-band temporal envelope filtering based on analytical expressions. Objective and subjective measures indicate that the proposed method achieves significant late reverberation suppression for both speech and music signals over a wide range of reverberation time (RT) scenarios.  相似文献   

16.
We investigated the effects that the visual field has on the perception of heading speed. The stimulus was a radial flow pattern simulating a translational motion through a cylindrical tunnel. Observers evaluated the perception of heading speed by using a temporal two-alternative forced choice (2AFC) staircase method. In the first experiment, we manipulated the stimulus area by cutting the visual field along the longitudinal direction. The results showed that the perceived heading speed increases with the stimulus area. In the second experiment, we manipulated both the stimulus area and the eccentricity by cutting the visual field along the longitudinal direction. The results showed that the perception of heading speed increases when the stimulus occupies a large portion of the peripheral visual field. These findings suggest that the effect of eccentricity is a consequence of an incorrect translation of two-dimensional visual information into three-dimensional scaling.  相似文献   

17.
In phonemic restoration, intelligibility of interrupted speech is enhanced when noise fills the speech gaps. When the broadband envelope of missing speech amplitude modulates the intervening noise, intelligibility is even better. However, this phenomenon represents a perceptual failure: The amplitude modulation, a noise feature, is misattributed to the speech. Experiments explored whether object formation influences how information in the speech gaps is perceptually allocated. Experiment 1 replicates the finding that intelligibility is enhanced when speech-modulated noise rather than unmodulated noise is presented in the gaps. In Experiment 2, interrupted speech was presented diotically, but intervening noises were presented either diotically or with an interaural time difference leading in the right ear, causing the noises to be perceived to the side of the listener. When speech-modulated noise and speech are perceived from different directions, intelligibility is no longer enhanced by the modulation. However, perceived location has no effect for unmodulated noise, which contains no speech-derived information. Results suggest that enhancing object formation reduces misallocation of acoustic features across objects, and demonstrate that our ability to understand noisy speech depends on a cascade of interacting processes, including glimpsing sensory inputs, grouping sensory inputs into objects, and resolving ambiguity through top-down knowledge.  相似文献   

18.
Many studies suggest that the color category influences visual perception. It is also well known that oculomotor control and visual attention are closely linked. In order to clarify temporal characteristics of color categorization, we investigated eye movements during color visual search. Eight color disks were presented briefly for 20–320 ms, and the subject was instructed to gaze at a target shown prior to the trial. We found that the color category of the target modulated eye movements significantly when the stimulus was displayed for more than 40 ms and the categorization could be completed within 80 ms. With the 20 ms presentation, the search performance was at a chance level, however, the first saccadic latency suggested that the color category had an effect on visual attention. These results suggest that color categorization affects the guidance of visual attention rapidly and implicitly.  相似文献   

19.
The idea that listeners are able to "glimpse" the target speech in the presence of competing noise has been supported by many studies, and is based on the assumption that listeners are able to glimpse pieces of the target speech occurring at different times and somehow patch them together to hear out the target speech. The factors influencing glimpsing in noise are not well understood and are examined in the present study. Specifically, the effects of the frequency location, spectral width, and duration of the glimpses are examined. Stimuli were constructed using an ideal time-frequency (T-F) masking technique that ensures that the target is stronger than the masker in certain T-F regions of the mixture, thereby rendering certain regions easier to glimpse than others. Sentences were synthesized using this technique with glimpse information placed in several frequency regions while varying the glimpse window duration and total duration of glimpsing. Results indicated that the frequency location and total duration of the glimpses had a significant effect on speech recognition, with the highest performance obtained when the listeners were able to glimpse information in the F1F2 frequency region (0-3 kHz) for at least 60% of the utterance.  相似文献   

20.
Reverberation usually degrades speech intelligibility for spatially separated speech and noise sources since spatial unmasking is reduced and late reflections decrease the fidelity of the received speech signal. The latter effect could not satisfactorily be predicted by a recently presented binaural speech intelligibility model [Beutelmann et al. (2010). J. Acoust. Soc. Am. 127, 2479-2497]. This study therefore evaluated three extensions of the model to improve its predictions: (1) an extension of the speech intelligibility index based on modulation transfer functions, (2) a correction factor based on the room acoustical quantity "definition," and (3) a separation of the speech signal into useful and detrimental parts. The predictions were compared to results of two experiments in which speech reception thresholds were measured in a reverberant room in quiet and in the presence of a noise source for listeners with normal hearing. All extensions yielded better predictions than the original model when the influence of reverberation was strong, while predictions were similar for conditions with less reverberation. Although model (3) differed substantially in the assumed interaction of binaural processing and early reflections, its predictions were very similar to model (2) that achieved the best fit to the data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号