首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This article describes a model in which the acoustic speech signal is processed to yield a discrete representation of the speech stream in terms of a sequence of segments, each of which is described by a set (or bundle) of binary distinctive features. These distinctive features specify the phonemic contrasts that are used in the language, such that a change in the value of a feature can potentially generate a new word. This model is a part of a more general model that derives a word sequence from this feature representation, the words being represented in a lexicon by sequences of feature bundles. The processing of the signal proceeds in three steps: (1) Detection of peaks, valleys, and discontinuities in particular frequency ranges of the signal leads to identification of acoustic landmarks. The type of landmark provides evidence for a subset of distinctive features called articulator-free features (e.g., [vowel], [consonant], [continuant]). (2) Acoustic parameters are derived from the signal near the landmarks to provide evidence for the actions of particular articulators, and acoustic cues are extracted by sampling selected attributes of these parameters in these regions. The selection of cues that are extracted depends on the type of landmark and on the environment in which it occurs. (3) The cues obtained in step (2) are combined, taking context into account, to provide estimates of "articulator-bound" features associated with each landmark (e.g., [lips], [high], [nasal]). These articulator-bound features, combined with the articulator-free features in (1), constitute the sequence of feature bundles that forms the output of the model. Examples of cues that are used, and justification for this selection, are given, as well as examples of the process of inferring the underlying features for a segment when there is variability in the signal due to enhancement gestures (recruited by a speaker to make a contrast more salient) or due to overlap of gestures from neighboring segments.  相似文献   

2.
Discrimination of speech-sound pairs drawn from a computer-generated continuum in which syllables varied along the place of articulation phonetic feature (/b,d,g/) was tested with macaques. The acoustic feature that was varied along the two-formant 15-step continuum was the starting frequency of the second-formant transition. Discrimination of stimulus pairs separated by two steps was tested along the entire continuum in a same-different task. Results demonstrated that peaks in the discrimination functions occur for macaques at the "phonetic boundaries" which separate the /b-d/ and /d-g/ categories for human listeners. The data support two conclusions. First, although current theoretical accounts of place perception by human adults suggest that isolated second-formant transitions are "secondary" cues, learned by association with primary cues, the animal data are more compatible with the notion that second-formant transitions are sufficient to allow the appropriate partitioning of a place continuum in the absence of associative pairing with other more complex cues. Second, we discuss two potential roles played by audition in the evolution of the acoustics of language. One is that audition provided a set of "natural psychophysical boundaries," based on rather simple acoustic properties, which guided the selection of the phonetic repertoire but did not solely determine it; the other is that audition provided a set of rules for the formation of "natural classes" of sound and that phonetic units met those criteria. The data provided in this experiment provide support for the former. Experiments that could more clearly differentiate the two hypotheses are described.  相似文献   

3.
Determining the onset of transient signals like seismograms, acoustic emissions or ultrasound signals is very time consuming if the onset is picked manually. Therefore, different approaches exist, especially in seismology. The concepts of the most popular approaches are summarized. An own approach adapted to ultrasound signals and acoustic emissions, based on the Akaike Information Criterion (AIC), is presented. The AIC-picker is compared to an automatic onset detection algorithm based on the Hinkley criterion and also adapted to acoustic emissions. Manual picks performed by an analyst are used as reference values. Both automatic onset detection algorithms are applied to ultrasound signals which are used to monitor the setting and hardening of concrete. They are also applied to acoustic emissions recorded during a pull-out test. The AIC-picker produces sufficient reliable results for ultrasound signals where the deviation from the manual picks varies between 2% and 4%. Concerning acoustic emissions, only 10% of the events result in a mislocation vector greater than 5mm. It can be shown that our AIC-picker is a reliable tool for automatic onset detection for ultrasound signals and acoustic emissions of varying signal to noise ratio.  相似文献   

4.
Two recent accounts of the acoustic cues which specify place of articulation in syllable-initial stop consonants claim that they are located in the initial portions of the CV waveform and are context-free. Stevens and Blumstein [J. Acoust. Soc. Am. 64, 1358-1368 (1978)] have described the perceptually relevant spectral properties of these cues as static, while Kewley-Port [J. Acoust. Soc. Am. 73, 322-335 (1983)] describes these cues as dynamic. Three perceptual experiments were conducted to test predictions derived from these accounts. Experiment 1 confirmed that acoustic cues for place of articulation are located in the initial 20-40 ms of natural stop-vowel syllables. Next, short synthetic CV's modeled after natural syllables were generated using either a digital, parallel-resonance synthesizer in experiment 2 or linear prediction synthesis in experiment 3. One set of synthetic stimuli preserved the static spectral properties proposed by Stevens and Blumstein. Another set of synthetic stimuli preserved the dynamic properties suggested by Kewley-Port. Listeners in both experiments identified place of articulation significantly better from stimuli which preserved dynamic acoustic properties than from those based on static onset spectra. Evidently, the dynamic structure of the initial stop-vowel articulatory gesture can be preserved in context-free acoustic cues which listeners use to identify place of articulation.  相似文献   

5.
Knowledge-based speech recognition systems extract acoustic cues from the signal to identify speech characteristics. For channel-deteriorated telephone speech, acoustic cues, especially those for stop consonant place, are expected to be degraded or absent. To investigate the use of knowledge-based methods in degraded environments, feature extrapolation of acoustic-phonetic features based on Gaussian mixture models is examined. This process is applied to a stop place detection module that uses burst release and vowel onset cues for consonant-vowel tokens of English. Results show that classification performance is enhanced in telephone channel-degraded speech, with extrapolated acoustic-phonetic features reaching or exceeding performance using estimated Mel-frequency cepstral coefficients (MFCCs). Results also show acoustic-phonetic features may be combined with MFCCs for best performance, suggesting these features provide information complementary to MFCCs.  相似文献   

6.
This study investigates the effects of sentential context, lexical knowledge, and acoustic cues on the segmentation of connected speech. Listeners heard near-homophonous phrases (e.g., plmpaI for "plum pie" versus "plump eye") in isolation, in a sentential context, or in a lexically biasing context. The sentential context and the acoustic cues were piloted to provide strong versus mild support for one segmentation alternative (plum pie) or the other (plump eye). The lexically biasing context favored one segmentation or the other (e.g., skmpaI for "scum pie" versus *"scump eye," and lmpaI, for "lump eye" versus *"lum pie," with the asterisk denoting a lexically unacceptable parse). A forced-choice task, in which listeners indicated which of two words they thought they heard (e.g., "pie" or "eye"), revealed compensatory mechanisms between the sources of information. The effect of both sentential and lexical contexts on segmentation responses was larger when the acoustic cues were mild than when they were strong. Moreover, lexical effects were accompanied with a reduction in sensitivity to the acoustic cues. Sentential context only affected the listeners' response criterion. The results highlight the graded, interactive, and flexible nature of multicue segmentation, as well as functional differences between sentential and lexical contributions to this process.  相似文献   

7.
A probabilistic framework for a landmark-based approach to speech recognition is presented for obtaining multiple landmark sequences in continuous speech. The landmark detection module uses as input acoustic parameters (APs) that capture the acoustic correlates of some of the manner-based phonetic features. The landmarks include stop bursts, vowel onsets, syllabic peaks and dips, fricative onsets and offsets, and sonorant consonant onsets and offsets. Binary classifiers of the manner phonetic features-syllabic, sonorant and continuant-are used for probabilistic detection of these landmarks. The probabilistic framework exploits two properties of the acoustic cues of phonetic features-(1) sufficiency of acoustic cues of a phonetic feature for a probabilistic decision on that feature and (2) invariance of the acoustic cues of a phonetic feature with respect to other phonetic features. Probabilistic landmark sequences are constrained using manner class pronunciation models for isolated word recognition with known vocabulary. The performance of the system is compared with (1) the same probabilistic system but with mel-frequency cepstral coefficients (MFCCs), (2) a hidden Markov model (HMM) based system using APs and (3) a HMM based system using MFCCs.  相似文献   

8.
The contribution of the nasal murmur and vocalic formant transition to the perception of the [m]-[n] distinction by adult listeners was investigated for speakers of different ages in both consonant-vowel (CV) and vowel-consonant (VC) syllables. Three children in each of the speaker groups 3, 5, and 7 years old, and three adult females and three adult males produced CV and VC syllables consisting of either [m] or [n] and followed or preceded by [i ae u a], respectively. Two productions of each syllable were edited into seven murmur and transitions segments. Across speaker groups, a segment including the last 25 ms of the murmur and the first 25 ms of the vowel yielded higher perceptual identification of place of articulation than any other segment edited from the CV syllable. In contrast, the corresponding vowel+murmur segment in the VC syllable position improved nasal identification relative to other segment types for only the adult talkers. Overall, the CV syllable was perceptually more distinctive than the VC syllable, but this distinctiveness interacted with speaker group and stimulus duration. As predicted by previous studies and the current results of perceptual testing, acoustic analyses of adult syllable productions showed systematic differences between labial and alveolar places of articulation, but these differences were only marginally observed in the youngest children's speech. Also predicted by the current perceptual results, these acoustic properties differentiating place of articulation of nasal consonants were reliably different for CV syllables compared to VC syllables. A series of comparisons of perceptual data across speaker groups, segment types, and syllable shape provided strong support, in adult speakers, for the "discontinuity hypothesis" [K. N. Stevens, in Phonetic Linguistics: Essays in Honor of Peter Ladefoged, edited by V. A. Fromkin (Academic, London, 1985), pp. 243-255], according to which spectral discontinuities at acoustic boundaries provide critical cues to the perception of place of articulation. In child speakers, the perceptual support for the "discontinuity hypothesis" was weaker and the results indicative of developmental changes in speech production.  相似文献   

9.
针对冲击板的物理属性辨识问题,研究了尺寸辨识的恒定声线索提取及其在听觉感知中所起的作用。设计并完成了三组主观评价实验,实验1分析了录音和合成声作为声刺激对尺寸辨识的影响。实验2和实验3分别针对铝板和木板的冲击声,通过不相似主观评价实验获得尺寸辨识的感知空间和对应力学空间维度,计算了不同声特征的信息精确度。最后,对比实验2和实验3的结果给出了尺寸辨识的恒定声线索,根据声线索与感知结果的相关性分析了听者的尺寸感知策略。结果表明,听者利用录音和合成声均获得了较好的尺寸辨识结果,且听者趋向于利用与尺寸有关的恒定声线索来完成感知任务,而忽略那些容易受其他声源属性影响的声信息。  相似文献   

10.
This paper presents a bimodal (audio-visual) study of speech loudness. The same acoustic stimuli (three sustained vowels of the articulatory qualities "effort" and "noneffort") are first presented in isolation, and then simultaneously together with an appropriate optical stimulus (the speaker's face on a video screen, synchronously producing the vowels). By the method of paired comparisons (law of comparative judgment) subjective loudness differences could be represented by different intervals between scale values. By this method previous results of effort-dependent speech loudness could be verified. In the bimodal study the optical cues have a measurable effect, but the acoustic cues are still dominant. Visual cues act most effectively if they are presented naturally, i.e., if acoustic and optical effort cues vary in the same direction. The experiments provide some evidence that speech loudness can be influenced by other than acoustic variables.  相似文献   

11.
The distinctive features, which axe one of the important research subjects in Phonetics and Phonology and in speech technology also, are the ultimate units of speech. Firstly a phoneme system of the standard Chinese-Putonghua was determined based on the results of cluster analysis of perceptual confusion of speech sounds of Putonghua. Then according to the principle of choice between the two opposites proposed by Jakobson, Fant and Halle, considering the characteristics of Putonghua the distinctive feature values for Initials, Finals and Tones were determined in this paper. And the features have been formulated at both acoustic level and genetic level. The distinctive feature trees of Chinese initials and finals were drawn in addition to the feature tables, in order to understand the distinctive features for individual phoneme easily.  相似文献   

12.
Bouts of vocalizations given by seven red deer stags were recorded over the rutting period, and homomorphic analysis and hidden Markov models (two techniques typically used for the automatic recognition of human speech utterances) were used to investigate whether the spectral envelope of the calls was individually distinctive. Bouts of common roars (the most common call type) were highly individually distinctive, with an average recognition percentage of 93.5%. A "temporal" split-sample approach indicated that although in most individuals these identity cues held over the rutting period, the ability of the models trained with the bouts of roars recorded early in the rut to correctly classify later vocalizations decreased as the recording date increased. When Markov models trained using the bouts of common roars were used to classify other call types according to their individual membership, the classification results indicated that the cues to identity contained in the common roars were also present in the other call types. This is the first demonstration in mammals other than primates that individuals have vocal cues to identity that are common to the different call types that compose their vocal repertoire.  相似文献   

13.
In the 1970-1980's, a number of papers explored the role of the transitional and burst features in consonant-vowel context. These papers left unresolved the relative importance of these two acoustic cues. This research takes advantage of refined signal processing methods, allowing for the visualization and modification of acoustic details. This experiment explores the impact of modifying the strength of the acoustic burst feature on the recognition scores P(c)(SNR) (function of the signal-to-noise ratio), for four plosive sounds /ta, ka, da, ga/. These results show high correlations between the relative burst intensity and the scores P(c)(SNR). Based on this correlation, one must conclude that these bursts are the primary acoustic cues used for the identification of these four consonants. This is in contrast to previous experiments, which used less precise methods to manipulate speech, and observe complex relationships between the scores, bursts and transition cues. In cases where the burst feature is removed entirely, it is shown that naturally existing conflicting acoustic features dominate the score. These observations seem directly inconsistent with transition cues playing a role: if the transition cues were important, they would dominate over low-level conflicting burst cues. These limited results arguably rule out the concept of redundant cues.  相似文献   

14.
Thorax disease classification is a challenging task due to complex pathologies and subtle texture changes, etc. It has been extensively studied for years largely because of its wide application in computer-aided diagnosis. Most existing methods directly learn global feature representations from whole Chest X-ray (CXR) images, without considering in depth the richer visual cues lying around informative local regions. Thus, these methods often produce sub-optimal thorax disease classification performance because they ignore the very informative pathological changes around organs. In this paper, we propose a novel Part-Aware Mask-Guided Attention Network (PMGAN) that learns complementary global and local feature representations from all-organ region and multiple single-organ regions simultaneously for thorax disease classification. Specifically, multiple innovative soft attention modules are designed to progressively guide feature learning toward the global informative regions of whole CXR image. A mask-guided attention module is designed to further search for informative regions and visual cues within the all-organ or single-organ images, where attention is elegantly regularized by automatically generated organ masks and without introducing computation during the inference stage. In addition, a multi-task learning strategy is designed, which effectively maximizes the learning of complementary local and global representations. The proposed PMGAN has been evaluated on the ChestX-ray14 dataset and the experimental results demonstrate its superior thorax disease classification performance against the state-of-the-art methods.  相似文献   

15.
The purpose of this study is to specify the contribution of certain frequency regions to consonant place perception for normal-hearing listeners and listeners with high-frequency hearing loss, and to characterize the differences in stop-consonant place perception among these listeners. Stop-consonant recognition and error patterns were examined at various speech-presentation levels and under conditions of low- and high-pass filtering. Subjects included 18 normal-hearing listeners and a homogeneous group of 10 young, hearing-impaired individuals with high-frequency sensorineural hearing loss. Differential filtering effects on consonant place perception were consistent with the spectral composition of acoustic cues. Differences in consonant recognition and error patterns between normal-hearing and hearing-impaired listeners were observed when the stimulus bandwidth included regions of threshold elevation for the hearing-impaired listeners. Thus place-perception differences among listeners are, for the most part, associated with stimulus bandwidths corresponding to regions of hearing loss.  相似文献   

16.
17.
This paper describes acoustic cues for classification of consonant voicing in a distinctive feature-based speech recognition system. Initial acoustic cues are selected by studying consonant production mechanisms. Spectral representations, band-limited energies, and correlation values, along with Mel-frequency cepstral coefficients features (MFCCs) are also examined. Analysis of variance is performed to assess relative significance of features. Overall, 82.2%, 80.6%, and 78.4% classification rates are obtained on the TIMIT database for stops, fricatives, and affricates, respectively. Combining acoustic parameters with MFCCs shows performance improvement in all cases. Also, performance in the NTIMIT telephone channel speech shows that acoustic parameters are more robust than MFCCs.  相似文献   

18.
张珑  李海峰  马琳  王建华 《声学学报》2014,39(5):639-646
提出一种汉语普通话水平测试中儿化音的自动检测与评价方法。在现有计算机辅助发音评测系统的框架下,深入分析儿化音的发音规律和声学特性,将儿化音的检测与评价转化成典型的分类问题进行处理。经过挑选多个有代表性的声学特征,并尝试多种不同的分类算法,结果表明,集成分类回归树(Boosting CART)强化分类模型,能充分利用儿化音的各种声学特征,分类正确率达到92.41%。通过对声学特征组的进一步分析,发现共振峰、发音置信度、时长是表达儿化音的最重要线索,利用这些线索能有效地实现对儿化音的自动检测与评价。  相似文献   

19.
Research on children's speech perception and production suggests that consonant voicing and place contrasts may be acquired early in life, at least in word-onset position. However, little is known about the development of the acoustic correlates of later-acquired, word-final coda contrasts. This is of particular interest in languages like English where many grammatical morphemes are realized as codas. This study therefore examined how various non-spectral acoustic cues vary as a function of stop coda voicing (voiced vs. voiceless) and place (alveolar vs. velar) in the spontaneous speech of 6 American-English-speaking mother-child dyads. The results indicate that children as young as 1;6 exhibited many adult-like acoustic cues to voicing and place contrasts, including longer vowels and more frequent use of voice bar with voiced codas, and a greater number of bursts and longer post-release noise for velar codas. However, 1;6-year-olds overall exhibited longer durations and more frequent occurrence of these cues compared to mothers, with decreasing values by 2;6. Thus, English-speaking 1;6-year-olds already exhibit adult-like use of some of the cues to coda voicing and place, though implementation is not yet fully adult-like. Physiological and contextual correlates of these findings are discussed.  相似文献   

20.
This study focuses on the extraction of robust acoustic cues of labial and alveolar voiceless obstruents in German and their acoustic differences in the speech signal to distinguish them in place and manner of articulation. The investigated obstruents include the affricates [pf] and [ts], the fricatives [f] and [s] and the stops [p] and [t]. The target sounds were analyzed in word-initial and word-medial positions. The speech data for the analysis were recorded in a natural environment, deliberately containing background noise to extract robust cues only. Three methods of acoustic analysis were chosen: (1) temporal measurements to distinguish the respective obstruents in manner of articulation, (2) static spectral characteristics in terms of logarithmic distance measure to distinguish place of articulation, and (3) amplitudinal analysis of discrete frequency bands as a dynamic approach to place distinction. The results reveal that the duration of the target phonemes distinguishes these in manner of articulation. Logarithmic distance measure, as well as relative amplitude analysis of discrete frequency bands, identifies place of articulation. The present results contribute to the question, which properties are robust with respect to variation in the speech signal.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号