首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
The many-to-one mapping from representations in the speech articulatory space to acoustic space renders the associated acoustic-to-articulatory inverse mapping non-unique. Among various techniques, imposing smoothness constraints on the articulator trajectories is one of the common approaches to handle the non-uniqueness in the acoustic-to-articulatory inversion problem. This is because, articulators typically move smoothly during speech production. A standard smoothness constraint is to minimize the energy of the difference of the articulatory position sequence so that the articulator trajectory is smooth and low-pass in nature. Such a fixed definition of smoothness is not always realistic or adequate for all articulators because different articulators have different degrees of smoothness. In this paper, an optimization formulation is proposed for the inversion problem, which includes a generalized smoothness criterion. Under such generalized smoothness settings, the smoothness parameter can be chosen depending on the specific articulator in a data-driven fashion. In addition, this formulation allows estimation of articulatory positions recursively over time without any loss in performance. Experiments with the MOCHA TIMIT database show that the estimated articulator trajectories obtained using such a generalized smoothness criterion have lower RMS error and higher correlation with the actual measured trajectories compared to those obtained using a fixed smoothness constraint.  相似文献   

2.
3.
In this paper, a quantitative study of acoustic-to-articulatory inversion for vowel speech sounds by analysis-by-synthesis using the Maeda articulatory model is performed. For chain matrix calculation of vocal tract (VT) acoustics, the chain matrix derivatives with respect to area function are calculated and used in a quasi-Newton method for optimizing articulatory trajectories. The cost function includes a distance measure between natural and synthesized first three formants, and parameter regularization and continuity terms. Calibration of the Maeda model to two speakers, one male and one female, from the University of Wisconsin x-ray microbeam (XRMB) database, using a cost function, is discussed. Model adaptation includes scaling the overall VT and the pharyngeal region and modifying the outer VT outline using measured palate and pharyngeal traces. The inversion optimization is initialized by a fast search of an articulatory codebook, which was pruned using XRMB data to improve inversion results. Good agreement between estimated midsagittal VT outlines and measured XRMB tongue pellet positions was achieved for several vowels and diphthongs for the male speaker, with average pellet-VT outline distances around 0.15 cm, smooth articulatory trajectories, and less than 1% average error in the first three formants.  相似文献   

4.
5.
Native and nonnative listeners categorized final /v/ versus /f/ in English nonwords. Fricatives followed phonetically long (originally /v/-preceding) or short (originally /f/-preceding) vowels. Vowel duration was constant for each participant and sometimes mismatched other voicing cues. Previous results showed that English but not Dutch listeners (whose L1 has no final voicing contrast) nevertheless used the misleading vowel duration for /v/-/f/ categorization. New analyses showed that Dutch listeners did use vowel duration initially, but quickly reduced its use, whereas the English listeners used it consistently throughout the experiment. Thus, nonnative listeners adapted to the stimuli more flexibly than native listeners did.  相似文献   

6.
We propose an approach for the exact dynamic inversion of singularly perturbed second-order linear systems through asymptotic expansion in a singular parameter. We show that the inversion solution, corresponding to the invariant slow manifold, can be expressed as a converging infinite series under desired output constraints composed of exponential support functions in the complex domain. We provide systematic mathematical procedures to obtain the closed-form invariant slow manifold, along with required admissible boundary conditions. Numerical examples are given to validate the proposed approach.  相似文献   

7.
Previous research shows that listeners are sensitive to talker differences in phonetic properties of speech, including voice-onset-time (VOT) in word-initial voiceless stop consonants, and that learning how a talker produces one voiceless stop transfers to another word with the same voiceless stop [Allen, J. S., and Miller, J. L. (2004). J. Acoust. Soc. Am. 115, 3171-3183]. The present experiments examined whether transfer extends to words that begin with different voiceless stops. During training, listeners heard two talkers produce a given voiceless-initial word (e.g., pain). VOTs were manipulated such that one talker produced the voiceless stop with relatively short VOTs and the other with relatively long VOTs. At test, listeners heard a short- and long-VOT variant of the same word (e.g., pain) or a word beginning with a different voiceless stop (e.g., cane or coal), and were asked to select which of the two VOT variants was most representative of a given talker. In all conditions, which variant was selected at test was in line with listeners' exposure during training, and the effect was equally strong for the novel word and the training word. These findings suggest that accommodating talker-specific phonetic detail does not require exposure to each individual phonetic segment.  相似文献   

8.
Accented speech recognition is more challenging than standard speech recognition due to the effects of phonetic and acoustic confusions. Phonetic confusion in accented speech occurs when an expected phone is pronounced as a different one, which leads to erroneous recognition. Acoustic confusion occurs when the pronounced phone is found to lie acoustically between two baseform models and can be equally recognized as either one. We propose that it is necessary to analyze and model these confusions separately in order to improve accented speech recognition without degrading standard speech recognition. Since low phonetic confusion units in accented speech do not give rise to automatic speech recognition errors, we focus on analyzing and reducing phonetic and acoustic confusability under high phonetic confusion conditions. We propose using likelihood ratio test to measure phonetic confusion, and asymmetric acoustic distance to measure acoustic confusion. Only accent-specific phonetic units with low acoustic confusion are used in an augmented pronunciation dictionary, while phonetic units with high acoustic confusion are reconstructed using decision tree merging. Experimental results show that our approach is effective and superior to methods modeling phonetic confusion or acoustic confusion alone in accented speech, with a significant 5.7% absolute WER reduction, without degrading standard speech recognition.  相似文献   

9.
The results of several experiments demonstrate that silence is an important cue for the perception of stop-consonant and affricate manner. In some circumstances, silence is necessary; in others, it is sufficient. But silence is not the only cue to these manners. There are other cues that are more or less equivalent in their perceptual effects, though they are quite different acoustically. Finally, silence is effective as a cue when it separates utterances produced by male and female speakers. These findings are taken to imply that, in these instances, perception is constrained as if by some abstract conception of what vocal tracts do when they make linguistically significant gestures.  相似文献   

10.
The perception of subphonemic differences between vowels was investigated using multidimensional scaling techniques. Three experiments were conducted with natural-sounding synthetic stimuli generated by linear predictive coding (LPC) formant synthesizers. In the first experiment, vowel sets near the pairs (i-I), (epsilon-ae), or (u-U) were synthesized containing 11 vowels each. Listeners judged the dissimilarities between all pairs of vowels within a set several times. These perceptual differences were mapped into distances between the vowels in an n-dimensional space using two-way multidimensional scaling. Results for each vowel set showed that the physical stimulus space, which was specified by the two parameters F1 and F2, was always mapped into a two-dimensional perceptual space. The best metric for modeling the perceptual distances was the Euclidean distance between F1 and F2 in barks. The second experiment investigated the perception of the same vowels from the first experiment, but embedded in a consonantal context. Following the same procedures as experiment 1, listeners' perception of the (bv) dissimilarities was not different from their perception of the isolated vowel dissimilarities. The third experiment investigated dissimilarity judgments for the three vowels (ae-alpha-lambda) located symmetrically in the F1 X F2 vowel space. While the perceptual space was again two dimensional, the influence of phonetic identity on vowel difference judgments was observed. Implications for determining metrics for subphonemic vowel differences using multidimensional scaling are discussed.  相似文献   

11.
12.
13.
We derive the torsion constraints for superspace versions of supergravity theories by means of the theory ofG-structures. We also discuss superconformal geometry and superKähler geometry.Permanent address as of September 1, 1990: Department of Mathematics, University of Michigan, Ann Arbor, MI 48109, USA  相似文献   

14.
Discrimination of speech-sound pairs drawn from a computer-generated continuum in which syllables varied along the place of articulation phonetic feature (/b,d,g/) was tested with macaques. The acoustic feature that was varied along the two-formant 15-step continuum was the starting frequency of the second-formant transition. Discrimination of stimulus pairs separated by two steps was tested along the entire continuum in a same-different task. Results demonstrated that peaks in the discrimination functions occur for macaques at the "phonetic boundaries" which separate the /b-d/ and /d-g/ categories for human listeners. The data support two conclusions. First, although current theoretical accounts of place perception by human adults suggest that isolated second-formant transitions are "secondary" cues, learned by association with primary cues, the animal data are more compatible with the notion that second-formant transitions are sufficient to allow the appropriate partitioning of a place continuum in the absence of associative pairing with other more complex cues. Second, we discuss two potential roles played by audition in the evolution of the acoustics of language. One is that audition provided a set of "natural psychophysical boundaries," based on rather simple acoustic properties, which guided the selection of the phonetic repertoire but did not solely determine it; the other is that audition provided a set of rules for the formation of "natural classes" of sound and that phonetic units met those criteria. The data provided in this experiment provide support for the former. Experiments that could more clearly differentiate the two hypotheses are described.  相似文献   

15.
The utility of phonetic features versus acoustic properties for describing perceptual relations among speech sounds was evaluated with a multidimensional scaling analysis of Miller and Nicely's [J. Acoust. Soc. Am. 27, 338-352 (1955)] consonant confusions data. The INDSCAL method and program were employed with the original data log transformed to enhance consistency with the linear INDSCAL model. A four-dimensional solution accounted for 69% of the variance and was best characterized in terms of acoustic properties of the speech signal, viz., temporal relationship of periodicity and burst onset, shape of voiced first formanant transition, shape of voiced second formanant transition, and amount of initial spectral dispersion, rather than in terms of phonetic features. The amplitude and spectral location of acoustic energy specifying each perceptual dimension were found to determine a dimension's perceptual effect as the signal was degraded by masking noise and bandpass filtering. Consequently, the perceptual bases of identification confusions between pairs of syllables were characterized in terms of the shared acoustic properties which remained salient in the degraded speech. Implications of these findings for feature-based accounts of perceptual relationships between phonemes are considered.  相似文献   

16.
Incorporation of loudness measures in active noise control   总被引:3,自引:0,他引:3  
An attempt has been made to use a modified version of a standard active noise control algorithm in order to take into account the unique response of the human auditory system. It has been shown in the past that decreasing the sound pressure level at a location does not guarantee a similar decrease in the perceived loudness at that location. Typically, active noise control is based on minimizing the "error signal" from a mechanical device such as a microphone, whose response is nominally flat across the frequency response range of the human ear. However, if the response of the ear can be approximated by digitally filtering the error signal before it reaches the adaptive controller, one can, in effect, minimize the more subjective loudness level, as opposed to the sound pressure level. The work reported here entails simulating active noise control based upon minimizing perceived loudness for a collection of input noise signals. A comparison of the loudness of the resulting error signal is made to the loudness of that resulting from standard sound pressure level minimization. It has been found that the effectiveness of this technique is largely dependent upon the nature of the input noise signal. Furthermore, this technique is judged to be worth considering for use with applications of active noise control where the uncontrolled noise more prominently constitutes low range audio frequencies (approximately 30 Hz-100 Hz) than medium range audio frequencies (approximately 300 Hz-600 Hz).  相似文献   

17.
18.
Incorporation of peptides in phospholipid aggregates using ultrasound   总被引:1,自引:0,他引:1  
This study presents the highlights of ultrasonic effects on peptides incorporated on phospholipid aggregates (liposomes). These liposomes or vesicles are known as transport agents in skin drug delivery and for hair treatment. They might be a good model to deliver larger peptides into hair to restore fibre strength after hair coloration, modelling, permanent wave and/or straightening. The preparation of liposomes 1,2-dipalmitoyl-sn-glycerol-3-phosphocholine (DPPC) with peptides (LLLLK LLLLK LLLLK LLLLK; LLLLL LCLCL LLKAK AK) was made by the thin film hydration method. The LUVs (uni-lamellar vesicles) were obtained by sonication, applying different experimental conditions, such as depth (mm) and power intensity (%). Photon-correlation spectroscopy (PCS) and electronic microscopy (EM) results confirmed that the incorporation of these peptides, with different sequence of amino acids, presented differences on the diameter, zeta-potential of membrane surface and shape of liposomes. The liposomes that included peptide LLLLK LLLLK LLLLK LLLLK present an increased in zeta-potential values after using ultrasound and an "amorphous" aspect. Conversely, the liposomes that incorporated the peptide LLLLL LCLCL LLKAK AK presented a define shape (rod shape) and the potential surface of liposome did not change significantly by the use of ultrasound.  相似文献   

19.
蔡德和 《应用声学》1996,15(1):39-45
本文主要研究:以音素为识别基元,运用语音学知识,对非特定人的普通话复合元音进行识别。其特点是音素识别由神经网络(NN)完成,为了便于利用语音知识NN输入层的刺激采用语音的功率谱,用单元音训练的NN识另非特定人的普通话复合元音,识别率是54%。而运用语音学知识后,其识别率提高到90%。  相似文献   

20.
We have observed an epilayer-thickness-dependent polarity inversion for the growth of CdTe on Sb(Bi)/CdTe(111)B. For films with Sb(Bi) thicknesses of less than 40 A (15 A), the CdTe layer shows a B (Te-terminated) face, but it switches to an A (Cd-terminated) face for thicker layers. On the other hand, a CdTe layer grown on Bi(Sb)/CdTe(111)A always shows the A face regardless of Sb or Bi layer thicknesses. In order to address the observations we have performed ab initio calculations, which suggest that the polarity of a polar material on a nonpolar one results from the binding energy difference between the two possible surface configurations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号