共查询到20条相似文献,搜索用时 0 毫秒
1.
Mitra V Nam H Espy-Wilson C Saltzman E Goldstein L 《The Journal of the Acoustical Society of America》2012,131(3):2270-2287
Studies have shown that supplementary articulatory information can help to improve the recognition rate of automatic speech recognition systems. Unfortunately, articulatory information is not directly observable, necessitating its estimation from the speech signal. This study describes a system that recognizes articulatory gestures from speech, and uses the recognized gestures in a speech recognition system. Recognizing gestures for a given utterance involves recovering the set of underlying gestural activations and their associated dynamic parameters. This paper proposes a neural network architecture for recognizing articulatory gestures from speech and presents ways to incorporate articulatory gestures for a digit recognition task. The lack of natural speech database containing gestural information prompted us to use three stages of evaluation. First, the proposed gestural annotation architecture was tested on a synthetic speech dataset, which showed that the use of estimated tract-variable-time-functions improved gesture recognition performance. In the second stage, gesture-recognition models were applied to natural speech waveforms and word recognition experiments revealed that the recognized gestures can improve the noise-robustness of a word recognition system. In the final stage, a gesture-based Dynamic Bayesian Network was trained and the results indicate that incorporating gestural information can improve word recognition performance compared to acoustic-only systems. 相似文献
2.
Ghosh PK Goldstein LM Narayanan SS 《The Journal of the Acoustical Society of America》2011,129(6):4014-4022
Understanding how the human speech production system is related to the human auditory system has been a perennial subject of inquiry. To investigate the production-perception link, in this paper, a computational analysis has been performed using the articulatory movement data obtained during speech production with concurrently recorded acoustic speech signals from multiple subjects in three different languages: English, Cantonese, and Georgian. The form of articulatory gestures during speech production varies across languages, and this variation is considered to be reflected in the articulatory position and kinematics. The auditory processing of the acoustic speech signal is modeled by a parametric representation of the cochlear filterbank which allows for realizing various candidate filterbank structures by changing the parameter value. Using mathematical communication theory, it is found that the uncertainty about the articulatory gestures in each language is maximally reduced when the acoustic speech signal is represented using the output of a filterbank similar to the empirically established cochlear filterbank in the human auditory system. Possible interpretations of this finding are discussed. 相似文献
3.
An automatic speech recognition approach is presented which uses articulatory features estimated by a subject-independent acoustic-to-articulatory inversion. The inversion allows estimation of articulatory features from any talker's speech acoustics using only an exemplary subject's articulatory-to-acoustic map. Results are reported on a broad class phonetic classification experiment on speech from English talkers using data from three distinct English talkers as exemplars for inversion. Results indicate that the inclusion of the articulatory information improves classification accuracy but the improvement is more significant when the speaking style of the exemplar and the talker are matched compared to when they are mismatched. 相似文献
4.
Finding the control parameters of an articulatory model that result in given acoustics is an important problem in speech research. However, one should also be able to derive the same parameters from measured articulatory data. In this paper, a method to estimate the control parameters of the the model by Maeda from electromagnetic articulography (EMA) data, which allows the derivation of full sagittal vocal tract slices from sparse flesh-point information, is presented. First, the articulatory grid system involved in the model's definition is adapted to the speaker involved in the experiment, and EMA data are registered to it automatically. Then, articulatory variables that correspond to measurements defined by Maeda on the grid are extracted. An initial solution for the articulatory control parameters is found by a least-squares method, under constraints ensuring vocal tract shape naturalness. Dynamic smoothness of the parameter trajectories is then imposed by a variational regularization method. Generated vocal tract slices for vowels are compared with slices appearing in magnetic resonance images of the same speaker or found in the literature. Formants synthesized on the basis of these generated slices are adequately close to those tracked in real speech recorded concurrently with EMA. 相似文献
5.
J S Perkell M H Cohen M A Svirsky M L Matthies I Garabieta M T Jackson 《The Journal of the Acoustical Society of America》1992,92(6):3078-3096
This paper describes two electromagnetic midsagittal articulometer (EMMA) systems that were developed for transducing articulatory movements during speech production. Alternating magnetic fields are generated by transmitter coils that are mounted in an assembly that fits on the head of a speaker. The fields induce alternating voltages in a number of small transducer coils that are attached to articulators in the midline plane, inside and outside the vocal tract. The transducers are connected by fine lead wires to receiver electronics whose output voltages are processed to yield measures of transducer locations as a function of time. Measurement error can arise with this method, because as the articulators move and change shape, the transducers can undergo a varying amount of rotational misalignment with respect to the transmitter axes; both systems are designed to correct for transducer misalignment. For this purpose, one system uses two transmitters and biaxial transducers; the other uses three transmitters and single-axis transducers. The systems have been compared with one another in terms of their performance, human subjects compatibility, and ease of use. Both systems can produce useful midsagittal-plane data on articular movement, and each one has a specific set of advantages and limitations. (Two commercially available systems are also described briefly for comparison purposes). If appropriate experimental controls are used, the three-transmitter system is preferable for practical reasons. 相似文献
6.
Kendrick P Cox TJ Li FF Zhang Y Chambers JA 《The Journal of the Acoustical Society of America》2008,124(1):278-287
This paper compares two methods for extracting room acoustic parameters from reverberated speech and music. An approach which uses statistical machine learning, previously developed for speech, is extended to work with music. For speech, reverberation time estimations are within a perceptual difference limen of the true value. For music, virtually all early decay time estimations are within a difference limen of the true value. The estimation accuracy is not good enough in other cases due to differences between the simulated data set used to develop the empirical model and real rooms. The second method carries out a maximum likelihood estimation on decay phases at the end of notes or speech utterances. This paper extends the method to estimate parameters relating to the balance of early and late energies in the impulse response. For reverberation time and speech, the method provides estimations which are within the perceptual difference limen of the true value. For other parameters such as clarity, the estimations are not sufficiently accurate due to the natural reverberance of the excitation signals. Speech is a better test signal than music because of the greater periods of silence in the signal, although music is needed for low frequency measurement. 相似文献
7.
The paper examines physical mechanisms of frequency modulations in acoustics of the vocal tract and methods of estimation
of these modulations in the speech signal. It has been found that vibrations of the tract walls make a negligibly small effect
on modulations of its resonance frequencies. The model of the process of speech formation with account for the subglottal
cavity shows that a change in boundary conditions at the open glottis produces noticeable variations in resonance frequencies.
Along with this type of modulations, modulations determined by the shape of the source of excitation also arise in the speech
signal. They substantially depend on the ratio of the frequency of the fundamental tone to the resonance frequency and of
the parameters of methods estimating modulations and methods of analysis of the speech signal. Overall, this may sometimes
cause unstable and unpredictable modulations of estimated formant frequencies in the speech signal. 相似文献
8.
Fractal dimensions of speech sounds: computation and application to automatic speech recognition 总被引:4,自引:0,他引:4
The dynamics of airflow during speech production may often result in some small or large degree of turbulence. In this paper, the geometry of speech turbulence as reflected in the fragmentation of the time signal is quantified by using fractal models. An efficient algorithm for estimating the short-time fractal dimension of speech signals based on multiscale morphological filtering is described, and its potential for speech segmentation and phonetic classification discussed. Also reported are experimental results on using the short-time fractal dimension of speech signals at multiple scales as additional features in an automatic speech-recognition system using hidden Markov models, which provide a modest improvement in speech-recognition performance. 相似文献
9.
We show how it is possible to realize quantum computations on a system in which most of the parameters are practically unknown. We illustrate our results with a novel implementation of a quantum computer by means of bosonic atoms in an optical lattice. In particular, we show how a universal set of gates can be carried out even if the number of atoms per site is uncertain. 相似文献
10.
V. A. Zverev 《Acoustical Physics》2008,54(2):261-268
Possibilities to eliminate the reverberation from a speech signal are investigated by applying the method based on the determination of the parameters of the reverberation frequency response from the cepstrum of the reverberation-distorted signal. The delays of reverberating signals and, for the case of a weak reverberation, their amplitudes are determined from the cepstrum of the signal with reverberation. For the cases of medium and strong reverberation, the levels of reverberating signals are refined by adjusting a certain factor. The criterion used for the adjustment of the factor is based on the shape of the speech signal amplitude distribution. By numerical modeling, it is demonstrated that the proposed method can reduce the reverberation level by 30 dB. 相似文献
11.
In this paper, a quantitative study of acoustic-to-articulatory inversion for vowel speech sounds by analysis-by-synthesis using the Maeda articulatory model is performed. For chain matrix calculation of vocal tract (VT) acoustics, the chain matrix derivatives with respect to area function are calculated and used in a quasi-Newton method for optimizing articulatory trajectories. The cost function includes a distance measure between natural and synthesized first three formants, and parameter regularization and continuity terms. Calibration of the Maeda model to two speakers, one male and one female, from the University of Wisconsin x-ray microbeam (XRMB) database, using a cost function, is discussed. Model adaptation includes scaling the overall VT and the pharyngeal region and modifying the outer VT outline using measured palate and pharyngeal traces. The inversion optimization is initialized by a fast search of an articulatory codebook, which was pruned using XRMB data to improve inversion results. Good agreement between estimated midsagittal VT outlines and measured XRMB tongue pellet positions was achieved for several vowels and diphthongs for the male speaker, with average pellet-VT outline distances around 0.15 cm, smooth articulatory trajectories, and less than 1% average error in the first three formants. 相似文献
12.
A method for synthesizing vocal-tract spectra from phoneme sequences by mimicking the speech production process of humans is presented. The model consists of four main processes and is particularly characterized by an adaptive formation of articulatory movements. First, our model determines the time when each phoneme is articulated. Next, it generates articulatory constraints that must be met for the production of each phoneme, and then it generates trajectories of the articulatory movements that satisfy the constraints. Finally, the time sequence of spectra is estimated from the produced articulatory trajectories. The articulatory constraint of each phoneme does not change with the phonemic context, but the contextual variability of speech is reproduced because of the dynamic articulatory model. The accuracy of the synthesis model was evaluated using data collected by the simultaneous measurement of speech and articulatory movements. The accuracy of the phonemic timing estimates were measured and compared the synthesized results to the measured results. Experimental results showed that the model captured the contextual variability of both the articulatory movements and speech acoustics. 相似文献
13.
Zong-Wen Li 《International Journal of Infrared and Millimeter Waves》1996,17(12):2175-2183
MilliMeter Wave (MMW) Doppler Radar with grating structures for the applications of detecting speech signals has been discovered in our laboratory. The operating principle of detection the acoustic wave signals based on the Wave Propagation Theory and Wave Equations of The ElectroMagnetic Wave (EMW) and Acoustic Wave (AW) propagating, scattering, reflecting and interacting has been investigated. The experimental and observation results have been provided to verify that MMW CW 40GHz dielectric integrated radar can detect and identify out exactly the existential speech signals in free space from a person speaking. The received sound signal have been reproduced by the DSP and the reproducer.Research project supported financially from the NSFC (National Natural Science Foundaton of China). 相似文献
14.
Codebook-based single-microphone noise suppressors, which exploit prior knowledge about speech and noise statistics, provide better performance in nonstationary noise. However, as the enhancement involves a joint optimization over speech and noise codebooks, this results in high computational complexity. A codebook-based method is proposed that uses a reference signal observed by a bone-conduction microphone, and a mapping between air- and bone-conduction codebook entries generated during an offline training phase. A smaller subset of air-conducted speech codebook entries that accurately models the clean speech signal is selected using this reference signal. Experiments support the expected improvement in performance at low computational complexity. 相似文献
15.
A. I. Tsyplikhin 《Acoustical Physics》2007,53(1):105-118
An algorithm for estimating the vocal pulse positions and durations in an actual speech signal is described. Testing of the algorithm shows that it outperforms the best of the competitor algorithms in accuracy on the average by a factor of two. The algorithm is less sensitive to spectrum distortions in telephone channels, to various types of noise, and to instability in duration and amplitude of pulses produced by the voice source. The accuracy of the pulse position estimate is sufficient for a synchronous speech signal analysis, while the speed of signal processing makes the algorithm suitable for real-time operation. 相似文献
16.
Physical task stress is known to affect the fundamental frequency and other measurements of the speech signal. A corpus of physical task stress speech is analyzed using a spectrum F-ratio and frame score distribution divergences. The measurements differ between phone classes, and are greater for vowels and nasals than for plosives and fricatives. In further analysis, frame score distribution divergences are used to measure the spectral dissimilarity between neutral and physical task stress speech. Frame scores are the log likelihood ratios between Gaussian mixture models (GMMs) of physical task stress and of neutral speech. Mel-frequency cepstral coefficients are used as the acoustic feature inputs to the GMMs. A Laplacian distribution is fitted to the frame scores for each of ten phone classes, and the symmetric Kullback-Leibler divergence is employed to measure the change in distribution from neutral to physical task stress. The results suggest that the spectral dissimilarity is greatest for the second level of a four level exertion measurement, and that spectral dissimilarity is greater for nasal phones than for plosives and fricatives. Further, the results suggest that different phone classes are affected differently by physical task stress. 相似文献
17.
A. M. Sazonov 《Russian Physics Journal》1984,27(12):1013-1016
The amplitude spectrum of the harmonics of a spin-echo signal is determined. The possibilities of obtaining information about the parameters of molecular motion of nuclear spin systems from the spectrum are discussed.Translated from Izvestiya Vyssikh Uchebnykh Zavedenii, Fizika, No. 12, pp. 13–16, December, 1984. 相似文献
18.
Results are reported from two experiments in which the benefit of supplementing speechreading with auditorily presented information about the speech signal was investigated. In experiment I, speechreading was supplemented with information about the prosody of the speech signal. For ten normal-hearing subjects with no experience in speechreading, the intelligibility score for sentences increased significantly when speechreading was supplemented with information about the overall amplitude of the speech signal, information about the fundamental frequency, or both. Binary information about voicing appeared not to be a significant supplement. In experiment II, the best-scoring supplements of experiment I were compared with two supplementary signals from our previous studies, i.e., information about the sound-pressure levels in two 1-oct filter bands centered at 500 and 3160 Hz, or information about the frequencies of the first and second formants from voiced speech segments. Sentence-intelligibility scores were measured for 24 normal-hearing subjects with no experience in speechreading, and for 12 normal-hearing experienced speechreaders. For the inexperienced speechreaders, the sound-pressure levels appeared to be the best supplement (87.1% correct syllables). For the experienced speechreaders, the formant-frequency information (88.6% correct), and the fundamental-frequency plus amplitude information (86.0% correct), were equally efficient supplements as the sound-pressure information (86.1% correct). Discrimination of phonemes (both consonants and vowels) was measured for the group of 24 inexperienced speechreaders. Percentage correct responses, confusion among phonemes, and the percentage of transmitted information about different types of manner and place of articulation and about the feature voicing are presented. 相似文献
19.
A. V. Nikolaev 《Acoustical Physics》2002,48(4):497-501
A practical application of linear prediction methods for calculating the pulse function that models the functioning of vocal cords is described. Some characteristics of the pulses of this function enable one to draw some conclusions about the speaker’s individual features (and, possibly, about the quality of sound). The first part of the paper is devoted to the theoretical background of the described method. The second part presents a detailed algorithm of the program realization of the method in the MATLAB 5.2 environment and analyzes the results of the experiment made on Russian vowels. 相似文献
20.
A method for the analysis of vocal tract parameters is developed, aimed to perform quantitative analysis of rigidity from speech signals of Parkinsonian patients. The cross-sectional area function of the vocal tract is calculated using pitch synchronous autoregressive moving average (ARMA) analysis. The changes in Parkinsonian subjects of the cross-sectional area during the utterance of sustained sounds are attributed to both Parkinsonian tremor and rigidity. In order to isolate the effects of the rigidity on the vocal tract from those of the tremor, an adaptive tremor cancellation (ATC) algorithm is developed, based on the correlation of tremor signals extracted from different locations of the speech production system. 相似文献