首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
A hidden Markov model (HMM) system is presented for automatically classifying African elephant vocalizations. The development of the system is motivated by successful models from human speech analysis and recognition. Classification features include frequency-shifted Mel-frequency cepstral coefficients (MFCCs) and log energy, spectrally motivated features which are commonly used in human speech processing. Experiments, including vocalization type classification and speaker identification, are performed on vocalizations collected from captive elephants in a naturalistic environment. The system classified vocalizations with accuracies of 94.3% and 82.5% for type classification and speaker identification classification experiments, respectively. Classification accuracy, statistical significance tests on the model parameters, and qualitative analysis support the effectiveness and robustness of this approach for vocalization analysis in nonhuman species.  相似文献   

2.
Knowledge-based speech recognition systems extract acoustic cues from the signal to identify speech characteristics. For channel-deteriorated telephone speech, acoustic cues, especially those for stop consonant place, are expected to be degraded or absent. To investigate the use of knowledge-based methods in degraded environments, feature extrapolation of acoustic-phonetic features based on Gaussian mixture models is examined. This process is applied to a stop place detection module that uses burst release and vowel onset cues for consonant-vowel tokens of English. Results show that classification performance is enhanced in telephone channel-degraded speech, with extrapolated acoustic-phonetic features reaching or exceeding performance using estimated Mel-frequency cepstral coefficients (MFCCs). Results also show acoustic-phonetic features may be combined with MFCCs for best performance, suggesting these features provide information complementary to MFCCs.  相似文献   

3.
Mammalian vocal production mechanisms are still poorly understood despite their significance for theories of human speech evolution. Particularly, it is still unclear to what degree mammals are capable of actively controlling vocal-tract filtering, a defining feature of human speech production. To address this issue, a detailed acoustic analysis on the alarm vocalization of free-ranging Diana monkeys was conducted. These vocalizations are especially interesting because they convey semantic information about two of the monkeys' natural predators, the leopard and the crowned eagle. Here, vocal tract and sound source parameter in Diana monkey alarm vocalizations are described. It is found that a vocalization-initial formant downward transition distinguishes most reliably between eagle and leopard alarm vocalization. This finding is discussed as an indication of articulation and alternatively as the result of a strong nasalization effect. It is suggested that the formant modulation is the result of active vocal filtering used by the monkeys to encode semantic information, an ability previously thought to be restricted to human speech.  相似文献   

4.
This work proposes a method to reconstruct an acoustic speech signal solely from a stream of mel-frequency cepstral coefficients (MFCCs) as may be encountered in a distributed speech recognition (DSR) system. Previous methods for speech reconstruction have required, in addition to the MFCC vectors, fundamental frequency and voicing components. In this work the voicing classification and fundamental frequency are predicted from the MFCC vectors themselves using two maximum a posteriori (MAP) methods. The first method enables fundamental frequency prediction by modeling the joint density of MFCCs and fundamental frequency using a single Gaussian mixture model (GMM). The second scheme uses a set of hidden Markov models (HMMs) to link together a set of state-dependent GMMs, which enables a more localized modeling of the joint density of MFCCs and fundamental frequency. Experimental results on speaker-independent male and female speech show that accurate voicing classification and fundamental frequency prediction is attained when compared to hand-corrected reference fundamental frequency measurements. The use of the predicted fundamental frequency and voicing for speech reconstruction is shown to give very similar speech quality to that obtained using the reference fundamental frequency and voicing.  相似文献   

5.
Current automatic acoustic detection and classification of microchiroptera utilize global features of individual calls (i.e., duration, bandwidth, frequency extrema), an approach that stems from expert knowledge of call sonograms. This approach parallels the acoustic phonetic paradigm of human automatic speech recognition (ASR), which relied on expert knowledge to account for variations in canonical linguistic units. ASR research eventually shifted from acoustic phonetics to machine learning, primarily because of the superior ability of machine learning to account for signal variation. To compare machine learning with conventional methods of detection and classification, nearly 3000 search-phase calls were hand labeled from recordings of five species: Pipistrellus bodenheimeri, Molossus molossus, Lasiurus borealis, L. cinereus semotus, and Tadarida brasiliensis. The hand labels were used to train two machine learning models: a Gaussian mixture model (GMM) for detection and classification and a hidden Markov model (HMM) for classification. The GMM detector produced 4% error compared to 32% error for a baseline broadband energy detector, while the GMM and HMM classifiers produced errors of 0.6 +/- 0.2% compared to 16.9 +/- 1.1% error for a baseline discriminant function analysis classifier. The experiments showed that machine learning algorithms produced errors an order of magnitude smaller than those for conventional methods.  相似文献   

6.
The American English phoneme /r/ has long been associated with large amounts of articulatory variability during production. This paper investigates the hypothesis that the articulatory variations used by a speaker to produce /r/ in different contexts exhibit systematic tradeoffs, or articulatory trading relations, that act to maintain a relatively stable acoustic signal despite the large variations in vocal tract shape. Acoustic and articulatory recordings were collected from seven speakers producing /r/ in five phonetic contexts. For every speaker, the different articulator configurations used to produce /r/ in the different phonetic contexts showed systematic tradeoffs, as evidenced by significant correlations between the positions of transducers mounted on the tongue. Analysis of acoustic and articulatory variabilities revealed that these tradeoffs act to reduce acoustic variability, thus allowing relatively large contextual variations in vocal tract shape for /r/ without seriously degrading the primary acoustic cue. Furthermore, some subjects appeared to use completely different articulatory gestures to produce /r/ in different phonetic contexts. When viewed in light of current models of speech movement control, these results appear to favor models that utilize an acoustic or auditory target for each phoneme over models that utilize a vocal tract shape target for each phoneme.  相似文献   

7.
Scientists have made great strides toward understanding the mechanisms of speech production and perception. However, the complex relationships between the acoustic structures of speech and the resulting psychological percepts have yet to be fully and adequately explained, especially in speech produced by younger children. Thus, this study examined the acoustic structure of voiceless fricatives (/f, theta, s, S/) produced by adults and typically developing children from 3 to 6 years of age in terms of multiple acoustic parameters (durations, normalized amplitude, spectral slope, and spectral moments). It was found that the acoustic parameters of spectral slope and variance (commonly excluded from previous studies of child speech) were important acoustic parameters in the differentiation and classification of the voiceless fricatives, with spectral variance being the only measure to separate all four places of articulation. It was further shown that the sibilant contrast between /s/ and /S/ was less distinguished in children than adults, characterized by a dramatic change in several spectral parameters at approximately five years of age. Discriminant analysis revealed evidence that classification models based on adult data were sensitive to these spectral differences in the five-year-old age group.  相似文献   

8.
9.
Numerous attempts have been made to find low-dimensional, formant-related representations of speech signals that are suitable for automatic speech recognition. However, it is often not known how these features behave in comparison with true formants. The purpose of this study was to compare two sets of automatically extracted formant-like features, i.e., robust formants and HMM2 features, to hand-labeled formants. The robust formant features were derived by means of the split Levinson algorithm while the HMM2 features correspond to the frequency segmentation of speech signals obtained by two-dimensional hidden Markov models. Mel-frequency cepstral coefficients (MFCCs) were also included in the investigation as an example of state-of-the-art automatic speech recognition features. The feature sets were compared in terms of their performance on a vowel classification task. The speech data and hand-labeled formants that were used in this study are a subset of the American English vowels database presented in Hillenbrand et al. [J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. Classification performance was measured on the original, clean data and in noisy acoustic conditions. When using clean data, the classification performance of the formant-like features compared very well to the performance of the hand-labeled formants in a gender-dependent experiment, but was inferior to the hand-labeled formants in a gender-independent experiment. The results that were obtained in noisy acoustic conditions indicated that the formant-like features used in this study are not inherently noise robust. For clean and noisy data as well as for the gender-dependent and gender-independent experiments the MFCCs achieved the same or superior results as the formant features, but at the price of a much higher feature dimensionality.  相似文献   

10.
11.
Previous research has shown that a region of the midbrain, the periaqueductal gray matter (PAG), is critical for vocalization. In this review, we describe the results of previous investigations in which we sought to find out how PAG neurons integrate the activity and precise timing of respiratory, laryngeal, and oral muscle activity for natural-sounding vocalization using the technique of excitatory amino acid microinjections in cats. In these studies, all surgical procedures were carried out under deep anaesthesia. In the precollicular decerebrate cat two general types of vocalization, classified as voiced and unvoiced, could be evoked by exciting neurons in the lateral part of the intermediate part of the PAG. The patterns of evoked electromyographic activity were strikingly similar to previously reported patterns of human muscle activity. Coordinated patterns of activity were evoked with just-threshold excitation leading to the conclusion that patterned muscle activity corresponding to the major categories of voiced and voiceless sound production are represented in the PAG. In a parallel series of human and animal experiments, we also determined that the speech and vocalization respiratory patterns are integrated and coordinated with afferent signals related to lung volume. These data have led to the proposal of a new hypothesis for the neural control of vocalization: that the PAG is a crucial brain site for mammalian voice production, not only in the production of emotional or involuntary sounds, but also as a generator of specific respiratory and laryngeal motor patterns essential for human speech and song  相似文献   

12.
多通路声重放系统能够增强听者的现实感与空间感,但在免提通信条件下,其不可避免会受到噪声和回声干扰,严重影响通信质量。针对上述问题,本文提出了一种基于门控卷积循环神经网络的多通路声学回声消除和噪声抑制方法。该方法以传声器接收信号和重放声道的压缩复数谱为网络输入,以近端语音的压缩复数谱为网络的输出目标,直接从传声器拾取信号中恢复近端纯净语音,无需对声重放信号进行去相关处理,解决了传统自适应滤波方法中存在的非唯一解问题,同时保证了多通路声重放质量。仿真和真实声学环境实验均表明本文所提出的方法可显著消除多通路声重放系统的噪声和回声,在语音质量和回声返回衰减增益方面均优于传统算法。  相似文献   

13.
Speaker recognition is an important classification task, which can be solved using several approaches. Although building a speaker recognition model on a closed set of speakers under neutral speaking conditions is a well-researched task and there are solutions that provide excellent performance, the classification accuracy of developed models significantly decreases when applying them to emotional speech or in the presence of interference. Furthermore, deep models may require a large number of parameters, so constrained solutions are desirable in order to implement them on edge devices in the Internet of Things systems for real-time detection. The aim of this paper is to propose a simple and constrained convolutional neural network for speaker recognition tasks and to examine its robustness for recognition in emotional speech conditions. We examine three quantization methods for developing a constrained network: floating-point eight format, ternary scalar quantization, and binary scalar quantization. The results are demonstrated on the recently recorded SEAC dataset.  相似文献   

14.
Optimum data windows make it possible to determine accurately the amplitude, phase, and frequency of one or more tones (sinusoidal components) in a signal. Procedures presented in this paper can be applied to noisy signals, signals having moderate nonstationarity, and tones close in frequency. They are relevant to many areas of acoustics where sounds are quasistationary. Among these are acoustic probes transmitted through media and natural sounds, such as animal vocalization, speech, and music. The paper includes criteria for multitone FFT block design and an example of application to sound transmission in the atmosphere.  相似文献   

15.
混合双语语音识别的研究   总被引:1,自引:0,他引:1  
随着现代社会信息的全球化,双语以及多语混合的语言现象日趋普遍,随之而产生的双语或多语语音识别也成为语音识别研究领域的热门课题。在双语混合语音识别中,主要面临的问题有两个:一是在保证双语识别率的前提下控制系统的复杂度;二是有效处理插入语中原用语引起的非母语口音现象。为了解决双语混合现象以及减少统计建模所需的数据量,通过音素混合聚类方法建立起一个统一的双语识别系统。在聚类算法中,提出了一种新型基于混淆矩阵的两遍音素聚类算法,并将该方法与传统的基于声学似然度准则的聚类方法进行比较;针对双语语音中非母语语音识别性能较低的问题,提出一种新型的双语模型修正算法用于提高非母语语音的识别性能。实验结果表明,通过上述方法建立起来的中英双语语音识别系统在有效控制模型规模的同时,实现了同时对两种语言的识别,且在单语言语音和混合语言语音上的识别性能也能得到有效保证。   相似文献   

16.
In this study, the problem of sparse enrollment data for in-set versus out-of-set speaker recognition is addressed. The challenge here is that both the training speaker data (5 s) and test material (2~6 s) is of limited test duration. The limited enrollment data result in a sparse acoustic model space for the desired speaker model. The focus of this study is on filling these acoustic holes by harvesting neighbor speaker information to leverage overall system performance. Acoustically similar speakers are selected from a separate available corpus via three different methods for speaker similarity measurement. The selected data from these similar acoustic speakers are exploited to fill the lack of phone coverage caused by the original sparse enrollment data. The proposed speaker modeling process mimics the naturally distributed acoustic space for conversational speech. The Gaussian mixture model (GMM) tagging process allows simulated natural conversation speech to be included for in-set speaker modeling, which maintains the original system requirement of text independent speaker recognition. A human listener evaluation is also performed to compare machine versus human speaker recognition performance, with machine performance of 95% compared to 72.2% accuracy for human in-set/out-of-set performance. Results show that for extreme sparse train/reference audio streams, human speaker recognition is not nearly as reliable as machine based speaker recognition. The proposed acoustic hole filling solution (MRNC) produces an averaging 7.42% relative improvement over a GMM-Cohort UBM baseline and a 19% relative improvement over the Eigenvoice baseline using the FISHER corpus.  相似文献   

17.
尹辉  谢湘  匡镜明 《声学学报》2012,37(1):97-103
分数阶Fourier变换在处理非平稳信号尤其是chirp信号方面有着独特的优势,而人耳听觉系统具有自动语音识别系统难以比拟的优良性能。本文采用Gammatone听觉滤波器组对语音信号进行前端时域滤波,然后对输出的各个子带信号用分数阶Fourer变换方法提取声学特征。分数阶Fourier变换的阶数对其性能有着重要影响,本文针对子带时域信号提出了采用瞬时频率曲线拟合求取阶数的方法,并将其与采用模糊函数的方法作了比较。在干净与含噪汉语孤立数字库上的语音识别结果表明,采用新提出的声学特征得到的识别正确率相对MFCC基线系统有了显著提高;根据瞬时频率曲线搜索阶数的算法与模糊函数方法相比,计算量大大减少,并且根据该方法提取的声学特征得到了最高的平均识别正确率。   相似文献   

18.
An efficient robust sound classification algorithm for hearing aids   总被引:1,自引:0,他引:1  
An efficient robust sound classification algorithm based on hidden Markov models is presented. The system would enable a hearing aid to automatically change its behavior for differing listening environments according to the user's preferences. This work attempts to distinguish between three listening environment categories: speech in traffic noise, speech in babble, and clean speech, regardless of the signal-to-noise ratio. The classifier uses only the modulation characteristics of the signal. The classifier ignores the absolute sound pressure level and the absolute spectrum shape, resulting in an algorithm that is robust against irrelevant acoustic variations. The measured classification hit rate was 96.7%-99.5% when the classifier was tested with sounds representing one of the three environment categories included in the classifier. False-alarm rates were 0.2%-1.7% in these tests. The algorithm is robust and efficient and consumes a small amount of instructions and memory. It is fully possible to implement the classifier in a DSP-based hearing instrument.  相似文献   

19.
Surface behavior and concurrent underwater vocalizations were recorded for Pacific white-sided dolphins in the Southern California Bight (SCB) over multiple field seasons spanning 3 years. Clicks, click trains, and pulsed calls were counted and classified based on acoustic measurements, leading to the identification of 19 key call features used for analysis. Kruskal-Wallis tests indicated that call features differ significantly across behavioral categories. Previous work had discovered two distinctive click Types (A and B), which may correspond to known subpopulations of Pacific white-side dolphins in the Southern California Bight; this study revealed that animals producing these different click types also differ in both their behavior and vocalization patterns. Click Type A groups were predominantly observed slow traveling and milling, with little daytime foraging, while click Type B groups were observed traveling and foraging. These behavioral differences may be characteristic of niche partitioning by overlapping populations; coupled with differences in vocalization patterns, they may signify that these subpopulations are cryptic species. Finally, random forest decision trees were used to classify behavior based on vocalization data, with rates of correct classification up to 86%, demonstrating the potential for the use of vocalization patterns to predict behavior.  相似文献   

20.
声场景探察和自动分类能帮助人类制定应对特定环境的正确策略,具有重要的研究价值。随着卷积神经网络的发展,出现了许多基于卷积神经网络的声场景分类方法。其中时频卷积神经网络(TS-CNN)采用了时频注意力模块,是目前声场景分类效果最好的网络之一。为了在保持网络复杂度不变的前提下进一步提高网络的声场景分类性能,该文提出了一种基于协同学习的时频卷积神经网络模型(TSCNN-CL)。具体地说,该文首先建立了基于同构结构的辅助分支参与网络的训练。其次,提出了一种基于KL散度的协同损失函数,实现了分支与主干的知识协同,最后,在测试过程中,为了不增加推理计算量,该文提出的模型只使用主干网络预测结果。在ESC-10、ESC-50和UrbanSound8k数据集的综合实验表明,该模型分类效果要优于TS-CNN模型以及当前大部分的主流方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号