首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
A new method for identification of fish vocalizations based on auditory analysis and support vector machine (SVM) classification is presented. In this method, high resolution features have been extracted from fish vocalization data using the amplitude modulation spectrogram (AMS) of the input signals to facilitate the identification of grunts and growls made by a highly vocal wild fish, Porichthys notatus. The comparison results made from ocean audio recordings verify the effectiveness of the proposed method in identifying various types of fish vocalizations. The relationships between signal-to-noise ratio (SNR) and ocean temperature with the accuracy of the proposed method have also been quantified. Moreover, a context-aware prediction algorithm is introduced for estimating the continuous data.  相似文献   

3.
Mammalian vocal production mechanisms are still poorly understood despite their significance for theories of human speech evolution. Particularly, it is still unclear to what degree mammals are capable of actively controlling vocal-tract filtering, a defining feature of human speech production. To address this issue, a detailed acoustic analysis on the alarm vocalization of free-ranging Diana monkeys was conducted. These vocalizations are especially interesting because they convey semantic information about two of the monkeys' natural predators, the leopard and the crowned eagle. Here, vocal tract and sound source parameter in Diana monkey alarm vocalizations are described. It is found that a vocalization-initial formant downward transition distinguishes most reliably between eagle and leopard alarm vocalization. This finding is discussed as an indication of articulation and alternatively as the result of a strong nasalization effect. It is suggested that the formant modulation is the result of active vocal filtering used by the monkeys to encode semantic information, an ability previously thought to be restricted to human speech.  相似文献   

4.
用神经阵列网络进行文本无关的说话人识别   总被引:9,自引:1,他引:8  
提出了一种可用于说话人识别的神经阵列网络,它以仅完成两类模式区分的小型网络作为子网络,再将单个子网络组合成阵列形式来完成多类模式的区分。文中给出了阵列网络的构成及搜索算法,并使用径向基函数(RBF)阵列网络进行了文本无关的说话人识别的研究。实验显示,对 20名说话人,用 5秒语音训练, 2秒语音识别时,该方法可达到 98%的正确识别率。  相似文献   

5.
Surface behavior and concurrent underwater vocalizations were recorded for Pacific white-sided dolphins in the Southern California Bight (SCB) over multiple field seasons spanning 3 years. Clicks, click trains, and pulsed calls were counted and classified based on acoustic measurements, leading to the identification of 19 key call features used for analysis. Kruskal-Wallis tests indicated that call features differ significantly across behavioral categories. Previous work had discovered two distinctive click Types (A and B), which may correspond to known subpopulations of Pacific white-side dolphins in the Southern California Bight; this study revealed that animals producing these different click types also differ in both their behavior and vocalization patterns. Click Type A groups were predominantly observed slow traveling and milling, with little daytime foraging, while click Type B groups were observed traveling and foraging. These behavioral differences may be characteristic of niche partitioning by overlapping populations; coupled with differences in vocalization patterns, they may signify that these subpopulations are cryptic species. Finally, random forest decision trees were used to classify behavior based on vocalization data, with rates of correct classification up to 86%, demonstrating the potential for the use of vocalization patterns to predict behavior.  相似文献   

6.
According to classical concepts, the relationship between the first two formants is the feature that determines the identification of long vowels in speech. However, the characteristics of vowels may considerably vary depending on the conditions of their production. Thus, the aforementioned features that are valid for adult speech cannot be extended to speech signals with high fundamental frequencies, such as infant speech or singing. On the basis of the studies of preverbal infant vocalizations, singing, and speech imitation by talkingbirds, it is shown that the stable features of vowel-like sounds are the positions and amplitude ratios of the most pronounced spectral maxima (including those corresponding to the fundamental frequency). The results of the studies suggest that precisely these features determine the categorical identification of vowels. The role of the relationship between the frequency and amplitude characteristics in the vowel identification irrespective of the way the vowel is produced and the age and state of the speaker, as well as in the case of speech imitation by talkingbirds, is discussed.  相似文献   

7.
提出了全局谱参数下的耳语说话人状态因子分析方法。首先,根据耳语听辨实验结果,提出导入唤醒度-愉悦度因子对说话人状态进行三级度量;其次,提取耳语音正弦模型、人耳听觉模型下的谱参数,结合其他短时频谱参量,进行轨迹跟踪并计算各参数的全局统计变量,作为特征参数来实现耳语说话人状态的分类。实验结果显示,正弦模型及人耳听觉模型的全局谱参数可将耳语说话人状态因子分类系统的准确率提高至90%。该分类方法及状态因子描述方案提供了耳语音说话人状态分析的有效途径。  相似文献   

8.
变异语音识别是一项极具挑战意义的研究课题,一种解决方法是在前端对语音进行变异分类,然后根据不同变异情况采用相关的处理算法。在各种语音变异中,说话人在战斗机、航天飞机座舱等环境中,身体受到应力(重力)影响时的情况更具有特殊性。其所引起的发音变异有别于因心理的、感知的或生理的因素所引起的变异,目前国内外还鲜见有关应力影响不变异语音分类问题的专门研究。木文从对应力影响下的几种基于基频的语音特征的分析出发,提出了对应力影响下的变异语音和正常语音进行分类的方法。对航空模拟飞行器中采集的小词表实验样本,特定人平均分类正确率达到了93.3%,多说话人分类上确率达到了85.8%。  相似文献   

9.
A new feature extraction model, generalized perceptual linear prediction (gPLP), is developed to calculate a set of perceptually relevant features for digital signal analysis of animal vocalizations. The gPLP model is a generalized adaptation of the perceptual linear prediction model, popular in human speech processing, which incorporates perceptual information such as frequency warping and equal loudness normalization into the feature extraction process. Since such perceptual information is available for a number of animal species, this new approach integrates that information into a generalized model to extract perceptually relevant features for a particular species. To illustrate, qualitative and quantitative comparisons are made between the species-specific model, generalized perceptual linear prediction (gPLP), and the original PLP model using a set of vocalizations collected from captive African elephants (Loxodonta africana) and wild beluga whales (Delphinapterus leucas). The models that incorporate perceptional information outperform the original human-based models in both visualization and classification tasks.  相似文献   

10.
In this study, the problem of sparse enrollment data for in-set versus out-of-set speaker recognition is addressed. The challenge here is that both the training speaker data (5 s) and test material (2~6 s) is of limited test duration. The limited enrollment data result in a sparse acoustic model space for the desired speaker model. The focus of this study is on filling these acoustic holes by harvesting neighbor speaker information to leverage overall system performance. Acoustically similar speakers are selected from a separate available corpus via three different methods for speaker similarity measurement. The selected data from these similar acoustic speakers are exploited to fill the lack of phone coverage caused by the original sparse enrollment data. The proposed speaker modeling process mimics the naturally distributed acoustic space for conversational speech. The Gaussian mixture model (GMM) tagging process allows simulated natural conversation speech to be included for in-set speaker modeling, which maintains the original system requirement of text independent speaker recognition. A human listener evaluation is also performed to compare machine versus human speaker recognition performance, with machine performance of 95% compared to 72.2% accuracy for human in-set/out-of-set performance. Results show that for extreme sparse train/reference audio streams, human speaker recognition is not nearly as reliable as machine based speaker recognition. The proposed acoustic hole filling solution (MRNC) produces an averaging 7.42% relative improvement over a GMM-Cohort UBM baseline and a 19% relative improvement over the Eigenvoice baseline using the FISHER corpus.  相似文献   

11.
蒿晓阳  张鹏远 《声学学报》2022,47(3):405-416
常见的多说话人语音合成有参数自适应及添加说话人标签两种方法。参数自适应方法获得的模型仅支持合成经过自适应的说话人的语音,模型不够鲁棒。传统的添加说话人标签的方法需要有监督地获得语音的说话人信息,并没有从语音信号本身无监督地学习说话人标签。为解决这些问题,提出了一种基于变分自编码器的自回归多说话人语音合成方法。方法首先利用变分自编码器无监督地学习说话人的信息并将其隐式编码为说话人标签,之后与文本的语言学特征送入到一个自回归声学参数预测网络中。此外,为了抑制多说话人语音数据引起的基频预测过拟合问题,声学参数网络采用了基频多任务学习的方法。预实验表明,自回归结构的加入降低了频谱误差1.018 dB,基频多任务学习降低了基频均方根误差6.861 Hz。在后续的多说话人对比实验中,提出的方法在3个多说话人实验的平均主观意见分(MOS)打分上分别达到3.71,3.55,3.15,拼音错误率分别为6.71%,7.54%,9.87%,提升了多说话人语音合成的音质。  相似文献   

12.
俞一彪  王朔中 《声学学报》2005,30(6):536-541
提出了一种文本无关说话人识别的全特征矢量集模型及互信息评估方法,该模型通过对一组说话人语音数据在特征空间进行聚类而形成,全面地反映了说话人语音的个性特征。对于说话人语音的似然度计算与判决,则提出了一种互信息评估方法,该算法综合分析距离空间和信息空间的似然度,并运用最大互信息判决准则进行识别判决。实验分析了线性预测倒谱系数(LPCC)和Mel频率倒谱系数(MFCC)两种情况下应用全特征矢量集模型和互信息评估算法的说话人识别性能,并与高斯混合模型进行了比较。结果表明:全特征矢量集模型和互信息评估算法能够充分反映说话人语音特征,并能够有效评估说话人语音特征相似程度,具有很好的识别性能,是有效的。  相似文献   

13.
针对低信噪比说话人识别中缺失数据特征方法鲁棒性下降的问题,提出了一种采用感知听觉场景分析的缺失数据特征提取方法。首先求取语音的缺失数据特征谱,并由语音的感知特性求出感知特性的语音含量。含噪语音经过感知特性的语音增强和对其语谱的二维增强后求解出语音的分布,联合感知特性语音含量和缺失强度参数提取出感知听觉因子。再结合缺失数据特征谱把特征的提取过程分解为不同听觉场景进行区分地分析和处理,以增强说话人识别系统的鲁棒性能。实验结果表明,在-10 dB到10 dB的低信噪比环境下,对于4种不同的噪声,提出的方法比5种对比方法的鲁棒性均有提高,平均识别率分别提高26.0%,19.6%,12.7%,4.6%和6.5%。论文提出的方法,是一种在时-频域中寻找语音鲁棒特征的方法,更适合于低信噪比环境下的说话人识别。  相似文献   

14.
Electrical stimulation of the midbrain was used to elicit a variety of vocalizations from six anesthetized dogs. This study was conducted to investigate the ranges of and relationships between fundamental frequency of the vocalizations (F0) and tracheal pressure (Pt) produced during the vocalizations. The vocalizations were described according to type (growl, howl, and whine); F0 and Pt, as well as patterns of laryngeal muscle activity, were examined for each vocalization type. Natural-sounding growl and howl vocalizations were elicited from five dogs; three dogs also produced whines. With few exceptions, F0 was categorically different for the three vocalization types (low for growls, average for howls, very high for whines). Pt values overlapped for the three vocalization types, although, on average, howls were produced with greater Pt than growls. Patterns and degrees of laryngeal muscle activity varied across and within vocalization types, but general findings were consistent with the presumed function of most of the muscles. Laryngeal muscle activity may help explain some of the variability in the acoustic and aerodynamic data.  相似文献   

15.
万伊  杨飞然  杨军 《应用声学》2023,42(1):26-33
自动说话人认证系统是一种常用的目标说话人身份认证方案,但它在合成语声的攻击下表现出脆弱性,合成语声检测系统试图解决这一问题。该文提出了一种基于Transformer编码器的合成语声检测方法,利用自注意力机制学习输入特征内部的长期依赖关系。合成语声检测问题并不关注句子的抽象语义特征,用参数量较小的模型也能得到较好的检测性能。该文分别测试了4种常用合成语声检测特征在Transformer编码器上的表现,在国际标准的ASVspoof2019挑战赛的逻辑攻击数据集上,基于线性频率倒谱系数特征和Transformer编码器的系统等错误率与串联检测代价函数分别为3.13%和0.0708,且模型参数量仅为0.082 M,在较小参数量下得到了较好的检测性能。  相似文献   

16.
The ability to identify delphinid vocalizations to species in real-time would be an asset during shipboard surveys. An automated system, Real-time Odontocete Call Classification Algorithm (ROCCA), is being developed to allow real-time acoustic species identification in the field. This Matlab-based tool automatically extracts ten variables (beginning, end, minimum and maximum frequencies, duration, slope of the beginning and end sweep, number of inflection points, number of steps, and presence/absence of harmonics) from whistles selected from a real-time scrolling spectrograph (ISHMAEL). It uses classification and regression tree analysis (CART) and discriminant function analysis (DFA) to identify whistles to species. Schools are classified based on running tallies of individual whistle classifications. Overall, 46% of schools were correctly classified for seven species and one genus (Tursiops truncatus, Stenella attenuata, S. longirostris, S. coeruleoalba, Steno bredanensis, Delphinus species, Pseudorca crassidens, and Globicephala macrorhynchus), with correct classification as high as 80% for some species. If classification success can be increased, this tool will provide a method for identifying schools that are difficult to approach and observe, will allow species distribution data to be collected when visual efforts are compromised, and will reduce the time necessary for post-cruise data analysis.  相似文献   

17.
Spectro-temporal modulations of speech encode speech structures and speaker characteristics. An algorithm which distinguishes speech from non-speech based on spectro-temporal modulation energies is proposed and evaluated in robust text-independent closed-set speaker identification simulations using the TIMIT and GRID corpora. Simulation results show the proposed method produces much higher speaker identification rates in all signal-to-noise ratio (SNR) conditions than the baseline system using mel-frequency cepstral coefficients. In addition, the proposed method also outperforms the system, which uses auditory-based nonnegative tensor cepstral coefficients [Q. Wu and L. Zhang, "Auditory sparse representation for robust speaker recognition based on tensor structure," EURASIP J. Audio, Speech, Music Process. 2008, 578612 (2008)], in low SNR (≤ 10 dB) conditions.  相似文献   

18.
Vocal fundamental frequency (Fo) characteristics were sampled for a group of seven young children. The children were followed longitudinally for a 12-month period, spanning preword, single-word, and multiword vocalizations. The Fo characteristics were analyzed with reference to chronological age, vocalization length, and lexicon size. Measures of average Fo and Fo variability changed little during the 12-month period for each child. A rising-falling intonation contour was the most prevalent Fo contour among the children. In general, the influence of vocalization length and language acquisition on measures of Fo was negligible. It is suggested that relative uniformity in vocal Fo exists in early vocalizations across preword and meaningful speech periods.  相似文献   

19.
A new methodology of voice conversion in cepstrum eigenspace based on structured Gaussian mixture model is proposed for non-parallel corpora without joint training.For each speaker,the cepstrum features of speech are extracted,and mapped to the eigenspace which is formed by eigenvectors of its scatter matrix,thereby the Structured Gaussian Mixture Model in the EigenSpace(SGMM-ES)is trained.The source and target speaker's SGMM-ES are matched based on Acoustic Universal Structure(AUS)principle to achieve spectrum transform function.Experimental results show the speaker identification rate of conversion speech achieves95.25%,and the value of average cepstrum distortion is 1.25 which is 0.8%and 7.3%higher than the performance of SGMM method respectively.ABX and MOS evaluations indicate the conversion performance is quite close to the traditional method under the parallel corpora condition.The results show the eigenspace based structured Gaussian mixture model for voice conversion under the non-parallel corpora is effective.  相似文献   

20.
戴明扬  徐柏龄 《应用声学》2001,20(6):6-12,44
本文基于人耳听觉模型提出了一种鲁棒性的话者特征参数提取方法。该种方法中,首先由Gamma tone听觉滤波器组和Meddis内耳毛细胞发放模型获得表征听觉神经活动特性的听觉相关图。由听觉神经脉冲发放的锁相特性和双声抑制特性,我们将听觉相关图每个频带中的幅值最大频率分量作为表征当前频带特性的特征参量,于是所有频带的特征参量便构成了表征当前语音段特性的特征矢量;我们采用DCT交换进一步消除各个特征参量之间的相关性,压缩特征矢量的维数。有效性试验表明,该种特征矢量基本上反映了输入语音的谱包络特性;抗噪声性能实验表明,在高斯白噪声和汽车噪声干扰下,这种特征参数比LPCC和MFCC有较小的相对失真;基于矢量量化的文本无关话者辨识表明,对于三种类型的噪声干扰该种特征参数在低信噪比下都获得了较好的识别结果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号