首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Feature selection and feature extraction are the most important steps in classification and regression systems. Feature selection is commonly used to reduce the dimensionality of datasets with tens or hundreds of thousands of features, which would be impossible to process further. Recent example includes quantitative structure–activity relationships (QSAR) dataset including 1226 features. A major problem of QSAR is the high dimensionality of the feature space; therefore, feature selection is the most important step in this study. This paper presents a novel feature selection algorithm that is based on entropy. The performance of the proposed algorithm is compared with that of a genetic algorithm method and a stepwise regression method. The root mean square error of prediction in a QSAR study using entropy, genetic algorithm and stepwise regression using multiple linear regressions model for training set and test set were 0.3433, 0.3591 and 0.5500, 0.4326 and 0.6373, 0.6672, respectively.  相似文献   

2.
Raman spectroscopy has the potential to significantly aid in the research and diagnosis of cancer. The information dense, complex spectra generate massive datasets in which subtle correlations may provide critical clues for biological analysis and pathological classification. Therefore, implementing advanced data mining techniques is imperative for complete, rapid and accurate spectral processing. Numerous recent studies have employed various data methods to Raman spectra for classification and biochemical analysis. Although, as Raman datasets from biological specimens are often characterized by high dimensionality and low sample numbers, many of these classification models are subject to overfitting. Furthermore, attempts to reduce dimensionality result in transformed feature spaces making the biological evaluation of significant and discriminative spectral features problematic. We have developed a novel data mining framework optimized for Raman datasets, called Fisher‐based Feature Selection Support Vector Machines (FFS‐SVM). This framework provides simultaneous supervised classification and user‐defined Fisher criterion‐based feature selection, reducing overfitting and directly yielding significant wavenumbers from the original feature space. Herein, we investigate five cancerous and non‐cancerous breast cell lines using Raman microspectroscopy and our unique FFS‐SVM framework. Our framework classification performance is then compared to several other frequently employed classification methods on four classification tasks. The four tasks were constructed by an unsupervised clustering method yielding the four different categories of cell line groupings (e.g. cancer vs non‐cancer) studied. FFS‐SVM achieves both high classification accuracies and the extraction of biologically significant features. The top ten most discriminative features are discussed in terms of cell‐type specific biological relevance. Our framework provides comprehensive cellular level characterization and could potentially lead to the discovery of cancer biomarker‐type information, which we have informally termed ‘Raman‐based spectral biomarkers’. The FFS‐SVM framework along with Raman spectroscopy will be used in future studies to investigate in‐situ dynamic biological phenomena. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

3.
近红外技术广泛应用于食品、药品等生产过程和产品质量检测,具有样品无需预处理、成本低、无破坏性、测定速度快等优点。但是,全光谱数据维数高、冗余信息多,直接应用于建模会导致模型复杂性高、稳定性差等问题。siPLS是最常见的光谱数据降维方法,但是难以处理光谱数据的共线性问题。LASSO是一种相对新的数据降维方法,但在小样本应用中具有不稳定性。针对siPLS和LASSO在近红外光谱数据应用中存在的问题,提出了基于siPLS-LASSO的近红外特征波长选择方法,并将其应用于秸秆饲料蛋白固态发酵过程pH值监测。该方法首先采用siPLS算法,实现对光谱波长最佳联合子区间的优选;然后,对优选联合子区间使用LASSO算法进行特征波长选择,在此基础上建立PLS校正模型。同时,将siPLS-LASSO方法与其他传统特征波长选择方法进行了对比。结果表明:建立在siPLS-LASSO方法优选33个特征波长基础上的PLS模型预测结果更好,其预测方差(RMSEP)和相关系数(Rp)分别为0.071 1和0.980 8;所提siPLS-LASSO方法有效选取了特征波长,提高了模型预测性能。  相似文献   

4.
In this paper the possibility of obtaining accurate estimates of parameters of selected peaks in the presence of unknown or uninteresting spectral features in biomedical magnetic resonance spectroscopy (MRS) signals is investigated. This problem is denoted by frequency-selective parameter estimation. A new time-domain technique based on maximum-phase finite impulse response (FIR) filters is presented. The proposed method is compared to a number of existing approaches: the application of a weighting function in the time domain, frequency domain fitting using a polynomial baseline, and the time-domain HSVD filter method. The ease of use and low computational complexity of the FIR filter method make it an attractive approach for frequency-selective parameter estimation. The methods are validated using simulations of relevant (13)C and (31)P MRS examples.  相似文献   

5.
With the widespread use of intelligent information systems, a massive amount of data with lots of irrelevant, noisy, and redundant features are collected; moreover, many features should be handled. Therefore, introducing an efficient feature selection (FS) approach becomes a challenging aim. In the recent decade, various artificial methods and swarm models inspired by biological and social systems have been proposed to solve different problems, including FS. Thus, in this paper, an innovative approach is proposed based on a hybrid integration between two intelligent algorithms, Electric fish optimization (EFO) and the arithmetic optimization algorithm (AOA), to boost the exploration stage of EFO to process the high dimensional FS problems with a remarkable convergence speed. The proposed EFOAOA is examined with eighteen datasets for different real-life applications. The EFOAOA results are compared with a set of recent state-of-the-art optimizers using a set of statistical metrics and the Friedman test. The comparisons show the positive impact of integrating the AOA operator in the EFO, as the proposed EFOAOA can identify the most important features with high accuracy and efficiency. Compared to the other FS methods whereas, it got the lowest features number and the highest accuracy in 50% and 67% of the datasets, respectively.  相似文献   

6.
提出一种基于流行降维的近红外光谱技术快速判别大米贮藏期的新方法。采用近红外光谱仪获取陈年米和新米的反射光谱特征曲线,利用直接正交信号矫正法(direct orthogonal signal correction, DOSC)对原始光谱进行预处理,滤除光谱数据中与因变量Y矩阵无关的信号,以消除无关信息对后续特征变量建模精度的影响。采用Durbin-Watson和Run测试法定性分析光谱数据结构的非线性性,并利用增强偏残差图(augmented partial residual plot)定量分析大米光谱曲线的非线性程度。分别采用线性流行降维法包括主成分分析法(PCA)和多维尺度分析法(MDS)以及非线性流行降维法包括等距映射法(ISOMAP)、局部线性嵌入法(LLE)和拉普拉斯特征映射法(LE)提取预处理后光谱数据的本征变量,并结合核偏最小二乘方法(KPLS)建立本征变量与贮藏时间属性之间的耦合模型。实验用陈年米和新米的样本数均为200个,随机将训练集和测试集样本划分为300个和100个。通过比较各个模型的预测结果得出,基于ISOMAP非线性降维法提取的40个本征变量建立的回归模型预测效果最好,预测相关系数(R2P)、预测均方根误差(RMSEP)和预测相对分析误差值(RPD)分别为0.917,0.187和2.698。实验结果说明提出的方法对于大米贮藏期具有很好的鉴别能力,该研究为今后大米贮藏期的快速无损检测提供了科学的手段。  相似文献   

7.
联合波叠加法的全息理论与实验研究   总被引:2,自引:0,他引:2       下载免费PDF全文
李卫兵  陈剑  毕传兴  陈心昭 《物理学报》2006,55(3):1264-1270
当空间声场中同时存在多个相干声源时,运用常规近场声全息方法无法重建每个相干声源表面的声学信息,当然也无法预测每个声源单独产生的空间声场,相干声场的全息重建与预测已成为全息技术推广应用过程中亟待解决的问题.在提出联合波叠加法并将其应用于空间声场变换的基础上,对其进行了实验研究.通过对实际相干声场的全息重建与预测,验证了常规波叠加法在相干声场重建中的局限性、联合波叠加法在相干声场全息重建与预测过程的可行性和准确性,还研究了Tikhonov正则化方法在抑制声学逆问题的非适定性中的有效性和滤波系数的选择原则的可行性,以提高全息重建与预测的精度. 关键词: 近场声全息 联合波叠加 相干声场 Tikhonov正则化  相似文献   

8.
One of the central problems in computational biology is protein function identification in an automated fashion. A key step to achieve this is predicting to which subcellular location the protein belongs, since protein localization correlates closely with its function. A wide variety of methods for protein subcellular localization prediction have been proposed over recent years. Linear dimensionality reduction (DR) methods have been introduced to address the high-dimensionality problem by transforming the representation of protein sequences. However, this approach is not suitable for some complex biological systems that have nonlinear characteristics. Herein, we use nonlinear DR methods such as the kernel DR method to capture the nonlinear characteristics of a high-dimensional space. Then, the K-nearest-neighbor (K-NN) classifier is employed to identify the subcellular localization of Gram-negative bacterial proteins based on their reduced low-dimensional features. Experimental results thus obtained are quite encouraging, indicating that the applied nonlinear DR method is effective to deal with this complicated problem of predicting subcellular localization of Gram-negative bacterial proteins. An online web server for predicting subcellular location of Gram-negative bacterial proteins is available at .  相似文献   

9.
混合式随机森林的土壤钾含量高光谱反演   总被引:1,自引:0,他引:1  
从土壤速效钾光谱中挖掘关键特征较为困难,导致高光谱反演模型预测精度较低。针对此问题,提出了一种混合式随机森林特征选择算法。首先采用封装式特征选择方法进行特征预选,快速去除冗余并保留相关特征,然后再利用改进的随机森林特征选择算法对预处理后的特征进行精选,通过增大关键特征与冗余特征的区分度以及采用迭代特征选择的方式,使精选后的特征具有更好的鲁棒性与区分性,较好的解决了土壤速效钾高光谱反演模型精度较低的问题。为了验证所提出算法的有效性,选取了青岛市大沽河流域具有代表性的124个土壤样品为实验对象,利用提出的算法从2 051个原始波段选出含有13个敏感波段的最优光谱子集建立土壤速效钾反演模型,并与现有特征选择算法所建模型进行对比分析。结果表明:该算法构建的回归模型具有较低的预测均方根误差RMSEP(9.661 5), 较高的相关系数(0.936 9)和预测分析相对误差RPD(2.14)。混合式随机森林特征选择算法以较少的特征波长数实现了较好的预测效果,可为土壤养分实时光谱传感器的设计提供一定的理论依据。  相似文献   

10.
李勇军  尹超  于会  刘尊 《物理学报》2016,65(2):20501-020501
微博是基于用户关注关系建立的具有媒体特性的实时信息分享社交平台.微博上的信息扩散具有快速性、爆发性和时效性.理解信息的传播机理,预测信息转发行为,对研究微博上舆论的形成、产品的推广等具有重要意义.本文通过解析微博转发记录来研究影响信息转发的因素或特征,把微博信息转发预测问题抽象为链路预测问题,并提出基于最大熵模型的链路预测算法.实例验证的结果表明:1)基于最大熵模型的算法在运行时间上具有明显的优势;2)在预测结果方面,最大熵模型比同类其他算法表现优异;3)当训练集大小和特征数量变化时,基于最大熵模型的预测结果表现稳定.该方法在预测链路时避免了特征之间相互独立的约束,准确率优于其他同类方法,对解决复杂网络中其他类型的预测问题具有借鉴意义.  相似文献   

11.
Multi-label learning is dedicated to learning functions so that each sample is labeled with a true label set. With the increase of data knowledge, the feature dimensionality is increasing. However, high-dimensional information may contain noisy data, making the process of multi-label learning difficult. Feature selection is a technical approach that can effectively reduce the data dimension. In the study of feature selection, the multi-objective optimization algorithm has shown an excellent global optimization performance. The Pareto relationship can handle contradictory objectives in the multi-objective problem well. Therefore, a Shapley value-fused feature selection algorithm for multi-label learning (SHAPFS-ML) is proposed. The method takes multi-label criteria as the optimization objectives and the proposed crossover and mutation operators based on Shapley value are conducive to identifying relevant, redundant and irrelevant features. The comparison of experimental results on real-world datasets reveals that SHAPFS-ML is an effective feature selection method for multi-label classification, which can reduce the classification algorithm’s computational complexity and improve the classification accuracy.  相似文献   

12.
Existing manifold learning algorithms use Euclidean distance to measure the proximity of data points. However, in high-dimensional space, Minkowski metrics are no longer stable because the ratio of distance of nearest and farthest neighbors to a given query is almost unit. It will degrade the performance of manifold learning algorithms when applied to dimensionality reduction of high-dimensional data. We introduce a new distance function named shrinkage-divergence-proximity (SDP) to manifold learning, which is meaningful in any high-dimensional space. An improved locally linear embedding (LLE) algorithm named SDP-LLE is proposed in light of the theoretical result. Experiments are conducted on a hyperspectral data set and an image segmentation data set. Experimental results show that the proposed method can efficiently reduce the dimensionality while getting higher classification accuracy.  相似文献   

13.
In previous works, boosting aggregation of classifier outputs from discrete brain areas has been demonstrated to reduce dimensionality and improve the robustness and accuracy of functional magnetic resonance imaging (fMRI) classification. However, dimensionality reduction and classification of mixed activation patterns of multiple classes remain challenging. In the present study, the goals were (a) to reduce dimensionality by combining feature reduction at the voxel level and backward elimination of optimally aggregated classifiers at the region level, (b) to compare region selection for spatially aggregated classification using boosting and partial least squares regression methods and (c) to resolve mixed activation patterns using probabilistic prediction of individual tasks. Brain activation maps from interleaved visual, motor, auditory and cognitive tasks were segmented into 144 functional regions. Feature selection reduced the number of feature voxels by more than 50%, leaving 95 regions. The two aggregation approaches further reduced the number of regions to 30, resulting in more than 75% reduction of classification time and misclassification rates of less than 3%. Boosting and partial least squares (PLS) were compared to select the most discriminative and the most task correlated regions, respectively. Successful task prediction in mixed activation patterns was feasible within the first block of task activation in real-time fMRI experiments. This methodology is suitable for sparsifying activation patterns in real-time fMRI and for neurofeedback from distributed networks of brain activation.  相似文献   

14.
Human activity recognition (HAR) plays a vital role in different real-world applications such as in tracking elderly activities for elderly care services, in assisted living environments, smart home interactions, healthcare monitoring applications, electronic games, and various human–computer interaction (HCI) applications, and is an essential part of the Internet of Healthcare Things (IoHT) services. However, the high dimensionality of the collected data from these applications has the largest influence on the quality of the HAR model. Therefore, in this paper, we propose an efficient HAR system using a lightweight feature selection (FS) method to enhance the HAR classification process. The developed FS method, called GBOGWO, aims to improve the performance of the Gradient-based optimizer (GBO) algorithm by using the operators of the grey wolf optimizer (GWO). First, GBOGWO is used to select the appropriate features; then, the support vector machine (SVM) is used to classify the activities. To assess the performance of GBOGWO, extensive experiments using well-known UCI-HAR and WISDM datasets were conducted. Overall outcomes show that GBOGWO improved the classification accuracy with an average accuracy of 98%.  相似文献   

15.
End-point prediction is one of the most difficult problems in basic oxygen furnace (BOF) steelmaking process. To address this problem, some researchers have proposed some methods based on flame image processing and pattern classification. Because of the dynamically changing flame and real-time needs during the blowing process, there are still some issues that need to be solved. We propose a novel method based on accurate and fast multi flame features extraction and general regression neural network (GRNN). Firstly, flame images were acquired, and then the background of each image was removed via color similarity determination algorithm; secondly, color, texture, and boundary features were extracted; the fast and robust boundary and texture features were extracted by using the proposed methods, and these features were tested for their validity to the end-point prediction via comparing them with some other similar methods; finally, the prediction model was built using multi-features and GRNN. The experimental results demonstrated that it is accurate and fast to use the proposed method to the BOF end-point predict.  相似文献   

16.
基于正交投影散度的高光谱遥感波段选择算法   总被引:2,自引:0,他引:2  
由于高光谱数据的海量高维特征,对其进行降维处理成为高光谱遥感研究的一个重要问题.波段选择算法由于能够有效地保留原始数据的信息,在高光谱数据降维及后续的遥感识别与分类等方面具有明显的优越性.文章提出了一种基于正交投影散度(OPD)的波段选择方法,该方法继承了正交子空间投影(OSP)算法的特点,通过把原始数据投影到特征空间...  相似文献   

17.
With the rapid growth of the Internet, the curse of dimensionality caused by massive multi-label data has attracted extensive attention. Feature selection plays an indispensable role in dimensionality reduction processing. Many researchers have focused on this subject based on information theory. Here, to evaluate feature relevance, a novel feature relevance term (FR) that employs three incremental information terms to comprehensively consider three key aspects (candidate features, selected features, and label correlations) is designed. A thorough examination of the three key aspects of FR outlined above is more favorable to capturing the optimal features. Moreover, we employ label-related feature redundancy as the label-related feature redundancy term (LR) to reduce unnecessary redundancy. Therefore, a designed multi-label feature selection method that integrates FR with LR is proposed, namely, Feature Selection combining three types of Conditional Relevance (TCRFS). Numerous experiments indicate that TCRFS outperforms the other 6 state-of-the-art multi-label approaches on 13 multi-label benchmark data sets from 4 domains.  相似文献   

18.
排名聚合将多个排名列表聚合成一个综合排名列表,可应用于推荐系统、链路预测、元搜索、提案评选等.当前已有工作从不同角度对不同排名聚合算法进行了综述、比较,但存在算法种类较少、数据统计特性不清晰、评价指标不够合理等局限性.不同排名聚合算法在提出时均声称优于已有算法,但是用于比较的方法不同,测试的数据不同,应用的场景不同,因此何种算法最能适应某一任务在很多情况下仍不甚清楚.本文基于Mallows模型,提出一套生成统计特性可控的不同类型的排名列表的算法,使用一个可应用于不同类型排名列表的通用评价指标,介绍9种排名聚合算法以及它们在聚合少量长列表时的表现.结果发现启发式方法虽然简单,但是在排名列表相似度较高、列表相对简单的情况下,能够接近甚至超过一些优化类方法的结果;列表中平局数量的增长会降低聚合排名的一致性并增加波动;列表数量的增加对聚合效果的影响呈现非单调性.整体而言,基于距离优化的分支定界方法 (FAST)优于其他各类算法,在不同类型的排名列表中表现非常稳定,能够很好地完成少量长列表的排名聚合.  相似文献   

19.
Feature selection (FS) is a vital step in data mining and machine learning, especially for analyzing the data in high-dimensional feature space. Gene expression data usually consist of a few samples characterized by high-dimensional feature space. As a result, they are not suitable to be processed by simple methods, such as the filter-based method. In this study, we propose a novel feature selection algorithm based on the Explosion Gravitation Field Algorithm, called EGFAFS. To reduce the dimensions of the feature space to acceptable dimensions, we constructed a recommended feature pool by a series of Random Forests based on the Gini index. Furthermore, by paying more attention to the features in the recommended feature pool, we can find the best subset more efficiently. To verify the performance of EGFAFS for FS, we tested EGFAFS on eight gene expression datasets compared with four heuristic-based FS methods (GA, PSO, SA, and DE) and four other FS methods (Boruta, HSICLasso, DNN-FS, and EGSG). The results show that EGFAFS has better performance for FS on gene expression data in terms of evaluation metrics, having more than the other eight FS algorithms. The genes selected by EGFAGS play an essential role in the differential co-expression network and some biological functions further demonstrate the success of EGFAFS for solving FS problems on gene expression data.  相似文献   

20.
结合X射线荧光光谱法,针对土壤中重金属元素Zn含量的预测问题,提出基于深度卷积神经网络回归预测模型.对原始土壤进行相关预处理,用粉末压片法制作土壤压片,采用X射线荧光光谱法(X-Ray-fluorescence,XRF)获取土壤光谱,相比于传统检测方式,XRF法具有检测速度快、精度高、操作简单、不破坏样品属性并且可实现...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号