首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 27 毫秒
1.
光谱预处理方法选择研究   总被引:1,自引:0,他引:1  
复杂样品光谱信号往往会受到杂散光、噪声、基线漂移等因素的干扰,从而影响最终的定性定量分析结果,因此通常需要在建模前对原始光谱进行预处理。目前已有的光谱预处理方法包括很多种,如何寻找合适的预处理方法是很棘手的问题。一种途径是观察光谱信号特点选择预处理方法(visual inspection),另一种途径是根据建模性能的优劣反过来选择预处理方法(trial-and-error strategy)。前者无需建模,更具有解释性,但是有时会由于选择者主观的因素导致错误的结果;后者无需观察光谱特点,但需要考察大量的预处理方法,对大数据集比较费时。因此需要探讨哪种选择方式更科学与合理。本研究采用9组数据,通过对10种预处理方法的120种排列组合来探讨预处理的必要性及预处理方法的选择。首先,优化偏最小二乘(PLS)的因子数及一阶导数、二阶导数、SG平滑的窗口参数,连续小波变换(CWT)的小波函数和分解尺度。然后把无预处理及一阶导数、二阶导数、CWT、多元散射校正(MSC)、标准正态变量(SNV)、SG平滑、中心化、Pareto尺度化、最大最小归一化、标准化10种预处理方法按照背景校正、散射校正、平滑和尺度化的顺序进行排列组合,得到120种预处理及其组合方法。最后对不同数据及相同数据的不同组分分别进行120种预处理,分析光谱信号特点及预处理后PLS建模的预测均方根误差值(RMSEP)。结果表明,相比观察光谱信号特点,根据光谱与预测组分的建模效果可以更为准确地选择最佳预处理方法。对于多数数据,采用合适的预处理方法可以提高建模效果;对于不同的数据集,因为其数据集信息和复杂性不同,所以其最佳预处理方法也不同;对于相同数据集,即使光谱相同,但不同组分的预处理方法也不相同。因此,不存在普适性的最佳预处理方法,最佳预处理方法除了与光谱有关,还与预测组分有关。通过对已有预处理方法按照预处理目的进行分类再排列组合是选择最佳预处理方法的一种有效途径。  相似文献   

2.
在基于核磁共振(NMR)的代谢组学数据分析中,尺度缩放是关键的预处理步骤之一,其主要目的是通过调整数据的方差结构,改善后续的多变量统计分析的结果。从信息熵的角度出发,利用Kullback-Leibler (K-L)散度来度量不同实验分组的生物样品的1H NMR波谱数据的差异程度,并结合单位方差缩放法,提出一种基于K-L散度的尺度缩放方法。该方法先利用单位方差法将数据各变量的标准差调整到同一水平上,再利用K-L散度对各变量进行有监督地加权,增强重要变量、减弱无关变量。由于K-L散度是在概率分布的意义上度量数据间的差异程度,且对于高斯和非高斯分布的数据均适用,因此能更准确地度量不同实验分组样品的1H NMR波谱数据的差异性,从而更有效地地对谱数据的重要变量进行识别和加权。人群尿液1H NMR波谱数据的分析结果表明,基于K-L散度的尺度缩放方法能有效抑制噪声变量,同时很好地区分特征变量和非特征变量;提高主成分回归(PCR)模型的判别能力;改善偏最小二乘回归判别分析(PLS-DA)模型的解释能力、预测能力以及对特征代谢物的辨识能力。  相似文献   

3.
The solution of the Navier–Stokes equations requires that data about the solution is available along the boundary. In some situations, such as particle imaging velocimetry, there is additional data available along a single plane within the domain, and there is a desire to also incorporate this data into the approximate solution of the Navier–Stokes equation. The question that we seek to answer in this paper is whether two-dimensional velocity data containing noise can be incorporated into a full three-dimensional solution of the Navier–Stokes equations in an appropriate and meaningful way. For addressing this problem, we examine the potential of least-squares finite element methods (LSFEM) because of their flexibility in the enforcement of various boundary conditions. Further, by weighting the boundary conditions in a manner that properly reflects the accuracy with which the boundary values are known, we develop the weighted LSFEM. The potential of weighted LSFEM is explored for three different test problems: the first uses randomly generated Gaussian noise to create artificial ‘experimental’ data in a controlled manner, and the second and third use particle imaging velocimetry data. In all test problems, weighted LSFEM produces accurate results even for cases where there is significant noise in the experimental data.  相似文献   

4.
The aim of this study was to evaluate the contribution of diffusion and perfusion MR metrics in the discrimination of intracranial brain lesions at 3T MRI, and to investigate the potential diagnostic and predictive value that pattern recognition techniques may provide in tumor characterization using these metrics as classification features. Conventional MRI, diffusion weighted imaging (DWI), diffusion tensor imaging (DTI) and dynamic-susceptibility contrast imaging (DSCI) were performed on 115 patients with newly diagnosed intracranial tumors (low-and- high grade gliomas, meningiomas, solitary metastases). The Mann–Whitney U test was employed in order to identify statistical differences of the diffusion and perfusion parameters for different tumor comparisons in the intra-and peritumoral region. To assess the diagnostic contribution of these parameters, two different methods were used; the commonly used receiver operating characteristic (ROC) analysis and the more sophisticated SVM classification, and accuracy, sensitivity and specificity levels were obtained for both cases. The combination of all metrics provided the optimum diagnostic outcome. The highest predictive outcome was obtained using the SVM classification, although ROC analysis yielded high accuracies as well. It is evident that DWI/DTI and DSCI are useful techniques for tumor grading. Nevertheless, cellularity and vascularity are factors closely correlated in a non-linear way and thus difficult to evaluate and interpret through conventional methods of analysis. Hence, the combination of diffusion and perfusion metrics into a sophisticated classification scheme may provide the optimum diagnostic outcome. In conclusion, machine learning techniques may be used as an adjunctive diagnostic tool, which can be implemented into the clinical routine to optimize decision making.  相似文献   

5.
Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets.  相似文献   

6.
Users of social networks have a variety of social statuses and roles. For example, the users of Weibo include celebrities, government officials, and social organizations. At the same time, these users may be senior managers, middle managers, or workers in companies. Previous studies on this topic have mainly focused on using the categorical, textual and topological data of a social network to predict users’ social statuses and roles. However, this cannot fully reflect the overall characteristics of users’ social statuses and roles in a social network. In this paper, we consider what social network structures reflect users’ social statuses and roles since social networks are designed to connect people. Taking an Enron email dataset as an example, we analyzed a preprocessing mechanism used for social network datasets that can extract users’ dynamic behavior features. We further designed a novel social network representation learning algorithm in order to infer users’ social statuses and roles in social networks through the use of an attention and gate mechanism on users’ neighbors. The extensive experimental results gained from four publicly available datasets indicate that our solution achieves an average accuracy improvement of 2% compared with GraphSAGE-Mean, which is the best applicable inductive representation learning method.  相似文献   

7.
The most common machine-learning methods solve supervised and unsupervised problems based on datasets where the problem’s features belong to a numerical space. However, many problems often include data where numerical and categorical data coexist, which represents a challenge to manage them. To transform categorical data into a numeric form, preprocessing tasks are compulsory. Methods such as one-hot and feature-hashing have been the most widely used encoding approaches at the expense of a significant increase in the dimensionality of the dataset. This effect introduces unexpected challenges to deal with the overabundance of variables and/or noisy data. In this regard, in this paper we propose a novel encoding approach that maps mixed-type data into an information space using Shannon’s Theory to model the amount of information contained in the original data. We evaluated our proposal with ten mixed-type datasets from the UCI repository and two datasets representing real-world problems obtaining promising results. For demonstrating the performance of our proposal, this was applied for preparing these datasets for classification, regression, and clustering tasks. We demonstrate that our encoding proposal is remarkably superior to one-hot and feature-hashing encoding in terms of memory efficiency. Our proposal can preserve the information conveyed by the original data.  相似文献   

8.
The analysis of ambient organic aerosols by nuclear magnetic resonance (NMR) spectroscopy is limited because of the large number of organic compounds present at low concentrations. Here, we demonstrate the integrative analysis of NMR spectra of airborne pollen particles using reference spectra, spectral similarity metrics, principal components analysis (PCA), and a chemical mass balance model to determine the predominant types of organic compounds. Strong associations among glucose, fucose, specific amino acids, and airborne pollen particles were observed by spectral similarity metrics and PCA. Carbohydrates accounted for about 51 % of the airborne pollen particles in the spectrum followed by amino acids (42 %) and other compounds (7 %). Overall, our investigations showed that analysis of NMR spectral data of mixtures of environmental organic compounds with pattern recognition methods may generate information on the chemical characteristics of the mixture.  相似文献   

9.
Exploring the mobility of mobile phone users   总被引:2,自引:0,他引:2  
Mobile phone datasets allow for the analysis of human behavior on an unprecedented scale. The social network, temporal dynamics and mobile behavior of mobile phone users have often been analyzed independently from each other using mobile phone datasets. In this article, we explore the connections between various features of human behavior extracted from a large mobile phone dataset. Our observations are based on the analysis of communication data of 100,000 anonymized and randomly chosen individuals in a dataset of communications in Portugal. We show that clustering and principal component analysis allow for a significant dimension reduction with limited loss of information. The most important features are related to geographical location. In particular, we observe that most people spend most of their time at only a few locations. With the help of clustering methods, we then robustly identify home and office locations and compare the results with official census data. Finally, we analyze the geographic spread of users’ frequent locations and show that commuting distances can be reasonably well explained by a gravity model.  相似文献   

10.
Uncertainty quantification for complex deep learning models is increasingly important as these techniques see growing use in high-stakes, real-world settings. Currently, the quality of a model’s uncertainty is evaluated using point-prediction metrics, such as the negative log-likelihood (NLL), expected calibration error (ECE) or the Brier score on held-out data. Marginal coverage of prediction intervals or sets, a well-known concept in the statistical literature, is an intuitive alternative to these metrics but has yet to be systematically studied for many popular uncertainty quantification techniques for deep learning models. With marginal coverage and the complementary notion of the width of a prediction interval, downstream users of deployed machine learning models can better understand uncertainty quantification both on a global dataset level and on a per-sample basis. In this study, we provide the first large-scale evaluation of the empirical frequentist coverage properties of well-known uncertainty quantification techniques on a suite of regression and classification tasks. We find that, in general, some methods do achieve desirable coverage properties on in distribution samples, but that coverage is not maintained on out-of-distribution data. Our results demonstrate the failings of current uncertainty quantification techniques as dataset shift increases and reinforce coverage as an important metric in developing models for real-world applications.  相似文献   

11.
12.
Yuangang Lu  Xiangzhao Wang 《Optik》2007,118(2):62-66
Among different phase unwrapping approaches, the weighted least-squares minimization methods are gaining attention. In these algorithms, weighting coefficient is generated from a quality map. The intrinsic drawbacks of existing quality maps constrain the application of these algorithms. They often fail to handle wrapped phase data contains error sources, such as phase discontinuities, noise and undersampling. In order to deal with those intractable wrapped phase data, a new weighted least-squares phase unwrapping algorithm based on derivative variance correlation map is proposed. In the algorithm, derivative variance correlation map, a novel quality map, can truly reflect wrapped phase quality, ensuring a more reliable unwrapped result. The definition of the derivative variance correlation map and the principle of the proposed algorithm are present in detail. The performance of the new algorithm has been tested by use of a simulated spherical surface wrapped data and an experimental interferometric synthetic aperture radar (IFSAR) wrapped data. Computer simulation and experimental results have verified that the proposed algorithm can work effectively even when a wrapped phase map contains intractable error sources.  相似文献   

13.
People nowadays use the internet to project their assessments, impressions, ideas, and observations about various subjects or products on numerous social networking sites. These sites serve as a great source to gather data for data analytics, sentiment analysis, natural language processing, etc. Conventionally, the true sentiment of a customer review matches its corresponding star rating. There are exceptions when the star rating of a review is opposite to its true nature. These are labeled as the outliers in a dataset in this work. The state-of-the-art methods for anomaly detection involve manual searching, predefined rules, or traditional machine learning techniques to detect such instances. This paper conducts a sentiment analysis and outlier detection case study for Amazon customer reviews, and it proposes a statistics-based outlier detection and correction method (SODCM), which helps identify such reviews and rectify their star ratings to enhance the performance of a sentiment analysis algorithm without any data loss. This paper focuses on performing SODCM in datasets containing customer reviews of various products, which are (a) scraped from Amazon.com and (b) publicly available. The paper also studies the dataset and concludes the effect of SODCM on the performance of a sentiment analysis algorithm. The results exhibit that SODCM achieves higher accuracy and recall percentage than other state-of-the-art anomaly detection algorithms.  相似文献   

14.
We introduce a novel noniterative algorithm for the fast and accurate reconstruction of nonuniformly sampled MRI data. The proposed scheme derives the reconstructed image as the nonuniform inverse Fourier transform of a compensated dataset. We derive each sample in the compensated dataset as a weighted linear combination of a few measured k-space samples. The specific k-space samples and the weights involved in the linear combination are derived such that the reconstruction error is minimized. The computational complexity of the proposed scheme is comparable to that of gridding. At the same time, it provides significantly improved accuracy and is considerably more robust to noise and undersampling. The advantages of the proposed scheme makes it ideally suited for the fast reconstruction of large multidimensional datasets, which routinely arise in applications such as f-MRI and MR spectroscopy. The comparisons with state-of-the-art algorithms on numerical phantoms and MRI data clearly demonstrate the performance improvement.  相似文献   

15.
Recently, there has been a huge rise in malware growth, which creates a significant security threat to organizations and individuals. Despite the incessant efforts of cybersecurity research to defend against malware threats, malware developers discover new ways to evade these defense techniques. Traditional static and dynamic analysis methods are ineffective in identifying new malware and pose high overhead in terms of memory and time. Typical machine learning approaches that train a classifier based on handcrafted features are also not sufficiently potent against these evasive techniques and require more efforts due to feature-engineering. Recent malware detectors indicate performance degradation due to class imbalance in malware datasets. To resolve these challenges, this work adopts a visualization-based method, where malware binaries are depicted as two-dimensional images and classified by a deep learning model. We propose an efficient malware detection system based on deep learning. The system uses a reweighted class-balanced loss function in the final classification layer of the DenseNet model to achieve significant performance improvements in classifying malware by handling imbalanced data issues. Comprehensive experiments performed on four benchmark malware datasets show that the proposed approach can detect new malware samples with higher accuracy (98.23% for the Malimg dataset, 98.46% for the BIG 2015 dataset, 98.21% for the MaleVis dataset, and 89.48% for the unseen Malicia dataset) and reduced false-positive rates when compared with conventional malware mitigation techniques while maintaining low computational time. The proposed malware detection solution is also reliable and effective against obfuscation attacks.  相似文献   

16.
BackgroundAchieving inter-site / inter-scanner reproducibility of diffusion weighted magnetic resonance imaging (DW-MRI) metrics has been challenging given differences in acquisition protocols, analysis models, and hardware factors.PurposeMagnetic field gradients impart scanner-dependent spatial variations in the applied diffusion weighting that can be corrected if the gradient nonlinearities are known. However, retrieving manufacturer nonlinearity specifications is not well supported and may introduce errors in interpretation of units or coordinate systems. We propose an empirical approach to mapping the gradient nonlinearities with sequences that are supported across the major scanner vendors.Study typeProspective observational study.SubjectsA spherical isotropic diffusion phantom, and a single human control volunteer.Field strength/sequence3 T (two scanners). Stejskal-Tanner spin echo sequence with b-values of 1000, 2000 s/mm2 with 12, 32, and 384 diffusion gradient directions per shell.AssessmentWe compare the proposed correction with the prior approach using manufacturer specifications against typical diffusion pre-processing pipelines (i.e., ignoring spatial gradient nonlinearities). In phantom data, we evaluate metrics against the ground truth. In human and phantom data, we evaluate reproducibility across scans, sessions, and hardware.Statistical testsWilcoxon rank-sum test between uncorrected and corrected data.ResultsIn phantom data, our correction method reduces variation in mean diffusivity across sessions over uncorrected data (p < 0.05). In human data, we show that this method can also reduce variation in mean diffusivity across scanners (p < 0.05).ConclusionOur method is relatively simple, fast, and can be applied retroactively. We advocate incorporating voxel-specific b-value and b-vector maps should be incorporated in DW-MRI harmonization preprocessing pipelines to improve quantitative accuracy of measured diffusion parameters.  相似文献   

17.
调制度分析在等步长相移法相位展开中的应用   总被引:7,自引:4,他引:3  
蒋震宇  缪泓  张青川  伍小平 《光学学报》2004,24(8):032-1038
推导了两种常用等步长相移算法的调制度表达式,提出一种新的调制度分析方法。该方法用于等步长相移法中基于加权最小二乘法的相位展开,能够充分利用调制度信息,构造二值和小数权重,从而增强相位展开过程对多种干扰因素的免疫力。实验结果说明了该方法的有效性和实用性。最后比较了二值权重和小数权重在加权最小二乘法的相位展开中的性能表现。  相似文献   

18.
The partial separability (PS) of spatiotemporal signals has been exploited to accelerate dynamic cardiac MRI by sampling two datasets (training and imaging datasets) without breath-holding or ECG triggering. According to the theory of partially separable functions, the wider the range of spatial frequency components covered by the training dataset, the more accurate the temporal constraint imposed by the PS model. Therefore, it is necessary to develop a new sampling scheme for the PS model in order to cover a wider range of spatial frequency components. In this paper, we propose the use of radial sampling trajectories for collecting the training dataset and Cartesian sampling trajectories for collecting the imaging dataset. In vivo high resolution cardiac MRI experiments demonstrate that the proposed data sampling scheme can significantly improve the image quality. The image quality using the PS model with the proposed sampling scheme is comparable to that of a commercial method using retrospective cardiac gating and breath-holding. The success of this study demonstrates great potential for high-quality, high resolution dynamic cardiac MRI without ECG gating or breath-holding through use of the PS model and the novel data sampling scheme.  相似文献   

19.
In this study, the potentiality of visible and near-infrared reflectance spectroscopy to estimate soil organic matter was assessed. Six preprocessing methods were implemented to process the original spectra. The partial least-squares regression approach was also applied to construct predictive models and evaluate the optimal spectral preprocessing method. The significant wavelengths of soil organic matter were determined by using the correlation analysis and the partial least-squares regression analysis. The results were: (i) visible and near-infrared reflectance spectroscopy was proved to be an ideal approach in the soil organic matter estimation; (ii) different preprocessed spectra could improve their correlation with soil organic matter; the combination of first-order derivative and Savitzky–Golay smoothing method outperformed other preprocessing methods; (iii) the soil organic matter predictive models based on spectra processed by derivatives and Savitzky–Golay smoothing together presented a satisfactory accuracy, yielding the determination coefficient and root mean square error values of 0.986 and 0.077, respectively, for first-order derivative; and 0.973 and 0.105, respectively, for second-order derivative. The combination of first-order derivative and Savitzky–Golay smoothing was ultimately recommended the preferable preprocessing method; and (iv) the wavelengths of 417, 1853, 1000, and 2412?nm were determined as the significant wavelengths associated with soil organic matter. The study will provide a reference for the site specific management of agricultural inputs by using the visible and near-infrared reflectance spectroscopy technology.  相似文献   

20.
Determination of suitable techniques and analyses that can be implemented by NMR well logging can greatly improve the characterization of underground petroleum reservoirs and aquifers. In this paper, the feasibility for using various NMR methods for detection and characterization of fractures is explored. Analyses of experimental data obtained with a variety of samples are presented. It is shown that relaxation contrasts are useful for separating the signal contributions from fluids in the fractures and the porous matrix, and that relaxation weighting can be used in combination with other NMR techniques for enhancing fracture characterization.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号