Feature selection method based on self-adaptive LASSO

A feature selection method, an adaptive technique, used in mathematics and computer science to solve problems such as the inability to guarantee feature consistency

Inactive Publication Date: 2020-10-27
EAST CHINA NORMAL UNIV
View PDF0 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, Zou[11] pointed out that LASSO cannot guarantee the consistency of the selected features in some cases, and proposed an adaptive LASSO method that adds coefficients to the regularization term of each feature

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Feature selection method based on self-adaptive LASSO
  • Feature selection method based on self-adaptive LASSO
  • Feature selection method based on self-adaptive LASSO

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0079] The data in this example comes from The Cancer Genome Atlas (TCGA) database, using the methylation expression data of liver cancer cells, where the cancer samples are taken from the cells of cancer organs, and the normal samples are taken from organs at a certain distance from the cancer organs in the cell. The dimension of the data set is 485577, and the number of samples is 100, including 50 cancer samples and 50 normal samples. According to the proportion of 70% training set and 30% test set, the data set is divided into two parts, and the feature selection method is implemented on the training set. Firstly, the Student's t-test is performed on the training data, and 1000 features with the smallest p value are selected; then, this method is implemented on the 1000 feature data for feature selection, and 8 features are screened. These 8 features and 1000 features were used to train the linear SVM model to verify the test set, and finally the same classification accur...

Embodiment 2

[0081] The data in this example comes from the Uci Machine Learning Repository, using the Sentiment Labeled Sentences Data Set. The data is randomly sampled from Amazon's shopping reviews to determine whether the reviews are positive or not. The data set has 1000 samples, including 500 positive samples and 500 negative samples. The text data is vectorized through the bag of words model to obtain 1897-dimensional training data. According to the proportion of 70% training set and 30% test set, the data set is divided into two parts, and the feature selection method is implemented on the training set. Because this data set is discretized data, the Relief algorithm cannot be used to calculate the degree of difference between the same and different classes, so only the adaptive Lasso method with symmetric uncertainty is used for feature selection, and 224 features are screened. The 224 features and 1897 features were used to train the linear SVM model to verify the test set, and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a feature selection method based on self-adaptive LASSO. The method is suitable for feature selection of gene microarray data with the characteristics of high dimension and lowsample size. The method comprises the steps of firstly, calculating the information entropy of each feature vector and the information entropy of each classification label and the conditional entropybetween the features and the labels to obtainin the symmetry uncertainty between each feature vector and the corresponding classification label; then, according to the principle that the feature expression difference degree between similar samples is small and the difference degree between heterogeneous samples is large, using the ReliefF algorithm for calculating the isomerism difference degree of each feature; and finally, respectively taking the two evaluation indexes as feature weights of an adaptive LASSO algorithm to perform feature selection, and combining the obtained two batches of feature subsets to generate a finally screened feature set.

Description

technical field [0001] The invention belongs to the technical field of feature selection in feature engineering, relates to mathematics and computer science, and can be applied to the field of machine learning, including gene microarray data processing, text analysis, pattern recognition and the like. Background technique [0002] As a carrier of gene expression data, DNA microarrays are widely used in the field of disease diagnosis [1-3]. For DNA microarray data, high dimensionality and low sample size are two characteristics. With the continuous development of biochip technology, the high-dimensional characteristics of its data have intensified, bringing the challenge of "dimension disaster" [4]. In order to deal with this problem, data preprocessing is inevitable. Feature selection and feature extraction are two commonly used feature preprocessing methods. The difference is that the former screens out important feature subsets from the original feature set, while the la...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G16B40/00G06K9/62
CPCG16B40/00G06F18/213
Inventor 李海晟赵炳君
Owner EAST CHINA NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products