Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

53 results about "Majority class" patented technology

Majority Class: Accuracy will be equal to \((1-x_i)\), the fraction of instances belonging to the majority class (assumed negative label is majority here).

Technique for classifying data

Provided is a system that generates models for classifying input data into a plurality of classes on the basis of training data previously classified into the plurality of classes. The system includes a sampling unit and a learning unit. The sampling unit samples, from the training data, a plurality of datasets each including a predetermined number of elements classified into a minority class and a corresponding number of elements classified into a majority class, the corresponding number being determined in accordance with the predetermined number. The learning unit learns each of a plurality of models for classifying the input data into the plurality of classes, by using a machine learning technique on the basis of each of the plurality of sampled datasets.
Owner:IBM CORP

Unbalanced data sampling method in improved C4.5 decision tree algorithm

The invention relates to an unbalanced data sampling method in an improved C4.5 decision tree algorithm. The method comprises the steps as follows: firstly, initial weights of various samples are determined according to the number of various samples; the weights of the samples are modified through the training result of the improved C4.5 decision tree algorithm in each round; the information gain ratio and misclassified sample weights are taken into account by a division standard of the improved C4.5 algorithm; the final weights of the samples are obtained after T iterations; the samples in minority class boundary regions and majority class center regions are found out according to the sample weights; over-sampling is carried out on the samples in the minority class boundary regions by an SMOTE algorithm; and under-sampling is carried out on majority class samples by a weight sampling method, so that the samples in the center regions are relatively easily selected to improve the balance degree of different classes of data, and the recognition rates of the minority class and the overall data set are improved. According to the unbalanced data sampling method in the improved C4.5 decision tree algorithm, weight modification is carried out through the improved C4.5 decision tree algorithm; and over-sampling and under-sampling are specifically carried out according to the sample weights, so that the phenomena of classifier over-fitting, loss of useful information of the majority class and the like are effectively avoided.
Owner:CHONGQING UNIV OF POSTS & TELECOMM

Weight clustering and under-sampling-based unbalanced data classification method

InactiveCN106778853ATo achieve the effect of automatic clusteringImprove classification accuracyCharacter and pattern recognitionData dredgingMajority class
The classification of unbalanced data sets already becomes one of most challenging problems in data mining. A quantity of minority class samples is far smaller than a quantity of majority class samples, so that the minority classes have the defects of low accuracy, poor generalization performance and the like in a classification learning process of a conventional algorithm. The algorithm integration already becomes an important method for dealing with the problem, wherein random under-sampling-based and clustering-based integrated algorithms can effectively improve classification performance. But, the former easily causes information loss, and the latter is complex in calculation and difficult to popularize. The invention provides a weight clustering-based improved integrated classification algorithm fusing under-sampling, which is specifically a weight clustering and under-sampling-based unbalanced data classification method. According to the algorithm, a cluster is divided according to weights of the samples, a certain proportion of majority classes and all minority classes are extracted from each cluster according to weight values of the samples to form a balanced data set, and classifiers are integrated by utilizing an Adaboost algorithm framework, so that the classification effect is improved. An experimental result shows that the algorithm has the characteristics of accuracy, simplicity and high stability.
Owner:CENT SOUTH UNIV

Self-adaptive oversampling method based on HDBSCAN clustering

The invention discloses a self-adaptive oversampling method based on HDBSCAN clustering, and mainly solves the problem of unbalanced data classification by using complete data information in an existing method. The technology comprises the following steps: (1) inputting a training data set; (2) clustering the minority class samples in the training set to obtain different scales of clusters which are not intersected with each other; (3) calculating the number of samples needing to be synthesized in each minority class cluster; (4) adaptively synthesizing new samples according to the number of samples needing to be synthesized by each cluster to obtain a new minority class data set; (5) forming a new balanced data set by the majority class data set and the new minority class data set; and (6) training and testing the classifier by using the new balance data set. According to the technology, noise in an unbalanced data set can be effectively prevented from being generated, meanwhile, theproblem of inter-class and intra-class unbalance is solved, and a brand-new oversampling strategy is provided for unbalanced learning.
Owner:重庆信科设计有限公司 +1

Construction method for classifier

The invention relates to a construction method for a classifier. The construction method includes the following steps that a part of majority class training samples in a training sample set are removed through an undersampling method, and a current training sample set is updated through the undersampled training sample set, wherein the training sample set comprises the majority class training samples and minority class training samples, and the classes of all the training samples in the training sample set are known; oversampling is conducted on the minority class training samples in the training sample set, and the classifier is constructed through the oversampled training sample set. According to the construction method for the classifier, noise in the training samples is removed effectively, the problem of data imbalance can be solved effectively, the accuracy rate of training sample data classification is greatly increased, the calculation amount is small, and the method is simple.
Owner:HARBIN INST OF TECH

Sampling method for unbalanced transaction data of fictitious assets

The invention discloses a sampling method for unbalanced transaction data of fictitious assets. The method includes the following steps that abnormal transaction data in fictitious asset transaction are defined as a minority class, and oversampling is carried out on samples of the minority class by means of an improved SMOTE method in order to increase the number of the samples of the minority class; normal transaction data in fictitious asset transaction are defined as a majority class, and undersampling is conducted on samples of the majority class by means of a distance-based DUS method in order to decrease the number of the samples of the majority class; a scaling factor is set to adjust the proportion of the oversampling number and the undersampling number. The sampling method for unbalanced transaction data is applied to abnormal transaction detection of the fictitious assets, the calculated amount of abnormal transaction detection can be greatly reduced, and a high accuracy rate can be reached.
Owner:NAT UNIV OF DEFENSE TECH

Software defect prediction optimization method based on differential evolution algorithm

The present invention discloses a software defect prediction optimization method based on a differential evolution algorithm, and belongs to the field of quality assurance in the software engineering.The method comprises the following steps: arranging modules in the software project, cleaning annotations and the like in the code, and establishing a software defect data code set; arranging the given defect set, including the defect metric design, the defect data marks, and the like, to generate a software defect data set; and with a differential evolution algorithm, creating a ratio of a majority class to a minority class as 2:1 for a defect prediction data set by using a minority class oversampling method, determining an optimal value of the neural network hyper-parameter, using a trainedneural network classification model to test in a test set, and if the performance indicators are satisfied, representing that a software defect prediction model is successfully established. Accordingto the method disclosed by the present invention, corresponding parameter factors in the classification model construction can be automatically classified according to the difference of the data sets, a parameter combination most suitable for the current data set and the classification model can be found, the performance of the software defect prediction model can be improved, and the workload ofparameter searching in the model construction can be reduced.
Owner:IANGSU COLLEGE OF ENG & TECH

Extremely unbalanced data classification method based on EasyEnsemble algorithm and SMOTE algorithm

InactiveCN108596199ASolve the problem of extreme deficiencyImprove reliabilityCharacter and pattern recognitionData setMajority class
The invention provides an extremely unbalanced data classification method based on an EasyEnsemble algorithm and an SMOTE algorithm. The method comprises: a plurality of minority class subsets are constructed by using an SMOTE algorithm and minority class samples are increased; random undersampling is carried out on majority classes, and all majority class subsets and minority class subsets are combined to obtain a plurality of training subsets with a fixed sample proportion; noise reduction is carried out on each training subset; AdaBoost classifiers are trained by using the training subset after noise reduction; and then all AdaBoost classifiers are integrated to obtain a final classifier. According to the invention, a problem of shortage of minority class samples is solved; and the unbalancing state of the sample is changed by combining random undersampling. With the noise reduction technology, reliability of a new data set is improved; the classification boundary is smoothened; andmajority class information losses are reduced by using an integration method, so that the performance of the classifier is improved.
Owner:BEIJING JIAOTONG UNIV

Automatic incident detection method based on under-sampling and used for unbalanced data set

The invention discloses an automatic incident detection method based on under-sampling and used for an unbalanced data set. The automatic incident detection method comprises the steps of (1) using a maximum and minimum normalization method to carry out normalization processing on actually-measured traffic flow data, carrying out under-sampling processing on a majority class in a training set on the basis of a neighborhood cleaning rule to obtain a new training set which is relatively balanced, (2) selecting a radial basis function as a kernel function of a support vector machine, using an improved grid search algorithm to optimize a penalty factor C and a kernel parameter g of the support vector machine, and (3) training the support vector machine through the training set which is relatively balanced so as to obtain an automatic incident detection model used for the unbalanced data set. According to the automatic incident detection method based on under-sampling and used for the unbalanced data set, the problem that an existing traffic incident detection algorithm is not applicable to unbalanced traffic data in reality is solved, detection performance of the traffic incident detection algorithm is remarkably improved, the average detection time is shortened, and the requirement of traffic incident detection for real-time performance is met.
Owner:SOUTHEAST UNIV

Multi-target evolutionary fuzzy rule classification method based on decomposition

The invention discloses a multi-target evolutionary fuzzy rule classification method based on decomposition, which mainly solves the problem of poor classification effect of an existing classification method on unbalanced data. The multi-target evolutionary fuzzy rule classification method comprises the steps of: obtaining a training data set and a test data set; normalizing and dividing the training data set into a majority class and a minority class; initializing an ignoring probability, a fuzzy partition number and a membership degree function; initializing an original group, and determining weight by adopting a fuzzy rule weight formula with a weighting factor; determining stopping criteria for iteration, iteration times, a step size and an ideal point; dividing direction vectors according to groups; performing evolutionary operation on the original group, and updating the original group by adopting a Chebyshev update mode until the criteria for iteration is stopped; obtaining classification results of the test data set; then projecting to obtain AUCH and output. The multi-target evolutionary fuzzy rule classification method has the advantages of high operating speed and good classification effect and can be applied in the technical fields of tumor detection, error detection, credit card fraud detection, spam messages recognition and the like.
Owner:XIDIAN UNIV

Method for keeping balance of implementation class data through local mean

The invention discloses a method for keeping balance of implementation class data through a local mean, which comprises the following steps: (1) distinguishing a minority class through acquiring training data; and calculating the number of majority class data and minority class data, and calculating an integer of the ratio of the number of the majority class data to the number of the minority class data; (2) calculating k neighbors in the minority class for each data in the minority class, and generating new data through weighing the k neighbors; (3) repeatedly generating new data for each data through adjusting parameters in weight and utilizing weighted summation of the k neighbors of each data; (4) marking the new data as the minority class, and merging the new data and original data to obtain balanced two classes data; and (5) further processing the balanced two classes data, i.e. a training sorting algorithm, and realizing sorting of the new unmarked data. According to the invention, the accuracy of medical diagnosis can be improved, the recognition rate of network attack is improved, the recognition rate of server failure is improved, the recognition of garbage pages is improved, and the like.
Owner:SHANDONG NORMAL UNIV

Subway fault data classification method based on unbalanced data set

The invention discloses a subway fault data classification method based on an unbalanced data set. The method comprises the following steps: inputting an original unbalanced data set, and dividing theunbalanced data set into a training data set and a test data set; the training data set is divided into a positive class sample set and a negative class sample set, wherein the positive class sampleset is a minority class sample, and the negative class sample set is a majority class sample; dividing the positive class sample set into K different clusters by using a K-Means clustering algorithm;for each cluster, sampling the data set by using an improved SMOTE algorithm to finally obtain a balanced data set; taking the SVM as a weak classifier, and constructing an integrated classifier by using an AdaBoost algorithm; and evaluating the performance of the integrated classifier by using the test data set. The method can effectively improve the recognition rate of a small number of types ofsamples in the unbalanced data set while guaranteeing the overall accuracy, and has a better effect in the classification prediction of the unbalanced data set.
Owner:NANJING UNIV OF SCI & TECH

A multi-classification method based on adaptive balanced integration and dynamic hierarchical decision-making

InactiveCN109359704AReduce dependenceSolve the problem of unbalanced number of positive and negative samplesCharacter and pattern recognitionMajority classData set
The embodiment of the invention provides a multi-classification method based on adaptive balance integration and dynamic hierarchical decision, which includes converting the original data set into a plurality of second-class data sets according to one-to-many decomposition strategy, taking the number of the majority class samples and the minority class samples in each second-class data set as theupper and lower limits of the parameter interval respectively, taking the average accuracy rate of each class as the scoring standard, and obtaining the sampling number of each subset by grid searching method; Based on this, the over-sampling and under-sampling techniques are combined to balance the two kinds of data sets to establish a plurality of binary classification sub-models, and the binaryclassification model is obtained by integrating the sub-models through the averaging method. According to the output results of all the binary classification models, the spatial position informationof the test samples is obtained under the one-to-many framework, and the classification strategies for the blank area, the intersecting area and the normal area are established to determine the finalcategory of the test samples. The technical proposal provided by the embodiment of the invention can improve the overall recognition rate of the classification model for each category under the one-to-many framework.
Owner:BEIJING UNIV OF POSTS & TELECOMM

Network intrusion detection model SGM-CNN based on class imbalance processing

For the data class imbalance problem, the present invention provides an effective network intrusion detection model SGM-CNN based on a Synthetic Minority Over-Sampling Technique (SMOTE) and a GaussianMixture Model (GMM) based on a data flow. According to the technical scheme, the method comprises the steps of firstly obtaining a to-be-identified network data flow; and preprocessing the data stream, inputting the preprocessed data stream into a pre-established network intrusion detection model based on a one-dimensional convolutional neural network (1D CNN), and outputting a detection result of the network data stream. The invention provides a class imbalance processing technology, namely an SGM, for large-scale data. The SGM firstly uses SMOTE to perform oversampling on minority class samples, then uses GMM to perform clustering-based downsampling on majority class samples, and finally balances data of each class. According to the SGM method, expensive time and space cost caused by oversampling is avoided, the situation that important samples are lost due to random downsampling is avoided, and the detection rate of minority classes is remarkably increased.
Owner:ZHENGZHOU UNIV

Down's syndrome screening method based on machine learning at progestational stage and pregnant metaphase

The invention relates to a Down's syndrome screening method based on machine learning at the progestational stage and the pregnant metaphase. The method comprises steps that ns fields of pregnant women's Down's screening result data at the pregnant metaphase are selected as training characteristics; Ns samples are added to a data set A; the samples of the data set A are preprocessed to make the number of samples in a minority class set be balanced with the number of samples in a majority class set to obtain a synthetic data set; samples in the synthetic data set are processed to obtain a prediction model for determining whether a fetus has the Down's syndrome, and the prediction model is utilized to predict a tested sample to obtain the prediction result. The method is advantaged in that the process of artificially dividing the indicator threshold is avoided, human resources are saved, and relatively high accuracy and relatively low false positive rate can be achieved.
Owner:JILIN UNIV

Unbalanced data classification method based on mixed sampling and machine learning

The invention discloses an unbalanced data classification method based on mixed sampling and machine learning. The method comprises the steps of step 1, generating a training set; step 2, for a few types of sample sets P in the training set, copying P to generate P ', using P and P' to synthesize PP ', adopting an smote algorithm to generate S on the basis of the PP', and P, P 'and S form PP' S at the same time; step 3, for the majority of types of sample sets N in the training set, randomly undersampling without putting back to obtain t Ni; step 4, repeatedly executing the step 2 for t timesto obtain t different PP 'Si, and synthesizing Ni and the corresponding PP' Si into a new training set to obtain t subsets; step 5, training to generate t classifiers Hi; and step 6, integrating t Hito obtain a final classifier H, and utilizing the classifier H to complete classification of the unbalanced data set. According to the method, the attention of few types of samples is improved, and meanwhile information of multiple types cannot be excessively lost; The possibility of over-fitting and over-generalization is reduced; The training effect is good, overfitting is not prone to occurring, and the training speed is high.
Owner:CENT SOUTH UNIV

Oversampling method for unbalanced data set

The invention relates to the technical field of data mining, and provides an oversampling method of an unbalanced data set. The method comprises the following steps: firstly, collecting an unbalanced data set, clustering the unbalanced data set based on a K-means method, and dividing the unbalanced data set into a minority class and a majority class according to the number of elements in the data set of each class; then, on the basis of an SMOTE method, carrying out oversampling on the minority class data set to obtain a synthesized minority class data set; then, performing oversampling with replacement on the synthesized minority class data set to obtain a new minority class data set, and forming a new data set; and finally, based on a CCA method, cleaning the new data set: clustering the new data set, calculating and sorting Euclidean distances between each sample in each class cluster and other samples in the class cluster, and deleting the sample corresponding to the farthest Euclidean distance to obtain the cleaned data set. According to the method, more minority class samples can be effectively synthesized, the learnability of the samples is improved, and the effectiveness of the samples is improved.
Owner:NORTHEASTERN UNIV

Imbalanced data industrial fault classification method based on k-means

The invention discloses an imbalanced data industrial fault classification method based on k-means. The method comprises the following steps: first, utilizing the k-means; based on the imbalance degrees, clustering the classes with relatively big data; dividing the majority classes into N sub-classes; combining the N sub-classes with M minority classes to serve as a multi-classification problem for an (M+N) classification; and finally, performing learning according to a naive bayes classifier. Compared with other existing methods in prior art, the method of the invention keep the information of the original data to the largest extent and better resolves the problem with imbalanced class data classification under the condition that over-fitting is prevented. Therefore, compared with other methods, the classification precision is increased, and the phenomenon of over-fitting can be reduced.
Owner:ZHEJIANG UNIV

Unbalanced data classification undersampling method, device and equipment and medium

InactiveCN108647727ALow resolutionSolve the problem of deleting samples that should not be deletedCharacter and pattern recognitionMajority classNear neighbor
The invention discloses an unbalanced data classification undersampling method, comprising the steps: obtaining all majority samples in to-be-processed unbalanced data; according to K nearest neighboralgorithm, obtaining the number of minority samples in k nearest neighbor samples of each majority sample; according to the number of minority samples, determining categories corresponding to the majority samples; and according to the category of each majority sample, performing operation corresponding to the category. The low precision of a classified-learning algorithm due to more majority samples and fewer minority samples in the unbalanced big data classification process is solved, and the accuracy of unbalanced big data classification is solved.
Owner:GUANGZHOU UNIVERSITY

A hierarchical nearest neighbor undersampling method based on clustering

The embodiment of the invention provides a hierarchical nearest neighbor undersampling method based on clustering, which comprises the following steps: elbow diagrams of a plurality of class samples are obtained by using a Kmeans clustering algorithm, and an optimal cluster number k of clusters is selected according to the relationship between the sum of the cluster number and the distortion degree of each cluster; Kmeans clustering algorithm is used to cluster most of the samples into k clusters so as to obtain the number of center points and sample points in each cluster. According to the number of sample points in each cluster, stratified sampling is carried out, and the nearest neighbor of the center point of each cluster is combined with a small number of samples as the sampling result. The technical proposal provided by the embodiment of the invention fully utilizes the distribution characteristics of the majority class samples, better retains the useful information of the majority class samples, and can effectively improve the classification effect of the subsequent classification algorithm.
Owner:BEIJING UNIV OF POSTS & TELECOMM

Under-sampling classification integration method and device for credit scoring and storage medium

The invention provides an under-sampling classification integration method and device for credit scoring and a storage medium. The method comprises the steps of obtaining a user training set, and dividing sample data in the training set into a majority class data set and a minority class data set; randomly undersampling k majority class data subsets from the majority class data set by using an undersampling algorithm, wherein each majority class data subset comprises a majority class data subset of n first data samples, and mn first data samples left after each time of undersampling form k pure majority class data subsets; mixing the k majority class data subsets with second data samples in the minority class data set to form k balanced data subsets; learning k CART tree dichotomy base classifiers by using the k balanced data subsets; utilizing the k pure majority class data subsets to learn k OnClassSVM classification base classifiers; and integrating the base classifiers through a bagging algorithm to output a final result. The problem of data imbalance in credit scoring is solved, and data samples are fully utilized to improve the classification performance.
Owner:CHANGSHA UNIVERSITY OF SCIENCE AND TECHNOLOGY +1

A multi-classification oriented unbalanced data preprocessing method and device and an apparatus

InactiveCN109033148AImprove classification accuracyResolve problems that arise when conflicts ariseSpecial data processing applicationsMajority classAlgorithm
The invention discloses a multi-classification oriented unbalanced data preprocessing method and device and an apparatus. The method comprises the following steps: receiving the final sample set sizeand the unbalanced ratio of the sample set, and obtaining the ideal sample number of each class; according to the number of ideal samples and the number of actual samples, judging the sample sets of minority classes and majority classes; for the samples in the sample set of a few classes, calculating the number of other class samples and a few class samples in the k-nearest neighbor to classify the samples; for the sample set of a few classes, performing deleting, saving, copying or synthesizing according to the marker of the sample set to obtain the final sample set of a few classes; For thesamples in most of the sample sets, calculating the number of the samples in the k-nearest neighbors and other samples to classify the samples. The samples in most class sample sets are deleted or saved according to the markers of the samples to obtain the final sample sets of most classes. The final sample set is generated. The invention enables the final sample set to effectively improve the accuracy of the multi-classification algorithm.
Owner:GUANGZHOU UNIVERSITY

Unbalanced data set conversion method and system based on sampling and feature reduction

The invention provides an unbalanced data set conversion method and system based on sampling and feature reduction, and the method comprises the steps: carrying out the sampling of samples in an unbalanced data set through a sampling method, and enabling the number of minority class samples to be close to the number of majority class samples; sorting the features from large to small by utilizing the correlation between the features and the category labels; sequentially deleting one-dimensional features from the last dimension of the features according to a sequence; inputting the sample data set of which the one-dimensional features are reduced into the random forest model every time when the one-dimensional features are deleted, calculating ACC values corresponding to the samples, comparing all the ACC values, and selecting the feature dimension corresponding to the maximum ACC value as a target feature dimension of feature reduction. New unbalanced data obtained through the conversion method is input into the multi-classification SVM for training, and the classification accuracy can be remarkably improved.
Owner:COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI

Method for extracting sensitive data from unbalanced data based on SVM-forest

ActiveCN107728476AReduce imbalanceClassification effect balanceAdaptive controlMajority classTest sample
The invention discloses a method for extracting sensitive data from unbalanced data based on SVM forest. The method comprises the steps that a part of labeled samples are taken as test samples, and the rest of the samples are used as training samples; k-Means is used to divide a normal working condition class into subclasses, and the subclasses are mixed with fault working condition type data to form N training subsets; an SVM-tree method is used to train SVM-Forest, and the test samples are used to test the SVM-forest; L trees with the highest fault working condition misclassification rate are selected; some data with a great influence on the classification effect are kept; according to a selection classification algorithm, a classifier T is trained through the minority classes and the remaining majority classes in a test set; and a temporary test sample is used to test the classification effect of T until the effect meets requirements. According to the sensitive data extracting method provided by the invention, samples with a great influence on the classification effect in a majority of sample sets are selected through multiple iterations to reduce the degree of unbalance; and the classification effect is close to or up to an equal classification effect under the same condition.
Owner:ZHEJIANG UNIV

Data category imbalance processing method, device and system and storage medium

The invention provides a data category imbalance processing method, which is applied to the technical field of data processing, and comprises the following steps: clustering minority class samples to obtain a plurality of clusters; calculating k neighbor samples of the minority class samples in each cluster, obtaining the number of majority class samples in the k neighbor samples of the minority class samples in each cluster, calculating the ratio of the number of the majority class samples in the cluster to the number of the k neighbor samples, and processing the samples in the cluster according to the ratio of the number of the majority class samples in the cluster to the number of the k neighbor samples.. The invention provides data category imbalance processing equipment, a semi-supervised generative adversarial network training method and equipment, an abnormal transaction detection method and equipment, a system and a storage medium.
Owner:INDUSTRIAL AND COMMERCIAL BANK OF CHINA

SAMME.RCW algorithm based face recognition optimization method

The invention relates to a SAMME.RCW algorithm based face recognition optimization method, which comprises the steps of firstly carrying out feature extraction on a face image, and carrying out recognition classification by using an image feature vector according to a SAMME.RCW algorithm. Modification is carried out on a weight adjustment process of the SAMME.RCW algorithm, thereby ensuring the weight of every class of samples not to be too small when re-sampling occurs, also enabling weight adjustment after re-sampling to be more partial to minority-class samples, and ensuring classification effects of the samples. A requirement of the SAMME.RCW algorithm for the performance of a weak classifier is that the weight of correctly classified samples in each class is greater than the weight of any other class of samples, and a requirement for the accuracy is performed on each class independently. Through modification carried out on weight allocation in re-sampling, the probability of being selected of each class of samples is ensured to be basically the same, and classification effects of the minority-class samples and majority-class samples in the weak classifier are ensured at the same time. The accuracy of face recognition is effectively improved by a finally acquired strong classifier.
Owner:BEIJING UNIV OF TECH

Data equalization method, system and equipment

The invention discloses a data equalization method. The method comprises the following steps: receiving an input non-equalization data set; calculating a boundary judgment factor of each sample in theminority class sample set; sampling the boundary set of the plurality of types of sample sets through a sparse sampling strategy to obtain a new plurality of types of sample sets; carrying out interpolation through a partition neighborhood interpolation strategy, and constructing a new minority class sample set; and combining the new majority class sample set with the new minority class sample set to obtain a processed unbalanced data set. According to the invention, the decision space of a plurality of types of boundary samples is shrunk, and the sample boundaries of a plurality of types anda few types become clearer; and meanwhile, through a partition neighborhood interpolation strategy, the sample boundary becomes clearer, and the mining effect of the unbalanced data is improved. Theinvention further provides a data equalization system and equipment and a computer readable storage medium which have the above beneficial effects.
Owner:GUANGDONG UNIV OF TECH

Text feature selection method based on unbalanced data sets

The invention relates to a text feature selection method based on unbalanced data sets. Feature sets of unbalanced documents are calculated on a computer; and modelling is carried out by selecting a classification algorithm model. The text feature selection method specifically comprises the following steps of: (1), dividing the data sets into majority classes and minority classes, stipulating the minority classes as positive classes represented by ci, and stipulating the majority classes as negative classes represented by a formula shown in the specification; (2), pre-processing texts in the data sets, and executing operations, such as word segmentation and removing of stop words, so as to form a set T of features t; (3), respectively calculating parameters A, B, C, D and N corresponding to various features t in the unbalanced class documents; (4), respectively calculating new X2(t,ci) of various features t under different classes in the unbalanced class documents; (5), respectively setting threshold values for screening features in the unbalanced class documents, according to the X2(t,ci) calculated by various features, arranging according to the size order; and taking out a feature set T' including an appointed number of features according to the classes; and (6), selecting a proper classification algorithm model (such as a decision tree, a support vector machine and Bayes) to model according to the feature set T' after the features are selected.
Owner:ZHEJIANG UNIV OF TECH

Minority-class identification-oriented multi-strategy joint fault diagnosis method

InactiveCN112067053AImprove fault recognition rateSolve the problem of difficult fault identificationMeasurement devicesCharacter and pattern recognitionData classMajority class
The invention discloses a minority-class identification-oriented multi-strategy joint fault diagnosis method, which comprises the following steps of: performing equalization processing on sample data,performing step-by-step training on a constructed multi-strategy joint fault diagnosis model by adopting the equalized sample data, and constructing a DBN-based feature extractor, so that the deep features of most types of samples can be extracted, the shallow and deep features of minority types of samples are fused, and the minority type fault recognition rate is improved. Starting from multiplelevels of data, features and classifiers, the powerful data representation and feature extraction capacity of deep learning is fully utilized, the problem that minority class faults are difficult torecognize due to data class imbalance is solved, and the recognition effect of the minority class faults is comprehensively improved.
Owner:BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products