The invention discloses a Spark
big data platform-based neighborhood density imbalance data mixed sampling method, and relates to a computer
information acquisition and
processing technology. According to the invention, the data is stored in the RDD through the Spark; carrying out normalization
processing; The method comprises the following steps: dividing RDD into a positive
domain space, a negative
domain space and a boundary
domain space according to neighborhood density in combination with a three-
branch decision theory, sampling data of a boundary domain by adopting an SMOTE
algorithm, sampling data of a negative domain by adopting a mixed sampling
algorithm, and finally merging the data in the three domains to obtain a final
data set. By dividing each piece of data into different domains and
processing the data according to the characteristics of the different domains, a small number of types of data can be properly added, and meanwhile, most types of data are properly reduced. And finally, calling an MLLib
algorithm library, and evaluating the effect by using a
machine learning classifier. According to the method, the problem of inter-class proportion imbalance of unbalanceddata can be effectively alleviated, and the precision of the algorithm is improved.