Spark big data platform-based neighborhood density imbalance data mixed sampling method

A big data platform, mixed sampling technology, applied in electrical digital data processing, digital data information retrieval, special data processing applications, etc. The interference of environmental factors and individual factors can improve the classification accuracy, improve the modeling efficiency, and improve the recognition rate.

Pending Publication Date: 2019-04-05
CHONGQING UNIV OF POSTS & TELECOMM
View PDF9 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the EEG signal is very weak and is easily interfered by various environmental factors and individual factors, which has a serious impact on the performance of the EEG identification system.
The existing EEG signal preprocessing technology has many limitations, such as only suitable for processing multi-electrode EEG signals, the decomposition process is time-consuming and lacks practical guidance, and the denoising process requires manual intervention, etc.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark big data platform-based neighborhood density imbalance data mixed sampling method
  • Spark big data platform-based neighborhood density imbalance data mixed sampling method
  • Spark big data platform-based neighborhood density imbalance data mixed sampling method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

[0031] The technical scheme that the present invention solves the problems of the technologies described above is:

[0032] A kind of unbalanced data mixed sampling method based on the neighborhood density of Spark big data platform, comprises the following steps:

[0033] Upload the local unbalanced data set to the big data platform, normalize the data through z-score, use HDFS distributed storage, combine the distributed computing framework Spark to read the data file in HDFS and save it as RDD, and then save it is a LabelPoint object. Specific steps include:

[0034] First create an RDD data set in a distributed manner through the textFile method of the Spark Context object (for parallel computing in...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Spark big data platform-based neighborhood density imbalance data mixed sampling method, and relates to a computer information acquisition and processing technology. According to the invention, the data is stored in the RDD through the Spark; carrying out normalization processing; The method comprises the following steps: dividing RDD into a positive domain space, a negative domain space and a boundary domain space according to neighborhood density in combination with a three-branch decision theory, sampling data of a boundary domain by adopting an SMOTE algorithm, sampling data of a negative domain by adopting a mixed sampling algorithm, and finally merging the data in the three domains to obtain a final data set. By dividing each piece of data into different domains and processing the data according to the characteristics of the different domains, a small number of types of data can be properly added, and meanwhile, most types of data are properly reduced. And finally, calling an MLLib algorithm library, and evaluating the effect by using a machine learning classifier. According to the method, the problem of inter-class proportion imbalance of unbalanceddata can be effectively alleviated, and the precision of the algorithm is improved.

Description

technical field [0001] The invention belongs to the technical fields of computer data mining and computer information processing. Background technique [0002] "Big data" has existed for a long time in the fields of physics, biology, environmental ecology, military, finance, communication and other industries, but it has attracted people's attention because of the development of the Internet and information industry in recent years. Big data is another disruptive technological revolution in the IT industry after cloud computing and the Internet of Things. Cloud computing mainly provides storage and access places and channels for data assets, and data is the real valuable asset. The amount of business transaction information within the enterprise, commodity logistics information in the Internet world, and human-to-human interaction information and location information in the Internet world will far exceed the carrying capacity of the existing enterprise IT architecture and i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/2458
Inventor 胡峰余春霖代劲刘柯于洪张清华
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products