Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A large-scale data distributed clustering processing method based on mapreduce

A large-scale data and distributed clustering technology, applied in special data processing applications, database models, relational databases, etc., can solve the problems of reducing the overall efficiency of parallel clustering, high similarity time consumption, and high computational overhead, etc. Achieve the effect of improving the efficiency of parallel clustering, reducing the number of clustering iterations, and fast convergence.

Active Publication Date: 2019-06-25
北京点为信息科技有限公司
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The K-Means method combines the Canopy method Canopy-Kmeans, uses the characteristics of Canopy to calculate the similarity of objects, and preprocesses the data. The advantage is that the initial clustering center point can be given to avoid falling into local optimum, but the disadvantage is that the distance between objects can be calculated. The time consumption of the similarity is relatively large
The method based on data density calculation is to calculate the density of all data, and then select the data with the highest density as the cluster center point to avoid the problem of random selection, and it is more accurate, but the traditional calculation cost is also large, and it is easy to cause node clustering. The load is heavy, reducing the overall efficiency of parallel clustering

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A large-scale data distributed clustering processing method based on mapreduce
  • A large-scale data distributed clustering processing method based on mapreduce
  • A large-scale data distributed clustering processing method based on mapreduce

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The specific implementation manners of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0038] Such as figure 1 As shown, the Hadoop distributed cluster environment in this embodiment has 3 servers, which constitute 3 nodes, including a master node Master for issuing orders and distributing tasks, and 2 sub-node slaves for receiving tasks distributed by the master node and according to the master node. Node Master requests to process running tasks, and all nodes are connected through high-speed Ethernet. The master node Master starts the entire cluster environment according to the user's application request. The slave node slave and the master node Master are the main body of the Hadoop distributed cluster environment parallel system, responsible for the processing and operation of the entire Hadoop distributed cluster. Such as figure 2 As shown, in this embodiment: 1) receive data to be processed according to u...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a large-scale data distributed clustering processing method based on MapReduce, which includes sampling large-scale data on the principle of non-repetition of equal scales; inputting sampled data into the MapReduce distributed parallel framework and calculating the local density and sum of the sampled data. Average density; find all sampled data whose local density is greater than the average density as a set of candidate points for the initial cluster center point of each cluster and feed it back to the main node, and select the distance between every two adjacent candidate points to be greater than 2 times the setting All candidate points in the range are used as the initial clustering center points; the MapReduce distributed parallel framework is used to perform parallel clustering tasks, and the average distance between data is calculated for each cluster to update the clustering center point; the error sum of squares criterion function is applied to the child nodes Determine whether to continue iteration; each sub-node clusters large-scale data based on the cluster center point. The invention realizes parallel clustering, reduces the number of clustering iterations, and improves clustering accuracy and parallel clustering efficiency.

Description

technical field [0001] The invention belongs to the technical field of parallel clustering, in particular to a large-scale data distributed clustering processing method based on MapReduce. Background technique [0002] With the rapid development of information technology, the scale of data continues to increase, and the use of parallel mechanisms to effectively mine and analyze large-scale data sets can promote the development and progress of Internet technology. Cluster analysis is an important data processing technology, and one of the important topics in the field of machine learning and artificial intelligence. It is widely used in data mining, information retrieval and other research. The main task is to divide the data set into multiple subsets, so that the similarity between the data objects in the subset is high, and the difference between the data objects in different subsets is relatively large. Due to the increase of data scale, the traditional stand-alone cluste...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/26
CPCG06F16/285
Inventor 高天寒孔雪
Owner 北京点为信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products