Mass data clustering analysis method and device

A technology of cluster analysis and massive data, applied in the field of data analysis, can solve problems such as inability to identify, achieve the effect of ensuring load balancing and improving computing efficiency

Inactive Publication Date: 2020-01-21
CHENGDU SEFON SOFTWARE CO LTD
View PDF0 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a massive data clustering analysis method and device, which solves the problem that with the development of the era of big data, the characteristics of data and the amount of data generated by people's behaviors increase rapidly, which has far exceeded the data of traditional methods. Therefore, traditional target recognition methods cannot quickly and effectively identify targets in a big data environment

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Mass data clustering analysis method and device
  • Mass data clustering analysis method and device
  • Mass data clustering analysis method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0059] A massive data clustering analysis method, comprising the following steps:

[0060] S1. The GeoHash encoding algorithm based on overlapping partitions processes the original data, and determines the partition corresponding to each data in the original data;

[0061] S2. In each partition, cluster the data in the partition in parallel, and save the cluster ID;

[0062] S3. After merging the partition results, a global cluster ID can be obtained.

[0063] The purpose of the present invention is to realize a DBSCAN algorithm based on parallel computing and solve the problem that the traditional density clustering algorithm cannot analyze massive data. The invention proposes an efficient overlapping partition and cluster merging strategy, which can quickly split data and merge clusters, and this method fully considers load balancing, and can realize efficient operations under a distributed framework, thus supporting massive data Clustering efficiently solves the problem t...

Embodiment 2

[0065] This embodiment is further based on Embodiment 1, and further, the GeoHash encoding based on overlapping partitions is named OverLap-GeoHash algorithm. During the execution of the entire algorithm, the DBSCAN algorithm has the highest time complexity and space complexity. According to the barrel principle, in order to ensure the efficiency of parallel clustering, it is necessary to divide the data into regions as much as possible.

[0066] The GeoHash algorithm is a spatial encoding algorithm, which is often used for two-dimensional latitude and longitude data, and can map the latitude and longitude data into one-dimensional values ​​or strings. In this paper, it is extended to multi-dimensional data, and some improvements are made in combination with the overlapping partition strategy. The data can be mapped into a one-dimensional value, which is the ID code of the partition. If the point to be coded is an overlapping point, it will be mapped It is multiple values, and...

Embodiment 3

[0076] This implementation is further on the basis of embodiment 2, the GeoHash encoding algorithm in the step S1 processes the original data, and the method for determining the partition corresponding to each data in the original data includes the following steps:

[0077] S101, initialize the Hash value to binary number 0, the number of iterations is 0, the number of iterations is given N, the upper bound and the lower bound of each dimension;

[0078] S102. For any data D, the selected dimension is the number of iterations and modulo the number of dimensions. When the value of data D in this dimension is not greater than the midpoint between the upper and lower bounds of the dimension, the Hash value is shifted to the left by one bit, and then Update the upper bound of the dimension to the midpoint of the original dimension, and add 1 to the number of iterations; when the value of data D in this dimension is greater than the midpoint between the upper and lower bounds of the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a mass data clustering analysis method and a mass data clustering analysis device, and aims to realize a DBSCAN algorithm based on parallel computing and solve the problem thata traditional density clustering algorithm cannot perform mass data analysis. According to the invention, an efficient overlapping partitioning and class cluster merging strategy is provided; data splitting and class cluster merging can be quickly carried out; according to the method, load balancing is fully considered, efficient operation can be achieved under a distributed framework, therefore,clustering of mass data is supported, the problem that mass data analysis cannot be conducted through a traditional DBSCAN is efficiently solved, and therefore the method has high performance and practical value.

Description

technical field [0001] The invention relates to the field of data analysis, in particular to a massive data clustering analysis method and device. Background technique [0002] With the development of social economy and the popularization of telephone and Internet, the crime rate of telecommunication fraud continues to rise, and because telecommunication fraud relies on the means of communication at the border, the scope of social harm caused by telecommunication fraud is wider. Different from general criminal cases, there is a certain threshold for telecom fraud, which is usually committed by gangs. Therefore, identifying criminal gangs through the suspect’s phone calls and network behavior data has become an effective way for public security organs to curb telecom fraud crimes. [0003] With the advent of the era of big data, data mining has become a sharp tool in the field of public security. Through data mining to mine the data distribution rules of criminal suspects, t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/906G06F16/901
CPCG06F16/9014G06F16/9024G06F16/906
Inventor 查文宇曾理徐浩王纯斌赵神州张艳清
Owner CHENGDU SEFON SOFTWARE CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products