The invention provides an
OPTICS clustering
algorithm based on a Spark
big data platform, and relates to a computer information obtaining and
processing technology. Parallel data is structurally partitioned, the optimal
data set partitioning is obtained, a corresponding RDD is generated, neighbor sample numbers and core distances are calculated in parallel, partitions are subjected to parallel execution of the
OPTICS algorithm to obtain a cluster sequence of the partitions, and the cluster sequence is obtained persistently; clusters are given to the partitions according to the cluster sequence, and samples can obtain global cluster numbers by combining the partitions. By means of the Spark distributed parallel technology, the optimal partitioning structure is found, and the cluster sequence of the partitions is obtained through parallel calculation. According to the
OPTICS cluster sequence, a user can observe the inherent clustering structure of a
data set from different levels of structures, the method can process a large
data set which cannot be processed by a serial
algorithm, and the time for obtaining the clustering result is greatly shortened.