Dask cluster-oriented dynamic data partitioning method based on local weighted linear regression

A locally weighted, linear regression technology, applied in electrical digital data processing, special data processing applications, digital data information retrieval, etc., can solve the problems of time-consuming, labor-intensive, difficult to adapt to data set parallel applications and cluster environments, and avoid Offline training, avoiding high dependence, and improving efficiency

Pending Publication Date: 2022-04-08
HUNAN UNIV OF TECH
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The technical problem to be solved by the present invention is to provide Dask cluster-oriented Dynamic Data Blocking Method Based on Locally Weighted Linear Regression

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dask cluster-oriented dynamic data partitioning method based on local weighted linear regression
  • Dask cluster-oriented dynamic data partitioning method based on local weighted linear regression
  • Dask cluster-oriented dynamic data partitioning method based on local weighted linear regression

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0049] In the dynamic data block method based on locally weighted linear regression for Dask clusters, the large-scale data set is defined as X, and the set of sub-data sets used for block size optimization is X profiling , the remaining set of sub-datasets to be processed is X rest , the number of sub-datasets in X is N, and X profiling The number of neutron datasets is n, X rest The number of neutron datasets is m, X profiling The i-th sub-dataset is X profiling.i , X rest The jth sub-dataset is X rest.j , the initial block size is M init , X profiling The block size M corresponding to the i-th sub-dataset in profiling.i , X rest The block size M corresponding to the jth sub-dataset in rest.j , processing X profiling The time taken for the i-th sub-dataset in is T profiling.i , processing X rest The time spent on the jth sub-dataset in is T rest.i , the total time spent processing X is T total .

[0050] Include the following steps:

[0051] S1. Divide the la...

Embodiment 2

[0063] This embodiment provides its specific algorithm according to the method described in Embodiment 1, and the specific steps include:

[0064]

[0065]

[0066] Wherein, the "←" is the way of writing in the algorithm, which is equivalent to "=".

Embodiment 3

[0068] According to the algorithm described in Embodiment 2, this implementation performs static and dynamic calculations on a data set, wherein the hardware environment used is:

[0069]

[0070] The software environment is:

[0071]

[0072] (1) Static calculation

[0073] Divide a data set of size 8.7GB into n blocks, and then process n blocks at the same time with a parallel application.

[0074] (2) Dynamic calculation

[0075] Divide a data set with a size of 8.7GB into m sub-datasets, and divide the m sub-datasets into X according to the ratio of 2:8 profiling and x rest , for X profiling Each sub-dataset in the block is divided into blocks, and the initial block size M init , block size variation d, Gaussian kernel parameter σ, residual X rest The number of blocks in the neutron data set depends on the dynamic selection of the program, and then the m sub-data sets are processed sequentially with a parallel application program.

[0076] The parameters of th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Dask cluster-oriented dynamic data blocking method based on local weighted linear regression, which comprises the following steps of: dividing a to-be-processed large-scale data set into a sub-data set for block size optimization and a remaining to-be-processed sub-data set, and processing the block size corresponding to each sub-data set for block size optimization to obtain a large-scale data block set for block size optimization; a local weighted linear regression algorithm is adopted to accurately and dynamically estimate the size of a block corresponding to each remaining to-be-processed sub-data set online according to the size of the block corresponding to each processed sub-data set and consumed time. According to the method, the problems of high dependence on artificial experience and time-consuming and labor-consuming offline training are solved, the method can better adapt to changes of the data set, the parallel application program and the cluster environment, and the efficiency of processing the large-scale data set in the Dask cluster is improved to a certain extent.

Description

technical field [0001] The present invention relates to the technical field of performance optimization for large data parallel processing, and more specifically, to a Dask cluster-oriented dynamic data block method based on locally weighted linear regression. Background technique [0002] When a parallel application program is executed in a Dask cluster to process a large-scale data set, the data set needs to be divided into blocks. In CN201410836567.2 big data parallel computing method and device, the data is processed according to the size of the data set, the size of the cluster memory and the degree of parallelism. The set is divided into blocks to obtain a block data set composed of multiple data blocks; the block data set is used as the training data set of the logistic regression classification algorithm, and the optimal weight vector of the logistic regression function is solved to obtain the logical regression classifier. The embodiment of the present invention di...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/50G06F16/2453
Inventor 万烂军张根赵昊鑫李长云王志兵张潇云
Owner HUNAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products