Column calculation optimization method based on Spark SQL

An optimization method and execution plan technology, applied in the field of SparkSQL-based column calculation optimization, to achieve the effect of reducing overhead, accelerating calculation speed, and reducing calculation time.

Pending Publication Date: 2022-03-04
西安烽火软件科技有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

It is impossible to use CPU optimization and GPU optimization capabilities in one business at the same time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Column calculation optimization method based on Spark SQL
  • Column calculation optimization method based on Spark SQL

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0041] EXAMPLES: A column calculation optimization method based on Spark SQL, including the following steps:

[0042] S1, unified memory management, providing an Arrow unified data management mechanism, after file data is loaded from disk to memory Arrow structure, can be implemented by multiple plug-in access and calculation, calculating Shuffle, etc., is also based on arrow implementation. Use arrow as a carrier of RDD memory, implement memory data between multitasking, multi-plug-in;

[0043] S2, heterogeneous calculation resource unified scheduling, expand the optimizer and plugin in Spark SQL, implement data-based heterogeneous resource scheduling mechanism, including the steps:

[0044] S2-1, based on the schedule optimization mechanism of the field data characteristics: For numerical calculation priority schedule to CPU, feature vector data, long string computing GPU;

[0045] S2-2, the scheduling optimization of combined calculation characteristics: Task for a large amount...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Spark SQL (Structured Query Language)-based column calculation optimization method, which comprises the following steps of: S1, unified memory management: establishing an Arrow-based unified data management mechanism, and accessing and calculating file data by various plug-ins after the file data is loaded into a memory Arrow structure from a disk; s2, heterogeneous computing resources are scheduled in a unified mode, an optimizer and a plug-in are expanded in a Spark SQL, and a heterogeneous resource scheduling mechanism based on data features is achieved; and S3, performing rule matching on the logic execution plan of the Spark SQL, and generating a physical execution plan of CPU and GPU mixed arrangement based on a unified memory structure Arrow. According to the method, the memory space can be compressed in the format of the Arrow column, the GC overhead of JVM memory calculation is avoided, the calculation efficiency is improved, the Spark SQL operators are mixed and arranged into the optimal execution plan according to the cost optimization method, and the overall calculation time consumption is reduced.

Description

Technical field [0001] The present invention relates to the technical field of cluster computing systems, and is specifically a column calculation optimization method based on Spark SQL. Background technique [0002] Apache Spark is a fast, universal cluster computing system. It provides high-rise APIs for Java, Scala, Python and R, which also supports a set of rich advanced tools, including SPARKSQL modules for SQL and structured data processing, in the drawings of architectures figure 1 Indicated. [0003] Spark SQL is based on the SQL engine of the open source parallel computing framework Spark, providing large data environments, based on data queries and analysis based on SQL languages. [0004] a) Spark SQL Based on RDD, data processing can express the data model of SQL, Spark SQL provides the data processing and analysis capabilities of ROW BASD; [0005] b) Spark SQL Based on Spark's parallel computing model for scheduling and processing, the SPARK is scheduled to perform ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/242G06F16/245G06F9/48G06F9/50
CPCG06F16/2433G06F16/24569G06F9/4881G06F9/5016G06F9/505
Inventor 李华蓉赵智峰李岩苏锋陈芒芒
Owner 西安烽火软件科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products