Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Spark-based data processing method and system

A technology for data processing and connection relationships, which is applied in database management systems, electronic digital data processing, and special data processing applications, etc. It can solve problems such as lack of versatility, low efficiency of big data preprocessing systems, and inability to apply application scenarios.

Inactive Publication Date: 2017-12-12
INST OF INFORMATION ENG CAS
View PDF5 Cites 23 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The invention solves the technical problems of low efficiency and no versatility of the existing big data preprocessing system
[0004] Most of the existing similar tasks are not universal. Users can only use the functional operators provided by the system, which cannot be customized according to their own needs, and cannot be applied to some flexible application scenarios. more or less question

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark-based data processing method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0071] The present invention will be described in further detail below in conjunction with specific examples, but the scope of the present invention is not limited in any way.

[0072] Process the two student form files. Table 1 has id, name, and Chinese grades, and Table 2 has id, name, and math grades. The final processing result is: add the column of grade to the table 1 file, which is In the second grade, add 3 points to the math scores of the students in the file in Table 2, and finally combine the two tables into one table.

[0073] The user drags and drops operators on the interface, which are two adaptation input operators, one implements the function of "adding columns", the other implements the function of "adding points", and the other implements the function of "merging two tables". Sub, an adaptive output operator.

[0074] The user configures the parameters of related operators on the interface: "Add column" operator: the added column is "grade" and the content ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Spark-based data processing method and system. The method comprises the following steps of: 1) selecting operators by a user according to a requirement of a to-be-processed document, configuring parameters of the selected operators, establishing a connection relationship of the selected operators and generating an XML file of a scene, wherein the XML file of the scene comprises XML content of each selected operator and the connection relationship of the operators; 2) generating a corresponding directed acyclic graph DAG according to the XML file of the scene; and 3) segmenting the directed acyclic graph DAG into a plurality of subtasks subJob which can be executed under distributed calculation environment, and executing the segmented subtasks subJob under a Spark calculation framework so as to process the to-be-processed document. The method and system are capable of realizing butt joint of various pieces of data, so that the data processing flexibility is improved.

Description

technical field [0001] The invention relates to a Spark-based data processing method and system, belonging to the technical field of computer software. Background technique [0002] Most of the existing big data preprocessing systems are developed based on Hadoop, and the intermediate processing results of Hadoop are stored in the HDFS file system, which will cause a lot of additional overhead, while Spark uses the concept of RDD, which allows It can store data in transparent memory. This approach greatly reduces disk reads and writes during data processing. In addition, some big data preprocessing systems are developed based on spark, but they are not universal. [0003] The system of the present invention is characterized in that it provides a large number of operator interfaces, and users can customize scenes to realize corresponding processing of specific files; users can customize operators according to their own needs; this system is a further package of Spark, and u...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/27G06F16/21G06F16/25
Inventor 木伟民张云李名扬张明诚王伟平
Owner INST OF INFORMATION ENG CAS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products