Spark-based Cassandra data import method and device, equipment and medium

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of data import and data, applied in the field of data processing, to achieve uniform data, reduce the number of small files, and prevent imbalance

Active Publication Date: 2020-05-12

同盾(广州)科技有限公司

View PDF5 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] In order to overcome the deficiencies of the prior art, one of the objectives of the present invention is to provide a Spark-based Cassandra data import method, which calculates the number of partitions according to the size of the SSTable single file and the total data volume, and divides the data equally according to the token value, In order to prevent the problem of data imbalance in the partition when importing Cassandra

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0050] Embodiment 1 provides a Spark-based Cassandra data import method, which aims to prevent data imbalance and data skew by evenly dividing the data to be imported into each partition, and reduce the occurrence probability of memory overflow.

[0051] Spark is a unified analysis engine for large-scale data processing. Its Spark provides a comprehensive and unified framework for managing various data sets and data sources (batch data or real-time data) with different properties (text data, graph data, etc.). streaming data) big data processing needs.

[0052] Please refer to figure 1 As shown, a Spark-based Cassandra data import method includes the following steps:

[0053] S110. Obtain the data volume of the data to be imported and the size of the SSTable single file, and calculate the required number of partitions N according to the data volume and the size of the SSTable single file;

[0054] The SSTable in S110 is the basic storage unit of Cassandra. The size of a sing...

Embodiment 2

[0081] The second embodiment is carried out on the basis of the first embodiment, and it mainly improves the parallel processing process.

[0082] After Spark completes partition (partition) interval calculation and shuffle partition sorting, that is, after completing steps S110-S130, when directly using CQLSSTableWriter+SSTableLoader to import data to Cassandra, the degree of parallelism and traffic are difficult to control, which affects the performance of the Cassandra cluster.

[0083] Therefore, in this embodiment, after the SSTable file is generated in step S140 of the first embodiment, a step of copying the SSTable file to the distributed file system is added, and the copy path is recorded, so as to control the parallel number.

[0084] The distributed file system in this embodiment selects hdfs.

[0085] Please refer to Figure 4 As shown, it specifically includes the following steps:

[0086] S210. Calculate the parallel number M according to the number of Cassandra...

Embodiment 3

[0096] Embodiment three discloses a device corresponding to the Spark-based Cassandra data import method corresponding to the above embodiment, which is the virtual device structure of the above embodiment, please refer to Figure 5 shown, including:

[0097]The partition calculation module 310 is used to obtain the amount of data and the size of the SSTable single file, and calculate the required number of partitions N according to the amount of data and the size of the SSTable single file;

[0098] The partition allocation module 320 is used to read data, and calculate a token value according to the Key of the data; according to the token value, distribute the data to the N partitions, and assign each partition according to the token value Sort the data of the above partitions;

[0099] File generation module 330, for using CQLSSTableWriter to read the data after sorting, generate SSTable file;

[0100] The file import module 340 is configured to import the SSTable file in...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a Cassandra data import method based on Spark. The Cassandra data import method comprises the following steps of: importing Cassandra data into a database; relating to the technical field of data processing, the method and the device are used for solving the problems that when data is imported into Cassandra through Spark at present; the method comprises the following stepsthat the data size of data to be imported and the size of an SSTable single file are obtained, and the number N of needed partitions is calculated according to the data size and the size of the SSTable single file; calculating a token value according to the Key of the data; according to the token value, allocating the data to the N partitions, and sorting the data; a CQLSSTableWriter is used forreading the sorted data, and an SSTable file is generated; and the SSTable files are processed in parallel, and the SSTable files are imported into a Cassandra cluster through an SSTable. The invention further discloses a Cassandra data import device based on the Spark, electronic equipment and a computer storage medium. According to the method, the data is partitioned through the Spark, so that the processing performance of the Cassandra is improved when the data is imported.

Description

technical field [0001] The present invention relates to the technical field of data processing, in particular to a Spark-based Cassandra data import method, device, equipment and media. Background technique [0002] In recent years, products and applications related to the Internet of Things, artificial intelligence, and smart cities have emerged in an endless stream. These products and applications have also promoted the vigorous development of big data technology. With the exponential growth of data scale, data processing and storage methods have become The main research directions of related companies. At present, Cassandra, as an open source distributed NoSQL database storage system, is used by more and more companies as a data storage system due to its high write performance and high read performance. [0003] As a distributed data storage system, Cassandra can provide relatively complete data reading, writing and management functions. Cassandra stores data in the form...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F16/25G06F16/27

CPCG06F16/258G06F16/27

Inventor 程万胜

Owner 同盾(广州)科技有限公司

Features

Generate Ideas
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Spark-based Cassandra data import method and device, equipment and medium

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology