Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for creating ETL scripts from relational database to Hive

A database and relational technology, applied in the database field, can solve problems such as high time cost and low efficiency of ETL scripts, and achieve the effect of improving efficiency

Inactive Publication Date: 2018-02-16
CTRIP TRAVEL NETWORK TECH SHANGHAI0
View PDF4 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The technical problem to be solved by the present invention is: in view of the defects of low efficiency and high time cost of manually writing corresponding ETL scripts when performing batch relational database to Hive ETL operations in the prior art, a kind of relational database to Hive is provided ETL script creation method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for creating ETL scripts from relational database to Hive

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0023] Relational database of the present invention to the ETL script method of Hive, as figure 1 shown, including the following steps:

[0024] S101. Identify feature classifications of data in a relational database.

[0025] The data of relational data tables can be divided into two types according to whether it is updated or not: one type is log type data that will not be updated after being inserted into the table (that is, the data in the table will not be deleted or updated); the other type is inserted into the table The business data that will be updated later (that is, the data in the table will be deleted or updated). For these two types of data, the methods of ETL to Hive are different. For log-type data, the daily ETL task is to migrate the data generated the previous day to the corresponding partition of Hive; for business-type data, the daily task is to migrate all the changed data from the previous day (that is, new data, data update data) to Hive, and merge a...

Embodiment 2

[0034] The database script creation method of the present embodiment, such as figure 1 shown, including the following steps:

[0035] S101. Identify feature classifications of data in a relational database.

[0036] The data write frequency of relational data tables can be divided into static data and dynamic data. Static data is characterized by slow data update iterations. For example, supplier data, once entered, is not updated frequently; while dynamic data will increase with the increase in business system traffic. This causes the Hive ETL methods corresponding to the two types of data to be different. Static data needs to always save the latest full amount of data in Hive, while dynamic data itself only needs to save the latest hot data.

[0037]According to the different characteristics of the above-mentioned dynamic data and static data, according to the statistical information of the dynamic management system tables of the relational database, the classification ch...

Embodiment 3

[0040] The relational database of the present invention is to the ETL script creation method of Hive such as figure 1 As shown, the difference between it and Embodiment 1 lies in the addition of the time field identification for incremental fetching.

[0041] After identifying the feature classification of the data table in the relational database, it is also necessary to specify the time range for each ETL operation. Therefore, identifying the time field of the incremental access of the data table is one of the necessary steps. Further, in S101, the time field of the relational database is obtained, and it is judged whether the time field is updated following the change of the data, and if so, whether the time field can use an index (in the index column or the joint index's first One column), if the above two principles are met, specify the above time field as the time field for incremental fetching, and generate the time range for fetching.

[0042] The remaining steps of t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for creating ETL scripts from a relational database to Hive. The method comprises the following steps of: obtaining database metadata information; rapidly recognizinga feature class of each data table in the relational database; automatically creating an ETL script for each data table according to the class; and registering the generated ETL scripts in a scheduling platform through a simulated access technology or a service calling manner. According to the method, two dimensionality features, related to ETL, of the data tables are self-defined. Features of thedata tables can be rapidly classified, and when ETL operation from the relational database to the Hive is carried out in batches, a defined script template is utilized to efficiently and automatically create a corresponding ETL script for each data table and the ETL scripts are submitted to the scheduling platform, so that the data migration efficiency is greatly improved.

Description

technical field [0001] The invention belongs to the field of databases, in particular to a method for creating an ETL script from a relational database to Hive. Background technique [0002] With the advent of the information age and the rapid expansion of data scale, distributed big data storage and processing technologies have also emerged as the times require. As the backbone of building an open source enterprise data warehouse, Hive (a Hadoop-based database artifact) is one of the most important data sources for data workers because of its powerful storage and computing capabilities. The data landed in Hive generally comes from relational databases, NoSQL databases (generally referring to non-relational databases), message queues, etc. Among them, relational databases are ideal for application business data due to their transaction support characteristics and mature connection with application systems. best choice. Therefore, ETL (Extract, Transform, Load, Extract, Tra...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/214G06F16/254G06F16/27
Inventor 余达明许鹏张振华
Owner CTRIP TRAVEL NETWORK TECH SHANGHAI0
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products