Method for quickly cleaning and importing data into Hive

A data and fast technology, applied in the software field, can solve problems such as inability to distinguish data, achieve the effect of eliminating inspection steps and improving data processing efficiency

Pending Publication Date: 2022-03-22
四川启睿克科技有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0026] The invention solves the problem that invalid (wrong) data cannot be distinguished in the process of loading data into the Hive table, and further data cleaning is required for subsequent calculations. This method realizes data cleaning during the loading process and automatically classifies the wrong data into special storage In the partition of wrong data, the data cleaning process is simplified and the efficiency of data cleaning is greatly improved

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for quickly cleaning and importing data into Hive
  • Method for quickly cleaning and importing data into Hive
  • Method for quickly cleaning and importing data into Hive

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be described in detail below. Apparently, the described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other implementations obtained by persons of ordinary skill in the art without making creative efforts fall within the protection scope of the present invention.

[0043] In either embodiment, if Figure 4 As shown, a method of quickly cleaning and importing data into Hive according to the present invention first marks the data, assuming that there is a piece of original data named Z, and its format is like "2021-09-09 09:50:36|02| ams|30|A02|504980|504980|chiq1|imei|192.168.1.4|keyword=crazy bird 2773|kyw=2|recodnum=2", the whole data is separated by "|", between every two vertical lines Interval represents a field, use...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for quickly cleaning and importing data into Hive, which comprises the following steps of: reading original data according to rows in data serialization and deserialization processes, judging whether the data is legal or not according to a data verification rule, adding a new separator 'and an error mark' EE 'behind the data for illegal data, and judging whether the data is legal or not according to the new separator' and the error mark 'EE'. In this way, distinguishing between correct data and wrong data is achieved in the mode that the wrong data is marked as' EE '. And classifying again according to the distinguished correct and wrong data, classifying the original data according to the date, and then distinguishing and classifying the interface ID under the date to finally form cleaned data provided for the outside. According to the method, the original data is quickly and automatically marked and cleaned, so that the inspection step during subsequent data use is omitted, and the data processing efficiency is improved.

Description

technical field [0001] The invention relates to the field of software technology, in particular to a method for quickly cleaning and importing data into Hive. Background technique [0002] Hive is a distributed fault-tolerant data warehouse system that enables large-scale analysis. This data warehouse centrally stores information that you can easily analyze to make informed, data-driven decisions. Hive lets users read, write, and manage petabytes of data using SQL. [0003] Hive is built on top of Apache Hadoop, an open source framework that can be used to efficiently store and process large data sets. Therefore, Hive is tightly integrated with Hadoop and is designed to operate on petabytes of data quickly. What sets Hive apart is that it utilizes Apache Tez or MapReduce to query large datasets through a SQL-like interface. [0004] Existing Hive architectures such as figure 1 shown) as follows: [0005] 1. The underlying storage of Hive [0006] Hive data is stored o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/215G06F16/22G06F16/2458G06F16/27G06F16/28
CPCG06F16/215G06F16/2282G06F16/2471G06F16/27G06F16/284
Inventor 任治州
Owner 四川启睿克科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products