Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

HDFS (Hadoop Distributed File System)-based small file combination tool and method

A technology for small files and files, which is applied in the field of small file merging tools based on HDFS, and can solve problems such as reduced file processing efficiency

Inactive Publication Date: 2016-06-08
INSPUR QILU SOFTWARE IND
View PDF5 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Because the NameNode puts the metadata of the file system in the memory block, although the size of each small file is much smaller than the size of a block, each small file needs to occupy a block, and the storage of each block occupies about 150 bytes , then, if there are 10000000 small files, the namenode needs about 2G memory space
If 100 million files are stored, the namenode needs 20G memory space. At the same time, HDFS needs to continuously jump from one small file to another in the access process of small files. Then, with the number of small files stored in HDFS increase, leading to a decrease in the efficiency of file processing

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • HDFS (Hadoop Distributed File System)-based small file combination tool and method
  • HDFS (Hadoop Distributed File System)-based small file combination tool and method
  • HDFS (Hadoop Distributed File System)-based small file combination tool and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the drawings in the embodiments of the present invention. Apparently, the described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0042] Such as figure 1 As shown, the embodiment of the present invention provides a small file merging tool based on HDFS, which is applied to servers capable of exchanging information with each node contained in HDFS in the cluster, including:

[0043] A setting unit 101, configured to set small file merging rules;

[0044] A generating directory unit 102, configured to determine directory parameters, and generate an input / output directory according ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides an HDFS (Hadoop Distributed File System)-based small file combination tool and method, which is applied to a server which can carry out information interaction with each node contained in the HDFS in a cluster. The HDFS-based small file combination tool comprises a setting unit, a catalogue generation unit, a transmission unit and a reading combination unit, wherein the setting unit is used for setting a small file combination rule; the catalogue generation unit is used for determining catalogue parameters and generating an input / output catalogue according to the catalogue parameters; the transmission unit is used for setting a small file threshold value, determining the small file in each external node according to the small file threshold value and storing the determined small file into the input catalogue generated by the catalogue generation unit; and the reading combination unit is used for traversing the input catalogue, reading at least two small files which conform to the combination rule set by the setting unit in the input catalogue, combining the at least two small files into at least one data file and storing the at least one data file into the output catalogue generated by the catalogue generation unit. File processing efficiency can be effectively improved.

Description

technical field [0001] The invention relates to the technical field of computer applications, in particular to an HDFS-based small file merging tool and method. Background technique [0002] Distributed file system (Hadoop, HDFS) consists of a NameNode and several DataNodes. It is an important part of the cluster and has been widely used in the field of large-scale computing due to its reliability, efficiency and scalability. Because the NameNode puts the metadata of the file system in the memory block, although the size of each small file is much smaller than the size of a block, each small file needs to occupy a block, and the storage of each block occupies about 150 bytes , then, if there are 10000000 small files, the namenode needs about 2G memory space. If 100 million files are stored, the namenode needs 20G memory space. At the same time, the access process of HDFS to small files needs to continuously jump from one small file to another. Then, with the number of small...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/11G06F16/182
Inventor 杨胜华王传超崔乐
Owner INSPUR QILU SOFTWARE IND
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products