Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

PDF (Portable Document Format) file information extraction method and device and computer equipment

A file information and file technology, which is applied in the field of computer equipment and PDF file information extraction, can solve the problems of low efficiency, reduced extraction efficiency, and time-consuming, etc., and achieve the effect of high efficiency, fast extraction speed, and high-efficiency extraction

Pending Publication Date: 2021-09-03
湖南四方天箭信息科技有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] 1. Since it is necessary to traverse all the intersection points between the lines, pairwise matching operations between the lines must be performed, resulting in a complex program structure and reducing the speed and efficiency of extraction
Especially for PDF files containing a large number of complex nested structures, it takes a lot of time to perform pairwise matching operations between lines, which will greatly reduce the efficiency of extraction
[0006] 2. Because it adopts the bottom-up method from micro to macro, that is, to find the cells first and then merge the cells into a table, it is impossible to obtain the logical structure of the table itself, which is only suitable for processing simple tables, but for complex ones Nested tables, the hierarchical relationship of table nesting cannot be obtained, that is, the complete logical structure information of this type of complex nested table cannot be extracted, which is not conducive to subsequent analysis of this type of table
[0007] In summary, the traditional bottom-up method is used to extract tables in PDF, which is slow and inefficient, and the logical structure of the table is not clear, and the hierarchical relationship of table nesting cannot be obtained. It is not suitable for the extraction of complex nested tables. Therefore, It is urgent to provide a method for extracting tables in PDF files, so that it can be applied to the extraction of complex nested tables, improve the extraction efficiency, and at the same time retain the internal logical relationship of complex tables, so that the complete logical structure information of the tables can be obtained

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF (Portable Document Format) file information extraction method and device and computer equipment
  • PDF (Portable Document Format) file information extraction method and device and computer equipment
  • PDF (Portable Document Format) file information extraction method and device and computer equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] The present invention is further described below with reference to the accompanying drawings and specific embodiments of the specification, but will not limit the scope of the invention.

[0049] like figure 1 As shown, the PDF file information extraction method of this embodiment includes:

[0050] Step S1: Get the PDF file to be extracted, extract characters and lines in the PDF file;

[0051] Step S2: The simplest table in the PDF file is extracted according to the positional relationship between the extracted lines and the positions between the lines, and the simplest table is the outermost pattern of two or two connected to all lines;

[0052] Step S3: Determine the cell according to the shortest table line to determine the ultrasounder unit, recursive loop to extract the shortline of the nesting of each cell;

[0053] Step S4: The table character of each of the simplest tables is extracted from the character based on the coordinate position of the form lines of each o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a PDF file information extraction method and device and computer equipment, and the method comprises the steps: obtaining a to-be-extracted PDF file, and extracting characters and lines in the to-be-extracted PDF file; according to the extracted coordinate positions of all the lines and the position relation between all the lines, extracting the simplest table in the PDF file to be extracted, wherein the simplest table is the outermost table with all the lines connected pairwise; determining cells of the simplest table according to the table lines of the simplest table, and recursively and circularly extracting the simplest table nested in each cell; and according to the coordinate position of the table line of each simplest table and the coordinate position of each extracted character, extracting the table character of each simplest table from the characters. The method has the advantages that the implementation method is simple, the extraction efficiency is high, the speed is high, and the internal logic relation of the complex table can be reserved.

Description

Technical field [0001] The present invention relates to document information extraction, and more particularly to a PDF file information extraction method, apparatus, and computer equipment. Background technique [0002] The information in the PDF file is mainly divided into text paragraphs, forms, and pictures, where the image is extraction is simpler, and the text paragraph and the extraction of the form are more complex, especially the extraction of complex nested forms. For the PDF file, the wireframe is extracted in the line frame, which is usually implemented from the bottom upward manner, the implementation principle is: [0003] First, the character, image, line, rectangle, etc. of the PDF are analyzed by the underlying open source analysis, and then the position coordinate information of the intersection of these lines is found again through all the lines, and then based on the idea of ​​the bottom up, the intersection and The line finds the cells that may exist, where t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/18G06F40/30
CPCG06F40/18G06F40/30
Inventor 阳建仁周忠诚段炼张圣栋黄九鸣
Owner 湖南四方天箭信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products