Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method, device, electronic device and storage medium for merging tables across pages of pdf document

A table and cross-page technology, applied in the field of text processing, to achieve high accuracy

Active Publication Date: 2022-07-15
PING AN TECH (SHENZHEN) CO LTD
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In the prior art, the cross-page table merging of PDF documents mainly uses rules to judge whether the two tables that cross pages contain the same number of columns. For complex tables that cross pages, the rule method cannot play a good judgment effect

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method, device, electronic device and storage medium for merging tables across pages of pdf document
  • Method, device, electronic device and storage medium for merging tables across pages of pdf document
  • Method, device, electronic device and storage medium for merging tables across pages of pdf document

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0053] figure 1 It is a flow chart of a method for merging tables across pages in a PDF document in an embodiment of the present invention. According to different requirements, the order of the steps in the flowchart can be changed, and some steps can be omitted.

[0054] see figure 1 As shown, the method for merging tables across pages in a PDF document specifically includes the following steps:

[0055] Step S11: Acquire at least two PDF documents containing tables, collect location information and text information of at least one table in each of the PDF documents, and obtain a table data set according to the location information of the tables.

[0056] Specifically, in at least one embodiment of the present invention, collecting the position information and text information of at least one table in each of the PDF documents, and obtaining the table data set according to the position information of the table includes:

[0057] Use the pdfplumber library to parse each of ...

Embodiment 2

[0118] figure 2 It is a structural diagram of an apparatus 30 for merging tables in a PDF document across pages in an embodiment of the present invention.

[0119] In some embodiments, the PDF document cross-page table merging apparatus 30 runs in an electronic device. The PDF document cross-page table merging apparatus 30 may include a plurality of functional modules composed of program code segments. The program codes of each program segment in the PDF document cross-page table merging apparatus 30 may be stored in the memory and executed by at least one processor to perform the PDF document cross-page table merging function.

[0120] In this embodiment, the PDF document cross-page table merging apparatus 30 may be divided into a plurality of functional modules according to the functions performed by the apparatus 30 . see figure 2 As shown, the PDF document cross-page table combining device 30 may include a table data acquisition module 301 , a training data set constr...

Embodiment 3

[0168] image 3 It is a schematic diagram of the electronic device 6 in an embodiment of the present invention.

[0169] The electronic device 6 includes a memory 61 , a processor 62 and computer readable instructions stored in the memory 61 and executable on the processor 62 . When the processor 62 executes the computer-readable instructions, the steps in the above embodiments of the PDF document cross-page table merging method are implemented, for example, figure 1 Steps S11 to S16 shown. Alternatively, when the processor 62 executes the computer-readable instructions, the functions of each module / unit in the above-mentioned embodiment of the apparatus for merging tables in a PDF document across pages are implemented, for example, figure 2 Modules 301 to 306 in .

[0170] Exemplarily, the computer-readable instructions may be divided into one or more modules / units, and the one or more modules / units are stored in the memory 61 and executed by the processor 62 to The pres...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the technical field of artificial intelligence, and provides a PDF document cross-page table merging method, device, electronic device and storage medium. The PDF document cross-page table merging method includes: acquiring at least two PDF documents, and collecting at least one table in each of the PDF documents to obtain a table data set; generating a cross-page table training data set according to the table data set; using The cross-page table training data set trains the deep learning model, obtains the table merge model, obtains the PDF test document, removes the header and footer, and constructs the cross-page table test data. value, and according to the two-category prediction value, determine whether the test data of the cross-page table needs to be merged, and merge and output the cross-page table that needs to be merged. The invention can effectively handle the task of extracting complex tables across pages in a PDF document, and has a high accuracy rate for judging whether the tables across pages need to be merged.

Description

technical field [0001] The invention relates to the technical field of text processing in artificial intelligence, in particular to a PDF document cross-page table merging method, device, electronic device and storage medium. Background technique [0002] The PDF format is widely used in the storage and transmission of various files, and it is often necessary to extract information from PDF documents. Since tables often appear in PDF documents, but because there is no table format in the PDF document format, the table obtained after parsing the PDF document has only text and image lines. When a table appears at the bottom of a page and the top of the next page in the PDF document at the same time, It is necessary to judge whether it is the same table. In the prior art, the cross-page table merging in a PDF document mainly uses rules to determine whether the two tables of the two-page spread contain the same number of columns. For complex tables that span pages, the rule met...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/174G06N20/00G06F16/16
CPCG06F40/174G06N20/00G06F16/16
Inventor 王文浩徐国强
Owner PING AN TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products