PDF text extraction method combined with ocr technology

A text extraction and technology technology, applied in the field of pattern recognition, can solve the problem that the character content of PDF files cannot be extracted, and achieve the effect of improving accuracy

Active Publication Date: 2011-11-30
武汉融冠科技发展有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The purpose of the present invention is to provide a PDF text extraction method combined with OCR technology to overcome the defects of the prior art, improve the accuracy of PDF text extraction and solve the problem that some PDF file character content cannot be extracted

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF text extraction method combined with ocr technology
  • PDF text extraction method combined with ocr technology
  • PDF text extraction method combined with ocr technology

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] figure 1 It is a schematic diagram showing the idea of ​​the present invention. Such as figure 1 As shown, by extracting the information from the PDF file, the encoding information of the characters in the PDF document, the image information of the characters and the coordinate information of the characters are obtained, and the ORC recognition is performed on the image information of the characters to obtain the recognition result in the form of character encoding information, and then The text encoding information and recognition results are comprehensively determined to obtain reliable text encoding information, and then combined with the text coordinate information to output text in a specific format and format.

[0033] figure 2 It is a flow chart of steps of the method of the present invention, and the method of the present invention is explained in detail below in conjunction with this figure. Such as figure 2 As shown, the PDF text extraction method of the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a PDF text extraction method combined with OCR technology, belonging to the technical field of pattern recognition, the method includes: (1) PDF data extraction; (2) combined with OCR technology to confirm character content; The encoding is processed; (4) according to the position, font and font size of the character, the second encoding of the character processed in step (3) is derived. In the present invention, the OCR technology is combined in the process of character computer internal code confirmation, which effectively improves the accuracy of PDF text extraction, and solves the problem that the character content of some PDF files cannot be extracted.

Description

technical field [0001] The invention relates to the technical field of pattern recognition, in particular to a method for extracting text from PDF files. Background technique [0002] PDF is the abbreviation of Portable Document Format, which is an open electronic file format developed by Adobe. PDF was developed from the PostScript programming language, and PostScript is still widely used in professional publishing as the mainstream printer programming language. PDF largely continues the page description method in PostScript, and adopts the character encoding method defined in PostScript. [0003] The advantage of the PDF file format is that the file format has nothing to do with software, hardware, and operating system platforms. It can be used without barriers in Windows, Unix, or Apple's Mac OS operating system, and can achieve the same display effect. This feature makes PDF the main electronic document format on the Internet and plays an important role in the dissemin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/22G06K9/34
Inventor 江世盛刘强
Owner 武汉融冠科技发展有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products