Document text extraction method and device

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A text extraction and document technology, applied in character and pattern recognition, electrical digital data processing, special data processing applications, etc., can solve the problems of low OCR recognition accuracy, wrong corresponding position, poor noise resistance, etc., to reduce processing Efficacy of workload, reduction of manual intervention, preservation of format and logical information

Inactive Publication Date: 2011-11-30

HANVON CORP

View PDF2 Cites 20 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

During this process, some OCR recognition engines have poor noise resistance, especially when the document layout is chaotic or contains background text, the accuracy of OCR recognition is not high, and in the layout proofreading, especially the text proofreading, the same word appears many times In the case of , if a text recognition error occurs in one place, errors will occur in multiple corresponding positions in the document. If you want to correct it, you need to modify the text multiple times

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0035] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0036] The invention discloses a text extraction method of a document, such as figure 1 shown, including the following steps:

[0037] Step 1: Parse the document, obtain the corresponding information of the font in the document, and obtain the character mapping table according to the corresponding information;

[0038] The corresponding information of the font includes baseline, original code, font name, Ascent (rising part), Descent (descending part), EM Square. Such as figure 2 As shown, Ascent represents the vertical distance above the baseline, and in this embodiment, Ascent is the height of 4 / 5 characters. Descent represents the vertical distance below the baseline, and Descent is 1 / 5 of the height of the character. EM S...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a document text extraction method and device, belonging to the field of data processing. The method includes: step 1: analyzing the document, obtaining the corresponding information of the font in the document, and obtaining a character mapping table according to the corresponding information; step 2: obtaining the font image corresponding to each character according to the font corresponding information; step 3: cutting the font image to obtain The inked area corresponding to the font image; Step 4: Perform character recognition on the inked area to obtain the recognition result of each character; Step 5: Update the character mapping table according to the recognition result, and extract text information from the document according to the updated character mapping table . The invention improves the flow of data processing, and also reduces the workload of data processing, so that the randomly coded packaged fonts will not become an obstacle to data processing. For a specific layout document, the correct text information can be obtained without identifying the page image, which minimizes manual intervention and preserves the format and logic information of the document.

Description

technical field [0001] The invention belongs to the field of data processing, and relates to a text extraction method and device for documents. Background technique [0002] In the process of packaging fonts when the layout document is created, some manufacturers use random codes to process the document in order to prevent the text in the document from being copied, and the text obtained when this type of document is exported is garbled. At present, the processing process of this type of layout document is as follows: the entire document is generated page by page into a layout picture, and an OCR recognition engine is used to identify the picture, and the text is proofread after the layout proofreading, and the obtained text is exported. During this process, some OCR recognition engines have poor noise resistance, especially when the document layout is chaotic or contains background text, the accuracy of OCR recognition is not high, and in the layout proofreading, especially...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/22G06K9/20

Inventor 楼永植陈峻峰

Owner HANVON CORP

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Document text extraction method and device

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology