Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Extracting a character string from a document and partitioning the character string into words by inserting space characters where appropriate

a document and character string technology, applied in the field of document processing methods and document processing apparatuses, can solve the problems of not being able to correctly find words or phrases, exact word-match search cannot find, and neither the exact word-match search nor the phrase search can find words or phrases correctly, so as to accurately determine whether to insert space characters, the effect of accurately detecting the width of space characters

Active Publication Date: 2012-07-17
CANON KK
View PDF4 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

The method improves the accuracy of character string searches by correctly inserting space characters, enabling successful word and phrase searches in documents with layout information, even when spaces are represented by character position changes.

Problems solved by technology

However, the exact word-match search cannot find “the” in the “theory” as a hit.
If characters are extracted from such a PDF document or PDL document, a character code of a space character is not included in the characters, resulting in a problem that neither the exact word-match search nor the phrase search can correctly find a word or phrase.
Also, the exact word-match search cannot find any word.
However, the technique as discussed in Japanese Patent Application Laid-Open No. 5-67237 aims at processing a scanned image, not at processing a document described in character code.
Further, the technique as discussed in Japanese Patent Application Laid-Open No. 5-67237 may not obtain a correct space character width, so that a space determination accuracy is lowered.
In addition, a character image may be erroneously recognized upon character recognition to provide wrong character code as a recognition result.
This causes a problem that the exact word-match search and the phrase search may rarely find a word or phrase as a hit.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Extracting a character string from a document and partitioning the character string into words by inserting space characters where appropriate
  • Extracting a character string from a document and partitioning the character string into words by inserting space characters where appropriate
  • Extracting a character string from a document and partitioning the character string into words by inserting space characters where appropriate

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036]Various exemplary embodiments, features, and aspects of the invention will be described in detail below with reference to the drawings.

[0037]FIG. 1A is a block diagram of an example of a configuration of a document processing apparatus according to an exemplary embodiment of the present invention.

[0038]The document processing apparatus includes an arithmetic-control central processing unit (CPU) 1, a keyboard 2 for inputting data and instructions, a display 3 for displaying a document image, a hard disk 4 for storing a document, a read-only memory (ROM) 5 storing programs for controlling the apparatus or necessary information, a random access memory (RAM) 6 used as various work areas, a document layout determination unit 7 corresponding to an analysis unit for analyzing a document structure and configured to determine a document layout for determination as to space / tab, a space determination unit 8 for determining a space / tab upon extracting characters from a document and inse...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

An apparatus includes a character extraction unit configured to extract a character string from a document including layout information, a character width acquisition unit configured to acquire space character width information, and a spacing amount determination unit configured to determine a spacing amount of each inter-character space based on the character string extracted by the character extraction unit and the layout information. The apparatus further includes an insertion unit configured to determine whether a space character is to be included in each inter-character space based on the spacing amount of each inter-character space determined by the spacing amount determination unit and the space character width information acquired by the character width acquisition unit, and to insert a space character code into an inter-character space in which a space character is determined to be included.

Description

BACKGROUND OF THE INVENTION[0001]1. Field of the Invention[0002]The present invention relates to a document processing method and a document processing apparatus, and more particularly to a document processing method and a document processing apparatus for extracting a character string from a document including document layout information.[0003]2. Description of the Related Art[0004]Conventional methods may be used to search a document for a character string by extracting a character string included in the document and determining whether the character string includes a search key. For example, in a general searching method, it is determined whether at least apart of the extracted character string includes a search key (hereinafter referred to as a “normal search”). Specific search examples include an “exact word-match search” for searching for a full-matched word and a “phrase search” for searching for a phrase including a plurality of words including a space or spaces.[0005]For ex...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(United States)
IPC IPC(8): G06F17/00
CPCG06F17/214G06F17/2294G06F17/25G06F40/109G06F40/163G06F40/189
Inventor NAKATSUKA, TADANORI
Owner CANON KK
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products