Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Chinese ancient book character recognition method, Chinese ancient book character segmentation, layout reconstruction method, medium and equipment

A technology of character recognition and character classification, applied in character recognition, character and pattern recognition, neural learning methods, etc., can solve problems such as misjudgment, omission, and uneven distribution of character categories, achieve uniform character size distribution, reduce negative interference, The effect of improving accuracy

Active Publication Date: 2021-07-23
SOUTH CHINA UNIV OF TECH
View PDF7 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are special elements such as icons, seals, and double-column notes in Chinese ancient book documents. Traditional text line detection algorithms or only focus on simple layouts such as single-line text are not suitable for ancient book documents with complex layout structures and diverse contents.
At the same time, there are handwritten, variant or uncommon fonts in Chinese ancient book documents, and traditional algorithms that focus on common printed Chinese character recognition can only deal with limited types of characters, and cannot accurately identify these uncommon special characters. There are disadvantages such as misjudgment and omission in the identification of ancient book documents
In addition, there is currently a lack of ancient book document datasets with a wide coverage of character categories and various font styles, and the distribution of character categories in the annotation data of existing ancient book document image datasets is uneven

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese ancient book character recognition method, Chinese ancient book character segmentation, layout reconstruction method, medium and equipment
  • Chinese ancient book character recognition method, Chinese ancient book character segmentation, layout reconstruction method, medium and equipment
  • Chinese ancient book character recognition method, Chinese ancient book character segmentation, layout reconstruction method, medium and equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0081] This embodiment discloses a method for character recognition in ancient Chinese books, which can be executed by smart devices such as computers, such as figure 1 As shown, it specifically includes the following steps:

[0082] Step 1. Obtain the Chinese ancient book document image marked with the character bounding box and character category as the original training sample; at the same time, obtain the annotation file of the original training sample. The standard file includes the character bounding box size, character position and character category.

[0083] The above character position can be obtained through the character bounding box, specifically, the character position is the coordinates of the two corners opposite to the bounding box, for example: (x left ,y top , x right ,y bottom ), (x left ,y top ) is the coordinate of the upper left corner of the bounding box, (x right ,y bottom ) is the coordinate of the lower right corner of the bounding box.

[00...

Embodiment 2

[0118] This embodiment discloses a method for grouping characters in ancient Chinese books, comprising the following steps:

[0119] Step 7, for the obtained Chinese ancient book document image, obtain the predicted bounding box and predicted category of each character wherein through the method described in embodiment 1;

[0120] Step 8. The predicted bounding box of each character is clustered and read according to the reading order of the ancient books and the semantic sentence group of the characters, and the reading order is restored to obtain the text content of the ancient books without punctuation marks. Such as figure 2 As shown in , the specific steps are as follows:

[0121] S1. Taking the predicted bounding box of each character as input, spatially sort them according to the reading order of ancient books, and calculate the geometric feature information of the character bounding box. details as follows:

[0122] S1a. Sorting the predicted bounding boxes of each...

Embodiment 3

[0171] This embodiment discloses a method for reconstructing the layout of ancient Chinese books, including steps:

[0172] Step 9, for the acquired Chinese ancient book document image, firstly carry out grouping and reading order restoration to the characters identified in the Chinese ancient book document image by the Chinese ancient book character grouping method described in embodiment 2, and obtain an ancient book without punctuation marks text content;

[0173] Step 10, build a language model ancient book document layout reconstruction algorithm, including an error correction language model and a sentence segmentation and punctuation language model, and perform error correction and sentence segmentation on the text content of ancient books without punctuation marks. Such as Figure 5 As shown, the details are as follows:

[0174] (1) Based on the pre-trained BERT-base-chinese language model based on modern texts, using the Yizhige ancient text dataset as the domain cor...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Chinese ancient book character recognition method, a Chinese ancient book character segmentation, a layout reconstruction method, a medium and equipment, and the Chinese ancient book character recognition method comprises the steps: firstly obtaining a Chinese ancient book document image marked with a character bounding box and a character category, and taking the image as an original training sample; acquiring an annotation file of the original training sample; randomly selecting a plurality of original training samples, and processing the original training samples to obtain new training samples: processing the original training samples and the new training samples in an online random cutting mode to obtain a training sample set; training a character level detection classification model through training samples in the training sample set; and inputting a Chinese ancient book document image of which characters are to be recognized into the character level detection classification model to obtain a prediction bounding box and a prediction category of each character of the Chinese ancient book document image. According to the method, common characters can be recognized, some uncommon special characters in the ancient books can be recognized very accurately, and the problems of misjudgment, omission and the like existing in ancient book document recognition in the prior art are solved.

Description

technical field [0001] The invention relates to the technical field of ancient Chinese books research, in particular to a method, medium and equipment for character recognition, grouping and layout reconstruction of ancient Chinese books. Background technique [0002] With the research and development of deep learning, image text detection and recognition technology based on computer vision is playing an increasingly important role in daily life, business activities and scientific research, and has made good progress. The research results involve document recognition , bill recognition and specific scene text recognition. However, the existing research is only for text images with clear handwriting, obvious background contrast, and limited character categories. The distribution of characters follows the modern typesetting style from left to right and top to bottom. There are deficiencies in the research work on the recognition of ancient book documents that follow the arran...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/00G06K9/62G06N3/04G06N3/08G06K9/03G06F40/232G06F40/30
CPCG06N3/08G06F40/232G06F40/30G06V30/414G06V30/40G06V10/98G06V30/287G06V30/10G06N3/045G06F18/23G06F18/24G06F18/214Y02D10/00
Inventor 薛洋李智豪
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products