Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Recognition method and device for page headers and page footers of format electronic document

An electronic document and identification method technology, applied in the field of header and footer identification, can solve the problems of document usefulness and difficulty in ensuring accuracy of footer appearing on the upper and lower positions of the document, and achieves an increase in coverage and high identification accuracy. Effect

Inactive Publication Date: 2015-09-30
ALIBABA GRP HLDG LTD
View PDF5 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When using the above two methods to identify the header and footer of the document, the requirements for the feature value of the document are very high. If the document does not have a corresponding feature value, the accuracy of the recognition is difficult to guarantee.
For example, the method of determining the header and footer according to the horizontal line at the top of the page is only suitable for documents that meet this characteristic. Useful for documents where the header and footer appear above and below the document

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Recognition method and device for page headers and page footers of format electronic document
  • Recognition method and device for page headers and page footers of format electronic document
  • Recognition method and device for page headers and page footers of format electronic document

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0034] see figure 1 Embodiment 1 of the present application provides a flow chart of a method for identifying headers and footers of a layout electronic document. As shown in the figure, the method may include the following steps:

[0035] S101: Analyze multiple pages of the layout electronic document respectively, and obtain the text content of each text line included in each page;

[0036] In the embodiment of the present application, when it is necessary to perform pre-display processing on a format electronic document, the format electronic document may first be parsed to obtain the text content of each text line contained in each page of the format electronic document. Usually, the content contained in the layout electronic document page is mostly text content, such as electronic novels, etc.; but some text content is also contained in pictures, such as the layout electronic document generated by scanning. In this case, you can first edit the text content in the picture ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a recognition method and device for page headers and page footers of a format electronic document. The method comprises the steps that multiple pages of the format electronic document are analyzed, and text content of all text lines contained in all pages is obtained; the text content of all the text lines in all the pages is traversed, and whether the text lines meet the characteristics of page headers and page footers is judged; the text lines where the page headers and the page footers are located are determined according to judging results. By means of the recognition method and device, whether a line in the document is the page header or the page footer is recognized through backstepping according to the similarity of content on multiple pages on a certain line and pages obtained based on the similarity; according to the method, the characteristic values and positions of page headers and pager footers have no pure definition, the coverage rate to current documents is greatly improved, and high recognition accuracy is achieved.

Description

technical field [0001] The present application relates to the technical field of document identification, in particular to a method and device for identifying headers and footers of formatted electronic documents. Background technique [0002] With the popularity of handheld terminal devices, people have more and more demands for reading on handheld terminal devices. As content carriers, most of the current electronic documents are converted from typesetting tools and typesetting files in PDF format. The main format electronic file, the pages of this file are usually large, which is not suitable for reading on handheld terminals or devices with small screens. At present, the file format that is more suitable for reading on the handheld device is a file format based on streaming, such as an epub (Electronic Publication, electronic publishing) format file. In this file format, the number of pages and layout of the document will be disturbed when reading, and the reader also n...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06K9/20
Inventor 吴运俊
Owner ALIBABA GRP HLDG LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products