Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

General method for extracting document structural information

A document structure and information extraction technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of loss of document resources, complex concepts, operation methods, poor usability, etc., and achieve operation interface calls Aspects, structure definitions are simple and compact, and the effect of ease of use is strong

Inactive Publication Date: 2013-11-20
BEIHANG UNIV
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the research objects of these research works are only locked on the text content and the semantics it represents, ignoring the original document structure information and chart information. In addition, because its architecture is too general, there is no specific field-oriented The isomorphic technology and method of extracting operable documents makes it doomed to fail to meet the actual needs of engineering
[0006] To sum up, there are many limitations in the existing document integration research field: 1) In document information extraction, only focus on the extraction of text information, ignoring document structure information, so that the extracted plain text is convenient for text retrieval and classification, etc. At the same time of application, due to the lack of important structural information, it cannot meet the needs of specific engineering fields; 2) In the document information extraction, the important picture and chart information in the document is ignored, which is convenient for the definition of the isomorphic format of the general document. There is information on document resources, but the original document resources are not fully utilized; 3) When defining an open document isomorphic structure, the concept of an open document hierarchical model is proposed, and key technologies for obtaining text information in multiple formats are introduced and methods, but did not give a domain-specific isomorphic document format and its definition method, and did not propose a practical process and method for establishing an open document isomorphism for a specific domain; 4) open documents Isomorphism mainly studies text information extraction processing and semantic understanding. There is no general document information description method, which cannot be understood and operated by people, and cannot meet the actual needs of engineering; It is easy to popularize in engineering practice; 6) The extraction method has poor versatility and cannot guarantee portability

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • General method for extracting document structural information
  • General method for extracting document structural information
  • General method for extracting document structural information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] The present invention will be further described in detail in conjunction with the accompanying drawings and implementation examples.

[0027] The purpose of the present invention is to provide a general document structure information extraction method for a specific field. Based on the concept of document extraction, it can extract important document structure information while maintaining the picture and chart information in the document. The extraction method is simple and easy to use. Versatile. The method of the invention can establish the isomorphic information model of the document in the specific field, realize the isomorphic operation of the document information, and facilitate the integrated management of the document. In the embodiment of the present invention, the specific implementation method is described by using the domain knowledge in the specific field of the earthquake emergency plan.

[0028] The earthquake emergency plan management information syste...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a general method for extracting document structural information and belongs to the field of document integration projects. According to domain knowledge of a specific domain, an isomorphic information format of a document is defined, the isomorphic information format at least comprises a text node for defining a text information format, a structure node for defining a document structural information format, a picture node for defining a picture information format and a table node for defining a table information format, an extraction and conversion method of an original document to the isomorphic information format, an unified operation interface used by an upper layer and an isomorphic information description format are built, and the isomorphic information format is converted into the isomorphic information description format for displaying. The method can be used for extracting the important document structural information, meanwhile keeps pictures and table information in the document, and is simple and easy to use and strong in universality. By means of the method, a document isomorphic information model in the specific domain can be built, isomorphic interoperability of document information is achieved, and document integrated management is convenient.

Description

technical field [0001] The invention belongs to the field of document integration engineering, and relates to a general document information format definition, a method for realizing the conversion and operation process between a group of document information formats, and a general definition of document isomorphic information description; in particular, it relates to a general Document structure information extraction method. Background technique [0002] With the development of related technologies, the status of document resources in engineering practice has become more and more prominent. The concept of document engineering put forward in the new century puts the status of document resources at the center of engineering practice. Document resources are a kind of knowledge accumulation and the crystallization of experience in engineering practice. Making full use of existing document resources can reduce mistakes made in current engineering practice, provide reference fo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 李新然吕江花马世龙
Owner BEIHANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products