A Configurable Information Extraction Method Based on Relational Table

A technology for information extraction and configuration information, applied in the computer field, can solve problems such as not considering the hidden content of web pages

Active Publication Date: 2017-04-05
FOCUS TECH +1
View PDF4 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] 1. In terms of versatility, some can only extract information from the plain text content of web pages; some are only suitable for the extraction of semi-structured data; some rely on the inherent structure of web pages and can only be extracted from similar web pages, etc.
[0014] 2. Existing extraction techniques are mainly aimed at extracting the information that appears clearly on the webpage, without considering the information implicit in the webpage.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Configurable Information Extraction Method Based on Relational Table
  • A Configurable Information Extraction Method Based on Relational Table
  • A Configurable Information Extraction Method Based on Relational Table

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] The information extraction method proposed by the present invention is mainly divided into information extraction user interface, extraction rule set generation and

[0046] Data extraction has three parts.

[0047] 1. Information extraction user interface

[0048] Users use SQL-like language through this interface to configure the information to be extracted in the form of relational tables, and define the extracted content

[0049] Each attribute of the content and its extraction method, for the attributes of manual construction of extraction rules, use the CSS selector to directly define the extraction rules, for the attributes of automatic construction rules using machine learning methods, give the definition of its characteristics.

[0050] Below is an example of definition information extraction table: use information extraction user interface: Create table travel website business (be the typical application of the inventive method):

[0051]

[0052]

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A configurable information extraction method based on a relational table. Firstly, the information extraction task is defined in a structured form, and then a combination of manual and machine learning is adopted to construct extraction rules to extract target pages; 1) Information extraction user interface: the user interface allows users to express information extraction requirements in tabular form, including the main body of information extraction; 2) Information extraction rule set generation, the extraction rule set of required information is divided into two parts: manually constructed Extraction rules and rules automatically generated by machine learning; 3) Data extraction: the extraction of information on the webpage and the persistence of the results: when extracting information from a webpage according to the information extraction table configured by the user, each attribute The content is extracted, and then the trained model is used to classify it.

Description

[0001] 1. Technical field [0002] The invention belongs to the Internet data extraction in the computer field, and in particular relates to a configurable information extraction framework technology based on a relational table. [0003] 2. Background technology [0004] With the rapid development of the Internet, people's lives are increasingly inseparable from the Internet, and the amount of information on the Internet is also increasing. [0005] The explosive growth of the Internet has made the Internet a huge information source, capable of providing massive amounts of valuable information. For users, how to effectively obtain and utilize such information has become particularly urgent and important. At present, most of the data on the Internet appear in the form of HTML. The information in HTML documents is mainly display-oriented, lacking a description of the data itself, and does not contain semantic information. Most of it is unstructured or semi-structured data. This...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/33
Inventor 滕晓程陈茂榕邵明路周晔孟凡军
Owner FOCUS TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products