Relational table-based extraction method of configurable information

A technology for information extraction and configuration information, which is applied in the computer field and can solve the problem of not considering web page implicits.

Active Publication Date: 2015-09-02
FOCUS TECH +1
View PDF4 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] 1. In terms of versatility, some can only extract information from the plain text content of web pages; some are only suitable for the extraction of semi-structured data; some rely on the inherent structure of web pages and can only be extracted from similar web pages, etc.
[0012] 2. Existing extraction techniques are mainly aimed at extracting the information that appears clearly on the webpage, without considering the information implicit in the webpage.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Relational table-based extraction method of configurable information
  • Relational table-based extraction method of configurable information
  • Relational table-based extraction method of configurable information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] The information extraction method proposed by the present invention is mainly divided into information extraction user interface, extraction rule set generation and

[0041] Data extraction has three parts.

[0042] 1. Information extraction user interface

[0043] Users use SQL-like language through this interface to configure the information to be extracted in the form of relational tables, and define the extracted content

[0044] Each attribute of the content and its extraction method, for the attributes of manual construction of extraction rules, use the CSS selector to directly define the extraction rules, for the attributes of automatic construction rules using machine learning methods, give the definition of its characteristics.

[0045] Below is an example of definition information extraction table: use information extraction user interface: Create table travel website business (be the typical application of the inventive method):

[0046]

[0047]

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A relational table-based extraction method of configurable information includes: defining an information extraction task in a structural form, and building extraction rules by means of the combination of a manual method and machine learning so as to extract a target page; 1, using an information extraction user interface which allows a user to express an information extraction demand in a tabular form, including a subject of information extraction; 2, generating an information extraction rule set including extraction rules manually built and rules automatically generated via machine learning; 3, extracting data, to be specific, extracting information on a webpage and persisting results, to be more specific, during extracting information of the certain webpage according to a user-configured information extraction table, extracting content of each attribute, and classifying the contents via a trained model.

Description

1. Technical field [0001] The invention belongs to the Internet data extraction in the computer field, and in particular relates to a configurable information extraction framework technology based on a relational table. 2. Background technology [0002] With the rapid development of the Internet, people's lives are increasingly inseparable from the Internet, and the amount of information on the Internet is also increasing. [0003] The explosive growth of the Internet has made the Internet a huge information source, capable of providing massive amounts of valuable information. For users, how to effectively obtain and utilize such information has become particularly urgent and important. At present, most of the data on the Internet appear in the form of HTML. The information in HTML documents is mainly display-oriented, lacking a description of the data itself, and does not contain semantic information. Most of it is unstructured or semi-structured data. This prevents appli...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/33
Inventor 滕晓程陈茂榕邵明路周晔孟凡军
Owner FOCUS TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products