Webpage information extraction method and system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of webpage information and webpage, which is applied in the field of information extraction, and can solve the problems of no overall awareness of the extraction method, low versatility, and low robustness of the extraction method

Active Publication Date: 2014-06-18

INST OF COMPUTING TECH CHINESE ACAD OF SCI

View PDF4 Cites 18 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

This method is sensitive to the structure of the web page and has poor generalization ability. In order to ensure the recall rate, a large number of rules and manual intervention are required, and a large number of rules will lead to a greater possibility of conflicts between rules, such as a specific Rules that correspond to data nodes in one web page may correspond to noise nodes in another slightly different web page

Existing methods often trade off between accuracy, recall and manual cost

[0012] 2. Single feature rule

In some webpages, the data and noise differ greatly in the characteristics used by the existing methods, and the method can achieve better results, but in other webpages, the data and noise may differ in the characteristics used by the method. is not obvious, the method cannot achieve a good extraction effect

The generality of the method is not high

[0013] 3. Does not support complex data schema (semantic structure)

Existing methods often only support simple flat data schemas and cannot adequately express more complex data schemas

[0014] 4. Extraction methods are not globally aware

Existing methods usually do not consider whether the matching position is the optimal position and the impact of the matching on the subsequent matching of other rules after the partial successful matching of the web page. A partial error or failed matching may affect the subsequent extraction. A series of side effects, the extraction method is less robust

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0072] The technical solutions of the present invention will be described in detail below in conjunction with the embodiments and the accompanying drawings.

[0073] First, the application scenarios and concepts used in the present invention are described.

[0074] The content in a web page is composed of some semantic units, and each semantic unit corresponds to a semantic attribute. The combination of semantic attributes can form a new semantic attribute. The new semantic attribute is called the parent semantic attribute. The semantic attribute directly contained in the parent semantic attribute is Sub-semantic attributes, the sub-semantic attributes under the same parent semantic attribute are sibling semantic attributes. Each specific value of the semantic attribute is a subtree forest in the DOM tree of the web page, and the subtrees in the subtree forest are continuous and non-overlapping, that is, there are no adjacent subtrees in the subtree forest. If there are other...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a webpage information extraction method and system. The method includes the steps: acquiring a marked webpage, generating a semantic structure tree, building an information mode pattern, generating semantic attribute node information of each semantic attribute node in the information mode pattern, generating a wrapper and deriving the wrapper into a wrapper document; building an extractor for extracting webpages similar to the marked webpage; acquiring the webpages to be extracted, and recursively extracting a data extraction area or an iterative data extraction area corresponding to each semantic attribute node in the information mode pattern layer by layer from the root semantic attribute node in the information mode pattern in a DOM (document object model) tree of the webpages to be extracted by the extractor; deriving data in the data extraction area or the iterative data extraction area corresponding to each semantic attribute node as extraction results. The method has high universality, generalization capability, fault tolerance and expandability and low manual involvement degree, and online extraction efficiency is ensured, so that practicability is high.

Description

technical field [0001] The invention belongs to the field of information extraction, and in particular relates to generation of a wrapper (wrapper) based on a webpage DOM tree and webpage information extraction technology. Background technique [0002] Since the 1990s, the World Wide Web (WWW) has developed rapidly, and the amount of information it contains has exploded. While the Internet has increasingly become a tool widely used by people, it has also become a huge treasure house of knowledge, which contains a large amount of valuable information. How to make full use of the massive information on the Internet to provide better services for human beings has always been a hot spot that people pay attention to. As an important information carrier on the Internet, web pages are the main way to obtain information from the Internet. How to extract needs from web pages has become an important research topic, that is, web page information extraction. Webpage information extrac...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

CPCG06F16/835G06F16/951

Inventor 程学旗万圣贤余钧郭岩刘悦张瑾余智华

Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Webpage information extraction method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology