Method and system for extracting webpage information

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A web page information and page technology, applied in the network field, can solve the problems of invalid positioning information, simple positioning information, and inability to solve repetitive structure identification well

Active Publication Date: 2012-12-19

ALIBABA GRP HLDG LTD

View PDF3 Cites 34 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The automatically generated XPATH usually only records the label name and offset information, and the positioning information is too simple to cope with the continuous change of the web page structure

However, after the content of the webpage is updated and the elements on the XPATH path change, the problem of not being able to locate the content or locating the non-extracted content will be caused.

At the same time, because the information recorded by XPATH is too simple, XPATH cannot be used to solve the problem of repeated structure identification, and additional algorithms need to be added to realize the identification and extraction of repeated structures

[0005] In the process of implementing the present application, the inventor found at least the following problems in the prior art: web page information extraction usually uses a semi-automatic information extraction method, and the extracted information is located by analyzing the page structure. Since web page information is a type of dynamic change, For real-time updated data, after the page content is updated and the structure of the page changes, it is prone to problems such as extraction failure or inaccurate extraction results caused by location information failure

[0006] On the other hand, existing techniques cannot well address the problem of repetitive structure identification

The automatic XPATH generation method cannot use XPATH to solve the problem of repeated structure identification, and additional algorithms need to be added to realize the identification and extraction of repeated structures

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0095] In the technical solution provided by this application, we first need to obtain the location information of the information to be extracted in the sample page, that is, the location information of the target node, so as to use the location information of the target node to obtain multiple paths from the target node to the root node. Here is the reverse positioning method. The sample page is generally provided by the user and is a web page that uses the same page template as the page to be extracted. One possible implementation is that the user inputs the web page address according to the information that needs to be extracted, and downloads the web page as a sample page. The sample pages may be downloaded from different sites. At this time, correspondingly, the page to be extracted is a collection of web pages with the same page template corresponding to the sample pages. Of course, the sample page can also be obtained in other ways, and this application does not restri...

Embodiment 2

[0192] In a preferred embodiment of the present application, when obtaining all paths from the target node to the root node as a path set, the reliability judgment rule is used to find the first N paths from the target node to the root node with the least deductions. The path is a collection of paths. Among them, the higher the robustness, the fewer points will be deducted. In this way, the obtained path is no longer all paths from the target node to the root node, but the path with the least deduction as the path in the path set.

[0193] The second embodiment of the present application will be described below in conjunction with the drawings. Image 6 This is a schematic diagram of the method in Embodiment 2 of this application.

[0194] S601: Select information to be extracted from a sample page.

[0195] S602: Analyze the DOM structure of the sample page, construct a DOM tree, and obtain the position of the information to be extracted in the DOM structure.

[0196] S603: Travers...

Embodiment 3

[0228] In another preferred embodiment of the present application, the reliability judgment rule is also used to find the path from the target to the root node with the least deduction as the path set. The main difference between the third embodiment and the second embodiment is that the third embodiment finds all the paths from the target node to the root node, and then deducts points for all the paths found according to the reliability judgment rules, so as to select the top N with the least deduction. Paths. The second embodiment is in the process of spreading, that is, calculating the deduction of the path according to the reliability judgment rule. If the deduction exceeds the threshold value, the spreading is stopped.

[0229] Figure 8 It is a schematic flow diagram of the method of the third embodiment of the present application, which is described below with reference to the drawings.

[0230] S801: Select information to be extracted from a sample page.

[0231] In the emb...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A method of extracting web page information includes analyzing a document object model (DOM) structure of a sample page to obtain a position of information to be extracted. A node corresponding to the position of the information to be extracted is rendered in the DOM structure as a target node. Starting from the target node, relative position information is traversed recursively until the root node is found to create candidate paths. The candidate paths are rendered as a path set. A DOM structure of a page to be extracted is analyzed, information is located in the DOM structure of the page starting from the root node in the path set, and an extracted node candidate set is obtained. A node having highest robustness from the extracted node candidate set is selected to be a final extracted node and extracted information is obtained using the extracted node.

Description

Technical field [0001] This application relates to the field of network technology, in particular to a method and system for extracting webpage information. Background technique [0002] With the rapid development of the Internet, the Internet has become the most important information release platform. However, in the face of the explosive growth of Internet information, how to quickly and effectively obtain the information users need has become an urgent problem to be solved. Traditional search engines can help people get web pages by searching keywords, but they only give links to related pages, and users still need to manually browse the web to find interesting information. On the other hand, due to the inability to customize precise queries, a large number of search results are not what users want, and precise and professional search results cannot be provided. An ideal approach is that the Internet as an information source can be queried like a database. Thus, web informa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30G06F40/143

CPCG06F17/30G06F17/30908G06F16/80G06F40/103G06F40/143G06F16/972

Inventor 蔡波洋强琦

Owner ALIBABA GRP HLDG LTD

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Method and system for extracting webpage information

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology