Method and system for extracting webpage information

A web page information and page technology, applied in the network field, can solve the problems of invalid positioning information, simple positioning information, and inability to solve repetitive structure identification well

Active Publication Date: 2012-12-19
ALIBABA GRP HLDG LTD
View PDF3 Cites 34 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The automatically generated XPATH usually only records the label name and offset information, and the positioning information is too simple to cope with the continuous change of the web page structure
However, after the content of the webpage is updated and the elements on the XPATH path change, the problem of not being able to locate the content or locating the non-extracted content will be caused.
At the same time, because the information recorded by XPATH is too simple, XPATH cannot be used to solve the problem of repeated structure identification, and additional algorithms need to be added to realize the identification and extraction of repeated structures
[0005] In the process of implementing the present application, the inventor found at least the following problems in the prior art: web page information extraction usually uses a semi-automatic information extraction method, and the extracted information is located by analyzing the page structure. Since web page information is a type of dynamic change, For real-time updated data, after the page content is updated and the structure of the page changes, it is prone to problems such as extraction failure or inaccurate extraction results caused by location information failure
[0006] On the other hand, existing techniques cannot well address the problem of repetitive structure identification
The automatic XPATH generation method cannot use XPATH to solve the problem of repeated structure identification, and additional algorithms need to be added to realize the identification and extraction of repeated structures

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for extracting webpage information
  • Method and system for extracting webpage information
  • Method and system for extracting webpage information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0095] In the technical solution provided by this application, we first need to obtain the location information of the information to be extracted in the sample page, that is, the location information of the target node, so as to use the location information of the target node to obtain multiple paths from the target node to the root node. Here is the reverse positioning method. The sample page is generally provided by the user and is a web page that uses the same page template as the page to be extracted. One possible implementation is that the user inputs the web page address according to the information that needs to be extracted, and downloads the web page as a sample page. The sample pages may be downloaded from different sites. At this time, correspondingly, the page to be extracted is a collection of web pages with the same page template corresponding to the sample pages. Of course, the sample page can also be obtained in other ways, and this application does not restri...

Embodiment 2

[0192] In a preferred embodiment of the present application, when obtaining all paths from the target node to the root node as a path set, the reliability judgment rule is used to find the first N paths from the target node to the root node with the least deductions. The path is a collection of paths. Among them, the higher the robustness, the fewer points will be deducted. In this way, the obtained path is no longer all paths from the target node to the root node, but the path with the least deduction as the path in the path set.

[0193] The second embodiment of the present application will be described below in conjunction with the drawings. Image 6 This is a schematic diagram of the method in Embodiment 2 of this application.

[0194] S601: Select information to be extracted from a sample page.

[0195] S602: Analyze the DOM structure of the sample page, construct a DOM tree, and obtain the position of the information to be extracted in the DOM structure.

[0196] S603: Travers...

Embodiment 3

[0228] In another preferred embodiment of the present application, the reliability judgment rule is also used to find the path from the target to the root node with the least deduction as the path set. The main difference between the third embodiment and the second embodiment is that the third embodiment finds all the paths from the target node to the root node, and then deducts points for all the paths found according to the reliability judgment rules, so as to select the top N with the least deduction. Paths. The second embodiment is in the process of spreading, that is, calculating the deduction of the path according to the reliability judgment rule. If the deduction exceeds the threshold value, the spreading is stopped.

[0229] Figure 8 It is a schematic flow diagram of the method of the third embodiment of the present application, which is described below with reference to the drawings.

[0230] S801: Select information to be extracted from a sample page.

[0231] In the emb...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method of extracting web page information includes analyzing a document object model (DOM) structure of a sample page to obtain a position of information to be extracted. A node corresponding to the position of the information to be extracted is rendered in the DOM structure as a target node. Starting from the target node, relative position information is traversed recursively until the root node is found to create candidate paths. The candidate paths are rendered as a path set. A DOM structure of a page to be extracted is analyzed, information is located in the DOM structure of the page starting from the root node in the path set, and an extracted node candidate set is obtained. A node having highest robustness from the extracted node candidate set is selected to be a final extracted node and extracted information is obtained using the extracted node.

Description

Technical field [0001] This application relates to the field of network technology, in particular to a method and system for extracting webpage information. Background technique [0002] With the rapid development of the Internet, the Internet has become the most important information release platform. However, in the face of the explosive growth of Internet information, how to quickly and effectively obtain the information users need has become an urgent problem to be solved. Traditional search engines can help people get web pages by searching keywords, but they only give links to related pages, and users still need to manually browse the web to find interesting information. On the other hand, due to the inability to customize precise queries, a large number of search results are not what users want, and precise and professional search results cannot be provided. An ideal approach is that the Internet as an information source can be queried like a database. Thus, web informa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F40/143
CPCG06F17/30G06F17/30908G06F16/80G06F40/103G06F40/143G06F16/972
Inventor 蔡波洋强琦
Owner ALIBABA GRP HLDG LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products