Method, device, equipment and storage medium for extracting webpage information

An extraction method and web page information technology, applied in the field of computer networks, can solve the problems of long training time, consume a large amount of computing resources, increase labor costs, etc., and achieve the effect of reducing consumption and reducing node information

Active Publication Date: 2022-01-28
SHANGHAI SINITEK COMPUTER TECH CO LTD
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Although the existing two methods can realize the extraction of the text in the webpage, but in the face of a large number of webpages, the method of constructing the XPath path template based on the preset rules, because there is no unified template, it is necessary to manually set a large number of XPath template, and in order to make the XPath path template applicable to the corresponding webpage, it is necessary to continuously modify or even rewrite the template according to the change of webpage information, which greatly increases the labor cost, and this method will still exist because it has not been found The web page is changed without modifying the XPath path template in time, resulting in inaccurate information finally extracted, or directly unable to extract information; while performing positive sequence (or preorder) traversal on the multi-fork tree constructed based on all nodes, and using pre-order The constructed node sub-model analyzes the traversed root nodes. Since all nodes need to be analyzed, the amount of node information required for the analysis process and the construction of the node analysis model is very large, resulting in a large consumption of the entire implementation plan. The computing resources and graphics processing unit (Graphics Processing Unit, GPU) GPU resources will cause the program that implements the scheme to be killed and stopped by the system due to memory problems, and the training time is long and the convergence is slow.
[0005] In addition, for a lot of web pages whose text information exists in leaf nodes, using the second method above, if the node analysis model misjudges an intermediate node as a node that does not need to be retained, after pruning, the original node that needs to be retained will be deleted. The node removal, that is, the original text information that needs to be retained is removed, resulting in the incomplete and low accuracy of the final text extracted from the web page

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method, device, equipment and storage medium for extracting webpage information
  • Method, device, equipment and storage medium for extracting webpage information
  • Method, device, equipment and storage medium for extracting webpage information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0047] In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, those skilled in the art can understand that in each embodiment of the present invention, many technical details are provided for readers to better understand the present application. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solutions claimed in this application can also be realized. The division of the following embodiments is for the convenience of description, and should not constitute any limitation to the specific implementation of the present invention, and the various embodiments can be combined and referred to each other on the premise of no contradiction.

[0048] The first embodiment of the present invention relates to a metho...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method, device, equipment and storage medium for extracting webpage information, so as to solve the problems of heavy workload, difficult maintenance and low accuracy of existing webpage information extraction. The method for extracting the webpage information includes: obtaining the leaf node path of each leaf node in the webpage to be extracted; according to the leaf node path, obtaining the leaf node information of the leaf node corresponding to the leaf node path and the parent node information of the parent node of the leaf node , to obtain the node information of the leaf node; according to the path of each leaf node and the information of each node, construct a DOM tree; traverse each node in the DOM tree, and use the neural network recognition model obtained by pre-training to identify each node traversed Analyze a leaf node to obtain the analysis result of each leaf node; determine the extraction path of the information to be extracted according to the analysis result of each leaf node; extract the information to be extracted from the webpage to be extracted according to the extraction path.

Description

technical field [0001] The embodiments of the present invention relate to the technical field of computer networks, and in particular to a method, device, equipment and storage medium for extracting webpage information. Background technique [0002] For global wide area network (World Wide Web, Web) data mining, the extraction of information carried in web pages is usually taken as a basic step in the early stage of data mining. Therefore, how to efficiently and accurately extract high-quality information from web pages has become a hot research topic in recent years. [0003] In the prior art, a common way to extract webpage information is to extract based on preset rules, specifically to construct different Extensible Markup Language Path (XPath) based on preset rules, namely XPath Path templates, and then use different XPath path templates to extract the text in the corresponding web pages; another common way is to build a document object model based on all nodes in the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951G06F16/957G06K9/62G06N3/04G06N3/08
CPCG06F16/951G06F16/9577G06N3/084G06N3/044G06N3/045G06F18/24323G06F18/214
Inventor 张学哲张浩波
Owner SHANGHAI SINITEK COMPUTER TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products