Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Web page content extraction method and apparatus, and computing device

A technology for computing equipment and web content, applied in the field of the Internet, which can solve problems such as high time and cost

Active Publication Date: 2017-07-14
QILIN HESHENG NETWORK TECH INC
View PDF5 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in this way, a complete DOM tree needs to be created and traversed every time the content of the web page is extracted, and the time cost is too high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page content extraction method and apparatus, and computing device
  • Web page content extraction method and apparatus, and computing device
  • Web page content extraction method and apparatus, and computing device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035]Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

[0036] figure 1 A schematic diagram of a system 100 for extracting webpage content according to an embodiment of the present invention is shown. Such as figure 1 As shown, the webpage content extraction system 100 includes a computing device 200 , a server 310 and a server 320 . s, figure 1 The webpage content extraction system 100 in the webpage content extraction system 100 is only exemplary, in the specific practical situation...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention discloses a web page content extracting method and apparatus, and a computing device. The method is suitable to implement in the computing device, and the computing device comprises a data storage device. The method comprises: obtaining an HTML document of a to-be-processed web page; according to the domain name of the to-be-processed web page, obtaining a node matching rule corresponding to the data storage device from the data storage device, wherein the node matching rule is generated based on a DOM tree of a source web page associated with the to-be-processed web page; constructing a target DOM tree, wherein the target DOM tree is initialized to be empty; processing the HTML document by using the node matching rule, so that updating of the target DOM tree is facilitated; and obtaining each node in the updated target DOM tree so as to extract content in the to-be-processed web page.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a web page content extraction method, device and computing equipment. Background technique [0002] Each website on the Internet has its own web page, and the structure and layout of the web page are quite different. It is a tedious and time-consuming task to parse the web page and extract the content. At present, most of the methods for extracting web page content are based on DOM trees. By organizing web page content into a DOM tree and traversing the DOM tree, the information in the required nodes is obtained to form the web page to be extracted. content. [0003] The full name of DOM is Document Object Model, that is, Document Object Model. It can use the tag information of HTML documents, such as Table, List, etc., to logically parse the document into a tree structure, and the nodes of the tree are objects one by one. After the DOM tree is built, it traverses each nod...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/9577G06F16/986
Inventor 李涛
Owner QILIN HESHENG NETWORK TECH INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products