Infinite layer collection method based on Web page

A collection method and webpage technology, applied in store-and-forward switching systems, electrical components, transmission systems, etc., can solve the problems of consuming large computer resources and not being able to use multi-threading technology, so as to reduce server load and ensure accuracy easily. The effect of saving network bandwidth

Inactive Publication Date: 2009-04-08
赵洪宇
View PDF0 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although this kind of program is simple, when there are many links in a URL itself, recursion will push the unfinished code into the program code stack every time, so that the program will consume a lot of computer resources during execution.
In addition, this program cannot use multi-threading technology
Therefore, this method is not used in efficient collection procedures

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Infinite layer collection method based on Web page
  • Infinite layer collection method based on Web page

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0019] The present invention uses the entry address of the given website as the initial URL for traversal. Based on the Web page acquisition model, traverse all the links in the Web page that conform to this model, and continuously expand to the required Web pages along with the links. Distinguish the characteristics of the web pages pointed to by these links, filter the noise according to the web page acquisition model, and then perform multi-level link analysis to extract the content that users care about.

[0020] Before starting to collect network information, the website entrance address is given first, and the given website entrance address is used as the starting URL of the traversal. When the acquisition program encounters a certain webpage, it analyzes the webpage according to the acquisition model, and adds the relevant link to the link queue; at the same time, it analyzes the content of the page, and puts the webpage into the page library. Program framework such as...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for acquiring an unlimited layer based on Web page, which comprises the following steps: (1) specifying entry page address StartURL acquired by web page; (2) analyzing each URL on the page, if the URL is a relative path, the URL is completed by using the entry address StartURL so as to convert the URL into an absolute path; and (3) judging whether the entry address StartURL is the superior of the URL or not, if so, a downlink acquisition is started to expand downwards continuously, if not, expansion is stopped; during the process of acquisition and expansion, for each URL, cyclically matching and extracting words in the web page, searching links on the web page, extracting and storing words on the link and words in the web page pointed by the link, so that all links of the web page are traversed for web page acquisition of unlimited layer. By using the method for acquiring web page, multi-level link analysis can be carried out against user requirement, contents concerned by the user can be extracted, and network information acquisition can be realized high efficiently.

Description

technical field [0001] The invention relates to a method for collecting web pages. Background technique [0002] The collection of network information is usually accomplished with the help of various search engines. A common commercial search engine consists of four parts: searcher, indexer, retriever and user interface. Generally speaking, a search engine is a network robot called a Robot computer program. It traverses the Internet from the URL of an initial page or site to automatically discover web page information. When entering a hypertext page, it uses HTML language Mark the structure to search for information and obtain URL links pointing to other hypertexts, select the next site to visit through a certain algorithm, and then turn to another site to continue collecting information. The function of the indexer is to understand the data information searched by the searcher, extract index items from it, and establish an index library for representing data documents and ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): H04L29/08H04L12/54
Inventor 赵洪宇袁青霞李闻阮振中
Owner 赵洪宇
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products