Network crawling method, terminal and storage medium

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A network crawler and effective technology, applied in the field of network crawlers, can solve the problem of limited crawling times or frequency of the same proxy IP, and achieve the effect of avoiding waste

Active Publication Date: 2018-09-18

PING AN TECH (SHENZHEN) CO LTD

View PDF11 Cites 18 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] In view of the above, it is necessary to propose a web crawler method, terminal and storage medium, combine depth information, construct a proxy IP pool, and select proxy IPs from the proxy IP pool according to preset selection rules or strategies for crawling, effectively Solved the problem of limited crawling times or frequency of the same proxy IP

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0053] figure 1 It is a flow chart of the web crawler method provided by Embodiment 1 of the present invention. According to different requirements, the execution sequence in the flow chart can be changed, and some steps can be omitted.

[0054] 101: Store multiple proxy IPs acquired at preset time intervals in a preset proxy IP pool.

[0055] In this embodiment, a proxy IP pool is preset in the local database, and multiple acquired proxy IPs are added to the proxy IP pool for use by crawlers. The proxy IP can be found in the proxy IP website provided on the Internet, and the specific list can be obtained manually or automatically by another small crawler. It is also possible to purchase multiple proxy IPs through a third-party service organization, and add the obtained proxy IPs to the preset proxy IP pool.

[0056] In this embodiment, the proxy information of the proxy IP may include, but not limited to: IP address, name and port.

[0057] In this embodiment, it is possi...

Embodiment 2

[0075] figure 2 It is a flow chart of the web crawler method provided by Embodiment 2 of the present invention. According to different requirements, the execution sequence in the flow chart can be changed, and some steps can be omitted.

[0076] 201: Store multiple proxy IPs acquired at preset time intervals in a preset proxy IP pool.

[0077] Step 201 in this embodiment is the same as step 101 in Embodiment 1, and will not be described in detail here.

[0078] 202: Verify each proxy IP in the proxy IP pool one by one, and judge whether the obtained proxy IP has the first validity.

[0079] In this embodiment, the proxy IP that performs the first validity verification is referred to as the proxy IP to be verified, and the proxy IP to be verified is used to access a search engine (eg, Google, Baidu, etc.) to verify whether a response from the search engine is obtained. If a response from the search engine is obtained, it indicates that the proxy IP to be verified has the fi...

Embodiment 3

[0137] image 3 It is a functional block diagram of a preferred embodiment of the web crawler device of the present invention.

[0138] In some embodiments, the web crawler device 30 runs in a terminal. The web crawler device 30 may include a plurality of functional modules composed of program code segments. The program codes of each program segment in the web crawler device 30 can be stored in a memory, and executed by at least one processor to execute (see for details figure 1 and its related description) tracking of the hand region.

[0139] In this embodiment, the web crawler device 30 of the terminal can be divided into multiple functional modules according to the functions it performs. The functional modules may include: a storage module 301 , a judging module 302 , a recording module 303 , a selection module 304 and a crawling module 305 . The module referred to in the present invention refers to a series of computer program segments that can be executed by at least...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a network crawling method, and the method comprises the following steps: storing a plurality of proxy IPs acquired at intervals of a preset period in a preset proxy IP pool; verifying each proxy IP in the proxy IP pool one by one, and judging validity of the acquired proxy IPs; recording the proxy IPs, determined as valid, in a white list in the proxy IP pool, and recordingthe proxy IPs, determined as invalid, in a black list in the proxy IP pool; when that the current proxy IP satisfies a preset proxy substitution condition is detected, selecting one proxy IP from thewhite list in the proxy IP pool; and taking the selected proxy IP as a new proxy IP and performing data crawling. The invention also provides a terminal and a storage medium. With the method, the terminal and the storage medium provided by the invention, a problem of IP limitation in a process of quickly crawling data in quantity many times for a long time of the same proxy IP can be solved effectively.

Description

technical field [0001] The invention relates to the technical field of web crawlers, in particular to a web crawler method, a terminal and a storage medium. Background technique [0002] The web crawler is a very important part of the search engine system. It is responsible for collecting web pages from the Internet and collecting information. These web page information are used to set the index to provide support for the search engine. Its performance directly affects the effect of the search engine. . As the amount of network information increases geometrically, the requirements for the performance and efficiency of web crawler page collection are also getting higher and higher. [0003] We always hope to obtain more data in a shorter period of time, but this will cause a very high load on the website, and it will also bring about problems such as increased network traffic and leakage of private data. Many websites use crawler detection technology. Analyze the web access...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): H04L29/06H04L12/24G06F17/30

CPCH04L41/5009H04L63/0876H04L63/101

Inventor 阮晓雯徐亮肖京

Owner PING AN TECH (SHENZHEN) CO LTD

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Network crawling method, terminal and storage medium

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology