Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and devices of intercepting crawler

A crawler and page technology, applied in the network field, can solve the problems of inefficient interception of web crawlers, and normal users mistakenly think that they are web crawlers, etc., to achieve the effects of high concurrency, reduced pressure, and improved interception rate.

Active Publication Date: 2017-11-10
BEIJING JINGDONG SHANGKE INFORMATION TECH CO LTD
View PDF7 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In the prior art, in order to ensure the access of normal users, some websites adopt the method of filtering the client IP, or the method of filtering the specific User-Agent header of the HTTP request to intercept the access from the web crawler. In some cases, when many normal users share the same IP, these normal users will be mistaken for web crawlers and filtered out
On the other hand, according to the HTTP protocol specification, the value of the User-Agent header can be set arbitrarily, so many web crawlers set their User-Agent headers to be the same as ordinary browsers to avoid filtering, which leads to the interception of web crawlers. is not efficient

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and devices of intercepting crawler
  • Method and devices of intercepting crawler

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0036] Example 1, in one embodiment,

[0037] 1) The browser sends an HTTP request to the server, requesting the first page of the current category;

[0038] The server generates an image URL path containing the cookie value and saves it to the first page;

[0039] The server side pre-sets the range of pages that allow direct access to pages as 1-10 pages, and the server side judges that the first page belongs to the direct access range, so it returns the first page that includes the image URL path to the browser;

[0040] The browser automatically downloads the picture to the browser according to the URL path of the picture contained in the returned page of the first page of the current category; parses the picture with the JS method, extracts the cookie value, and saves it; carries the cookie value when turning the page later .

[0041] 2) The browser sends an HTTP request carrying a cookie value to the server, requesting page 10 of the current category;

[0042] The serv...

Embodiment 2

[0050] Embodiment 2, in another embodiment,

[0051] If the browser receives a link to page 10 of the category, then,

[0052] The browser sends an HTTP request to the server, requesting page 10 of the current category;

[0053] The server side generates the image URL path containing the cookie value and saves it to page 10;

[0054] The server side pre-sets the range of pages that allow direct access to pages 1-10, and the server judges that the 10th page belongs to the direct access range. Therefore, although the HTTP request does not contain a cookie value at this time, it will directly include pictures. Page 10 of the URL path is returned to the browser.

[0055] The browser automatically downloads the picture to the browser according to the URL path of the picture contained in the returned page of the 10th page of the current classification; parses the picture with the JS method, extracts the cookie value in it, and saves it; carries the cookie value when turning the pa...

Embodiment 3

[0056] Embodiment three, in another embodiment,

[0057] If the browser receives a link to category page 11, then,

[0058] The browser sends an HTTP request to the server, requesting page 11 of the current category;

[0059] The server generates an image URL path containing the cookie value and saves it to page 11;

[0060] The server judges that the 11th page does not belong to the scope of direct access. Therefore, it further judges whether there is a cookie value in the HTTP request. Since it is a link directly received by the browser, the HTTP request does not contain a cookie value. Therefore, to browse The browser returns to the first page of the current category.

[0061] Next, if you want to continue to visit other pages, you can repeat the operation in Embodiment 1 to achieve normal page visits.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and devices of intercepting a crawler. The method includes: generating a field value, which is currently used for identifying the crawler, by a server end after receiving an access request, which is sent by a client, of an accessed page, and generating a picture attribute value of saving the field value into a picture; saving a picture uniform-resource-locator (URL) path, which contains the picture attribute value, into the requested page; judging whether the page, which currently needs to be accessed, belongs to pages, which are directly allowed to be accessed, by the server end, and if yes, returning the requested page to the client; otherwise, further judging whether the access request contains a valid field value used for identifying the crawler, and if the valid field value is contained, returning the requested page to the client; and if the field value used for identifying the crawler is not contained or a contained field value is invalid, determining that the access request is sent by the crawler, and returning a first page of a class of the page, which needs to be accessed, to the client. By using the method, crawler accessing can be effectively intercepted.

Description

technical field [0001] The invention relates to network technology, in particular to a method and device for intercepting reptiles. Background technique [0002] Web crawlers are a fundamental part of search engine technology. Web crawler technology starts from the URL (Uniform Resource Locator, Uniform Resource Locator) of one or several initial webpages, and obtains the URLs on the initial webpage. Extract new URLs from the web page and put them in the queue until some stopping condition is met. Then store the captured webpage information in the server of the search engine. [0003] In the prior art, in order to ensure the access of normal users, some websites adopt the method of filtering the client IP, or the method of filtering the specific User-Agent header of the HTTP request to intercept the access from the web crawler. Under normal circumstances, when many normal users share the same IP, these normal users will be mistaken for web crawlers and filtered out. On t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951G06F16/9566G06F16/00
Inventor 王向维韩笑跃王飞谢刚费艳茹韩勇马顺风
Owner BEIJING JINGDONG SHANGKE INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products