General distributed crawler system capable of automatically detecting shielding

A technology of automatic detection and crawler system, applied in the field of distributed systems and artificial intelligence, can solve problems such as high cost

Inactive Publication Date: 2014-01-01
FUDAN UNIV
View PDF3 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, when there are too many data source websites to download, it is a very expensive job to manually sample and block pages for each website

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • General distributed crawler system capable of automatically detecting shielding
  • General distributed crawler system capable of automatically detecting shielding
  • General distributed crawler system capable of automatically detecting shielding

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0067] The present invention will be further elaborated below in conjunction with the accompanying drawings and embodiments.

[0068] In this embodiment, the original system is implemented using C# language, based on the .Net Framework 4.0 framework. The recommended cluster size for stable operation is within 100 machines, but theoretically more machines can be supported. It can run on Windows XP or above clusters or Linux clusters with Mono3.0 or above installed. Set a machine in the cluster as the core node to control the operation of the entire cluster. The machines in the cluster do not have to be in the same LAN, as long as they can communicate with each other. Its extension module may need the support of C++ runtime library or Java runtime library. The configuration of machines in the cluster has no special requirements and can be different.

[0069] According to attached Figure 4 As shown, the system consists of two executable programs, Master and Slave. Each exe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of distributed systems and artificial intelligence, and particularly relates to a general distributed crawler system capable of automatically detecting shielding. The system is of a star-type network structure and comprises a core node Master and a plurality of Slaves. The core node Master controls the Slaves in a cluster. According to the general distributed crawler system capable of automatically detecting the shielding, an algorithm for detecting a shielded page in a full-automatic mode is adopted, and whether an abnormal situation occurs or not is detected by detecting the size of the page downloaded by crawlers and the randomness of the distribution of Token editing distances so as to automatically distinguish whether the current obtained page is effective data or not. According to the general distributed crawler system, whether each Slave in the current cluster is shielded or not can be automatically distinguished so as to conduct better task scheduling, and therefore the Slave resources and network resources are utilized to a greater extent.

Description

technical field [0001] The invention belongs to the technical field of distributed systems and artificial intelligence, and in particular relates to a general distributed crawler system for automatic detection and shielding. Background technique [0002] A crawler is a program that can automatically browse and download data on the Internet. It is widely used in major Internet companies and data analysis departments as an extremely important source of data. Generally speaking, a machine is far from meeting the needs of obtaining various information on the Internet. Therefore, most crawlers run on clusters (that is, multiple computers), and download the required information from the Internet in parallel through different network outlets. [0003] Due to the changing needs, reptiles are not a simple problem. Crawling tasks often require searching and downloading on the Internet according to a certain strategy, and there are different downloading and analysis methods for diffe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): H04L29/08H04L12/44G06F9/46
Inventor 肖仰华梁家卿汪卫
Owner FUDAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products