Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and system for realizing streaming crawler

An implementation method and crawler technology, applied in the field of web crawlers, can solve problems such as low CPU utilization, CPU time loss, and large memory consumption, and achieve the effects of simple crawler development, fast crawling efficiency, and high resource utilization

Pending Publication Date: 2021-08-24
NANJING UNIV
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But in the case of heavy load, this mode will consume a lot of memory, because each crawler request maintains a thread stack
Continuous context switching can also cause a lot of CPU time loss, can not make full use of CPU resources
[0005] In the real world, in addition to the problem of low CPU utilization caused by a large number of network IOs, crawlers also have various problems, such as login verification, exception handling, customized data analysis, and crawler configuration switching, etc.
These problems greatly reduce the efficiency of crawler development and the efficiency and quality of crawled data
However, the current crawler framework rarely considers the issues of crawler development and crawler operating efficiency.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for realizing streaming crawler
  • Method and system for realizing streaming crawler
  • Method and system for realizing streaming crawler

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] Below in conjunction with specific embodiment, further illustrate the present invention, should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various equivalent forms of the present invention All modifications fall within the scope defined by the appended claims of the present application.

[0040] A method for implementing a streaming crawler based on responsive programming, including the following steps:

[0041] Step 1. Construction of data object model and initial request data flow: Map the website structure according to the hierarchical tree structure template, and then construct the mapping relationship between the website data structure and the object model. Generate asynchronous request data flow according to website data structure and object model mapping relationship, and co...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and a system for realizing a streaming crawler. The method comprises the following steps of: constructing an initial request data stream, and configuring a construction process of a crawler data stream conversion diagram; constructing a request data stream by using an object model through a mapping relationship between a hierarchical tree model and a website structure and a data object model, and configuring the request data stream to bypass a website anti-crawling strategy; a crawler component performing conversion operation on a data stream, and constructing a crawler data stream conversion diagram from requesting the data stream to downloading a page data stream and then to a result data stream. According to the method, the thought of responsive programming is adopted, the data stream conversion diagram of the whole crawler is constructed, the asynchronous streaming crawling model based on responsive programming is generated, blocking operation in the data crawling process is processed asynchronously, and compared with a traditional crawler scheme, the development efficiency, the system throughput and the resource utilization rate are improved; and the application value is very high.

Description

technical field [0001] The invention relates to a streaming crawler implementation method and system, and belongs to the technical field of network crawlers. Background technique [0002] With the explosive growth of network data, people's demand for personalized data search in various fields has also increased, and how to obtain useful information in the network has become an important task in various fields. For large-scale acquisition of network data, we often use crawler technology to crawl web pages, and perform respective analysis on the web page structure of specific pages, and finally obtain the required structured or semi-structured data. [0003] The general crawling method of crawlers is to start from an initial URL set, download the page corresponding to the URL; and analyze the content of the page to extract the corresponding page elements and links on the page. The extracted page elements are The required data; then filter the page links to obtain the links th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/9566
Inventor 曹春马晓星何城贤徐经纬
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products