Distributed internet data acquisition system and method based on event-driven model

A data acquisition system and event-driven technology, applied in the field of network search, can solve problems such as no rapid shrinkage mechanism, high technical requirements for users, and no support for visual user interface

Active Publication Date: 2019-10-18
北京熵简科技有限公司
View PDF5 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] 1. The train collector is client software, so it is not suitable for distributed deployment;
[0006] 2. The performance of the train collector is limited by the performance of the physical machine where the client is located;
[0007] 3. Combining the above points 1 and 2, the train collector system cannot meet the needs of real-time large-scale data collection
[0009] 1. Although it is a distributed crawler system, when the system needs to be expanded, the Archer collection system needs to be specially arranged and configured on the newly added node machine. Therefore, the cost of system expansion is high and cumbersome;
[0010] 2. There is no rapid shrinkage mechanism;
[0011] 3. Each machine node is relatively independent, and the operation and maintenance pressure is high;
[0012] 4. Does not support visual user interface, and has high technical requirements for users
[0015] 1. It is only suitable for writing small-scale crawler projects. When facing large-scale crawling tasks, it is necessary to adjust the code logic from the bottom of the system;
[0016] 2. It mainly provides code execution management and lacks support for distributed scheduling;
[0017] 3. Does not support visual user interface, and has high technical requirements for users

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed internet data acquisition system and method based on event-driven model
  • Distributed internet data acquisition system and method based on event-driven model
  • Distributed internet data acquisition system and method based on event-driven model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0080] Such as Figure 4 Described, is a kind of distributed Internet data acquisition system based on event-driven model, including console module, data acquisition engine module, data storage module, log service module;

[0081] The entire data system runs on the container orchestration engine;

[0082] The console module configures data collection, including configuring crawler scheduling and parsing rules, triggering various events such as crawler running and stopping, and completing related configuration of data storage.

[0083] The data acquisition engine module completes data acquisition according to the configuration of the console module; the data acquisition engine module captures and parses relevant web pages from the corresponding website according to the rules configured by the user, and outputs structured data and parsed pages.

[0084] The data storage module is connected with the data acquisition engine module, and completes the data storage according to the ...

Embodiment 2

[0092] Such as Figure 5 As shown, on the basis of Embodiment 1, the data collection engine module includes a scheduling component, a download analysis component and a data verification component; the scheduling component cooperates with the download analysis component to complete data collection; the download analysis component cooperates with the data verification component to check the data Collect for verification.

[0093] Among them, the scheduling component is used to generate tasks to be crawled and manage the task status; the download analysis component calls various services to efficiently complete the page download analysis work, and according to different configuration requirements, the new link may continue to be downloaded; the data verification component uses It is used to conduct conformity inspection on the data before entering the database to improve the data quality.

Embodiment 3

[0095] Such as Figure 6 As shown, on the basis of the second embodiment, the scheduling component includes a crawler scheduling service and a link scheduling service.

[0096] A scheduling event message queue is provided between the crawler scheduling service and the link scheduling service; the link scheduling service is also connected to the crawling event message queue.

[0097] After the crawler scheduling service checks that the current time meets the execution time and cycle of the project configured by the user, it obtains the crawling task meta information of the task from the corresponding configuration database according to the unique identification number of the task, including the target website, data parsing rules and storage fields , data verification method, project execution time and period, database configuration and other information, and package the meta-information as a data scheduling event and put it in the scheduling event message queue.

[0098] The l...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a distributed internet data acquisition system and method based on an event-driven model, and relates to the technical field of network search. The distributed internet data acquisition system comprises a console module, a data acquisition engine module, a data storage module and a log service module, and the distributed internet data acquisition system runs on a containerarrangement engine, wherein the console module configures data acquisition and data storage; the data acquisition engine module completes data acquisition according to configuration of the console module; the data storage module is connected with the data acquisition engine module and completes data storage according to configuration of the console module; the console module, the data acquisitionengine module, the data storage module and the log service module respectively comprise one or more services; and the services are mutually decoupled and deployed on the container orchestration enginein an independent Docker mirror image form. The distributed internet data acquisition system has the capability of quickly and dynamically expanding and shrinking the capacity, supports the requirement of daily acquisition of TB quantity-level mass data, and supports data acquisition of thousands of websites with different sources at the same time.

Description

technical field [0001] The invention relates to the technical field of network search, in particular to a distributed Internet data collection system and method based on an event-driven model. Background technique [0002] The rapid development of modern information technology has resulted in explosive growth of data and information contained on the Internet. In recent years, the proposal and application of big data has made people further realize the important value of Internet data. Internet data is increasingly regarded as digital oil, which can provide the underlying information driving force for governments, financial institutions, banks, traditional enterprises and other institutions. Therefore, for information scattered everywhere on the Internet, professional Internet data collection technology (also known as web crawler) is needed to collect these massive data in a timely and large-scale manner. [0003] The types of data on the Internet are abundant, and the pres...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/951G06F16/9535G06F9/48
CPCG06F9/4881G06F16/951G06F16/9535
Inventor 孔逸飞段毅飞王亮亮薛彦文刘博李渔
Owner 北京熵简科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products