An Object Storage Based Crawler Network Path Tracing Method

A technology of object storage and network path, which is applied in the field of path tracking research in software engineering, can solve the problem of serious disk IO load, achieve the effect of improving IO efficiency, decoupling, and ensuring retrieval efficiency

Active Publication Date: 2020-06-09
广州探迹科技有限公司
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method will cause serious disk IO load due to the need to read and write to the disk frequently. In addition, this method still has the problem that the same data is jointly maintained by two systems, and it cannot fundamentally avoid read-write conflicts.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An Object Storage Based Crawler Network Path Tracing Method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0022] Such as figure 1 As shown, in this embodiment, a crawler network path tracking method based on object storage includes the following steps:

[0023] 1. Deploy the object storage system and log processor.

[0024] The object storage system is a file storage system based on the HBASE distributed file system, which can support the storage of PT-level files. By calling the HTTP interface and passing corresponding parameters, deletion (DELETE), creation (POST), and rewriting (PUT) of files on the object storage system can be implemented. It should be noted that the object storage system of the present invention can only provide file deletion, creation and rewriting, and does not support incremental writing of files.

[0025] The log processor is a single piece, which is used to process the received results and distinguish them by crawler. Each crawler is correspondingly written into a result path log file, which is convenient for the subsequent system to read and index.

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a crawler network path tracing method based on object storage. The method comprises the steps that an object storage system and a log recorder are established, wherein the log recorder generates a result path log, and indexes from the source URL of a crawling result to a crawler result file on the object storage system are recorded in the result path log; when an external system needs to call the data in the database, the crawler result file on the object storage system is directly obtained through the indexes. According to the method, the object storage system is introduced so that the file reading and writing speed can be increased; the result path log is established so that the data can be retrieved in the log when the external system calls the data and does not need to be searched in the database, and accordingly the possibility of reading and writing conflicts is avoided.

Description

technical field [0001] The invention belongs to the research field of path tracing in software engineering, in particular to a crawler network path tracing method based on object storage. Background technique [0002] A web crawler is a program or script that automatically captures information on the World Wide Web according to certain rules. In the current path tracing, most of the crawler network path tracing is based on the crawler task as the basic unit. For example, the open source crawler framework pyspider, the default The action is to store the result into the database. If the external system needs to retrieve the data in the database, there is no convenient retrieval method. It can only scan the database, and it is necessary to modify the status of the result data in the database so that these processed data will be excluded in the next processing. result. As a result, the data in the database needs to be maintained by the two systems together, causing great uncer...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/18G06F16/953G06F16/182
CPCG06F16/1815G06F16/182G06F16/951
Inventor 陈开冉邓楚健
Owner 广州探迹科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products