A method for collecting batch encrypted data for news media

A technology for encrypting data and collecting methods, applied in network data indexing, network data retrieval, other database retrieval and other directions, can solve the problems of increasing data collection difficulty, poor collection stability, and high collection cost, reducing data collection workload, running The effect of fast speed and improved data collection efficiency

Active Publication Date: 2022-03-01
成都橙视传媒科技股份公司
View PDF15 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] At present, the existing news data collection technology has encountered the following problems: 1. Since many websites use css encryption, character encryption, ajax, dynamic page loading and anti-crawler detection, the difficulty of data collection increases
2. If the traditional data collection technology is still used, such as a single website to analyze and crack and do content extraction, there is often a problem of low efficiency
3. With the development of science and technology, the website anti-crawling mechanism and website style changes are updated faster and faster. The existing data collection solutions have problems of poor collection stability and difficult maintenance
4. The problem of high collection cost

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for collecting batch encrypted data for news media
  • A method for collecting batch encrypted data for news media
  • A method for collecting batch encrypted data for news media

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0039] Embodiment 1: a kind of collection method for batch encrypted data of news media, at first will need to collect the network address url, station name content to add database; Also comprise the following steps:

[0040] S1, setting the url deduplication set realized by redis and the url queue realized by redis, and adding the website url and site name content in the database to the url deduplication set realized by redis and the url queue realized by redis;

[0041] S2, the processor generates multiple puppeteer processes to consume the data in the url queue implemented by redis in step S1;

[0042] S3, setting the html queue realized by redis, after obtaining the web page data html, adding it to the html queue realized by redis, and setting a marking process in the html queue realized by redis, the marking process is used to distinguish the list page webpage Data html or content page webpage data html;

[0043] S4, analyze the data in the html queue implemented by redi...

Embodiment 2

[0045] Embodiment 2: On the basis of Embodiment 1, in step S2, a plurality of puppeteer processes will continue to maintain a plurality of puppeteer processes and save the browser status information in a text file when idle, and mark it as to be called; When there is a url in the url queue implemented by redis that needs to be parsed, randomly read the text document information of a puppeteer process marked as waiting to be called, and then mark the document status as being called, which can reduce memory usage and improve browser opening speed .

Embodiment 3

[0046] Embodiment 3: On the basis of embodiment 1, in step S4, set a marking process, specifically set an html mark; To monitor whether there is parsing data in the redis html queue, if there is parsing data, the processor calls the html tag parsing program process to parse the html tags.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for collecting batch-encrypted data of news media, which belongs to the field of news media data collection, and includes steps: S1, adding the content of website url and station name in the database to the url deduplication set realized by redis and The url queue implemented by redis; S2, use the puppeteer process to consume the data in the url queue implemented by redis; S3, obtain the web page data html, add it to the html queue implemented by redis, and transfer the html queue implemented by redis Marking is divided into list page webpage data html or content page webpage data html; S4, parse and process the data in the html queue implemented by redis. The invention is easier to realize the collection of batch encrypted data, and has the advantages of high efficiency, low cost, easy maintenance and the like.

Description

technical field [0001] The invention relates to the field of news media data collection, and more specifically, to a method for collecting batch encrypted data of news media. Background technique [0002] As a news and public opinion media, it is necessary to collect relevant news data. [0003] At present, the existing news data collection technology has encountered the following problems: 1. Since many websites use css encryption, character encryption, ajax, dynamic page loading and anti-crawler detection, the difficulty of data collection increases. 2. If traditional data collection technology is still used, such as analysis and cracking of a single website and content extraction, there is often a problem of low efficiency. 3. With the development of science and technology, the website anti-crawling mechanism and website style changes are updated faster and faster. The existing data collection solutions have problems of poor collection stability and difficult maintenance...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/955
Inventor 李林吴雷孙于扬
Owner 成都橙视传媒科技股份公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products