Web text data crawler method and system

A network text and crawler system technology, applied in the field of network text data crawler methods and systems, can solve the problems of single data source, reduce the amount of network text data, and low crawling accuracy, so as to expand the scope, improve accuracy and crawling speed Effect

Active Publication Date: 2022-05-06
TIBET UNIVERSITY FOR NATIONALITIES
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The above network text data crawler technology has the following disadvantages: 1. The data source is single, and the content information of the pages is often crawled with a certain website as the root directory, which greatly reduces the The amount of network text data obtained, the crawling accuracy is not high
At the same time, some useless web content such as pictures, videos, and maps will be processed, which reduces the efficiency of data crawling

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web text data crawler method and system
  • Web text data crawler method and system
  • Web text data crawler method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0065] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0066] One of the purposes of the present invention is to obtain URLs for the search results pages of target keywords by search engines. A large number of different URLs obtained can expand the source of data, and at the same time use URL type filtering and the relevance degree of keywords and description texts Filter to filter URLs, and then obtain high-value URLs for crawling text data.

[0067] The second object of the present invention is to propose a text...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a web text data crawler method and system. The method comprises the steps that a target keyword is input into a keyword retrieval bar of a crawler system, and the crawler system automatically calls a search engine to obtain a search result page; analyzing the entry websites in the search result page, and deleting the entry websites belonging to the set website type according to an analysis result to obtain a target object list; according to the association degree between the target keyword and the description text in the target object list, filtering the deleted entry websites in the target object list; analyzing the webpage corresponding to the filtered entry website by adopting a webpage analysis library to obtain a text content list of each webpage, and filtering the text content in the text content list according to the text probability distribution and the text length; and screening paragraphs and sentences of the text content in the filtered text content list according to the target keyword to obtain web text data of the target keyword. According to the invention, the crawling precision and the crawling efficiency can be improved.

Description

technical field [0001] The invention relates to the technical field of web crawlers, in particular to a web text data crawler method and system. Background technique [0002] A web crawler, also known as a web crawler, is a program or script that automatically grabs information on the Internet according to certain rules. Nowadays, there is a huge amount of information stored in the World Wide Web, how to effectively extract this information and how to use it has become a major challenge. Web crawlers access network resources by imitating browsers, and then use regular expressions, XPath and other data extraction libraries, as well as specific rules to automatically obtain specific target information in web pages. Web crawlers are currently widely used in search engines, public opinion monitoring, and data analysis. [0003] At present, the network text data crawler method has following several: 1, be used for the method, device and computer program of collecting data from ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/951G06F16/9532G06F16/332
CPCG06F16/951G06F16/9532G06F16/332
Inventor 赵尔平王禹皓张雅坤王通辉
Owner TIBET UNIVERSITY FOR NATIONALITIES
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products