Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Topic crawler system based on social labels

A topic crawler, social technology, applied in the field of topic crawler system based on social annotation, can solve the problems of search deviation from topic topic drift, ignoring relevance, computational complexity increase, etc., to achieve high program operation efficiency, improve crawler efficiency, network The effect of high bandwidth utilization

Inactive Publication Date: 2009-10-21
HUAZHONG UNIV OF SCI & TECH
View PDF0 Cites 50 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The search strategy based on the link structure evaluation takes the structural characteristics of the links into consideration, and it works well when searching for websites related to the topic. However, due to ignoring the correlation between the content of the page and the topic, it is prone to the problem of "topic drift" where the search deviates from the topic. In addition, In the search process, it is necessary to iteratively calculate the PageRank value or Authority and Hub weights. When the number of pages and links continues to grow, the computational complexity also increases exponentially.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Topic crawler system based on social labels
  • Topic crawler system based on social labels
  • Topic crawler system based on social labels

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] Below in conjunction with accompanying drawing and example the present invention is described in further detail.

[0025] Such as figure 1 As shown, the present invention proposes a web crawler strategy based on social labeling, and a multi-thread crawler system of asynchronous IO is designed based on this strategy, and the system includes a page acquisition module 100, a page processing module 200, a correlation calculation module 300, and a storage module 400 , the link extraction module 500 , and the link analysis module 600 .

[0026]The page acquiring module 100 is responsible for acquiring webpages, and acquires pages according to the robots.txt (robots.txt: robot prohibition protocol file) of the acquired website, the limitation of network bandwidth, and the priority of webpages. The page acquisition module 100 hands over the acquired page to the page processing module 200 for processing.

[0027] The page acquisition module 100 starts from the list of seed URL...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a topic crawler system based on social labels, which comprises a page acquisition module, a page processing module, a correlation calculation module, a storage module, a link extraction module and a link analysis module. The system makes full use of the social labels of web pages, utilizes the properties that the social labels are the acknowledged description on page contents and are closer to actual contents described by the web pages to judge the correlation of the web pages, and applies the correlation to a network topic crawler to guide the crawling direction of the crawler and provide high-quality webpage data contents for a topic search engine. The system well uses the network bandwidth resources to reduce unnecessary expenditure during web page acquisition, adopts different storage modes aiming at different requirements to reduce the consumption of IO, and adopts a multi-level cache mechanism to reduce blockage and improve the efficiency of the crawler. Under the support of the social labels, the system optimizes a crawler framework, and provides an optimal webpage data set for subsequent other processing flows of the topic search engine.

Description

technical field [0001] The invention belongs to computer data mining technology, and specifically relates to a theme crawler system based on social annotations. The system proposes a new crawling strategy, and guides the crawling of crawlers according to the correlation between social annotations of web pages and predetermined topics. It enables the theme crawler to crawl relevant pages accurately and effectively, and dynamically adjusts the priority of the pages to be crawled according to the relevance of the crawled pages. Background technique [0002] With the rapid development of the Internet, people rely more and more on computer networks to find the information they need, and the Internet has become an important source of information in people's lives. The emergence of search engines enables people to use keywords to quickly query relevant web page information, avoid searching aimlessly, save time for obtaining information, and thus greatly improve work efficiency, suc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 李瑞轩文坤梅赵勇辜希武卢正鼎靳延安丁益斌
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products