Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

User access content-based real-time personalized information collection method

A technology for accessing content and information collection. It is used in special data processing applications, instruments, computing, etc., and can solve problems such as multi-system resources, ignoring the role of link information, and consumption.

Inactive Publication Date: 2015-12-09
SHANDONG UNIV
View PDF2 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Today’s commonly used technology for crawling web resources based on topic orientation is focused crawler technology. The goal of general focused crawlers is to manually set topic keywords and seed links according to pre-selected topics, so as to collect as many relevant pages as possible. It will consume a lot of system resources and network bandwidth, and the processing speed is slow
And today's focused crawler technology mainly adopts the theme crawling strategy based on content evaluation, ignoring the role of link information, and the ability to predict the value of links is poor

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • User access content-based real-time personalized information collection method
  • User access content-based real-time personalized information collection method
  • User access content-based real-time personalized information collection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0076] The present invention will be further described below with reference to the accompanying drawings and embodiments.

[0077] Focused crawler: Also known as a web spider, it is a program or script that automatically crawls information from the World Wide Web according to set rules.

[0078] like figure 1 shown,

[0079] 1. First obtain the user's network request in real time through the intelligent gateway, extract the URL from the network request, and download the corresponding page as the current seed page according to the URL. Filter noise content such as navigation, advertisements and copyright notices, extract title, introduction and body content respectively, and save them in a custom web page information structure WebText.

[0080] 2. Multi-angle analysis to extract the topic keywords of torrent pages. like figure 2 shown,

[0081] (2.1) First, perform word segmentation on the title, introduction, and body content, remove words such as stop words, merge numbe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a user access content-based real-time personalized information collection method, which comprises the following steps: obtaining a current seed page by analyzing a user network request in real time and extracting structural information of a webpage; extracting a topic keyword from various angles according to the structural information of the webpage; constituting a topic keyword entry; extracting an anchor text of a sub-link of the current seed page, and carrying out word segmentation on the anchor text according to the topic keyword entry, building a vector space model according to the word segmentation result, and calculating the topic relevance between the sub-link and the current seed page by the cosine law according to the vector space model; judging the sub-link of which the top relevance is greater than the set threshold as an effective sub-link; building a link topic classification base, and carrying out seed link priority setup and current seed link topic classification; calculating the importance of all sub-links in the link topic classification base, ranking the sub-links according to the importance, and carrying out downloading and data storage of the ranked corresponding page information.

Description

technical field [0001] The invention relates to a real-time personalized information collection method based on user access content. Background technique [0002] With the increase of terminal products such as smart phones and tablet computers in the home environment and the enrichment of various multimedia data, users have gradually established the habit of using smart terminal devices. However, with the increase of terminal products, the growth of network information is also extremely rapid. Massive information can provide users with rich information resources, and it also poses a challenge to how users can quickly obtain the required information from the information ocean. Real-time personalized information collection based on user access content has become an important topic in the context of big data, which is of great significance for subsequent data analysis and mining. [0003] The commonly used technology for crawling web resources according to the theme is the foc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9535
Inventor 曹叶文王鹏达
Owner SHANDONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products