Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Automatic webpage classification method based on network hot word identification

An automatic classification and hot word technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of multiple manual review steps, hot word response lag, etc., to achieve the effect of improving accuracy and reducing workload

Inactive Publication Date: 2013-07-03
南京安讯科技有限责任公司
View PDF3 Cites 55 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Compared with the classification method based on statistics, the advantage of this method is that it can obtain higher classification accuracy when classifying texts, but the disadvantage is that more manual review steps are often needed to be added in the process of updating the thesaurus, and the The response to newly emerging hot words on the Internet is relatively lagging

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic webpage classification method based on network hot word identification
  • Automatic webpage classification method based on network hot word identification
  • Automatic webpage classification method based on network hot word identification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment approach

[0026] Step 1, use a customized crawler to obtain the content information of the webpage.

[0027] Step 2, performing a word segmentation operation on the extracted webpage content.

[0028] Step 3: compare the word segmentation results of the webpage content with the established Internet keyword category library, and then list the total number of categories that the webpage may belong to as M.

[0029] Step 4, if the value of M is greater than or equal to 2, go to step 5, otherwise go to step 7.

[0030] Step 5: Randomly select two categories from M categories each time, and use formula (2) to determine which category the content of the web page belongs to, so that a total of a comparison result.

[0031] Step 6, analyze the According to the comparison results, the category information of the webpage content is obtained, and the word segmentation results are written into the hot word association data thesaurus according to the category.

[0032] Step 7, if the value of ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to an automatic webpage classification method based on network hot word identification. The automatic webpage classification method mainly comprises the following steps of: acquiring webpage content information by using customization crawlers, and automatically performing word classification on acquired webpage contents through an Internet keyword base and an Internet stopword base; calculating a hot value according to a keyword appearance frequency and a time distance degree, and performing initial classification on the webpage contents according to the hot value of the words through a Bayesian multidimensional classification model; performing relevance identification on non-matched word classification items in classified webpages through a relevance algorithm, and finding out non-collected hot words from the Internet keyword base, and collecting the non-collected hot words into the Internet keyword base; and reclassifying the webpage contents which cannot be classified in an initial webpage classification process through an updated Internet word base.

Description

technical field [0001] The invention relates to a method for automatically classifying webpages, in particular to a method for automatically classifying webpages based on network hot word recognition, which belongs to the technical field of data mining. Background technique [0002] With the rapid development of the Internet and Web technology, the number of web pages on the Internet is constantly increasing. The increasing popularity of the Internet and the explosive growth of the number of Internet users have resulted in complex and diverse network behaviors. In order to effectively organize and analyze massive web information resources and help users quickly obtain the knowledge and information they need, it is necessary to automatically classify web pages. [0003] There are two main types of traditional text classification methods: one is the classification method based on statistics, and the other is the classification method based on knowledge. The idea of ​​the clas...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
Inventor 邵伟昂卫武黄汇
Owner 南京安讯科技有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products