An Automatic Hot Topic Mining System Based on Internet Corpus
A hot topic and automatic mining technology, applied in unstructured text data retrieval, instrumentation, computing, etc., can solve problems such as poor scalability and non-reusable matching templates
Active Publication Date: 2019-01-22
北京一览群智数据科技有限责任公司
View PDF5 Cites 0 Cited by
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
The method based on rule matching requires a lot of prior knowledge. Although the accuracy is high, the scalability is poor, and matching templates in different fields cannot be reused; the method based on site statistics needs to collect a large number of logs based on a large number of user groups , these data cannot be obtained by small and medium-sized companies or research institutes; the method based on event detection first needs to generate high-quality candidate words, because the information on the Internet is changing with each passing day, and new words emerge in an endless stream, the problem of unregistered words is a challenge for this method
Method used
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreImage
Smart Image Click on the blue labels to locate them in the text.
Smart ImageViewing Examples
Examples
Experimental program
Comparison scheme
Effect test
Embodiment Construction
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More PUM
Login to View More
Abstract
The invention discloses an automatic hot topic mining system based on internet corpora. The system is composed of two routes: 1) crawling hot words of existing hot word statistics sites, and generating a series of hot topics through the steps of clustering, entity extraction and key word mining; and 2) extracting n-gram from massive news documents, mining high-frequency hot words from the massive news documents by calculating mutual information and conditional entropy values of the n-gram, and recognizing new topics by using an event detection method based on a time sequence. By adopting the system, not only can current hot events be mined in real time, but also can relevant keywords and named entities of a hot topic be mined when the topic is generated.
Description
technical field The invention relates to an automatic hot topic mining system based on Internet corpus. Background technique There are three main methods in existing hot word mining systems: the method based on rule matching, the method based on site statistics and the method based on event detection. The method based on rule matching requires a lot of domain knowledge, and hot words are mined by using manually established hot word matching templates. The method based on site statistical information mainly utilizes the statistical data of site traffic, such as news access logs of portal websites, query logs of search engines, etc., and mines hot words from frequently accessed content. The method based on event detection first uses named entity recognition, high-frequency string statistics and other methods to mine candidate hot words, and then uses related methods of time series analysis to select words with obvious hot trends in the candidate set as the final result. Th...
Claims
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More Application Information
Patent Timeline
Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F16/9535
CPCG06F16/35G06F16/9535
Inventor 窦志成文继荣江政宝
Owner 北京一览群智数据科技有限责任公司
Who we serve
- R&D Engineer
- R&D Manager
- IP Professional
Why Patsnap Eureka
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com