Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Realization method for analysis model supporting massive long text data classification

An analysis model and data classification technology, which is applied in text database clustering/classification, unstructured text data retrieval, electronic digital data processing, etc., to achieve the effects of reducing classification accuracy, improving algorithm efficiency, and meeting performance requirements

Active Publication Date: 2017-05-24
BEIJING SCISTOR TECH +1
View PDF4 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Aiming at these two problems of the statistical classification model, the present invention provides a corresponding solution to deal with the current actual classification needs, and proposes to use a CHI algorithm to extract the feature words of each type of text, and calculate the feature words of all categories The set intersection operation of the set can obtain the word vector space of the subsequent text classification, which can effectively reduce the dimension of the word vector space of each article during text classification, reduce the time complexity of text classification calculation, improve the algorithm efficiency, and meet the requirements of big data. Performance requirements for massive long text classification

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Realization method for analysis model supporting massive long text data classification
  • Realization method for analysis model supporting massive long text data classification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0014] The present invention will be further described in detail with reference to the accompanying drawings and embodiments.

[0015] The present invention provides an analysis model supporting the classification of massive long text data and its implementation method. The analysis model adopts a text classification algorithm based on statistics. The text classification algorithm adopts a vector space model (VSM), and extracts from the CHI algorithm. According to the TFIDF algorithm, the category feature words realize the vectorized representation of the text, and use the naive Bayesian method to train the corpus, and realize the analysis model for the classification of massive long text data.

[0016] The analysis model is realized through the following steps:

[0017] The first step is to establish a VSM-based statistical classification model to represent the text in a vectorized manner.

[0018] Because the situation of directly processing natural language is too complica...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a realization method for an analysis model supporting massive long text data classification, and belongs to the technical field of big data text analysis. According to the method, standard word segmentation in an HanLP word segmentation tool and an improved CHI algorithm are adopted, so that on one hand, the dimension of a word vector space of each article during text classification is effectively reduced, the time complexity of text classification computing is lowered, the algorithm efficiency is improved, and the performance demand during massive long text classification under the background of big data is met; and meanwhile, the situation of reduced classification accuracy due to reduced vector space dimension is reduced to the maximum extent. A barrier between a text and a vector can be effectively eliminated by adopting a TFIDF algorithm, and finally the text can be accurately subjected to better training by adopting a Naive Bayesian classification algorithm, so that accurate classification of the long text is realized. The method can effectively solve the problem of contradictoriness of a performance index and an accuracy index of long text classification in a big data environment, and has a wide application prospect.

Description

technical field [0001] The invention belongs to the technical field of big data text analysis, and specifically relates to an analysis model for classifying a large amount of long text data that uses the CHI algorithm to extract the characteristic words of each type of text, the TFIDF algorithm to realize the vectorized representation of the text, and the naive Bayesian method to train and classify implementation method. Background technique [0002] Today's era is an era of rapid development of information technology. With the development of information technology, scientific knowledge has experienced rapid and explosive growth in a short period of time. A large amount of information is produced every day. 500,000 books are published in the world every year, and a new book is published every minute. On average, 13,000 to 14,000 papers containing new knowledge are published every day; more than 300,000 invention-creation patents are registered every year, and an average of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/35G06F40/216G06F40/289
Inventor 王宇徐晓燕周渊刘庆良郑彩娟黄成周游王海平马雪
Owner BEIJING SCISTOR TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products