Realization method for analysis model supporting massive long text data classification

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An analysis model and data classification technology, which is applied in text database clustering/classification, unstructured text data retrieval, electronic digital data processing, etc., to achieve the effects of reducing classification accuracy, improving algorithm efficiency, and meeting performance requirements

Active Publication Date: 2017-05-24

BEIJING SCISTOR TECH +1

View PDF4 Cites 5 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] Aiming at these two problems of the statistical classification model, the present invention provides a corresponding solution to deal with the current actual classification needs, and proposes to use a CHI algorithm to extract the feature words of each type of text, and calculate the feature words of all categories The set intersection operation of the set can obtain the word vector space of the subsequent text classification, which can effectively reduce the dimension of the word vector space of each article during text classification, reduce the time complexity of text classification calculation, improve the algorithm efficiency, and meet the requirements of big data. Performance requirements for massive long text classification

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0014] The present invention will be further described in detail with reference to the accompanying drawings and embodiments.

[0015] The present invention provides an analysis model supporting the classification of massive long text data and its implementation method. The analysis model adopts a text classification algorithm based on statistics. The text classification algorithm adopts a vector space model (VSM), and extracts from the CHI algorithm. According to the TFIDF algorithm, the category feature words realize the vectorized representation of the text, and use the naive Bayesian method to train the corpus, and realize the analysis model for the classification of massive long text data.

[0016] The analysis model is realized through the following steps:

[0017] The first step is to establish a VSM-based statistical classification model to represent the text in a vectorized manner.

[0018] Because the situation of directly processing natural language is too complica...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a realization method for an analysis model supporting massive long text data classification, and belongs to the technical field of big data text analysis. According to the method, standard word segmentation in an HanLP word segmentation tool and an improved CHI algorithm are adopted, so that on one hand, the dimension of a word vector space of each article during text classification is effectively reduced, the time complexity of text classification computing is lowered, the algorithm efficiency is improved, and the performance demand during massive long text classification under the background of big data is met; and meanwhile, the situation of reduced classification accuracy due to reduced vector space dimension is reduced to the maximum extent. A barrier between a text and a vector can be effectively eliminated by adopting a TFIDF algorithm, and finally the text can be accurately subjected to better training by adopting a Naive Bayesian classification algorithm, so that accurate classification of the long text is realized. The method can effectively solve the problem of contradictoriness of a performance index and an accuracy index of long text classification in a big data environment, and has a wide application prospect.

Description

technical field [0001] The invention belongs to the technical field of big data text analysis, and specifically relates to an analysis model for classifying a large amount of long text data that uses the CHI algorithm to extract the characteristic words of each type of text, the TFIDF algorithm to realize the vectorized representation of the text, and the naive Bayesian method to train and classify implementation method. Background technique [0002] Today's era is an era of rapid development of information technology. With the development of information technology, scientific knowledge has experienced rapid and explosive growth in a short period of time. A large amount of information is produced every day. 500,000 books are published in the world every year, and a new book is published every minute. On average, 13,000 to 14,000 papers containing new knowledge are published every day; more than 300,000 invention-creation patents are registered every year, and an average of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30G06F17/27

CPCG06F16/35G06F40/216G06F40/289

Inventor 王宇徐晓燕周渊刘庆良郑彩娟黄成周游王海平马雪

Owner BEIJING SCISTOR TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Realization method for analysis model supporting massive long text data classification

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology