Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Multi-strategy combined document automatic classification method

An automatic classification, multi-strategy technology, applied in special data processing applications, instruments, electrical digital data processing and other directions, can solve problems such as large amount of calculation, achieve the effect of not losing classification accuracy and improving classification efficiency

Inactive Publication Date: 2013-05-08
IOL WUHAN INFORMATION TECH CO LTD
View PDF2 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Another disadvantage of the KNN method is that the amount of calculation is large, because for each text to be classified, the distance to all known samples must be calculated to obtain its K nearest neighbors

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-strategy combined document automatic classification method
  • Multi-strategy combined document automatic classification method
  • Multi-strategy combined document automatic classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The advantage of the vector space method is that its classification speed is fast, and the time complexity is O(m), but the classification accuracy is not high, and the accuracy and recall are low. The KNN nearest neighbor algorithm has good classification accuracy, but n calculations are required for each classification, and the time complexity is o(n), which greatly exceeds the vector space method. If the strengths of the two can be combined, the classification can be greatly improved. efficiency without losing classification accuracy.

[0029] In the present invention, it is considered that when the similarity between the document to be divided and the center vector of the predefined class is greater than a certain threshold, the classification accuracy of the vector space method and the KNN method are not much different, and the same classification results are often obtained, so In the present invention, a threshold value is set, and when the similarity with a certa...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a multi-strategy combined document automatic classification method which comprises the steps: extracting keywords from a given to-be-classified document, obtaining a vector of the to-be-classified document according to the keywords, and obtaining a class center vector of each class in a standard document base; obtaining a class threshold value of a classified document according to the class centre vector; carrying out comparison of the similarity degree of the vector of the to-be-classified document and the class centre vector with the class threshold value, when the similarity degree of the vector of the to-be-classified document and the class centre vector is larger than the class threshold value, classifying the to-be-classified document through adoption of a space vector model; and if the similarity degree of the vector of the to-be-classified document and the class centre vector is not larger than the class threshold value, classifying the to-be-classified document through adoption of a k-nearest neighbor (KNN) method. Compared with the prior art, through the technical scheme, the multi-strategy combined document automatic classification method can greatly improve efficiency of the document classification and cannot reduce classification precision at the same time.

Description

technical field [0001] The invention relates to a computer document classification method, in particular to a multi-strategy combined document automatic classification method. Background technique [0002] Automatic Text Categorization (Automatic Text Categorization), or text classification for short, refers to the process of using a computer to attribute an article to a predetermined category or categories. Accurate and efficient classification of text is an important part of many data management tasks and an important part of text mining. Before the 1990s, the dominant text classification method has always been the classification method based on knowledge engineering, that is, manual classification by professionals. Manual classification is very time-consuming and inefficient. Since the 1990s, many statistical methods and machine learning methods have been applied to automatic text classification, and the research on text classification technology has aroused great inter...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 江潮
Owner IOL WUHAN INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products