Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Chinese text classification method based on Base64 coding

A text classification and text technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of high dimensionality of feature space of data sets, low word segmentation accuracy, slow speed, etc., and achieve good classification results Effect

Inactive Publication Date: 2011-06-01
ZHEJIANG UNIV
View PDF3 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method has the problem of slow word segmentation and low word segmentation accuracy, which often affects the final classification performance.
In particular, the N-gram features based on Chinese word strings need to obtain N-gram feature items after word segmentation, which is not only slow, but also introduces the problem that the feature space dimension of the data set is too high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese text classification method based on Base64 coding
  • Chinese text classification method based on Base64 coding
  • Chinese text classification method based on Base64 coding

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0043] The flow of the Chinese text classification method based on Base64 is shown in Figure 1, which mainly includes the following steps:

[0044] 1) Use Base64 to encode the Chinese text, and convert the Chinese text into a string text composed of English letters and numbers;

[0045] For webpage text in HTML format, it is necessary to extract useful text information in advance, that is, to remove format tags. Since the webpage markup contains a fixed label format, the text can be scanned to extract information such as the header, title, keywords, abstract, and text of the text, and at the same time remove useless scripts, comments, and form information. After completing the above operations, save the text as a file.

[0046] As a simple and effective encoding method, Base64 encoding is widely used in the encryption conversion of network data transmission. This encoding method uses "A-Z", "a-z", "0-9", "+", " / " 64 ASCII characters and a suffix character "=" to encode data,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Chinese text preprocessing method based on Base64 coding, comprising the following steps of: (1) preprocessing Chinese text by using the Base64 coding and converting the text to character string text; (2) segmenting words of the converted character string text and extracting text feature items by using 4-gram; (3) screening the text feature items by utilizing IG (Information Gain) to generate a feature space; (4) counting word frequency of the text feature items, wherein the word frequency is used for expressing weight of the feature items and the Chinese text is expressed as an eigenvector; and (5) training an SVM (Support Vector Machine) classifier by utilizing an LIBLINEAR toolbox to obtain an SVM classification model and classifying the Chinese text to judge the category of the text. According to the Chinese text preprocessing method based on the Base64 coding, the Base64 coding is used for cording the Chinese text and the 4-gram is used for extracting the text feature items; therefore, the problems that the Chinese text classification is time-consuming and not high in accuracy rate of Chinese word segmentation during the Chinese text classification are solved; in the mean time, the IG is used for feature selection and the word frequency is used for expressing text feature, therefore, the accuracy rate and efficiency of the Chinese text classification can be effectively enhanced.

Description

technical field [0001] The invention relates to Chinese information processing, in particular to a method for classifying Chinese texts based on Base64 encoding. Background technique [0002] With the continuous development of information technology, especially the continuous popularization and improvement of Internet technology, various information on the Internet are constantly emerging, how to efficiently organize and manage these resources, and at the same time quickly and accurately locate useful information has become an important and urgent issue in the information age. task. Chinese is the most spoken language in the world and one of the international official languages ​​designated by the United Nations. With the development of the Internet and the rapid development of China's economy, the flow of Chinese information in the world is becoming more and more extensive. Therefore, the study of large-scale Chinese texts has great practical significance for my country's ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 徐从富陈雅芳张志华
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products