Method for identifying coding form of Chinese text

A coding form and text technology, applied in the field of identifying Chinese text coding forms, can solve the problems of low recognition accuracy of multi-time and short text, and achieve the effect of improving speed and accuracy, fast recognition speed and high recognition accuracy.

Inactive Publication Date: 2007-08-08
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF0 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This voting method takes more time, and th...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for identifying coding form of Chinese text
  • Method for identifying coding form of Chinese text
  • Method for identifying coding form of Chinese text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

[0045] The core idea of ​​the present invention is: by integrating all known Chinese character code recognition methods, the text to be recognized will go through three stages of serialization, word segmentation and statistics, and the code can be recognized in each stage. Determining the coded form of the text to be recognized eliminates the need for a subsequent recognition stage.

[0046]As shown in Fig. 1, Fig. 1 is the realization flow diagram of the overall technical scheme of identifying Chinese text coding form provided by the present invention, and this method comprises the following steps:

[0047] Step 101: Integer ID sequence conversion is performed on the text to be recognized in various encoding forms;

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an identification forms of Chinese text encoding methods, including: A. do integer ID sequence conversion in various encoded forms to the text will be identified; B. Identify the Chinese text to be judged whether only can be conversed to integer ID sequence in some coding, if so, execute step D; otherwise, execute step C; C. cut the word in the integer ID sequences that are get from the identified text in various coding forms to judge that if the integer ID sequence of the Chinese text in some forms contains one or more terms in the dictionary, if so, execute D; D. determine encoding of the Chinese text that will be identified is in the coded form. Using the invention has greatly enhanced the speed and accuracy of Chinese character coding recognition and it can effectively identify character coding forms in short Chinese text.

Description

technical field [0001] The invention relates to the technical field of Chinese character recognition for information retrieval, in particular to a method for recognizing the coding form of Chinese text. Background technique [0002] Due to various reasons such as history and geography, Chinese characters have various encoding forms when they are stored and processed in a computer. Three of the most common are: [0003] 1. The national standard code formulated by mainland my country, including GB2313, GBK and GB18030; [0004] 2. Traditional Chinese character codes formulated in Hong Kong, Macao, Taiwan and other regions of my country, including BIG-5, BIG-5E and HKCS; [0005] 3. International standards for Chinese character encoding, including ISO 10646, Unicode, etc. [0006] The transition from Chinese character encoding to ISO 10646 is a long process. During this period, various Chinese character encoding forms will coexist. This will inevitably require the operating ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/28
Inventor 龚才春
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products