Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Tibetan web page and its code identification method

A recognition method and coding technology, applied in the field of text coding recognition, can solve the problems of keywords not corresponding to Tibetan and unable to judge web pages, etc.

Inactive Publication Date: 2007-10-17
INST OF SOFTWARE - CHINESE ACAD OF SCI
View PDF0 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Since most of the Tibetan encodings are custom encodings, the keywords of charset and encoding do not correspond to Tibetan values. Corresponding to the aforementioned encodings, they all borrow values ​​from other texts, such as on Tibetan web pages. , there may be similar information such as "charset=gb2312", "charset=ascii", etc. At this time, it is impossible to judge whether the webpage is Tibetan-encoded based on these information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Tibetan web page and its code identification method
  • Tibetan web page and its code identification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0044] Embodiment 1 uses criterion 4 to identify whether the Tibetan code of the webpage is the Founder DOS code.

[0045] For the Founder DOS code, the code of the sound node is "C0 32" (here is the hexadecimal representation, the same below), scan the webpage character stream, if the number of characters contained between two adjacent "C0 32" is found Between 1 and 7, the counter is incremented by one. If the counter reaches the preset threshold (for example, 10) before the scanning of the current webpage is finished, it is considered that the current webpage is a Tibetan webpage, and it adopts Founder DOS encoding.

Embodiment 2

[0046] Embodiment 2 uses criterion 4 to identify whether the Founder Windows code is correct.

[0047] The process is the same as Example 1, except that the coding of the sound node is changed from "C0 32" to "AAAC".

Embodiment 3

[0048] Embodiment 3 adopts criterion 3 to identify whether TCRC coding.

[0049] For TCRC encoding, the encoding of the sound node is "2D" (hexadecimal, the same below), and the TCRC encoding sequence of a high-frequency syllable is "7A F4 68", then "2D 7A F4 68 2D" is the characteristic string , calculate the number of times it appears in the current webpage, if the number of times is greater than the threshold (for example, 10), it is considered to be a Tibetan webpage, and it uses TCRC encoding.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a method for identifying Tibetan language webpage and its coded, including the steps of: giving a code of characteristic string, which is syllable node and / or selected high frequency syllable, in Tibetan language codefirstly; webpage character flow, the code of characteristic string as keyword, scanned and searched; calculating the frequency that accords with characteristic string coded character to appear by the couter; determining whether the webpage is Tibetan language webpage and the Tibetan language code is used according to the result of counter. The invention makes the best of the syllable structural feature of Tibetan language language and the statistics characteristic of Tibetan language word, and respectively applys the identification criteria for different code, accordingly Tibetan language webpage and non Tibetan language webpageu can efficiently be distinguished correctly, and Tibetan language coding used by the webpage is also able to be identified.

Description

[0001] The technical field [0002] The invention belongs to the technical field of character code recognition, and in particular relates to a Tibetan web page and a code recognition method thereof. Background technique [0003] With the development of the Internet, there are more and more information on the Internet, which brings great convenience to people's life. Finding the data we need in the massive network data is a very real problem, and the emergence of search engines has solved this problem. In the past two years, the development of search engines has been in full swing, and many Chinese search engines with their own characteristics have emerged, such as Baidu, Sogou, and Kuxun. In contrast, as a minority language, search products related to Tibetan have not yet appeared. [0004] The functional modules of search engines can generally be divided into foreground and background. The foreground provides an interface for interacting with end users. The background nee...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 吴健芮建武刘汇丹
Owner INST OF SOFTWARE - CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products