Method for automatic indexing and searching word and word attributes in Chinese text

A technology of automatic indexing and word indexing, which is applied in special data processing applications, instruments, electrical and digital data processing, etc. Effect

Inactive Publication Date: 2005-03-16
LANGUAGE INFORMATION PROCESSING INST OF BEIJING LANGUAGE & CULTURE UNIV
View PDF0 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] The second is that the search objects are too limited
Due to the large manpower consumption, it is impossible to process a large-scale corpus, so the retrievable objects are very limited
For example, if the corpus of the People's Daily in 1998 has been processed, the language phenomena related to words and word attributes in the People's Daily in 1998 can be retrieved, but those in 1999 cannot be retrieved. As for "Dream of Red Mansions", "Camel "Xiangzi" and other novels, Taiwan and Hong Kong corpus, as long as the corpus has not been processed, it cannot be retrieved
Therefore, this method is far from meeting the wide range of needs of users.
[0011] The third is that the word attribute system and word attribute labeling are too rigid
However, once the corpus tagging project starts, the attribute system cannot be changed. After the corpus tagging is completed, which token and what attribute are also determined, so that it cannot meet the different needs of different people.
[0012] The fourth is that the accuracy is not very high
On the other hand, using the existing technology, word attributes are automatically disambiguated and manually corrected, and the remaining errors will also cause inaccurate attribute retrieval results, and what is more serious is that the retrieval results cannot be guaranteed to be complete.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatic indexing and searching word and word attributes in Chinese text
  • Method for automatic indexing and searching word and word attributes in Chinese text
  • Method for automatic indexing and searching word and word attributes in Chinese text

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0104] Suppose the retrieval condition is: "the adverb is equal to 1 adjective", that is, retrieve all instances where the adverb is immediately followed by a word that is an adjective.

[0105] There are 2 conditional items in the retrieval condition: adverb and adjective.

[0106] The left condition item "adverb" is a word attribute, and its feature code is 0000 2000 (hexadecimal). In the word attribute index, the primary key whose logical "AND" operation result with this feature code is not 0 is only 0000 2010 and 0000 2008, and there are 4 associated words: "one", "no", "more", "always" . Check the word index again, and the positions of the first words corresponding to the word examples are 17, 49, 19, 30, 58, and 68.

[0107] The right condition item "adjective" is a word attribute, and its feature codes are 0010 0010 and 0010 0008 (hexadecimal). In the word attribute index, the only primary keys whose logical "AND" operation result with this feature code is not 0 are ...

example 2

[0113] Establish retrieval condition again and be: " verb ∧ two words are less than 4 nouns ∧ two words ", promptly retrieve the 1st word or the 2nd word or the 3rd word that is the whole example of two word nouns after the two word verbs.

[0114] The left condition item "verb ∧ two characters" is a word attribute connected by "AND" operation, and the sum of the feature codes of "verb" and "two characters" is 0020 0008 (hexadecimal). In the word attribute index, the primary key whose logical "AND" operation result with this feature code is not 0 is only 0020 0008 itself, and there are 5 associated words: "encourage", "continue", "lose", "support", " face to face". Check the word index again, and the occurrence positions of the first word of the corresponding word example are 73, 82, 39, 70, 77.

[0115] The right condition item "noun ∧ two characters" is a word attribute connected by "AND" operation, and the sum of the feature codes of "noun" and "two characters" is 0040 000...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a word in the Chinese text and its relevant attribute automatically indexing and searching method. The method encodes the word attribute and automatically translates the user's word stock into a built-in one. Segment the words of the user's language database automatically with the built-in word stock to generate the word segmenting result. Then the word index and the word's attribute index are generated automatically. Automatically search the user's language database to get the searching result with the user's searching condition, aided by the word attribute index, the word index and the word segmenting result. The invention realizes that the automatically indexing and searching a word and its attribute can be done to any unknown language data, based on any word stock and any word attribute architecture. The invention eliminates the language database marking operations, and improves the efficiency of work.

Description

1. Technical field [0001] The invention relates to a text retrieval technology, in particular to a method for automatic indexing and retrieval of words and word attributes in Chinese texts. 2. Background technology [0002] Personnel engaged in Chinese language teaching, language research, dictionary compilation and language engineering (such as automatic machine translation, automatic reading aloud, automatic speech recognition, etc.) need to accumulate a large number of language examples related to words and word attributes, such as several words behind "because" It is an example of "so", an example of an adjective followed by an adverb, and an example of a verb after the preposition "ba", etc. It is used for language example collection, language phenomenon statistics, and language law induction. [0003] In the past, the accumulation, statistics and rule induction of language examples mainly relied on manual copying of cards. With the development and popularization of co...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 宋柔樊太志罗智勇荀恩东
Owner LANGUAGE INFORMATION PROCESSING INST OF BEIJING LANGUAGE & CULTURE UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products