Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Corpus tagging method and a corpus tagging device

A corpus labeling and corpus technology, applied in the creation of semantic tools, natural language data processing, special data processing applications, etc., can solve problems such as process redundancy, achieve simple algorithms, good reliability, and reduce repetitive manual processing work. Effect

Active Publication Date: 2019-03-26
XIAMEN KUAISHANGTONG INFORMATION TECH CO LTD
View PDF6 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The second is to use an unsupervised algorithm (clustering algorithm) to cluster the labeled data, and then label each category; this method can directly label the corpus without relying on too much prior information, but some manual intervention is required later
[0007] For the second labeling method, the most common one is to pre-label the corpus with the k-means algorithm as the core algorithm, but the disadvantage is that because k-means needs to achieve a given number of clusters, the possible result is that it depends on experience first Specify a number of clusters, and then adjust the value of the number of clusters continuously according to the clustering effect until the appropriate value is adjusted. The whole process is too redundant

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus tagging method and a corpus tagging device
  • Corpus tagging method and a corpus tagging device
  • Corpus tagging method and a corpus tagging device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034] In order to make the technical problems, technical solutions and beneficial effects to be solved by the present invention clearer and clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0035] Such as figure 1 Shown, a kind of corpus labeling method of the present invention, it comprises the following steps:

[0036] a. vectorizing the corpus to be processed to obtain the text vector of the corpus;

[0037] B. according to the text vector of described corpus, utilize DBSCAN clustering algorithm to carry out cluster processing to described corpus, obtain long tail class corpus and class corpus to be labeled;

[0038] c. For the long-tail corpus, return to step b; for the corpus to be marked, set the label to obtain the marked corpus; ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a corpus tagging method and a corpus tagging device. The corpus to be processed is vectorized to obtain a text vector of the corpus. According to the text vector of the corpus,clustering the corpus by using a DBSCAN clustering algorithm to obtain a long tail corpus and a corpus to be annotated; Returning a re-clustering process for the long tail class corpus; As for that corpus to be annotate, a label is set to obtain the annotated corpus; Finally, all the tagged corpus are combined to get the final tagged corpus, without adjusting the number of clusters many times, the algorithm is simpler, the tagging efficiency is higher, and the reliability is better.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a corpus tagging method and a device for applying the method. Background technique [0002] Corpus is the basic resource of corpus linguistics research and the main resource of empirical language research methods. Traditional corpora are mainly used in dictionary compilation, language teaching, traditional language research, statistical or case-based research in natural language processing, etc. With the development of Internet big data and artificial intelligence technology, corpus has also been widely used. [0003] The corpus stores the language materials that have actually appeared in the actual use of the language, such as user messages and customer service dialogues obtained directly from the webpage; the corpus is the basic resource for carrying language knowledge, but it is not equal to language knowledge; real corpus It needs to be processed before ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/36G06F17/27G06K9/62
CPCG06F40/295G06F18/2321
Inventor 林志伟肖龙源蔡振华李稀敏刘晓葳谭玉坤
Owner XIAMEN KUAISHANGTONG INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products