Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Text multi-granularity similarity comparison method based on semantic aggregation fingerprints

A similarity, multi-granularity technology, applied in semantic tool creation, unstructured text data retrieval, text database clustering/classification, etc., can solve problems such as inability to provide comparisons

Active Publication Date: 2019-10-11
COMP APPL RES INST CHINA ACAD OF ENG PHYSICS
View PDF6 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, these methods only focus on the text itself, while ignoring the useful information in the text library. At the same time, Simhash generates a text fingerprint for a text, which cannot provide a similar comparison of two texts.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text multi-granularity similarity comparison method based on semantic aggregation fingerprints
  • Text multi-granularity similarity comparison method based on semantic aggregation fingerprints
  • Text multi-granularity similarity comparison method based on semantic aggregation fingerprints

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0054] Such as figure 1 As shown, the text multi-granularity similarity comparison method based on semantic aggregation fingerprint of the present invention comprises the following steps:

[0055] Step 1. Word vector representation training: comprehensive multi-dimensional semantic correlation joint modeling word vector learning, that is, using a public corpus, using text and word mapping to represent the horizontal relationship between words and words co-occurring in context, and word and context mapping to represent words The longitudinal relationship with the similar context between words, and adding synonyms and antonyms information for modeling, and performing word vector representation training through unsupervised learning methods, so that the trained word vectors perform better in semantic correlation and antonym synonyms recognition tasks . figure 2 A word vector representation model is shown.

[0056] In this step, the joint modeling of the comprehensive multidime...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a text multi-granularity similarity comparison method based on semantic aggregation fingerprints. The method comprises the following steps: training word vector representation;extracting semantic features; performing multi-feature aggregation; constructing a hierarchical index; calculating similarity. According to the method, word vector representation modeling is carriedout in combination with multi-dimensional semantic correlation; semantic information among words is fully mined; characteristics are extracted by taking sentences as units, semantic features are represented by adopting multiple weights, text library statistics and distribution information are mined by utilizing a statistical learning method, finer division of a feature space is realized, a compacttext fingerprint with high identification degree is generated on the basis of multi-feature aggregation, and the description capability and the discrimination degree of the text fingerprint are effectively improved. According to the method, text similarity comparison is carried out by adopting a top-down thought and using semantic aggregation fingerprint and local semantic features, and global-to-local multi-granularity similarity comparison of texts can be quickly and efficiently realized by constructing hierarchical indexes; the method has good expandability.

Description

technical field [0001] The invention relates to a text similarity comparison method, in particular to a text multi-granularity similarity comparison method based on semantic aggregation fingerprints, which belongs to the technical field of pattern recognition and information processing. Background technique [0002] The similarity of two texts means that the content and information described in the texts are similar or even identical. Two texts are considered to be similar if one text is generated by another text by modifying a small part of the content by means of insertion, deletion, replacement, etc. Approximate diffusion of text or web pages is generally bad, and as data proliferates, the problem caused by approximate text becomes more and more serious. Therefore, approximate text detection is an important technique to reduce storage overhead, improve search efficiency and data utilization, and avoid illegal plagiarism and plagiarism. [0003] Experts and scholars at h...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/62G06F16/36G06F16/35
CPCG06F16/35G06F16/36G06F18/22G06F18/23213
Inventor 梁燕万正景陶以政李龚亮许峰曹政谢杨马丹阳
Owner COMP APPL RES INST CHINA ACAD OF ENG PHYSICS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products