Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A text deduplication method and system

A text and target text technology, applied in the field of information processing, can solve the problems of missed judgment and low similarity, and achieve the effect of good deduplication effect, high robustness and excellent effect.

Active Publication Date: 2021-04-02
重庆电信系统集成有限公司 +1
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When the existing technology deals with deduplication tasks with events as the main body, two articles with low similarity may also refer to the same event, which will lead to missed judgments

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A text deduplication method and system
  • A text deduplication method and system
  • A text deduplication method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0073] Aiming at the unique characteristics of network texts, the present invention adjusts the strategy of applying SimHash algorithm, and obtains better effect and higher robustness when deduplication is carried out with events behind news texts as the main body.

[0074] Target text can be the news text that web crawler grabs from Internet, and one of the problems that the present invention will solve is, in the news text storehouse that grabs, judge those news texts in report same event, and report the news of same event. The text is categorized and deduplicated.

[0075] Such as figure 1 As shown, the present invention provides a text deduplication method, comprising the following steps,

[0076] target text data preprocessing step;

[0077] Steps of generating target text body local sensitive hash value and target text title local sensitive hash value;

[0078] Remove duplicate steps.

[0079] Further, the target text data preprocessing step includes removing stop wo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a text duplicate removal method. The method comprises the following steps: a step of preprocessing data of a target text, a step of generating a local sensitive hash value of amain body of the target text and a local sensitive hash value of a title of the target text, and a duplicate removal step. For unique characteristics of a network text, a policy applying a SimHash algorithm is adjusted, so that the better effect and the higher robustness are achieved when the duplicate removal is performed by taking an event behind a news text as a subject.

Description

technical field [0001] The invention relates to the field of information processing, in particular to a text deduplication method and system. Background technique [0002] Text deduplication technology is widely used in the stage of massive data collection, and no big data company can avoid this problem. The current mainstream text deduplication schemes can be roughly divided into the following two types: [0003] 1 Similarity matching based on text feature vector [0004] 2 Using SimHash based on word segmentation results to realize distance measurement [0005] However, when identifying the same event behind the text, the citation of a small number of local chapters will affect the final result, resulting in misjudgment and missed judgment. [0006] The existing technology is based on the similarity matching of text feature vectors, using LSI, LDA algorithm or one-hot method to represent the text as a text vector of a specific dimension, and calculating the similarity b...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/211G06F40/284
CPCG06F40/211G06F40/284
Inventor 孙世通刘德彬万杰严开陈玮
Owner 重庆电信系统集成有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products