A text deduplication method and system
A text and target text technology, applied in the field of information processing, can solve the problems of missed judgment and low similarity, and achieve the effect of good deduplication effect, high robustness and excellent effect.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0073] Aiming at the unique characteristics of network texts, the present invention adjusts the strategy of applying SimHash algorithm, and obtains better effect and higher robustness when deduplication is carried out with events behind news texts as the main body.
[0074] Target text can be the news text that web crawler grabs from Internet, and one of the problems that the present invention will solve is, in the news text storehouse that grabs, judge those news texts in report same event, and report the news of same event. The text is categorized and deduplicated.
[0075] Such as figure 1 As shown, the present invention provides a text deduplication method, comprising the following steps,
[0076] target text data preprocessing step;
[0077] Steps of generating target text body local sensitive hash value and target text title local sensitive hash value;
[0078] Remove duplicate steps.
[0079] Further, the target text data preprocessing step includes removing stop wo...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com