Text deduplication method and device and electronic equipment

A text and text technology, applied in the computer field, can solve the problems of slow speed, inaccurate deduplication effect, memory overflow, etc., and achieve the effect of improving deduplication effect, saving memory and improving deduplication efficiency.

Pending Publication Date: 2022-04-29
北京清格科技有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the duplicate checking based on the Jaccard algorithm involves a lot of calculations, the speed is relatively slow, and it requires computer memory. If the text of the web page that needs to be checked is too long, it will cause memory overflow
The duplication check based on the sim-hash algorithm is more suitable for the similarity comparison of long texts, and the deduplication effect of short texts will be inaccurate

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text deduplication method and device and electronic equipment
  • Text deduplication method and device and electronic equipment
  • Text deduplication method and device and electronic equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0017] Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.

[0018] It should be understood that the various steps described in the method implementations of the present disclosure may be executed in different orders, and / or executed in parallel. Additionally, method embodiments may include additional steps and / or omit performing illustrated steps. The scope of the present disclosure is not limited in this regard. ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a text duplicate removal method and device and electronic equipment. A specific embodiment of the method comprises the steps of obtaining a to-be-deduplicated webpage set; for each webpage to be subjected to duplicate removal in the webpage set to be subjected to duplicate removal, extracting webpage features from webpage data of the webpage to be subjected to duplicate removal, and performing duplicate removal on the webpage to be subjected to duplicate removal on the basis of a vector space hash algorithm and a minimum common substring matching algorithm by utilizing a webpage title and a webpage body of the webpage to be subjected to duplicate removal; determining whether a webpage similar to the to-be-deduplicated webpage exists in a candidate webpage set, and if yes, setting a similar flag bit of the to-be-deduplicated webpage; grouping the to-be-deduplicated web pages in the to-be-deduplicated web page set by utilizing the similar flag bits; and based on the webpage features, selecting a target webpage from each group of to-be-deduplicated webpages, and deleting other webpages except the target webpage to obtain a deduplicated webpage set. According to the embodiment, while the webpage text deduplication effect is improved, the deduplication efficiency is improved, and the memory is saved.

Description

technical field [0001] The embodiments of the present disclosure relate to the field of computer technologies, and in particular to a text deduplication method, device and electronic equipment. Background technique [0002] There are usually a large number of mutual reprints in some Internet websites. For example, local government websites reprint news from central government affairs websites, which will lead to repeated retrieval of captured data in search engines, resulting in a bad user experience. [0003] Existing text deduplication methods are based on the vector space hash (sim-hash) algorithm, or the least common substring matching (Jaccard) algorithm. However, the plagiarism check based on the Jaccard algorithm involves a lot of calculations, the speed is relatively slow, and it requires computer memory. If the text of the webpage that needs to be checked is too long, it will cause memory overflow. The duplication check based on the sim-hash algorithm is more suita...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F40/194G06F40/216G06F40/289G06F16/903
CPCG06F16/951G06F40/194G06F40/216G06F40/289G06F16/90344
Inventor 张洵刘青松刘博伟彭辉
Owner 北京清格科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products