Document detection method and system
A detection method and document technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems that the precision rate and recall rate cannot meet high requirements, so as to improve the recall rate and improve The effect of accuracy and efficiency
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0017] This embodiment provides a document detection method, such as figure 1 As shown, the method includes the following steps.
[0018] 101. Extract characters from a webpage document to be tested to obtain at least one document feature.
[0019] Specifically, the document feature can replace the webpage document to be tested, and be used for comparison with document features of other webpage documents, so as to determine whether the webpage document to be tested is an approximate duplicate document with other webpage documents.
[0020] 102. Perform hash calculation on each of the document features to obtain corresponding feature fingerprints.
[0021] 103. If no document cluster similar to the webpage document to be tested is found in the fingerprint mapping database according to each feature fingerprint, compare the webpage document to be tested with webpage documents within a specified number of days for similarity comparison.
[0022] Specifically, the (K, V) pair of ...
Embodiment 2
[0035] This embodiment provides a document detection method, such as image 3 As shown, the method includes the following steps.
[0036] 301. Extract characters from a webpage document to be tested to obtain at least one document feature.
[0037] Specifically, the document feature can replace the webpage document to be tested, and be used for comparison with document features of other webpage documents, so as to determine whether the webpage document to be tested is an approximate duplicate document with other webpage documents. At least one document feature can be obtained through the following method.
[0038] Divide the text of the webpage document to be tested into at least one paragraph according to the paragraph identifier; select the N paragraphs containing the largest number of characters in the at least one paragraph; divide each paragraph in the N paragraphs according to punctuation marks It is at least one sentence; each paragraph contains a sentence with the la...
Embodiment 3
[0099] This embodiment provides a document detection system, such as Figure 4 As shown, it includes: a feature extraction device 41, which is used to extract characters in the webpage document to be tested to obtain at least one document feature, and perform a hash calculation on each of the document features to obtain a corresponding feature fingerprint; a document comparison device 43, When no document cluster similar to the webpage document to be tested is found in the fingerprint mapping database according to each of the feature fingerprints, the webpage document to be tested is compared with webpage documents within a specified number of days, And when it is determined that among the webpage documents within the specified number of days, the webpage documents whose similarity value with the webpage document to be tested is greater than the similarity threshold are included in the document cluster of the document mapping database, the The document cluster is a target docu...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com