Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Chinese web page text deduplication system and method

A text and webpage technology, applied in the field of Chinese webpage text deduplication system, can solve the problems of wasting user time, wasting search engine resources, reducing retrieval efficiency, etc., to avoid waste of storage space, ensure uniqueness, improve retrieval accuracy and The effect of retrieval efficiency

Inactive Publication Date: 2012-04-04
SHENGLE INFORMATION TECH SHANGHAI
View PDF0 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Since the search engine will download and save the web pages captured by the spider program to the local storage system, and the capacity of the storage system is limited, a large number of repeated web pages will waste the resources of the search engine and occupy the storage space of other valuable web pages. In addition, the retrieval efficiency of search engines will also be reduced due to the increase in the amount of data in the local webpage database, which not only wastes the user's time, but also affects the user's search experience
[0005] Deduplication of webpages, that is, removing duplicate webpages on the Internet, is an effective way to solve the above problems. However, due to the extremely large number of webpages collected by search engines, which are more than tens of millions of pages, the number of webpages collected by large search engines like Google There are even billions of web pages, and the existence forms of web pages are also very complex and diverse. If you directly compare a web page newly captured by the spider program with the massive web pages already included in the search engine system one by one, the calculation complexity will be very high. High, for example, assuming that the number of documents included in the search engine is n, and the average length of the documents is m, if the complexity of similarity calculation is T, T is a function of m, that is, T=T(m), and the comparison of documents is complicated degree is 0(n^2), then the combined complexity is 0(n^2×T(m)), such a complexity is obviously unacceptable for a system like a search engine that needs to process massive amounts of data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese web page text deduplication system and method
  • Chinese web page text deduplication system and method
  • Chinese web page text deduplication system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] In order to have a more specific understanding of the technical content, characteristics and effects of the present invention, now in conjunction with the illustrated embodiment, the details are as follows:

[0042] Such as figure 1 As shown, the Chinese web page text deduplication system of the present invention mainly includes two parts: an index server and a retrieval server, wherein:

[0043] Index server, used to calculate the digital signature of Chinese web pages. The index server further includes a webpage text preprocessing module, a combined characteristic sentence extraction module and a digital signature calculation module. The web page text preprocessing module is used to normalize the webpage text to be determined sent by the retrieval server; the combined feature sentence extraction module is used to extract the combined feature sentence of the text processed by the web page text preprocessing module; the digital signature calculation module It is used ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Chinese web page text deduplication system and a Chinese web page text deduplication method. The deduplication system comprises an index server and a search server, wherein the index server comprises a web page text preprocessing module, a combined characteristic sentence extraction module and a digital signature calculation module; and the search server comprises a web page text capture module and a Hash query module. The deduplication method comprises the following steps of: normalizing a web page text; extracting a combined characteristic sentence of the text; calculating a digital signature of the combined characteristic sentence; and comparing the digital signature with the existing digital signature in a Hash table, and judging whether the digital signature is duplicated or not. By the deduplication system and the deduplication method, a search engine can quickly and accurately determine and remove a large number of Chinese web pages with duplicated contents in the Internet; and when the search engine captures a new web page, the digital signature of the web page is calculated and compared with the digital signature of the web page, which has been stored by the search engine, whether the web page is duplicated or not is judged, and the web page is not stored if the web page is duplicated, so that the waste of a storage space is avoided, and the search accuracy of the search engine is improved simultaneously.

Description

technical field [0001] The invention relates to a Chinese web page text deduplication system, and the invention also relates to a method for removing duplicate Chinese web pages by using the deduplication system. Background technique [0002] At present, the amount of information on the Internet is increasing at an explosive rate, and users must rely on search engines to find the information they want in the massive amount of information on the Internet. Full-text search engines, such as Google, Baidu, etc., are search engines in the true sense. They usually send out "spider" (spider) programs regularly to grab web pages on the Internet according to certain rules and save them in the local storage system. After inputting query keywords on the search interface of the search engine, the search engine searches the local web database for records matching the query conditions, and returns the search results to the user according to certain sorting rules. [0003] However, since ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 陈运文
Owner SHENGLE INFORMATION TECH SHANGHAI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products