The invention provides a network page efficient and accurate deduplication
system based on
cloud computing, and aims to solve the problems that most of web pages searched by an existing search engineare static web pages, due to the existence of a large amount of
transshipment and plagiarism, the main content of a large number of web pages is repeated, and for the
search engine, the repeated web pages virtually increase the burden of index storage, and meanwhile, more retrieval time can be consumed; the webpage deduplication
system based on the Hadoop cloud platform is designed and realized bycombining an
open source framework, other modules of a
search engine can be better connected by adopting a mode of detecting and judging duplicate in real time after a spider program captures a webpage; and in a massive webpage collection stage, the network page efficient and accurate deduplication
system based on
cloud computing can preprocess the web pages in advance, then
web page similarity detection and discovery are carried out, repeated web pages or web pages with high similarity are removed, and therefore index quality is improved, retrieval results are optimized, and good search experience is provided for users.