The invention discloses a Chinese
Web document online clustering method based on common substrings. As known to all, search engines are important in application of
information searching and positioning with sharp increase of information on
the internet.
Web document clustering can automatically classify return results of the search engines according to different themes so as to assist users to reduce query range and fast position needed information. The
Web document online clustering is characterized in that non-numerical and non-structured characteristics of Web documents are required to be met on the one hand, and clustering time is required to meet
online search requirements of users on the other hand. According to the two characteristics, the invention provides the Chinese Web document online clustering method based on common substrings, and the method comprises steps as follows: (1) firstly, preprocessing the first n query results returned by the search engines so as to realize deleting and replacing operation of non-
Chinese characters in the return results of the search engines, (2) extracting common substrings in the Web documents by utilizing GSA, (3) presenting
a weighting calculation formula referring to TF*IDF according to the common substrings which are extracted and then building a document characteristic vector model, (4) computing
pairwise similarity of the Web documents on the basis of the model to acquire a
similarity matrix, (5) adopting an improved
hierarchical clustering algorithm to achieve clustering of the Web documents on the basis of the matrix, and (6) executing clustering description and
label extraction. The Chinese Web document online clustering method based on common substrings has obvious advantages on performance, clustering
label generation and clustering time effects.