Content based junk webpage detecting method and detecting apparatus thereof
A technology of spam web pages and detection methods, applied in website content management, network data retrieval, other database retrieval, etc., can solve problems affecting the relevance and accuracy of search results, high ranking, etc. sex, improve relevance
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0055] A content-based spam detection method, see figure 1 , the spam detection method includes the following steps:
[0056] 101: selecting several spam webpages as seed spam webpages;
[0057] Assume that there are a total of N web pages, of which x number of spam web pages have been marked and stored in the set X. Randomly select m spam webpages from the set X as the sample set M, and use M to denote the seed spam webpage.
[0058] 102: Calculate the maximum similarity value between all webpages and the content of the seed spam webpage, and generate a similarity set S;
[0059] Firstly, the features of all web pages are extracted by statistical methods, and then the extracted features are composed into vectors by using VSM. Finally, the cosine similarity method based on vector space is used to calculate the similarity between all web pages and the content of seed spam web pages.
[0060] 103: Use the PageRank algorithm to sort all web pages; and set the sorted web pages...
Embodiment 2
[0067] The scheme in Embodiment 1 is described in detail below in conjunction with specific calculation formulas and examples, see the following description for details:
[0068] 201: selecting several spam webpages as seed spam webpages;
[0069] Wherein, the spam webpage refers to a webpage containing malicious content or worthless content. The process of selecting spam webpages as seeds in the embodiment of the present invention is as follows: Suppose there are a total of N webpages, among which x number of spam webpages that have been marked are stored in the set X. Randomly select m spam web pages from the set X as the sample set M, and use M to denote the seed spam web pages.
[0070] 202: Using a statistical method to extract features from the webpage, and then using VSM to form feature vectors from the extracted features;
[0071] The innovation of the embodiment of the present invention is based on the traditional PageRank algorithm, adding the calculation of the co...
Embodiment 3
[0102] Below in conjunction with specific example, the scheme in embodiment 1 and 2 is carried out feasibility verification, see the following description for details:
[0103] In the embodiments of the present invention, the recall rate is used to evaluate the experimental results, that is, the recall rate is represented by the ratio of the intersection of detected spam webpages and marked spam webpage sets to the marked spam webpage set.
[0104] When calculating the experimental results, the capacity of the collection of detected spam web pages is set to 20,000 web pages. The threshold s of the similarity is set to five values of 0.91, 0.93, 0.95, 0.97 and 0.99 to monitor the recall rate.
[0105] Comparing the experimental results of this method with the traditional PageRank results, it is found that the number and recall rate of spam web pages detected by this method (Sim-PageRank) are higher than those of the traditional PageRank algorithm. When the similarity threshol...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com