Third, it can improve the user experience. The quality of the
search engine’s search results directly determines the quality of the user experience. If the
search engine only matches the webpage based on the keywords given by the user, it does not consider the user’s
search intent and understanding of the user’s keywords. The corresponding theme cannot meet the needs of users
[0004] Chinese webpage classification is based on text classification. Although the technology of text classification has been maturely applied to various fields of life, webpage classification is much more complicated than text classification due to the irregularity of webpage
data structure.
For example, the data set of text classification comes from text inventions or data items in the
database. It has a very standardized
data structure and it is very easy to obtain the feature items of the data set. However, most web pages are
HTML files, and
HTML is a semi-structured The theme information of web pages exists in
HTML tags, and noise data and junk information can also exist anywhere in HTML tags. This kind of irregular and irregular web pages leads to It is becoming more and more difficult to extract web page topic information, which ushers in great difficulties for web page classification
[0005] First, the extracted webpage theme content is not accurate enough, and the webpage has no fixed modules and structure, so how to extract the webpage theme content is more difficult. In addition, the webpage not only contains the webpage theme content information, but also contains Various advertisements, navigation bars, useless links and other irrelevant information, because of the unstructured webpage, these spam and noise data can be filled in any position of the webpage, which seriously affects the accuracy of webpage classification
[0006] Second, the amount of webpage data is too large to meet the real-time requirements of the webpage classification
system. The
network data information is updated all the time, and the amount of data is increasing all the time. The real-time requirements of the webpage classification
system are already very severe. Only continuous improvement Only by improving the calculation speed of the classification method, or proposing a new classification method, can the
accuracy and precision of the web classification
system be improved, and an efficient user experience can be achieved to meet the growing needs of users
[0011] First, because the content on the Internet is constantly updating and changing, the web page structure can also be set at will, resulting in a variety of web page presentation methods. They do not have a fixed structural template, and the web page content and
layout styles are inconsistent. The method is webpage classification, which is very inefficient and cannot meet the needs of the growing
mobile Internet users. Although text classification technology has been maturely applied, webpage classification is much more complicated than text classification due to the irregularity of webpage
data structure. Web pages are HTML files in most cases, and the subject information of web pages exists in HTML tags.
Noise data and garbage information can also exist anywhere in HTML tags. This kind of irregular web page leads to It is becoming more and more difficult to extract webpage topic information from webpages, which ushers in great difficulties for webpage classification;
[0012] Second, the subject content of the webpage extracted by the existing technology is not accurate enough, and the webpage has no fixed modules and structure, so it is difficult to extract the subject content of the webpage. In addition, the webpage not only contains the subject content information of the webpage, but also contains There are various advertisements, navigation bars, useless links and other irrelevant information. Because of the unstructured webpage, these spam and noise data can be filled anywhere on the webpage, which seriously affects the accuracy of webpage classification; in addition, the amount of webpage data is too large. Huge, the existing technology cannot meet the real-time requirements of the webpage classification system, and the
network data information is updated all the time. The real-time requirements of the webpage classification system are already very severe. Only by improving the
accuracy and precision of the web classification system can an efficient user experience be achieved;
[0013] Third, most of the webpage classification technologies in the prior art use existing corpora as data sets, and the webpages extracted from these corpora are basically outdated and cannot reflect current hot issues, and the existing corpus contains noise data. The data seriously affects the accuracy of the classification model
In addition, the feature extraction method of the existing technology does not consider the semantic correlation between the feature items, which has a certain negative
impact on the performance of the classification model. The existing technology cannot effectively remove the noise data in the data set, and the quality of the data set is poor. The accuracy and precision of the model is poor;
[0014] Fourth, the existing webpage classification
algorithm based on the
vector space model mainly calculates the similarity of webpage document feature vectors to judge the webpage document category. When the number of webpage documents reaches the order of trillions, the
time complexity of calculating the similarity between documents Too high. In addition, classification results or clustering results are based on keyword information matching, without considering
semantic information, and cannot solve the situation of
polysemy and
polysemy, resulting in low user experience; the existing technology is based on
linear algebra The webpage topic classification
algorithm of the website uses SVD
matrix decomposition. The
matrix decomposition solution process is complex, and the result of SVD
decomposition is not positive in many dimensions of the
feature vector, which leads to the unsatisfactory semantic
concept space of LSI. In addition, LSI makes certain categories stronger. The feature items are deleted after being mapped to the
concept space, which greatly affects the classification accuracy of web pages; the prior art web page topic classification
algorithm based on the probabilistic feature
topic model has the problem of
overfitting when the amount of web page data increases greatly , and the parameters will increase with the increase of the amount of web page data, resulting in a significant increase in computational complexity;
[0015] Fifth, most of the data sets used in the existing Chinese web page
classification methods come from the Sogou corpus. Although the Sogou corpus extracts the subject information of the web pages and classifies the web page categories, the web pages extracted by the Sogou corpus are updated slowly. It cannot reflect the current social hotspots, nor can it deal with new words and unregistered words on the Internet, so it cannot use the data of Sogou corpus to deal with current hotspots; the webpage classification method of the prior art depends on the quality of training data. Topics and news hotspots are updated every day. If the data is not representative or becomes unrepresentative after a period of time with the generation of new data, it will seriously affect the accuracy of the classification model. A large number of new words and hot words are generated. If the previous classification model is used to classify web pages containing a large number of new words, because the training model is not sensitive to new words, the classification effect is very poor