Provided is a
text categorization method based on an Xgboost
categorization algorithm. According to the
text categorization method, a characteristic value is calculated by extracting a tagged word through Labeled-LDA, and then
text categorization is conducted by using the Xgboost
categorization algorithm. Compared with a method that the text
categorization is conducted by using a common categorization
algorithm and a common vector space
modal is adopted as
characteristic space, the method reduces required consumed memory, this is because the number of words contained in a Chinese text is several million, dimensionality is high, if the words are adopted as characteristics, the consumed memory is massive, even one
machine cannot conduct
processing, however, the number of common
Chinese characters is no more than ten thousand, the number of frequent
Chinese characters is even two to three thousand, the dimensionality is reduced greatly, and meanwhile Xgboost supports input in a dictionary mode rather than an array mode. Besides, the invention provides a novel
feature selection algorithm Labeled-LDA algorithm with latent semantic and supervision, the Labeled-LDA is adopted to conduct
feature selection, and thus not only can
semantic information of massive linguistic data be dug by utilizing LDA, but also
class information contained in the text can be utilized. Furthermore, preprocessing is easy, there is no need to extract the characteristics carefully, and accuracy and performance of categorization are improved with the addition of the strong
ensemble learning algorithm Xgboost supporting a distributed mode.