The present invention provides a
machine learning-based
data classification method. The method comprises the following steps of S11, based on
learning data, determining a first feature
word group corresponding to each data; S12, according to feature words, classifying the
learning data; S13, judging whether the classification of the
learning data is correct or not, and if yes,
jumping to the step S15; if not, adjusting the first feature
word group, and
jumping to the step S12; S15, establishing a
data classification model based on the first feature
word group. The
machine learning-based
data classification device comprises a first feature word group determination module, a first data classification module, a judgment and classification module, a second feature word group determination module and a modeling module. According to the technical scheme of the invention, the content of a file is subjected to word-
cutting treatment, and the weight of a word is calculated based on the TFIDF
algorithm. After that, the similarity of the file is calculated. Similar files are clustered and feature words are extracted. Feature words are different from key words, while feature words are more representative and more suitable to be served as the sensitive information. Therefore, the feature words of one cluster can be distinguished from the feature words of other clusters.