Improved feature evaluation function based Bayesian spam filtering method

A technology for spam filtering and evaluating functions, applied in special data processing applications, electrical digital data processing, instruments, etc., can solve the problems of weak negative correlation performance, lack of performance of filtering methods, and different contribution capabilities of feature item category definition.

Active Publication Date: 2015-06-24
LIAONING UNIVERSITY
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] The most common feature selection method in Bayesian spam filtering is the "mutual information" method. This method can effectively express the degree of dependence between words in text classification. However, it will be highlighted in the feature selection stage of spam filtering. The following problems make the entire filtering method lack in performance: 1. Positive and negative correlation issues: the correlation between feature items and text categories is divided into two types: positive correlation and negative correlation. The effect is that the positive correlation has a strong ability to express the category, and the negative correlation has a weak ability to express, but the meaning expressed from the formula is that the positive and negative offset each other, that is, the negative correlation has the opposite effect on the performance, which is contrary to the original intention; 2 Ignoring word frequency and tending to low-frequency words: the mutual information feature selection method is based on the assumption that the amount of text in each category is roughly equal
In addition, only the occurrence and non-appearance of the term is considered, regardless of the number of times the term appears in the document, but usually we think that the feature words with more occurrences (that is, higher word frequency) are more related to the category and more representative This category, so this has an impact on the feature items that appear frequently in an email; 3. The feature items in different positions have different contribution to the category definition: the feature items extracted from the two different positions of the email title and the body The ability to contribute to classification will be very different. In actual spam filtering, users can often judge whether an email is normal email or spam from its main image.
However, for the above problems, there is currently no improvement method for these unsuitable problems.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Improved feature evaluation function based Bayesian spam filtering method
  • Improved feature evaluation function based Bayesian spam filtering method
  • Improved feature evaluation function based Bayesian spam filtering method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] The Bayesian spam filtering method based on improved feature evaluation function is characterized in that the steps are as follows:

[0032] 1) Preprocess the training mail set: divide the mail into two sub-text sets S 1 ,S 2 , In which word segmentation is performed separately to form two feature item sets T 1 , T 2 ; 2) Respectively in two feature sets T 1 , T 2 Use the stop vocabulary table to delete prepositions, pronouns, adverbs, auxiliary words, conjunctions, and words whose frequency is lower than a given threshold p, and the processed feature item set is marked as T 1 ’, T 2 ’;

[0033] 3) Respectively in the feature item set T 1 ’, T 2 ’Uses an improved feature evaluation function to calculate the mutual information value MI(t k )’:

[0034] 3a) Set the feature vector set T = {t k ,k=1,2,...,n}, obtain the training set category set C={c in the network file text library j ,i=1,2,...,r};

[0035] 3b) Using formula (1) to calculate the correction coefficient λ:

[0036] ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Disclosed is an improved feature evaluation function based Bayesian spam filtering method. The method includes the steps of 1), preprocessing a training mail set into a mail head part and a text part; 2), respectively deleting prepositions, pronouns, adverbs, auxiliary words, conjunctions and words with the work frequency lower than the given threshold P in two feature sets T1 and T2; 3), respectively calculating a mutual information value MI (tk)' in the feature sets T1 and T2 by adopting the improved feature evaluation function; 4), in the training set, sorting the MI (tk)' according to the order from big to small, and selecting feature items corresponding to first n values as representation of the training set; 5), performing spam filtering on to-be-tested samples by adopting a Bayes classifier at the sorting phase. With the method, mails can be classified highly accurately, and spasm can be filtered out.

Description

Technical field [0001] The invention relates to a Bayesian spam filtering method based on an improved feature evaluation function. Background technique [0002] The most common feature selection method in Bayesian spam filtering is the "mutual information" method. This method can effectively express the degree of dependence between words in text classification, but it will be prominent when used in the feature selection stage of spam filtering. The following problems cause the performance of the entire filtering method to be lacking: 1 Positive and negative correlation problem: The correlation between the feature item and the text category is divided into two types: positive correlation and negative correlation. Both cases indicate that the feature item has a definition of the category. The effect is that the positive correlation has a stronger performance ability on the category, and the negative correlation performance ability is weak, but the meaning expressed from the formula...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
Inventor 王青松魏如玉温翠娟张黎
Owner LIAONING UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products