The features that are presented to an
evolutionary algorithm are preprocessed to generate combination features that may be more efficient in distinguishing among classifications than the individual features that comprise the combination feature. An initial set of features is defined that includes a large number of potential features, including the generated features that are combinations of other features. These features include, for example, all of the words used in a collection of content material that has been previously classified, as well as combination features based on these features, such as all the
noun and
verb phrases used. This
pool of original features and combination features are provided to an
evolutionary algorithm for a subsequent evaluation, generation, and determination of the best subset of features to use for classification. In this evaluation and
generation process, each combination feature is processed as an independent feature, independent of the features that were used, or not used, to form the combination feature. In this manner, for example, a particular
phrase that is generated as a combination of original feature words may be determined to be a better distinguishing feature than any of the original feature words and a more efficient distinguishing feature than an unrelated selection of the individual feature words, as might be provided by a conventional
evolutionary algorithm. The
resultant best performing subset is subsequently used to characterize new content material for automated classification. If the automated classification includes a learning
system, the evolutionary
algorithm and the generated combination features are also used to
train the learning
system.