This invention uses linguistic principles, which together can be called Collocational Cohesion (CC), to evaluate and sort documents automatically into one or more user-defined categories, with a specified level of
precision and recall. Human readers are not required to review all of the documents in a collection, so this invention can save time and money for any manner of large-scale
document processing, including legal discovery, Sarbanes-Oxley compliance, creation and review of archives, and maintenance and monitoring of electronic and other communications. Categories for evaluation are user-defined, not pre-set, so that users can adopt either traditional categories (such as different
business activities) or custom, highly specific categories (such as perceived risks or sensitive matters or topics). While the CC process is not itself a general tool for text searches, the application of the CC process to large collections of documents will result in classifications that allow for more efficient indexing and retrieval of information. This invention works by means of linguistic principles. Everyday communication (letters, reports, emails-all kinds of communication in language) does follow the grammatical patterns of a language, but forms of communication also follow other patterns that analysts can specify but that are not obvious to their authors. The CC process uses that additional information for the purposes of its users. Any communication exchange that can be recognized as a particular kind of discourse may be used as a category for classification and assessment. Specific linguistic characteristics that belong to the kind of discourse under study can be asserted and compared with a body of general language, both by inspection and by mathematical tests of significance. These characteristics can then be used to form the roster of words and collocations that specifies the discourse type and defines the category. When such a roster is applied to collections of documents, any document with a sufficient number of connections to the roster will be deemed to be a member of the category Larger documents can be evaluated for clusters of connections, either to identify portions of the larger document for further review, or to subcategorize portions with different linguistic characteristics. The CC process may be extended to create a roster of rosters belonging to many categories, thereby increasing the specificity of evaluation by multilevel application of this invention. The CC process works better than other processes used for document management that rely on non-linguistic means to characterize documents. Simple keyword searches either retrieve too many documents (for general keywords), or not the right documents (because a few keywords cannot adequately define a category), no matter how complex the logic of the search. Application of
statistical analysis without attention to linguistic principles cannot be as effective as this invention, because the words of a language are not randomly distributed. The assumptions of statistics, whether simple inferential tests or advanced
neural network analysis, are thus not a good fit for language. This invention puts basic principles of language first, and only then applies the speed of computer searches and the power of inferential statistics to the problem of evaluation and
categorization of textual documents.