Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method of identifying documents with similar properties utilizing principal component analysis

Inactive Publication Date: 2008-11-13
SPARTA
View PDF11 Cites 23 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0009]As discussed in more detail below, a further advantage of PCA is that the training aspect of the algorithm (in which the principal component transformation is calculated, and which can be computationally intensive) can be done separately from the analysis of a text under study, which can be accomplished relatively quickly.
[0014]In another aspect, the invention provides a system for processing textual data, which includes a module for determining for each of a plurality of n-gram groupings occurrence frequency distribution corresponding to n-gram member of that grouping for at least two reference texts, wherein one text exhibits an attribute of interest and the other lacks that attribute. The system can further include an analysis module receiving the frequency distribution and applying a principal component transformation to that distribution so as to generate a plurality of principal component vectors corresponding to the reference texts for each n-gram grouping. The analysis module can determine, for each n-gram grouping, a minimum angle between the principal components of the texts corresponding to that grouping. Further, the analysis module can rank order the n-gram groupings based on the minimum angles corresponding thereto, e.g., by assigning a higher rank to a grouping that is associated with a larger minimum angle.

Problems solved by technology

Depending on the application, these methods, however, have several drawbacks.
Further, these methods can be sensitive to misspellings, variants, synonyms, and inflected forms, and they tend to be language specific.
This can create a very high-dimensional analysis space in which to classify the text, one which cannot be easily visualized and whose analysis can be computationally intensive.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method of identifying documents with similar properties utilizing principal component analysis
  • Method of identifying documents with similar properties utilizing principal component analysis
  • Method of identifying documents with similar properties utilizing principal component analysis

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025]The present invention generally provides methods and systems that employ transformation of n-grams frequency distributions of a text into principal component (PC) space for characterizing the text, as discussed in more detail below. In some embodiments, a subset of all possible n-grams is selected that is best suited for characterizing a text under analysis. The selection of such a subset of n-grams is analogous to the selection of a plurality of wavelengths for interrogating a sample as discussed in co-pending patent application entitled “Selection of Interrogation Wavelengths in Optical Bio-detection Systems,” which is herein incorporated by reference. Hence, in the following discussion, initially methods for selecting such wavelengths are discussed, and further details can be in the aforementioned patent application.

[0026]As discussed in more detail below, in many embodiments, a metric is defined based on the transformation of spectral data into the principal component spac...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention generally provides methods and systems for characterizing texts, for example, for identifying textual documents by language, topic, author, or other attributes. In some embodiments, a method of the invention can include creating an n-gram frequency spectrum for a document under analysis, preferably selecting a subset of the n-gram frequency spectrum, transforming the n-gram frequency spectrum into principal component space, and identifying one or more attributes of the document according to its similarity to (or distinction from) reference documents in the principal component space.

Description

RELATED APPLICATIONS[0001]This application claims priority to a provisional application entitled “Selection of Interrogation Wavelengths in Optical Bio-detection Systems,” having a Ser. No. 60 / 916,480 and filed on May 7, 2007. This provisional application is herein incorporated by reference.[0002]The present application is also related to a commonly-owned patent application entitled “Selection of Interrogation Wavelengths in Optical Bio-Detection Systems” by Pierre C. Trepagnier, Matthew B. Campbell and Philip D. Henshaw filed concurrently herewith (Attorney Docket No. 101335-36). This concurrently filed application is also incorporated herein by reference in its entirety.BACKGROUND[0003]The present invention relates generally to methods and systems for determining characteristics of a text, such as the language or languages in which it is written, its subject matter, or its author.[0004]Traditionally, many document categorization methods have relied on high-level identifiers such a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27
CPCG06F17/30707G06F16/353
Inventor HENSHAW, PHILIP D.TREPAGNIER, PIERRE C.
Owner SPARTA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products