Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

2750 results about "Text corpus" patented technology

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Method and system for optimally searching a document database using a representative semantic space

A term-by-document matrix is compiled from a corpus of documents representative of a particular subject matter that represents the frequency of occurrence of each term per document. A weighted term dictionary is created using a global weighting algorithm and then applied to the term-by-document matrix forming a weighted term-by-document matrix. A term vector matrix and a singular value concept matrix are computed by singular value decomposition of the weighted term-document index. The k largest singular concept values are kept and all others are set to zero thereby reducing to the concept dimensions in the term vector matrix and a singular value concept matrix. The reduced term vector matrix, reduced singular value concept matrix and weighted term-document dictionary can be used to project pseudo-document vectors representing documents not appearing in the original document corpus in a representative semantic space. The similarities of those documents can be ascertained from the position of their respective pseudo-document vectors in the representative semantic space.
Owner:KLDISCOVERY ONTRACK LLC

Ranking search results by reranking the results based on local inter-connectivity

A search engine for searching a corpus improves the relevancy of the results by refining a standard relevancy score based on the interconnectivity of the initially returned set of documents. The search engine obtains an initial set of relevant documents by matching a user's search terms to an index of a corpus. A re-ranking component in the search engine then refines the initially returned document rankings so that documents that are frequently cited in the initial set of relevant documents are preferred over documents that are less frequently cited within the initial set.
Owner:GOOGLE LLC

Process and system for retrieval of documents using context-relevant semantic profiles

A process and system for database storage and retrieval are described along with methods for obtaining semantic profiles from a training text corpus, i.e., text of known relevance, a method for using the training to guide context-relevant document retrieval, and a method for limiting the range of documents that need to be searched after a query. A neural network is used to extract semantic profiles from text corpus. A new set of documents, such as world wide web pages obtained from the Internet, is then submitted for processing to the same neural network, which computes a semantic profile representation for these pages using the semantic relations learned from profiling the training documents. These semantic profiles are then organized into clusters in order to minimize the time required to answer a query. When a user queries the database, i.e., the set of documents, his or her query is similarly transformed into a semantic profile and compared with the semantic profiles of each cluster of documents. The query profile is then compared with each of the documents in that cluster. Documents with the closest weighted match to the query are returned as search results.
Owner:DTI OF WASHINGTON

Process for the document management and computer-assisted translation of documents utilizing document corpora constructed by intelligent agents

A method of document management utilizing document corpora including gathering a source corpus of documents in electronic form, modeling the source corpus in terms of document and domain structure information to identify corpus enhancement parameters, using a metalanguage to electronically tag the source corpus, programming the corpus enhancement parameters into an intelligent agent, and using the intelligent agent to search external repositories to find similar terms and structures, and return them to the source corpora, whereby the source corpus is enhanced to form a unicorpus.
Owner:KENT STATE UNIV

Context vector generation and retrieval

A system and method for generating context vectors for use in storage and retrieval of documents and other information items. Context vectors represent conceptual relationships among information items by quantitative means. A neural network operates on a training corpus of records to develop relationship-based context vectors based on word proximity and co-importance using a technique of “windowed co-occurrence”. Relationships among context vectors are deterministic, so that a context vector set has one logical solution, although it may have a plurality of physical solutions. No human knowledge, thesaurus, synonym list, knowledge base, or conceptual hierarchy, is required. Summary vectors of records may be clustered to reduce searching time, by forming a tree of clustered nodes. Once the context vectors are determined, records may be retrieved using a query interface that allows a user to specify content terms, Boolean terms, and / or document feedback. The present invention further facilitates visualization of textual information by translating context vectors into visual and graphical representations. Thus, a user can explore visual representations of meaning, and can apply human visual pattern recognition skills to document searches.
Owner:FAIR ISAAC & CO INC

Methods and Systems of Automatic Ontology Population

Methods and systems for creating a knowledge graph that relates terms in a corpus of literature in the form of an assertion and provides a probability of the veracity of the assertion are disclosed herein. Various aspects of the invention are directed to and / or involve knowledge graphs and structured digital abstracts (SDAs) offering a machine readable representation of statements in a corpus of literature. Various methods and systems of the invention can automatically extract, structure, and visualize the statements. Such graphs and abstracts can be useful for a variety of applications including, but not necessarily limited to, semantic-based search tools for search of electronic medical records, specific content verticals (e.g. newswire, finance, history) and general internet searches.
Owner:COUNSYL INC

Rule-based learning of word pronunciations from training corpora

A text-to-pronunciation system (11) includes a large training set of word pronunciations (19) and an extractor for extracting language specific information from the training set to produce pronunciations for words not in its training set. A learner (13) forms pronunciation guesses for words in the training set and for finding a transformation rule that improves the guesses. A rule applier (15) applies the transformation rule found to guesses. The learner (13) repeats the finding of another rule and the rule applier (15) applies the new rule to find the rules that improves the guesses the most.
Owner:TEXAS INSTR INC

Search processing with automatic categorization of queries

Search results are processed using search requests, including analyzing received queries in order to provide a more sophisticated understanding of the information being sought. A concept network is generated from a set of queries by parsing the queries into units and defining various relationships between the units. From these concept networks, queries can be automatically categorized into categories, or more generally, can be associated with one or more nodes of a taxonomy. The categorization can be used to alter the search results or the presentation of the results to the user. As an example of alterations of search results or presentation, the presentation might include a list of “suggestions” for related search query terms. As other examples, the corpus searched might vary depending on the category or the ordering or selection of the results to present to the user might vary depending on the category. Categorization might be done using a learned set of query-node pairs where a pair maps a particular query to a particular node in the taxonomy. The learned set might be initialized from a manual indication of which queries go with which nodes and enhanced has more searches are performed. One method of enhancement involves tracking post-query click activity to identify how a category estimate of a query might have varied from an actual category for the query as evidenced by the category of the post-query click activity, e.g., a particular hits of the search results that the user selected following the query. Another method involved determining relationships between units in the form of clusters and using clustering to modify the query-node pairs.
Owner:R2 SOLUTIONS

Data processing system for autonomously building speech identification and tagging data

A method, system, and computer program product for autonomously transcribing and building tagging data of a conversation. A corpus processing agent monitors a conversation and utilizes a speech recognition agent to identify the spoken languages, speakers, and emotional patterns of speakers of the conversation. While monitoring the conversation, the corpus processing agent determines emotional patterns by monitoring voice modulation of the speakers and evaluating the context of the conversation. When the conversation is complete, the corpus processing agent determines synonyms and paraphrases of spoken words and phrases of the conversation taking into consideration any localized dialect of the speakers. Additionally, metadata of the conversation is created and stored in a link database, for comparison with other processed conversations. A corpus, a transcription of the conversation containing metadata links, is then created. The corpus processing agent also determines the frequency of spoken keywords and phrases and compiles a popularity index.
Owner:NUANCE COMM INC

Disambiguating user intent in conversational interaction system for large corpus information retrieval

A method of disambiguating user intent in conversational interactions for information retrieval is disclosed. The method includes providing access to a set of content items with metadata describing the content items and providing access to structural knowledge showing semantic relationships and links among the content items. The method further includes providing a user preference signature, receiving a first input from the user that is intended by the user to identify at least one desired content item, and determining an ambiguity index of the first input. If the ambiguity index is high, the method determines a query input based on the first input and at least one of the structural knowledge, the user preference signature, a location of the user, and the time of the first input and selects a content item based on comparing the query input and the metadata associated with the content item.
Owner:VEVEO INC

System and method for providing question and answers with deferred type evaluation

A system, method and computer program product for conducting questions and answers with deferred type evaluation based on any corpus of data. The method includes processing a query including waiting until a “Type” (i.e. a descriptor) is determined AND a candidate answer is provided; the Type is not required as part of a predetermined ontology but is only a lexical / grammatical item. Then, a search is conducted to look (search) for evidence that the candidate answer has the required LAT (e.g., as determined by a matching function that can leverage a parser, a semantic interpreter and / or a simple pattern matcher). In another embodiment, it may be attempted to match the LAT to a known Ontological Type and then look for a candidate answer up in an appropriate knowledge-base, database, and the like determined by that type. Then, all the evidence from all the different ways to determine that the candidate answer has the expected lexical answer type (LAT) is combined and one or more answers are delivered to a user.
Owner:IBM CORP

Part-of-speech tagging using latent analogy

Methods and apparatuses to assign part-of-speech tags to words are described. An input sequence of words is received. A global fabric of a corpus having training sequences of words may be analyzed in a vector space. A global semantic information associated with the input sequence of words may be extracted based on the analyzing. A part-of-speech tag may be assigned to a word of the input sequence based on POS tags from pertinent words in relevant training sequences identified using the global semantic information. The input sequence may be mapped into a vector space. A neighborhood associated with the input sequence may be formed in the vector space wherein the neighborhood represents one or more training sequences that are globally relevant to the input sequence.
Owner:APPLE INC

Method and system for analyzing text

An apparatus for providing a control input signal for an industrial process or technical system having one or more controllable elements includes elements for generating a semantic space for a text corpus, and elements for generating a norm from one or more reference words or texts, the or each reference word or text being associated with a defined respective value on a scale, and the norm being calculated as a reference point or set of reference points in the semantic space for the or each reference word or text with its associated respective scale value. Elements for reading at least one target word included in the text corpus, elements for predicting a value of a variable associated with the target word based on the semantic space and the norm, and elements for providing the predicted value in a control input signal to the industrial process or technical system. A method for predicting a value of a variable associated with a target word is also disclosed together with an associated system and computer readable medium.
Owner:STROSSLE INT

Systems and methods for collecting user annotations

ActiveUS20050216457A1Enhance and personalize searchEnhance and personalize and browsing operationData processing applicationsWeb data indexingPaper documentDocument preparation
Computer systems and methods allow users to annotate content items found in a corpus such as the World Wide Web. Annotations, which can include any descriptive and / or evaluative metadata related to a document, are collected from a user and stored in association with that user. Users are able to annotate and view their annotations for any document they encounter while interacting with the corpus, including hits returned in a search of the corpus. Users are also able to search their annotations or to limit searches to documents they have annotated. Metadata from annotations can also be aggregated across users and aggregated metadata applied in generating search results.
Owner:R2 SOLUTIONS

Apparatus and method for building domain-specific language models

Disclosed is a method and apparatus for building a domain-specific language model for use in language processing applications, e.g., speech recognition. A reference language model is generated based on a relatively small seed corpus containing linguistic units relevant to the domain. An external corpus containing a large number of linguistic units is accessed. Using the reference language model, linguistic units which have a sufficient degree of relevance to the domain are extracted from the external corpus. The reference language model is then updated based on the seed corpus and the extracted linguistic units. The process may be repeated iteratively until the language model is of satisfactory quality. The language building technique may be further enhanced by combining it with mixture modeling or class-based modeling.
Owner:NUANCE COMM INC

System and method for hybrid speech synthesis

A speech synthesis system receives symbolic input describing an utterance to be synthesized. In one embodiment, different portions of the utterance are constructed from different sources, one of which is a speech corpus recorded from a human speaker whose voice is to be modeled. The other sources may include other human speech corpora or speech produced using Rule-Based Speech Synthesis (RBSS). At least some portions of the utterance may be constructed by modifying prototype speech units to produce adapted speech units that are contextually appropriate for the utterance. The system concatenates the adapted speech units with the other speech units to produce a speech waveform. In another embodiment, a speech unit of a speech corpus recorded from a human speaker lacks transitions at one or both of its edges. A transition is synthesized using RBSS and concatenated with the speech unit in producing a speech waveform for the utterance.
Owner:NOVASPEECH

Query translation through dictionary adaptation

ActiveUS8775154B2Digital data information retrievalDigital data processing detailsCross-language information retrievalCross lingual
Cross-lingual information retrieval is disclosed, comprising: translating a received query from a source natural language into a target natural language; performing a first information retrieval operation on a corpus of documents in the target natural language using the translated query to retrieve a set of pseudo-feedback documents in the target natural language; re-translating the received query from the source natural language into the target natural language using a translation model derived from the set of pseudo-feedback documents in the target natural language; and performing a second information retrieval operation on the corpus of documents in the target natural language using the re-translated query to retrieve an updated set of documents in the target natural language.
Owner:CONDUENT BUSINESS SERVICES LLC

Large Scale Distributed Syntactic, Semantic and Lexical Language Models

A composite language model may include a composite word predictor. The composite word predictor may include a first language model and a second language model that are combined according to a directed Markov random field. The composite word predictor can predict a next word based upon a first set of contexts and a second set of contexts. The first language model may include a first word predictor that is dependent upon the first set of contexts. The second language model may include a second word predictor that is dependent upon the second set of contexts. Composite model parameters can be determined by multiple iterations of a convergent N-best list approximate Expectation-Maximization algorithm and a follow-up Expectation-Maximization algorithm applied in sequence, wherein the convergent N-best list approximate Expectation-Maximization algorithm and the follow-up Expectation-Maximization algorithm extracts the first set of contexts and the second set of contexts from a training corpus.
Owner:WRIGHT STATE UNIVERSITY

Ranking search results by reranking the results based on local inter-connectivity

A search engine for searching a corpus improves the relevancy of the results by refining a standard relevancy score based on the interconnectivity of the initially returned set of documents. The search engine obtains an initial set of relevant documents by matching a user's search terms to an index of a corpus. A re-ranking component in the search engine then refines the initially returned document rankings so that documents that are frequently cited in the initial set of relevant documents are preferred over documents that are less frequently cited within the initial set.
Owner:GOOGLE LLC

System and method for suggestion mining

A system and method for extraction of suggestions for improvement form a corpus of documents, such as customer reviews, are disclosed. A structured terminology provided or a topic includes a set of semantic classes, each including a set of terms. A thesaurus of terms relating to suggestions of improvement is provided. Text elements of text strings in the documents which are instances of terms in the structured terminology are labeled with the corresponding semantic class and text elements which are instances of terms in the thesaurus are also labeled. A set of patterns is applied to the labeled text strings to identify suggestions of improvement expressions. The patterns define syntactic relations between text elements, some of which are required to be instances of one of the terms in a particular semantic class or thesaurus. A set of suggestions for improvements is output based on the identified suggestions of improvement expressions.
Owner:XEROX CORP

Information data retrieval, where the data is organized in terms, documents and document corpora

The invention relates to improved solutions for information retrieval, wherein the information is represented by digitized text data. This data is further presumed to be organized in terms (431-438), documents and document corpora, where each document contains at least one term (431-438) and each document corpus contains at least one document. Based on a concept vector (420-424), which conceptually classifies the contents of each document, a term-to-concept vector is generated for each term (431-438) in the document corpus. The term-to-concept vector describes a relationship between the term (431) and each of the concept vectors (420-424). On basis of the term-to-concept vectors for the document corpus, a term-term matrix is generated which describes a term-to-term relationship between all the terms (431-438) in the document corpus. The term-term matrix may then be processed and used for retrieving information from the document corpus, such as the fact that a first term (431) is related to a second term (436).
Owner:ELUCIDON GROUP

Search processing with automatic categorization of queries

Search results are processed using search requests, including analyzing received queries in order to provide a more sophisticated understanding of the information being sought. A concept network is generated from a set of queries by parsing the queries into units and defining various relationships between the units. From these concept networks, queries can be automatically categorized into categories, or more generally, can be associated with one or more nodes of a taxonomy. The categorization can be used to alter the search results or the presentation of the results to the user. As an example of alterations of search results or presentation, the presentation might include a list of “suggestions” for related search query terms. As other examples, the corpus searched might vary depending on the category or the ordering or selection of the results to present to the user might vary depending on the category. Categorization might be done using a learned set of query-node pairs where a pair maps a particular query to a particular node in the taxonomy. The learned set might be initialized from a manual indication of which queries go with which nodes and enhanced has more searches are performed. One method of enhancement involves tracking post-query click activity to identify how a category estimate of a query might have varied from an actual category for the query as evidenced by the category of the post-query click activity, e.g., a particular hits of the search results that the user selected following the query. Another method involved determining relationships between units in the form of clusters and using clustering to modify the query-node pairs.
Owner:R2 SOLUTIONS

System and method for dynamically evaluating latent concepts in unstructured documents

A system and method for dynamically evaluating latent concepts in unstructured documents is disclosed. A multiplicity of concepts are extracted from a set of unstructured documents into a lexicon. The lexicon uniquely identifies each concept and a frequency of occurrence. A frequency of occurrence representation is created for the documents set. The frequency representation provides an ordered corpus of the frequencies of occurrence of each concept. A subset of concepts is selected from the frequency of occurrence representation filtered against a pre-defined threshold. A group of weighted clusters of concepts selected from the concepts subset is generated. A matrix of best fit approximations is determined for each document weighted against each group of weighted clusters of concepts.
Owner:NUIX NORTH AMERICA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products