Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Discovering terms using statistical corpus analysis

Inactive Publication Date: 2016-04-28
IBM CORP
View PDF13 Cites 180 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

The present invention provides a method, computer program product, and system for identifying relevant terms in a corpus of text based on their initial context and adding them to a set of category related terms. The system then identifies new terms based on the set of initial contextual characteristics and the set of additions. The technical effect is improved efficiency and accuracy in identifying relevant terms in a text.

Problems solved by technology

Many challenges in NLP involve natural language understanding (that is, enabling computers to derive meaning from human or natural language input).
However, since domain ontologies represent concepts in very specific and often eclectic ways, they are often incompatible.
In the context of NLP, term extraction becomes difficult when the text being processed belongs to a different domain (for example, medical technology) than the domain from which the NLP software was built (for example, financial news).

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Discovering terms using statistical corpus analysis
  • Discovering terms using statistical corpus analysis
  • Discovering terms using statistical corpus analysis

Examples

Experimental program
Comparison scheme
Effect test

example embodiment

II. Example Embodiment

[0045]FIG. 2 shows flowchart 250 depicting a method according to the present invention. FIG. 3 shows program 300 for performing at least some of the method steps of flowchart 250. This method and associated software will now be discussed, over the course of the following paragraphs, with extensive reference to FIG. 2 (for the method step blocks) and FIG. 3 (for the software blocks).

[0046]The present embodiment refers extensively to a high precision domain lexicon (HPDL). The HPDL (also referred to as a “set of category related terms”) is a collection of terms (words or sets of words) that belong to a specific domain, category, or genre (“domain”). In term extraction, and more generally in natural language processing, the HPDL can serve as an underlying “knowledge base” for a given domain so as to extract more contextually relevant terms from a piece of text (or corpus). In many embodiments of the present invention, the HPDL is used to: (i) extract contextually ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Software that extracts contextually relevant terms from a text sample (or corpus) by performing the following steps: (i) identifying a first term from a corpus, based, at least in part, on a set of initial contextual characteristic(s), where each initial contextual characteristic of the set of initial contextual characteristic(s) relates to the contextual use of at least one category related term of a set of category related term(s) in the corpus; (ii) adding the first term to the set of category related term(s), thereby creating a revised set of category related term(s) and a set of first term contextual characteristic(s), where each first term contextual characteristic of the set of first term contextual characteristic(s) relates to the contextual use of the first term in the corpus; and (iii) identifying a second term from the corpus, based, at least in part, on the set of first term contextual characteristic(s).

Description

BACKGROUND OF THE INVENTION[0001]The present invention relates generally to the field of natural language processing, and more particularly to “term extraction.”[0002]Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve natural language understanding (that is, enabling computers to derive meaning from human or natural language input).[0003]Information Extraction (IE) is a known element of NLP. IE is the task of automatically extracting structured information from unstructured (and / or semi-structured) machine-readable documents. Term Extraction is a sub-task of IE. The goal of Term Extraction is to automatically extract relevant terms from a given text (or “corpus”). Term Extraction is used in many NLP tasks and applications, such as question answering, i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F17/30684G06F17/30719G06F17/30705G06F40/30G06F16/3344G06F16/345G06F16/35
Inventor AJMERA, JITENDRAPARIKH, ANKUR
Owner IBM CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products