E-Science environment-oriented multi-domain Web text feature extracting system and method

A feature extraction, multi-domain technology, applied in the field of Web text feature extraction, can solve the problems of restricting the application scope of Chinese information extraction system, inconvenient for experiment reproduction, unable to meet the actual needs of multi-domain information extraction system portability, etc. Portability and practical value, the effect of improving utilization efficiency

Inactive Publication Date: 2011-05-25
UNIV OF SCI & TECH BEIJING
View PDF3 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Existing domain-based information extraction methods mostly rely on domain dictionaries to discover text features, which is neither convenient for experimental reproduction, nor easy for transplantation and promotion in multi-domain environments, which seriously restricts the application range of Chinese information extraction systems
In the analysis process, it mostly relies on the assistance of domain dictionaries or tagged word sets. Although it can effectively improve the extraction accuracy of specific domain features, it cannot meet the actual needs of multi-domain information extraction in terms of system portability.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • E-Science environment-oriented multi-domain Web text feature extracting system and method
  • E-Science environment-oriented multi-domain Web text feature extracting system and method
  • E-Science environment-oriented multi-domain Web text feature extracting system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0041] The multi-domain Web text feature extraction system for e-Science environment mainly considers the following three aspects during the design process: First, get rid of the dependence on domain dictionaries. The role of domain dictionaries in most Chinese information extraction systems is to segment texts and perform data preprocessing for feature discovery. However, due to its limitations in quantity and update speed, it seriously restricts the ability of the Chinese information extraction system to discover new events and the latest vocabulary in the field, which is not conducive to the transplantation and promotion of the Chinese information extraction system. The introduction of dictionary-free word segmentation technology will effectively improve the knowledge learning ability of the Chinese information extraction system, and is more suitable for f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an e-Science environment-oriented multi-domain Web text feature extracting system and method. The method comprises the following steps of: 1. making statistics on the frequentness of characters in a target text; 2. with a character as a basic processing unit, extracting character strings between the character used as a start point and a character having the frequentness of 1 and being used as a terminal point one by one; and 3. making statistics of the frequentness of each character string, and performing descending order on feature character strings according to the frequentness and outputting the feature character strings. In the invention, a non-dictionary character segmentation technology is introduced in the feature discovery of a domain text, thereby the dependence of a traditional method on a domain dictionary is effectively overcome and the portability and the practicability of the e-Science environment-oriented multi-domain Web text feature extracting system and method in multi-domain scientific data are enhanced to some extent.

Description

technical field [0001] The invention relates to feature extraction of Web text, in particular to a multi-field Web text feature extraction system and method for e-Science environment. Background technique [0002] Khaled Khelif (2007) proposed an ontology-based information extraction method, aiming to help biologists acquire professional knowledge more effectively. This method relies on semantic annotation of scientific and technological documents, automatically generates domain ontology and provides corresponding information retrieval interface. Tara McIntosh (2007) proposed a full-text information extraction system for the biomedical field to solve the shortcomings of the traditional analysis methods based on literature summarization. Ziya Ozkan Gokturk and Nihan Kesim Cicekli et al. (2007) used web crawler technology to extract and classify web page metadata using pre-set regular expressions. In the experiment, taking the European Cup and the UEFA Champions League as ex...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 胡长军赵冲冲翁彧赵立永
Owner UNIV OF SCI & TECH BEIJING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products