Industry text entity extraction method based on distributed platform

A distributed platform and extraction method technology, applied in the field of text entity extraction, can solve the problems of increasing feature extraction time and training time, and achieve the effect of enhancing generalization ability, fast extraction, and accurate text entity extraction

Active Publication Date: 2018-04-13
江苏华通晟云科技有限公司
View PDF9 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Chinese patent document CN2017100036859 discloses an online TCM text named entity recognition method based on deep learning. The entity extraction method enriches the text training sample set by crawlers, and at the same time extracts text features by neural network, which can extract samples to a certain extent. The accuracy of the entity, but as the training sample increases, the corresponding extracted entity model also increases, and the training time will gradually increase, and the feature extraction time will also increase.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Industry text entity extraction method based on distributed platform

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0029] Such as figure 1 As shown in , the industry text entity extraction method based on the distributed platform includes the following steps:

[0030] (1) In the text collection, the text data information of various industries is obtained through the akka communication module in the spark open source platform, and the text data collected by the monitoring equipment that needs to be extracted is transmitted to the distributed Spark platform.

[0031] (2) Build a spark platform cluster, use one of the servers as a management node, and four servers as service nodes. The management node mainly records the dependencies between data streams and is responsible for task scheduling and generating new RDDs. The service node mainly implements the analysis algorithm and data storage function.

[0032] (3) Train the existing text data set through the deep learning neural network method to obtain the relational feature model, and then use the relational feature model to extract the rel...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an industry text entity extraction method based on a distributed platform. The method comprises the following steps: obtaining a relation feature model by using a deep learningneural network training text data set, generating multiple elastic distributed relation feature data set RDD through the extracted relation features; extracting class features from a data set in theRDD through a class feature model obtained through improved non-linear SVM classification algorithm; finding a corresponding context entity model according to the extracted class feature, and extracting entity data in the text of the corresponding class feature through the trained entity model; judging whether the text amount of the corresponding context exceeds a set threshold, retraining the context entity model if the text amount exceeds the threshold, and extracting the entity data in the text of the corresponding class feature by using the retrained entity model; otherwise, saving the text entity feature and the text data. The text feature entities under different contexts can be processed, the entity extraction efficiency and the entity extraction accuracy rate are effectively improved.

Description

technical field [0001] The invention relates to a method for extracting text entities, in particular to a method for extracting industry text entities based on a distributed platform. Background technique [0002] Traditional text extraction methods use pattern matching relationship extraction methods, dictionary-driven relationship extraction methods, and machine learning-based relationship extraction methods. Most of these methods first extract words with relatively high word frequencies in the text as effective entities through word segmentation. These methods are suitable for the scene where the entities in the text are relatively single, but in different contexts, these methods cannot effectively distinguish entities in different contexts, and will segment and merge entities that do not need to be segmented or merged incorrectly. [0003] At the same time, it is difficult for traditional detection methods to extract words that have not appeared in previous texts through...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/35G06F40/289
Inventor 武克杰周书勇
Owner 江苏华通晟云科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products