Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Theme information-based text segmentation method

A cutting method and subject information technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of inconvenient research and inconvenient reading, and achieve the effect of convenient retrieval

Active Publication Date: 2019-08-09
XI AN JIAOTONG UNIV
View PDF13 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This is not convenient for people to read, nor is it convenient for researchers in the fields of natural language processing and information retrieval to conduct research

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Theme information-based text segmentation method
  • Theme information-based text segmentation method
  • Theme information-based text segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be further described below in conjunction with accompanying drawings and examples.

[0037] Such as figure 1 As shown, a text segmentation method based on topic information can be divided into the following five processes:

[0038] Step 1, preprocessing the input text and the training set to obtain a series of sentences composed of words; includes two steps.

[0039] 101. For the input text, divide it according to the ending punctuation mark, and the ending punctuation mark refers to all symbols that can be used at the end of a Chinese sentence; obtain a series of separate sentences, each sentence occupies a separate line, and for the training set, its format is: sentence-topic tag, where both the sentence and the topic tag are Chinese text. The sentence part wherein carries out above-mentioned operation.

[0040] 102. Segment individual sentences and remove numbers, stop words, punctuation marks, and non-Chinese special symbols. Get a seque...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a theme information-based text segmentation method, which comprises the following specific operations of: preprocessing an input text and a training set to obtain a sentence consisting of a series of words; carrying out feature extraction to obtain feature vectors of the features; carrying out clustering operation on the input text according to semantic information contained in the sentence cluster to obtain a series of sentence clusters, and distributing a digital label for each cluster in sequence to obtain a series of simple sentences with the digital labels; distributing existing theme tags in a training set for each sentence, so that the existing theme tags in the training set are distributed to all sentences in the text. According to the invention, the digitallabel labeling result and the theme label labeling result are used for correction to obtain the text fragment with the theme label, and the theme label is distributed to the cut text, so that the theme described by the sentence can be clearly seen, the position for describing the theme in the text can be conveniently positioned according to the theme, and the retrieval is more convenient.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, and in particular relates to a text cutting method based on subject information. Background technique [0002] Text is usually composed of a series of semantically related fragments. With the rapid increase in the size of today's web, the amount of text on the web is also increasing dramatically. Among the online texts, longer texts occupy a considerable proportion. Most of these texts have not been carefully divided, but are just a series of fragment stacks with semantically related relationships. This is not convenient for people to read, nor is it convenient for researchers in the fields of natural language processing and information retrieval to conduct research. [0003] In order to solve the above problems, usually the text is cut. For the viewer, fragments related to a single topic are obtained after cutting, which makes the text read more concisely and clearly, and...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27G06K9/62
CPCG06F40/289G06F40/30G06F18/23213G06F18/24
Inventor 魏笔凡李鸿轩刘均郑庆华吴蓓张铎吴科炜郭朝彤
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products