Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A Text Segmentation Method Based on Topic Information

A cutting method and technology of subject information, applied in the field of text cutting based on subject information, can solve the problems of inconvenient research and inconvenient reading, etc.

Active Publication Date: 2020-10-27
XI AN JIAOTONG UNIV
View PDF13 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This is not convenient for people to read, nor is it convenient for researchers in the fields of natural language processing and information retrieval to conduct research

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Text Segmentation Method Based on Topic Information
  • A Text Segmentation Method Based on Topic Information
  • A Text Segmentation Method Based on Topic Information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be further explained below with reference to the drawings and examples.

[0037] Such as figure 1 As shown, a text cutting method based on topic information can be divided into the following five processes:

[0038] Step 1. Preprocess the input text and training set to obtain a series of sentences composed of words; it includes two steps.

[0039] 101. For the input text, divide it according to the ending punctuation mark. The ending punctuation mark refers to all the symbols that can be used at the end of a Chinese sentence; a series of separate sentences are obtained, and each sentence occupies a separate line. For the training set, its format It is: sentence-topic label, in which sentence and topic label are both Chinese text. Perform the above operations on the sentence part.

[0040] 102. Separate individual sentences, and remove numbers, stop words, punctuation marks, and non-Chinese special symbols. Obtain a series of sentences composed of wo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a text cutting method based on subject information. The specific operation is as follows: preprocessing the input text and the training set to obtain a sentence composed of a series of words; then performing feature extraction to obtain its feature vector; and then according to its implication The semantic information of the input text is clustered to obtain a series of sentence clusters, and a numerical label is assigned to each cluster in order to obtain a series of single sentences with numerical labels; each sentence is assigned an existing sentence in the training set Topic tags, so that the existing topic tags in the training set are assigned to all sentences in the text; use the digital tag labeling results and the topic tag labeling results to make corrections to obtain text fragments with topic tags, and assign topic tags to the cut text In this way, the topics described in the sentences are clearly visible, and the position in the text describing the topic can be easily located according to the topic, making retrieval more convenient.

Description

Technical field [0001] The invention belongs to the technical field of natural language processing, and specifically relates to a text cutting method based on topic information. Background technique [0002] The text usually consists of a series of semantically related segments. With the rapid growth of today's network, the number of texts on the network is also rapidly increasing. In Internet texts, longer texts account for a considerable proportion. Most of these texts are not meticulously divided, just a series of stacks of semantically related segments. This is neither convenient for people to read, nor for researchers in the fields of natural language processing and information retrieval to conduct research. [0003] In order to solve the above problems, the text is usually cut. For the viewer, the fragments related to a single topic are obtained after cutting, which makes the text more concise and clear to read, and can browse the text content related to the specific topi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/289G06F40/30G06K9/62
CPCG06F40/289G06F40/30G06F18/23213G06F18/24
Inventor 魏笔凡李鸿轩刘均郑庆华吴蓓张铎吴科炜郭朝彤
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products