Statistical method and statistical system of text similarity

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of text similarity and statistical methods, applied in computing, special data processing applications, instruments, etc., can solve problems such as difficult to accurately reflect the degree of similarity

Active Publication Date: 2013-06-26

南方电网互联网服务有限公司

View PDF4 Cites 14 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] Based on this, in order to solve the problem that the traditional text similarity statistical method is difficult to accurately reflect the similarity between texts whose order of words and sentences has been artificially disrupted, it is necessary to provide a method that can accurately reflect the artificially disrupted words and sentences A Statistical Method of Text Similarity of Sentence Sequence Similarity Between Texts

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0020] figure 1 It is a flowchart of a statistical method for text similarity in an embodiment, including the following steps:

[0021] S110. Acquire text T1 and text T2 for which similarity needs to be determined.

[0022] S120. Separate the text T1 and the text T2 into several natural segments, compare all the natural segments in the text T1 with all the natural segments in the text T2, and record the number of identical natural segments as k3.

[0023] In this embodiment, the number of natural paragraphs in the text T1 is recorded as k1, and the number of natural paragraphs in the text T2 is recorded as k2. i ranges from 1 to k1, j ranges from 1 to k2, compare whether paragraph i of text T1 is the same as paragraph j of text T2, and record the number of identical natural paragraphs as k3.

[0024] S130, delete the same natural segment from the text T1 and the text T2, the text T1 is deleted to obtain the text T3, and the text T2 is deleted to obtain the text T4.

[0025]...

Embodiment 2

[0053] S210. Acquire text T1 and text T2 for which similarity needs to be determined.

[0054] S220. Separate the text T1 and the text T2 into several natural segments, compare all the natural segments in the text T1 with all the natural segments in the text T2, and record the number of identical natural segments as k3.

[0055] In this embodiment, the number of natural paragraphs in the text T1 is recorded as k1, and the number of natural paragraphs in the text T2 is recorded as k2. i ranges from 1 to k1, j ranges from 1 to k2, compare whether paragraph i of text T1 is the same as paragraph j of text T2, and record the number of identical natural paragraphs as k3.

[0056] S230, delete the same natural segment from the text T1 and the text T2, the text T1 is deleted to obtain the text T3, and the text T2 is deleted to obtain the text T4.

[0057] S240. Separate the text T3 and the text T4 into several words, compare all the words in the text T3 with all the words in the text...

Embodiment 3

[0069] S310. Acquire text T1 and text T2 for which similarity needs to be determined.

[0070] S320. Separate the text T1 and the text T2 into several sentences, compare all the sentences in the text T1 with all the sentences in the text T2, and record the number of identical sentences as k3.

[0071] In this embodiment, the number of sentences in the text T1 is denoted as k1, and the number of sentences in the text T2 is denoted as k2. i is from 1 to k1, j is from 1 to k2, compare whether the i-th sentence of the text T1 is the same as the j-th sentence of the text T2, and record the number of identical sentences as k3.

[0072] S330, delete the same sentence from the text T1 and the text T2, the text T1 is deleted to obtain the text T3, and the text T2 is deleted to obtain the text T4.

[0073] S340. Separate the text T3 and the text T4 into several words, compare all the words in the text T3 with all the words in the text T4, and record the number of identical words as k6....

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a statistical method of text similarity. The statistical method comprises the following steps: obtaining a first text and a second text need to distinguish similarity; respectively dividing the first text and the second text into a plurality of text segments according to a first dividing scale, calculating a proportion of quantity of the same text segments in the first text and the second text to total text segment quantity of the first text under the first dividing scale; deleting the same text segments from the first text and the second text, respectively obtaining a first remaining text and a second remaining text; respectively dividing the first remaining text and the second remaining text into a plurality of text segments according to a second dividing scale, calculating a proportion of quantity of the same text segments in the first remaining text and the second remaining text to total text segment quantity of the first remaining text under the second dividing scale; and calculating the comprehensive text similarity of the first text and the second text. The statistical method of the text similarity can accurately reflect the similarity degree between texts in which the orders of words and sentences are disorganized by men and detect the similar text in which the word order, the sentence order and the section order are disorganized on purpose.

Description

technical field [0001] The invention relates to text processing, in particular to a statistical method for text similarity and a statistical system for text similarity. Background technique [0002] In the prior art, judging the similarity between two texts is generally by segmenting the two texts, and then judging the repeated strings of words and phrases in the two texts in order. [0003] However, if the order of words and sentences in the text is deliberately disrupted, even if the texts are essentially similar (such as plagiarized), the similarity obtained according to the existing similarity statistics method is low, which cannot reflect the similarity of the text itself. degree. Contents of the invention [0004] Based on this, in order to solve the problem that the traditional text similarity statistical method is difficult to accurately reflect the similarity between texts whose order of words and sentences has been artificially disrupted, it is necessary to prov...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/27

Inventor 朱定局

Owner 南方电网互联网服务有限公司

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Statistical method and statistical system of text similarity

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology