Similar document query method based on stop words

A query method and technology of stop words, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve the problem of no special response to the particularity of East Asian languages ​​such as Chinese, and achieve good comparison effect and improve comparison. the effect of efficiency

Inactive Publication Date: 2013-02-20
RUN TECH CO LTD BEIJING
View PDF2 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, most of these methods come from Western word processing, and there is no special response to the particularity of Chinese and other East Asian languages.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Similar document query method based on stop words
  • Similar document query method based on stop words
  • Similar document query method based on stop words

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The technical solutions in this embodiment will be clearly and completely described below in conjunction with the accompanying drawings in the embodiment of the present invention. It should be understood that the described embodiments are only some of the embodiments of the present invention, not all of them. the embodiment. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without making creative efforts belong to the protection scope of the present invention.

[0029] like figure 1 As shown, this method is divided into the following steps:

[0030] 1. Normalize the Chinese electronic document format. Normalization refers to the removal of all characters in a document (Character is the general term for various characters and symbols) to remove all non-literal characters and pure character information in formats other than punctuation.

[0031] Example:

[0032] Before normalization: China, #¥%...&*My mothe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a similar document query method based on stop words. The similar document query method comprises the steps of (1) performing normalized processing on two documents to be queried, and removing all non-Chinese character information in the documents; (2) performing word segmentation operation on the two documents according to a word segmentation dictionary, and converting the documents into vocabulary flow; (3) obtaining the stop words by extraction in the vocabulary flow according to writing habits; (4) combining normal sememic words at the back of the stop words and the stop words into segmentation information fingerprints; (5) respectively collecting the segmentation information fingerprints to form middle fingerprint identification of the two documents, and placing the fingerprint identification in an information fingerprint database to make comparisons; (6) calculating fingerprint identification similarity in the fingerprint database, and obtaining similarity values of the two documents; and (7) enabling the two documents with similarity larger than a set threshold to serve as similar documents, and outputting all or part of similar documents according to a set mode. The method for utilizing the Chinese stop words and a plurality of follow-up words is used, conforms to the Chinese context and has good comparison effects.

Description

technical field [0001] The invention relates to a query method for similarity between electronic documents and documents, in particular to a method for comparing document similarity by using Chinese stop words, and belongs to the technical field of computer language processing and information retrieval. Background technique [0002] With the popularization and application of Internet technology, the use of electronic documents is increasing. While electronic documents are convenient for people to improve work efficiency and save paper and other natural resources, they also bring some additional problems and troubles. For example: Electronic documents are easier to copy and spread. The new features of digitization of these documents make plagiarism of documents technically easier. Coupled with the increasing amount of electronic data, manual judgment of this type of plagiarism is becoming more and more difficult. Therefore, it is very necessary to use modern information te...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 林述民
Owner RUN TECH CO LTD BEIJING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products