News document duplicate removal method and device and storage medium

A news and document technology, applied in the field of natural language processing, can solve the problems of low frequency noise words, such as large impact, time-consuming and labor-intensive, and achieve the effect of solving time-consuming and labor-intensive problems.

Pending Publication Date: 2020-02-04
NAVINFO
View PDF7 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Solve the time-consuming and labor-intensive problem of manually labeling training samples in supervised learning, and the problem of being greatly affected by low-frequency noise words in unsupervised learning

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • News document duplicate removal method and device and storage medium
  • News document duplicate removal method and device and storage medium
  • News document duplicate removal method and device and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] In order to make the purpose, technical solution and advantages of the present application clearer, the technical solution of the present application will be clearly and completely described below in conjunction with specific embodiments of the present application and corresponding drawings. Apparently, the described embodiments are only some of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

[0028] figure 1 A schematic flowchart of an embodiment of a method for deduplication of news documents provided by this application, the schematic flowchart includes:

[0029] Step 105, segmenting each road news document in the news document set to obtain the term of each road news document;

[0030] Optionally, the news documents store the road new...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a news document duplicate removal method and device, and a storage medium. The method comprises the steps of performing word segmentation on a document; calculating the weightof the lexical item in the document; obtaining a document vector according to the lexical item; calculating the similarity between the documents according to the document vectors; clustering the documents of which the similarity is greater than a preset value into a cluster, and determining a cluster center according to the similarity among the documents in the cluster; and marking repeated documents according to the cluster center. The method has the advantages that training samples do not need to be labeled manually, and the problem that time and labor are wasted when the training samples are labeled manually is solved; the similarity is calculated according to the weights of the lexical items in the document; the weights of named entities and event behavior lexical items are improved, and the problem that the influence of low-frequency noise words is large is solved; the documents of which the similarity is greater than a preset value are clustered into a cluster, and each documentonly appears in a single cluster, so that the repeated documents have uniqueness; and the marked repeated documents are used for duplicate removal, so that repeated documents are prevented from beingprocessed for multiple times.

Description

technical field [0001] The present application relates to the technical field of natural language processing, in particular to a method, device and storage medium for deduplication of news documents. Background technique [0002] With the development of the Internet, the amount of online news information has increased dramatically. A large amount of repetitive news information is processed many times, which reduces the efficiency of information processing. Therefore, how to deduplicate news information has become an urgent problem to be solved. [0003] The existing technology uses supervised learning and unsupervised learning to extract news information features. Supervised learning extracts keywords from events expressed in text, and uses them as the representation of events, then quantifies these keywords, and calculates the similarity between different documents as the basis for clustering. Taking news related to the road field as an example, the specific place names ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/166G06F40/284
Inventor 冯博琳王秋森刘斌生吴中恒
Owner NAVINFO
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products