Method, device and storage medium for extracting news web page content

A webpage content and news technology, applied in the field of news webpage content extraction, can solve problems such as different news content

Active Publication Date: 2019-01-25
数据地平线(广州)科技有限公司
View PDF7 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The invention provides a method, device and storage medium for extracting news webpage conte...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method, device and storage medium for extracting news web page content
  • Method, device and storage medium for extracting news web page content
  • Method, device and storage medium for extracting news web page content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0042] It should be noted that if there is a directional indication (such as up, down, left, right, front, back...) in the embodiment of the present invention, the directional indication is only used to explain the position in a certain posture (as shown in the accompanying drawing). If the specific posture changes, the directional indication will also change accordingly.

[0043] In addition, if there are descriptions involving "first", "second...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method, a device and a storage medium for extracting news webpage content, which relate to the technical field of news webpage content extraction. The method comprises the following steps: obtaining HTML code of the webpage, HTML linear reconstruction of the webpage, removing HTML noise label, filtering and dividing data set, absorbing pseudo noise paragraph, and generating text paragraph. HTML linear reconstruction of web page linearizes the tree-shaped div tags nested with each other, and the linear structure can be conveniently located as a div tag, thereby eliminating the influence of nested tags on subsequent steps; HTML noise label removal will reduce the impact of noise text on paragraph clustering; Data set filtering partitioning further reduces the effectof noise on text segments; Absorbing pseudo-noise paragraphs increases the recall rate of text paragraphs. The method overcomes the shortcomings of specific crawling of specific websites and enhancesthe generality of extracting news page content. Compared with the existing technology, the method can extract news content accurately and efficiently, and has a good effect.

Description

technical field [0001] The invention relates to the technical field of news web page content extraction, in particular to a method, device and storage medium for extracting news web page content. Background technique [0002] In the field of news, the extraction of news webpage content is the core step, and the accuracy of news text, release time and title extraction is directly related to the quality of news search and user experience. In addition, in the financial field, the accurate extraction of news webpages is also the key to quantitative transactions. The news content is analyzed and processed based on natural language processing technology, and the processing results are used for economic behavior analysis. Therefore, how to extract news web page content has become a key issue of the research of the present invention. [0003] At present, there are various methods for extracting news webpage content, which are mainly divided into the following two categories: templ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/958G06F16/953
Inventor 陈贺
Owner 数据地平线(广州)科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products