Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Method for extracting core content of webpage based on text-tag density

A technology of core content and extraction method, which is applied in the Internet and communication fields, can solve problems such as poor versatility, improve accuracy, improve efficiency and accuracy, and achieve simple effects

Active Publication Date: 2016-10-26
BEIJING FORESTRY UNIVERSITY
View PDF4 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Since the template largely depends on the specific structure of the web page, once the structure of the web page changes, it needs to be reset and learned, and the versatility is not strong

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting core content of webpage based on text-tag density
  • Method for extracting core content of webpage based on text-tag density
  • Method for extracting core content of webpage based on text-tag density

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] The present invention will be described in detail below with reference to the drawings and embodiments.

[0041] This invention takes the source code of the webpage as input, and outputs the core text of the webpage including the title, keywords, description, and core content, and its focus is on the core content of the webpage of acquisition.

[0042] as follows As shown in Figure 1, the processing process of the present invention includes four stages: webpage source code preprocessing, webpage core content range estimation, core content boundary determination, and deletion of remaining tags.

[0043] The present invention is specifically realized through the following technical solutions:

[0044] 1. Web page source code Preprocessing stage

[0045] The preprocessing stage needs to extract the core elements of the webpage such as the title, keywords, and description from the original webpage text, and delete the text part of the webpage that is easy to interfere The tag for extr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

This invention relates to a method for extracting a core content of a webpage based on text-tag density. The method comprises four steps of preprocessing webpage source code, estimating the range of the core content of the webpage, determining the boundary of the core content and deleting residual tags. In the step of preprocessing webpage source code, extract core elements of title, summary, description, and so on from an original webpage text, and delete the tags unrelated to the core content of the webpage in the original webpage text so as to acquire a pending text. In the step of estimating the range of the core content of the webpage, determine a general range of the core content of the webpage. In the step of determining the boundary of the core content, separately determine precise start and stop positions of the core content of the webpage text. In the step of deleting residual tags, take out the core content part and delete residual tags to acquire the core content of the webpage, which is convenient to be analyzed and processed. By adoption of this method, the DOM (Document Object Model) structure of a webpage document is unnecessary to be analyzed; the theme and the content of the webpage are not limited; the processing procedure has linear complexity; and this method is applicable to the technical applications of extracting the core contents of various kinds of webpages, denoising webpages, and so on.

Description

technical field [0001] The invention relates to the technical field of the Internet in the field of communication, in particular to a method for extracting the core content of a webpage text with linear complexity based on text-label density. Background technique [0002] With the rapid development of the Internet, the World Wide Web (WWW) has become the largest Internet database in the world. Therefore, how to effectively extract information from the World Wide Web has become a new research direction. These involve collecting, processing, and extracting information from web pages at high speed. [0003] However, in reality, in addition to the text content related to the topic, there will be a lot of irrelevant information on the web page. This content includes everything from logos, advertisements, images, navigation, sidebars, and more. Although this information can play a role in assisting browsing for web browsers, it is useless in most cases for many Internet applica...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/95G06F16/9577
Inventor 蒋东辰闫艺鑫
Owner BEIJING FORESTRY UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products