Webpage text extraction method based on maximum text density

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A web page text extraction and text technology, applied in the field of information processing, can solve the problems of inapplicability, lack of generality, time-consuming and labor-intensive information pattern recognition knowledge, and achieve the effect of improving the accuracy rate

Inactive Publication Date: 2014-04-09

TONGJI UNIV

View PDF1 Cites 24 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The method based on DOM structure and webpage segmentation is mainly to analyze HTML tags, but now webpages tend to be complicated and non-standardized, and it is not applicable to interpret webpage content simply through HTML semantics

The template-based method can only target a certain type of information source in a specific format, and the acquisition of information pattern recognition knowledge required to construct it is a time-consuming and laborious work. At present, Internet web pages are becoming more and more diverse and customizable. not universal

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0040] like figure 1 As shown, the specific steps of the web page text extraction method based on the maximum text density are as follows:

[0041] 1. Web page preprocessing

[0042] (1) Character encoding problem

[0043] Common encoding methods include GBK (including Simplified Chinese and Traditional Chinese), BG2312 (Simplified Chinese), BIG-5 (Traditional Chinese), UTF-8, UTF-16, and UNICODE. In the HTML document, the encoding method is defined as follows:

[0044]

[0045]

[0046]

[0047] The charset attribute defines how the web page is encoded. In order to prevent garbled characters on the webpage, in the preprocessing stage of the webpage, the default encoding of the acquired webpage file is converted to UTF-8 character encoding. If the relevant encoding information cannot be obtained from the webpage, try to convert it to UTF-8 character encoding coding.

[0048] (2) Web page standardization

[0049] Now the HTML code format on som...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a webpage text extraction method based on the maximum text density. The method includes the following steps of (1) preprocessing a webpage, processing character codes and standardizing the webpage, (2) analyzing the webpage into a DOM tree and extracting tag text blocks in the webpage according to specific tags, (3) calculating the maximum text density, and (4) extracting texts, carrying out sequencing according to calculated text densities after all the tag text blocks are processed, and selecting a tag with the maximum text density, wherein the tag and content of a nested sub-tag serve as a text block and the text is obtained after the tag is eliminated. The webpage text extraction method based on the maximum text density is low in algorithm complexity, has universality and has a good effect on webpages with complex structures.

Description

technical field [0001] The present invention relates to information processing based on the Internet, which is network information extraction and application. Background technique [0002] With the development of the times, the World Wide Web has become an important source of information for people. Users usually use browsers to directly view web pages. In addition, there are many Internet-based information processing tasks (such as information search, data mining, machine translation, etc.), which are also carried out based on the information content of web pages. However, the text information of web pages on the Internet is often surrounded by "web page noise" such as advertisement links, navigation bars, and copyright information. How to accurately and efficiently extract the text information of web pages has become an important topic in the current network information extraction and application, which has high application value and practical significance. [0003] At p...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

CPCG06F16/986

Inventor 蒋昌俊陈闳中闫春钢丁志军王鹏伟何源夏琳娟

Owner TONGJI UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Webpage text extraction method based on maximum text density

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology