A Method of Webpage Text Extraction

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A web page and text technology, applied in the computer field, can solve problems such as low performance, spam information, and small problems, and achieve the effects of improving work efficiency, high analysis efficiency, and accurate results

Inactive Publication Date: 2017-01-18

北京中搜云商网络技术有限公司

View PDF4 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0006] The disadvantage of manually writing templates is that it takes a lot of human resources to write templates, and as the target website changes, the cost of maintaining templates is also very high

The disadvantage of the automatic template method is that the algorithm is complex, and at the same time, it also needs to periodically monitor the target website to maintain the changes of the template

Regardless of whether the template is generated manually or automatically, the assumption is that the data of the website is generated through the template. Some large websites have basically no problem, that is, different entrances may have different templates, but for many small and medium websites, the templating is not Very good, using template extraction can only extract most of the information, there are more opportunities to contain spam

Vision-based page segmentation algorithm is not suitable for news search engine applications due to complex rules and low performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0032] The specific embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings.

[0033] A web page contains information such as text title, text source, text release time, text, author, etc. The web page may also include a large number of advertisements, spam, etc., and the "longest string" in news web pages mostly appears in the text. features to find a paragraph in the text area and obtain its corresponding label features, and then use the found label features to search forward, backward, and bidirectionally for similar label nodes. This process is referred to as "label clustering".

[0034] A method for extracting the text of a webpage, according to searching for "longest string" to search for iconic nodes to realize the extraction of news webpage text content, said method comprises the following steps: 1, deleting the negligible label in the said webpage and the negligible label in the Content; II, looking ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a webpage content extracting method. The method comprises the following steps of I, preprocessing a webpage, II, searching for the longest series in the webpage, III, establishing a DOM tree and searching for the nodes corresponding to the longest series according to the DOM tree, IV, determining a beginning node and a finishing node according to labels of the nodes corresponding to the longest series, V, checking and filtering the beginning node and the finishing node, and VI, outputting text in the filtered beginning node and text in the filtered finishing node. The method overcomes the defect of a module or blocking technique in news content extraction application, searches for seed paragraphs based on the longest series and improves webpage content extracting work efficiency and accuracy.

Description

technical field [0001] The invention relates to a method in the field of computers, in particular to a method for extracting news webpage text content based on searching for "longest string" to find symbolic nodes. Background technique [0002] In the field of news (or information) search, news text extraction is an essential link, and the quality of its text extraction determines the quality of news search and user experience. [0003] At present, news text extraction methods come in various formats, which can be divided into two categories according to whether templates are used: template-based (or wrapper)-based extraction and non-template-based extraction. [0004] Template-based extraction: first define the template, and then write a program to parse and execute the template to obtain data. According to the template generation method, it can be divided into: manual template extraction and automatic template extraction. Manual template extraction. For the extracted ta...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F17/30

CPCG06F16/80G06F40/14

Inventor 涂波

Owner 北京中搜云商网络技术有限公司

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

A Method of Webpage Text Extraction

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology