Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and device for extracting webpage text

An extraction method and webpage technology, applied in the field of mobile communication, can solve problems such as inability to perform webpage text, front-end extraction cannot be executed, environment dependence, etc., and achieve the effects of good independence, good scalability, and good page structure

Active Publication Date: 2018-11-16
ALIBABA (CHINA) CO LTD
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] However, the front-end extraction method has the following disadvantages: 1. Environment dependence: it needs to rely on the browser. When the browser is not opened, the front-end extraction cannot be executed
2. Page dependence: webpage writers must write more standardized webpage programs. For non-standard or special webpage programs, the front-end extraction method cannot extract the text of the webpage

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting webpage text
  • Method and device for extracting webpage text
  • Method and device for extracting webpage text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0072] The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0073] Various embodiments of the present invention relate to a method and device for extracting webpage text. figure 1 A flow chart of the method for extracting webpage text in the present invention is shown. Such as figure 1 As shown, the methods for extracting the text of a web page include:

[0074] S101: Read the webpage data, determine the disturbing data included in the webpage data, and replace the disturbing data with null characters.

[0075] In this ste...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a webpage text extraction method and device. The method comprises the following steps: reading webpage data, determining interference data contained in the webpage data and replacing the interference data with null characters; recording the line number of each line of the webpage and the quantity of characters in the corresponding line; determining the webpage text by utilizing the line number of each line and the quantity of the characters in the corresponding line; and extracting the webpage text. Compared with the prior art, the webpage text extraction method and device have the advantages of being independent of the browser environment and page structure and being good in expansibility.

Description

technical field [0001] The invention relates to the technical field of mobile communication, in particular to a method and device for extracting webpage text. Background technique [0002] With the popularization of Internet technology, web pages have become the most extensive source of information for people. In order to utilize the extensiveness of web pages to the greatest extent, people continue to develop information technologies that can effectively utilize web pages. However, due to the complexity of the information carried by the webpage, the webpage is not as neat and clean as the traditional text, and it contains a lot of noise content, such as scripts added to enhance user interaction, and navigation links added to facilitate user browsing , and advertising links added for commercial considerations, etc. [0003] In order to fully share and utilize the text information carried by the webpage, it is necessary to extract the text information from the webpage. In ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 王磊
Owner ALIBABA (CHINA) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products