Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Web page cleaning method based on web page content

A web page content and web page technology, applied in the field of web page cleaning based on web page content, can solve the problems of lack of versatility and uncertain HTML structure of web pages, and achieve the effect of strong versatility

Inactive Publication Date: 2007-02-28
上海态格信息技术有限公司
View PDF0 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Although (2) and (3) can achieve ideal results for specific websites or specific types of webpages, they lack versatility. As webpages increasingly emphasize personalization and interaction with users, the html structure of webpages becomes It is more uncertain, and it also highlights the limitations of existing web cleaning methods

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page cleaning method based on web page content
  • Web page cleaning method based on web page content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] In this embodiment, it is assumed that there are two web pages A and B, and A is cleaned: The HTML of A is:

[0023] Title of A

[0024] advertising

[0025] Content A

[0026] Link to B The HTML of B is:

[0027] Title of B

[0028] advertising

[0029] Content B

[0030] The following takes pages A and B as examples to explain the cleaning steps in detail:

[0031] 1. Use the web page download component to download the web page to be cleaned from the Internet through the computer network adapter. In this step, the non-text content of the web page has been cleared, such as script codes, html tags, etc.

[0032] In this embodiment, we obtain the title "title of A" and content "advertisement", "content A" and url link to B "link to B" from page A.

[0033] 2. Use the web page downloading component and the url list obtained in step 1 to download from the Internet a web page that has a first-level hyperlink relationship with the web page to be cleaned through a com...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a website clearing method based on the website content, wherein it avoids assumption the html label of website, but arranges the invent point on the test of website, as the element without label; the invention comprises that downloading needed website from internet, decomposing the website into url list as super linkage in html, and the test element list; finding the website with similar website structure of needed website; based on two appointed websites, if one text element appears in two websites, deleting the test element in needed website, to obtain cleared text content. The invention has the advantages that: it is irrelevant with website structure, to support the treatment on variable self-defined websites.

Description

Technical field [0001] The invention belongs to the field of intelligent information processing, and relates to a webpage cleaning method based on webpage content. Background technique: [0002] As an information source, web information is increasingly used in intelligent information processing systems. Despite the continuous emergence of new technologies such as rss, html text is still the main source of information; html files are standard ASCII files, which seem to be added Many common text files with special strings called tags are created. Structurally speaking, an html file is composed of elements, and there are many types of elements that make up an html file, which are used to organize the content of the file and guide the output format of the file. Most elements are "containers", that is, they have start and end tags. The start tag of the element is called the start tag, and the end tag of the element is called the end tag. The part between the start link tag and the end...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 邱致中沈超
Owner 上海态格信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products