Method for extracting regular noise from single record web pages
A web page and regular technology, applied in the field of network information retrieval, can solve the problems of small noise leakage extraction, low frequency of noise branches, low efficiency, etc.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0083] The present invention will be described below with reference to the accompanying drawings and specific embodiments.
[0084] According to an embodiment of the present invention, a method for extracting regular noise from a single-record web page is provided. Based on the web page's DOM tree structure information, visual information and text information of the web page, the multi-template model is used to extract the noise before the text, in the text and after the text of the single-record web page respectively. In the extraction process, firstly, n (n>=2) web pages are automatically classified according to the DOM tree structure of the web page, and then m web pages (m>=2) of the same category (similar web page structure) are matched and merged Form the site section style tree SBSTree, on this basis, use some visual and text rules to find the approximate position of the text title and text body in the site section style tree (the merged DOM tree), and then judge wh...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com