Method and apparatus for grabbing content of target page
A target page and content capture technology, applied in the field of web pages, can solve problems such as inconsistent absolute paths, inability to execute JS scripts, and no parsing function for capture scripts, so as to improve capture efficiency and simplify the configuration of capture rules
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0061] refer to figure 1 , which shows a schematic flow chart of a method for grabbing target page content according to the present invention, which may specifically include:
[0062] Step 110, obtaining the webpage document of the target page;
[0063] In the embodiment of the present invention, before executing the content grabbing process of the target page, that is, before enabling the grabbing script, the relevant information for the target page will be configured, such as the link address of the target page, and the relative path information for the target page. The relative path information is used to find the location of the content of the target page in the web document of the target page.
[0064] Of course, in the embodiment of the present invention, the present invention can provide a configuration interface, which includes a configuration column for the link address of the target page, a configuration column for the relative path, etc. After the user confirms, th...
Embodiment 2
[0104] refer to figure 2 , which shows a schematic flow chart of a method for grabbing target page content according to the present invention, which may specifically include:
[0105] Step 210, obtaining the webpage document of the target page;
[0106] Step 220, according to the relative path information for the target page, search the document object model node under the relative path information in the web document; wherein, the relative path information is constructed based on the attribute related information of the document object model node ;
[0107] Step 230 , extracting the content of the target page from the document object model node according to the preset regular matching expressions and / or matching expressions before and after the content of the target page.
[0108] In the embodiment of the present invention, when extracting the content of the target page from the found document object model nodes, in order to more accurately extract the content required by ...
Embodiment 3
[0118] refer to image 3 , which shows a schematic flow chart of a method for grabbing target page content according to the present invention, which may specifically include:
[0119] Step 310, according to the link address of the list page, obtain the webpage document of the list page;
[0120] In this step, the list page mentioned is the target page mentioned in the first embodiment is the list page.
[0121] Before step 310 in the embodiment of the present invention, the link address of the list page, relative path information for the list area in the list page, extraction rules for DOM nodes, etc. can be configured first. Such as Figure 3A , the user can enter the Figure 3A On the list configuration page, configure the website name "Site", the column name of the website "Zone", and the URL of the list page of this column http: / / ng.d.cn / wushuangjianji / news / list_walkthrough_1.html. The relative path information ul[class=znewsList] of the URL link with the resource page...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com