Website purification method and device
A purification device and website technology, which is applied in the direction of network data retrieval, network data indexing, special data processing applications, etc., can solve the problems of wasting storage space of the URL scheduling module, and the actual use efficiency of crawlers is not high, so as to save resources and improve the ability Effect
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0066] Embodiment 1: the command word goodsid, this command word is suitable for the overall URL form is relatively changeable, it is necessary to summarize the rules, find out the main parts, and then splicing out the final form of the website.
[0067] For example, some B2C website link forms are not standardized, and there will be multiple forms of links on the website at the same time, as follows:
[0068] http: / / www.eggcoo.com / page_product_527393_0.html
[0069] http: / / www.eggcoo.com / product.shtml?method=detailView&id=527393&cv=0
[0070] This is in the golden egg market, there are two different forms of links, but in fact they point to the same product.
[0071] For another example, some large B2Cs with a long history will also be revised from time to time, and the same situation exists:
[0072] http: / / www.amazon.cn / gp / product / B0019DBU60?ver=gp&uid=476-6816060-6082564&pageletid=taiwan (from the list page)
[0073] http: / / www.amazon.cn / mn / detailApp / ref=sr_1_1?_encodin...
Embodiment 2
[0089] Embodiment 2: command word truncate, this command word is applicable to the situation that URL is followed by additional information. Now many websites will add some additional parameters after the URL to mark the source or do statistics. This form is more common and easier to handle, for example:
[0090] http: / / www.vancl.com / Product_0006984 / BaiHeHuaLianYiQun%20HongSeYinHua.html?Source=eqf&SourceSunInfo=96845|yqftid_12783880711284196186
[0091] The way to purify this kind of website is to use the command word truncate (truncate), set up grouping (with a pair of brackets) for all the data that needs to be preserved, and only return the grouped results, as in the following rule :
[0092] {"www.vancl.com", "^( / Product_[0-9]+ / [\w]+\.html).*.*$", "truncate", null}
[0093] Apply this rule and return when the above link is encountered
[0094] http: / / www.vancl.com / Product_0006984 / BaiHeHuaLianYiQun%20HongSeYinHua.html
Embodiment 3
[0095]Embodiment 3: The command word is a group command, and this command word is applicable to websites whose URLs are not case-sensitive. Some websites are not sensitive to the case of the URL, but for crawlers, the uppercase and lowercase URLs correspond to different links respectively. In this case, you can use the grouping command to uniformly capitalize a certain group Convert to lowercase or lowercase to uppercase.
[0096] For example, the URL of Dangdang.com:
[0097] http: / / product.dangdang.com / product.aspx?product_id=22799821
[0098] http: / / product.dangdang.com / Product.aspx?product_id=22799821
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com