Method for carrying out data duplicate record cleaning on URL (Uniform Resource Locator)
A data duplication and duplication recording technology, applied in the field of big data, can solve the problems of large amount of network data, complex capture method, and time-consuming, etc., to improve storage speed, simplify repeated data cleaning methods, and improve collection and capture speed effect
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Examples
Embodiment
[0030] A method for cleaning duplicate data records for URLs, including:
[0031] Step 1, crawling webpage content from the Internet through a web crawler, and extracting required attribute content;
[0032] Step 2: Provide the crawler with the URL that needs to grab the data network through the URL queue; firstly, preprocess the URL that needs to grab the data network, and then realize the duplicate record detection through field matching and record matching, and detect duplicate records at the database level The algorithm clusters the duplicate records in the entire data set, merges or deletes the duplicate records detected in the same duplicate record cluster according to the rules, and only keeps the correct record;
[0033] Step 3, process the content captured by the crawler through the data processing module;
[0034] Step 4, store the URL information of the website that needs to grab data, the data extracted from the webpage by the crawler, and the data processed by th...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com