Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for carrying out data duplicate record cleaning on URL (Uniform Resource Locator)

A data duplication and duplication recording technology, applied in the field of big data, can solve the problems of large amount of network data, complex capture method, and time-consuming, etc., to improve storage speed, simplify repeated data cleaning methods, and improve collection and capture speed effect

Inactive Publication Date: 2018-03-06
ANHUI KECHUANG INTELLIGENT INTPROP SERVICE CO LTD
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] The technical problem to be solved by the present invention is that the network data is large in quantity and messy in content. The existing big data data collection technology is more complicated for the network information capture method and takes a lot of time. The purpose is to provide repeated data recording for URL Cleaning method to improve the speed of crawling and storing network information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0030] A method for cleaning duplicate data records for URLs, including:

[0031] Step 1, crawling webpage content from the Internet through a web crawler, and extracting required attribute content;

[0032] Step 2: Provide the crawler with the URL that needs to grab the data network through the URL queue; firstly, preprocess the URL that needs to grab the data network, and then realize the duplicate record detection through field matching and record matching, and detect duplicate records at the database level The algorithm clusters the duplicate records in the entire data set, merges or deletes the duplicate records detected in the same duplicate record cluster according to the rules, and only keeps the correct record;

[0033] Step 3, process the content captured by the crawler through the data processing module;

[0034] Step 4, store the URL information of the website that needs to grab data, the data extracted from the webpage by the crawler, and the data processed by th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for carrying out data duplicate record cleaning on a URL (Uniform Resource Locator). The method comprises the following steps: step 1, grasping webpage content from Internet through Web crawlers and extracting needed attribute content; step 2, providing the URL of a data network needing to be grasped for the crawlers through a URL queue; step 3, processing the content grasped by the crawlers through a data processing module; step 4, storing URL information of the data websites needing to be grasped, data grasped from a webpage by the crawlers and data subjectedto DP processing through a data storage module. According o the method disclosed by the invention, data information is acquired from a website through the Web crawlers or a website published API (Application Program Interface) manner; non-structuralized data can be extracted from the webpage, is stored as uniform local data files and is stored in a structuralized manner; the acquisition of pictures, audios and videos is supported; attachments and texts can be automatically correlated so that the acquisition and grasping speeds of the network information are improved, meanwhile, the storage speed after grasping information is improved.

Description

technical field [0001] The invention relates to the field of big data, in particular to a method for cleaning URLs with repeated data records. Background technique [0002] Similar terms have appeared in the history of data development, including ultra-large-scale data and massive data. "Super large-scale" generally refers to data corresponding to GB (1GB=1024MB), "massive" generally refers to data at the level of TB (1TB=1024GB), and the current "big data" refers to PB (1PB=1024TB), EB (1EB=1024PB), or even data above the ZB (1ZB=1024EB) level. In 2013, Gartner predicted that the data stored in the world would reach 1.2ZB. If these data were recorded on CD-R discs and piled up, the height would be five times the distance from the earth to the moon. Behind the different scales are different technical problems or challenging research problems. [0003] Big data refers to a collection of data that cannot be captured, managed and processed by conventional software tools with...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 石文威
Owner ANHUI KECHUANG INTELLIGENT INTPROP SERVICE CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products