Method for carrying out data duplicate record cleaning on URL (Uniform Resource Locator)

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A data duplication and duplication recording technology, applied in the field of big data, can solve the problems of large amount of network data, complex capture method, and time-consuming, etc., to improve storage speed, simplify repeated data cleaning methods, and improve collection and capture speed effect

Inactive Publication Date: 2018-03-06

ANHUI KECHUANG INTELLIGENT INTPROP SERVICE CO LTD

View PDF3 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0010] The technical problem to be solved by the present invention is that the network data is large in quantity and messy in content. The existing big data data collection technology is more complicated for the network information capture method and takes a lot of time. The purpose is to provide repeated data recording for URL Cleaning method to improve the speed of crawling and storing network information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0030] A method for cleaning duplicate data records for URLs, including:

[0031] Step 1, crawling webpage content from the Internet through a web crawler, and extracting required attribute content;

[0032] Step 2: Provide the crawler with the URL that needs to grab the data network through the URL queue; firstly, preprocess the URL that needs to grab the data network, and then realize the duplicate record detection through field matching and record matching, and detect duplicate records at the database level The algorithm clusters the duplicate records in the entire data set, merges or deletes the duplicate records detected in the same duplicate record cluster according to the rules, and only keeps the correct record;

[0033] Step 3, process the content captured by the crawler through the data processing module;

[0034] Step 4, store the URL information of the website that needs to grab data, the data extracted from the webpage by the crawler, and the data processed by th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method for carrying out data duplicate record cleaning on a URL (Uniform Resource Locator). The method comprises the following steps: step 1, grasping webpage content from Internet through Web crawlers and extracting needed attribute content; step 2, providing the URL of a data network needing to be grasped for the crawlers through a URL queue; step 3, processing the content grasped by the crawlers through a data processing module; step 4, storing URL information of the data websites needing to be grasped, data grasped from a webpage by the crawlers and data subjectedto DP processing through a data storage module. According o the method disclosed by the invention, data information is acquired from a website through the Web crawlers or a website published API (Application Program Interface) manner; non-structuralized data can be extracted from the webpage, is stored as uniform local data files and is stored in a structuralized manner; the acquisition of pictures, audios and videos is supported; attachments and texts can be automatically correlated so that the acquisition and grasping speeds of the network information are improved, meanwhile, the storage speed after grasping information is improved.

Description

technical field [0001] The invention relates to the field of big data, in particular to a method for cleaning URLs with repeated data records. Background technique [0002] Similar terms have appeared in the history of data development, including ultra-large-scale data and massive data. "Super large-scale" generally refers to data corresponding to GB (1GB=1024MB), "massive" generally refers to data at the level of TB (1TB=1024GB), and the current "big data" refers to PB (1PB=1024TB), EB (1EB=1024PB), or even data above the ZB (1ZB=1024EB) level. In 2013, Gartner predicted that the data stored in the world would reach 1.2ZB. If these data were recorded on CD-R discs and piled up, the height would be five times the distance from the earth to the moon. Behind the different scales are different technical problems or challenging research problems. [0003] Big data refers to a collection of data that cannot be captured, managed and processed by conventional software tools with...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

Inventor 石文威

Owner ANHUI KECHUANG INTELLIGENT INTPROP SERVICE CO LTD

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Method for carrying out data duplicate record cleaning on URL (Uniform Resource Locator)

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology