Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Two-stage online sampling method based on mapreduce model

A stage and model technology, applied in the field of data online sampling, can solve problems such as the influence of unbiased estimation algorithm and the accuracy of estimation results, so as to ensure unbiasedness and effectiveness, eliminate bias influence, and ensure randomness Effect

Active Publication Date: 2020-06-02
SICHUAN XW BANK CO LTD
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At any point in the query processing process, observe the sample set, and the probability of occurrence of blocks with small aggregation values ​​is higher. The samples cannot be regarded as independent and identically distributed random variables, so it will affect the unbiasedness of the estimation algorithm. affect the accuracy of the estimates

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Two-stage online sampling method based on mapreduce model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] Such as figure 1 Shown the present invention is based on the two-stage online sampling method of MapReduce model, comprises:

[0028] A. The first stage of sampling: when the MapReduce model receives the input data of the upstream data node and initializes it, set up a group sampler before online processing on the map side, divide each data block into a group, and use the data block as the sample unit Take a sample. The cluster sampler maintains a data block random queue for each data table, and a data block random queue contains data blocks corresponding to multiple data tables, and each data block random queue corresponds to a mapper (mapper). The order of all data blocks in the data block random queue is randomized, and a mapper is designated by the map side each time it is scheduled. When requesting to receive input data from the upstream data node, the mapper iteratively selects from the corresponding In the data block random queue, return the data block at the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a two-stage online sampling method based on a MapReduce model, and the method comprises the steps: 1, carrying out the first-stage sampling: setting a whole group of samplersbefore the MapReduce model carries out the online processing at a map end, and carrying out the sampling through employing a data block as a sample unit; step 2, in a query stage of the MapReduce model, obtaining an estimated value of a query result, and calculating a confidence interval width when a confidence coefficient is given; step 3, second-stage sampling: correcting the probability that each data block is extracted by the reduce end through a receiving-rejecting sampler before the reduce end starts to process; and 4, performing aggregation processing on the discarded map end output result in a recycle bin of the reduce end, and adding a snapshot result obtained by the received data block to obtain an actual result of aggregation query. According to the method, the randomness of thesample is ensured on the premise of not increasing the network transmission cost, effective statistical estimation is provided, the bias influence of data inclination on the statistical estimation iseliminated, and the unbiased property and effectiveness of query estimation are ensured.

Description

technical field [0001] The invention relates to a data online sampling method, specifically a two-stage online sampling method based on a MapReduce model. Background technique [0002] With the development of information digitization, the amount of global data has shown explosive growth, and data mining and data analysis based on big data has become a hot spot of widespread concern in various fields. On-Line Aggregation (OLA) technology provides a method to quickly return approximate results based on sample data to meet the requirements of real-time processing and fast user interaction. In the process of query processing, compared with the offline batch processing technology, the online aggregation technology can return the estimated result and the result confidence interval within a certain degree of confidence in a short period of time, and continuously return approximate results during the processing process, and with The quality of estimates continues to improve as more...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/2458
CPCG06F16/2462G06F16/2471
Inventor 谭皓予
Owner SICHUAN XW BANK CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products