Network data automatic cleaning method and system based on webpage label distribution characteristics
A technology for distributing features and network data, applied in the field of data cleaning, it can solve the problems of slow result feedback, high cost, inconsistent update of official account templates, etc., and achieve the effect of improving the efficiency of cleaning and the accuracy of cleaning.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment
[0031] see Figure 1-3 , the present invention provides a technical solution: a method for automatically cleaning network data based on the distribution characteristics of webpage labels, characterized in that: comprising the following steps:
[0032] Step 1: Use the offline crawler system to crawl network news data:
[0033] That is, through the crawler collection system, collect articles and network news data according to the list page principle, and then obtain offline news data;
[0034] Step 2: Analyze the tree nodes of the crawled offline news data, and extract attribute information such as tag names, attributes, texts, links, etc. in the nodes;
[0035] Step 3: Use the idea based on n-gram2vec to predict other node block information through the current node, and obtain the word embedding information of the label through training:
[0036] Based on the idea of n-gram2vec, data model training is carried out, and the original text with html tag is processed by feature ...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com