Method for automatically extracting BBS (bulletin board system) data
An automatic extraction and forum technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of poor adaptability to web page structure changes, inability to automatically extract large-scale website data, etc.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
specific Embodiment approach
[0060] Further, as a specific implementation manner, step c includes the following steps:
[0061] c1. Establishing a four-dimensional feature vector for the visual word string of the item;
[0062] c2. Divide the data set according to the feature vector;
[0063] c3. Giving meaning to the visible character string and forming an extraction template.
[0064] Wherein, the four-dimensional feature vector described in step c1 is F1, F2, F3 and F4, specifically:
[0065] F1: whether it is a number;
[0066] F2: Length;
[0067] F3: Whether it is a time format, the judgment of the time format is to manually collect the time expression format of the website, generate a regular expression, and convert it into a timestamp calculation method according to the modified format;
[0068] F4: Whether it is a hyperlink text;
[0069] The feature vector is put into the path dictionary, and the entropy of all strings on all paths is calculated, and the strings with entropy less than 0.4 a...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com