Webpage data structured analytic method and device

A technology of web page data and analysis method, which is applied in the direction of network data retrieval, electronic digital data processing, other database retrieval, etc., and can solve the problem of high degree of artificial dependence.

Active Publication Date: 2015-06-10
SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD
View PDF3 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method is mainly completed by professional engineers in this field, which requires a lot of labor to discover relevant patterns or rules, and has a high degree of manual dependence.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage data structured analytic method and device
  • Webpage data structured analytic method and device
  • Webpage data structured analytic method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment approach

[0041] Embodiment 1 of the present invention provides a method and device for structural analysis of web page data. see figure 1 As shown, as an embodiment, the method includes the steps:

[0042] Step S110: Collect a plurality of template web pages of the same type in a field, perform text extraction on the collected template web page data and perform structured analysis according to preset rules, and use the extracted text and corresponding parsed data as training corpus.

[0043] Step S111 , extracting multiple template web pages of various types in the field, and obtaining structured item names and various aliases in different web pages from them.

[0044] Step S112, training an analytical model according to the training corpus.

[0045] An analytical model θ(N, M, A, B, p, q) is constructed, and the model is described as follows:

[0046] N: the number of states, let the state set be S={s 1 , s 2 ,...,s N}, corresponding to the tag (Tag) of the item to be extracted ...

Embodiment 2

[0064] The web page data parsing method provided by the second embodiment of the present invention includes the steps:

[0065] Step S210, for a website in a certain field, collect a certain number of webpages of the same template. Use ContentExtractor-master to extract the body of this batch of web pages to obtain the body of the web page; use htmlunit to write parsing rules for the web pages to obtain the content of structured items. The structured valid data and the corresponding text are saved as training corpus.

[0066] For example, the body text can look like the following table:

[0067]

[0068]

[0069] The corresponding structured parsed text is shown in the following table:

[0070]

[0071] Step S211 , obtain all possible names of the implicit state "field name" of the parsing model (that is, the name of the structured item to be parsed out) in different web pages.

[0072] For a website in a certain domain, web page collection is performed to obtain w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a webpage data structured analytic method. The webpage data structured analytic method comprises the steps that a program which fetches information according to rules is written according to template webpages, and a training corpus is obtained; possible names of possible items to be structured are acquired through IDFs; a hidden markov model is trained through the training corpus, and parameters are determined; hidden markov model decoding is conducted on a webpage to be analyzed through a correlation algorithm, so that final structured data are acquired. The invention further provides a webpage data structured analytic device. The webpage data structured analytic device comprises a collection module, an acquisition module, a training module and a decoding module. According to the webpage data structured analytic method and device, operation is accomplished according to the intelligent analysis feature and the self learning feature of the model, domain experts do not need to pay more attention to the operation, the manual dependence degree is low, and the accuracy, the performance and the efficiency of analysis are greatly improved.

Description

technical field [0001] The invention relates to the technical field of computer applications, in particular to a method and device for structural analysis of web page data. Background technique [0002] With the advent of the era of big data, companies around the world are full of enthusiasm for big data, and big data analysis and processing have emerged as the times require. The big data processing process includes data collection, data storage integration, data preprocessing, data mining analysis, and data presentation applications. When enterprises in traditional industries develop big data, the first thing they face is how to connect internal data and external data, that is, how to obtain Internet data based on the internal data of the enterprise. However, the data collected by the Internet are generally unstructured or semi-structured text, pictures, audio and video, etc. How to parse and structure these data will be an essential work to integrate with the data in the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/2246G06F16/958
Inventor 范莹于治楼梁华勇
Owner SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products