Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Internet information object positioning method based on webpage structure semantic meaning

A technology of Internet information and positioning method, applied in the field of Internet information object positioning, can solve the problems of restricting the in-depth research and wide application of semantic technology, inconvenient range of semantic concepts, semantic technology is not systematic enough, etc., to achieve the effect of improving precise search performance

Active Publication Date: 2012-09-12
FUDAN UNIV
View PDF4 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At present, although semantic technology has a lot of research and achievements in the fields of information retrieval, search engines, product price comparison, data mining, etc., in most cases, the application of semantic technology is partial and not systematic enough. The semantic definition of semantics is not easy to clarify the scope of semantic concepts, and the integrity of semantic structure lacks theoretical basis, which limits the in-depth research and wide application of semantic technology.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Internet information object positioning method based on webpage structure semantic meaning
  • Internet information object positioning method based on webpage structure semantic meaning
  • Internet information object positioning method based on webpage structure semantic meaning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044]基于本发明方法的一个实用例子是互联网药品监测系统(如图4所示)。

[0045]一、计算实例

[0046]互联网药品监测系统涉及的结构语义熵的计算实例,其DOM子树如图3所示。可以看出图3中的叶子节点(文本节点)已经进行了语义匹配,被分配了不同的语义角色,节点4是一个典型的详细信息聚集区域,而节点3则是一个干扰信息列表。根据图中给出的语义角色,可以计算节点4的结构语义熵值: 

[0047]

[0048]而列表节点3的结构语义熵为:

[0049]。

[0050]二、核心算法描述

[0051]1、算法1:语义匹配

[0052]输入:DOM树D,属性集合L

[0053]输出:匹配完毕的匹配信息列表M

[0054]步骤:

[0055]1)历遍DOM树D中的每个节点N,与属性集合L中定义的每个属性P进行匹配;

[0056]2)如果匹配,则把匹配信息Ip(属性名)添加到匹配信息列表M中,然后在节点N中查找属性值,如果找到,则把匹配信息Iv(属性值)也添加到M中,否则在节点N的下一个文本节点中查找属性值。

[0057]2、算法2:计算节点结构语义熵

[0058]输入:节点N

[0059]输出:节点N的结构语义熵H

[0060]步骤:

[0061]1)计算节点N中包含的每种语义角色出现的概率:

[0062]p(x i )=语义角色xi在N中出现的次数 / N下的所有文本节点个数;

[0063]2)利用以下公式计算节点N的结构语义熵H:

[0064]

[0065]其中p(xi)由第一步计算得到,n为节点N下包含的语义角色的数量,I(xi)=logb(1 / p(xi))指信息量,p(xi)越小,那么有某个元素被标记为第i种语义角色这个事件信息量就越大,b在信息论里一般取2。

[0066]3、算法3:选择属性聚集区域节点并过滤干扰项

[0067]输入:经过按节点结构语义熵计算结果排序的节点列表L、结构语义熵值阈值HT

[0068]输出:经过过滤的节点列表L

[0069]步骤:

[0070]1)给定结构语义熵值阈值HT,若节点N的结构语义熵值大于该阈值,则该节点可能成为属性聚集区域,否则,判定为非属性聚集区域,HT可根据情况进行调整;

[0071]2)把L中的所有节点标记为属性聚集区域节点;

[0072]3)历遍节点列表L中的节点Ni,如果节点Ni的结构语义熵值小于HT,则把节点Ni修改为非属性聚集区域节点;

[0073]4)历遍节...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to an Internet technology field, and in particular relates to an Internet information object positioning method based on webpage structure semantic meaning. The method comprises a first step of converting HTML codes of retrieved relevant webpage into DOM tree structures; a second step of carrying out semantic matching for every text node according to Internet information object semantic dictionary, distributing different semantic roles, calculating the structure semantic entropy value for internal node (nonleaf node) of every DOM tree structure, to measure the semantic richness; and a final step of integrating the hierarchy relationship of the entropy value and the webpage, reflecting the aggregation degree of semantic information in some node, determining a webpage area of appointed information objects in a lot of webpage, and then extracting required data. One application embodiment of the invention is Internet medicine information search and analysis.

Description

technical field [0001] The invention belongs to the technical field of the Internet, and in particular relates to a method for locating an Internet information object. technical background [0002] Precise search technology for specific application fields is the basis of application systems such as public opinion monitoring, product price comparison, and advertising monitoring, while semantic technology is the prerequisite for precise search. At present, although semantic technology has a lot of research and achievements in the fields of information retrieval, search engines, product price comparison, data mining, etc., in most cases, the application of semantic technology is partial and not systematic enough. The semantic definition of semantics is not easy to clarify the scope of semantic concepts, and the integrity of semantic structure lacks theoretical basis, which limits the in-depth research and wide application of semantic technology. [0003] The Internet informati...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
Inventor 李银胜廖逸吴晓彦顾轶灵沈元一
Owner FUDAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products