Method and apparatus for extracting page theme

A page and theme technology, applied in the computer field, can solve problems such as page theme offset, page theme words cannot accurately reflect page theme, and cannot accurately meet user needs, so as to meet user needs and reduce deviations. Effect

Active Publication Date: 2012-10-17
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF7 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there may be multiple paragraphs in the title of the page, and some paragraphs are irrelevant to the page theme, which will cause the offset of the page theme
The application may not be able to accurately meet the needs of users in the ranking of page search, and the application may cause the determined page keywords to accurately reflect the page theme in the determination of page keywords

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and apparatus for extracting page theme
  • Method and apparatus for extracting page theme
  • Method and apparatus for extracting page theme

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0071] figure 1 The flow chart of the method for extracting page topics provided by Embodiment 1 of the present invention, as shown in figure 1 As shown, the method may include the following steps:

[0072] Step 101: Obtain candidate paragraphs expressing the theme of the page in the page.

[0073] In this step, the candidate paragraphs on the page that express the theme of the page refer to those paragraphs that may reflect the theme of the page, which may specifically include but not limited to at least one of the following paragraphs:

[0074] The page title paragraph with the label title, the page title row with the label realtitle, the navigation paragraph with the label mypos, and the front link with the label preanchor.

[0075] For example, for http: / / www.22zw.cn / XH / 91H53969KX / The page from which the above is obtained are the four paragraphs:

[0076] The title paragraph of the page with the label title reads: The latest chapter of Fights Break the Sky.

[0077]...

Embodiment 2

[0104] figure 2 The flow chart of the method for calculating the confidence of each paragraph provided by Embodiment 2 of the present invention, such as figure 2 As shown, the method may include the following steps:

[0105] Step 201: Perform word segmentation processing on each paragraph.

[0106] Preferably, each word obtained after the word segmentation process can also be filtered based on a preset stop word list. Wherein, the stop word list includes words that appear frequently in webpages, including but not limited to: adverbs, function words, modal particles, particles, pronouns, etc. These words usually have low expressive ability.

[0107] Step 202: According to formula D ij =α*S ij +β*P ij , and calculate the confidence of each word after word segmentation processing.

[0108] Among them, D ij Confidence of the jth word obtained after word segmentation for the ith paragraph, S ij The frequency of occurrence of the jth word in all paragraphs obtained after t...

Embodiment 3

[0118] image 3 The flow chart of the method for extracting page keywords provided by Embodiment 3 of the present invention, such as image 3 As shown, the method may include the following steps:

[0119] Step 301: Perform word segmentation processing on the maintitle determined in the first embodiment.

[0120] If there is only one maintitle of the page determined, the process shown in the third embodiment is executed only for this maintitle; if there are multiple maintitles determined for the page, the process shown in the third embodiment is respectively executed for each maintitle.

[0121] Step 302: Perform part-of-speech tagging on each word obtained after word segmentation.

[0122] Step 303: Filter each word obtained after word segmentation based on a preset stop word list.

[0123] This step is to filter out the words contained in the stop vocabulary list from the words obtained after word segmentation. Wherein, the stop words list includes words that appear frequ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and an apparatus for extracting a page theme. The method comprises: A. acquiring candidate paragraphs which convey the page theme; B, if a candidate paragraph which can be re-paragraphed exists, paragraphing the candidate paragraph which can be re-paragraphed; otherwise performing step C; C. calculating the confidences of the paragraphs obtained after the step B respectively; and D. taking the paragraph with a confidence that meets the requirement of a preset confidence as the paragraph of the page theme. By using the method and the apparatus, the page theme can be determined more accurately, and the deviation between an extracted page theme and an actual page theme can be reduced.

Description

【Technical field】 [0001] The invention relates to the field of computer technology, in particular to a method and device for extracting page topics. 【Background technique】 [0002] Whether it is the sorting in the page search, the determination of the page subject words or other aspects, the acquisition of the page topic will be involved. For example, in the sorting of the page search, the higher the correlation between the page topic and the query, the higher the ranking. , page keywords are usually extracted from the page subject, and so on. [0003] Currently, it is common to simply use the entire title paragraph (title) of the page as the page theme. However, there may be multiple paragraphs in the title of the page, and some paragraphs are irrelevant to the theme of the page, which will cause the deviation of the theme of the page. The application may not be able to accurately meet the needs of users in the ranking of page search, and the determined page keywords may ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 刘海浪
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products