Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Text structure analysis-based Web document abstract generation method

A technology of text structure and document summarization, which is applied in the fields of webpage text extraction, natural language processing, and Chinese automatic summarization, can solve the problems that automatic summarization cannot be truly realized, the research history of automatic summarization technology is short, and the title naming is not so rigorous, etc., to achieve High summary coverage, fast and accurate search for information, smooth summary effect

Inactive Publication Date: 2014-06-11
EAST CHINA NORMAL UNIVERSITY
View PDF3 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Since there is still no major breakthrough in natural language processing technology, the method based on comprehension cannot really realize automatic summarization
[0006] The research history of automatic summarization technology for Web documents is even shorter. "Compared with traditional texts, the text structure of web pages is loose, the title naming is relatively less rigorous, and there may be no terminator at the end of a sentence, and there are a large number of inconsistencies with the text." related content, which brings some difficulties to the generation of summaries

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text structure analysis-based Web document abstract generation method
  • Text structure analysis-based Web document abstract generation method
  • Text structure analysis-based Web document abstract generation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] The invention discloses a search engine-oriented method for generating web document abstracts, which can automatically analyze a web page and generate text abstracts reflecting the theme of the web page.

[0050] The invention includes a webpage body text extraction that combines visual features and text features and an automatic text summary based on subtopic division through text structure analysis.

[0051] The invention takes a URL as input, and finally generates a text summary through two stages of web page text extraction and automatic summary.

[0052] The following is a further description of the specific algorithms of the two stages, combined with an example of summarizing a news web page:

[0053] figure 1 Describes the overall process from the URL to be summarized to generating the summary, including the web page preprocessing process and the automatic summarization process.

[0054] Specifically, in the embodiment, the present invention is in the web page ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a text structure analysis-based Web document abstract generation method. The method comprises the steps of using a URL (uniform resource locator) as input, integrating the webpage main bodies of visual features and text features for extraction, partitioning the main bodies into a plurality of semantic paragraphs, and abstracting each semantic paragraph, so the generated abstract has higher coverage rate. The text structure analysis-based Web document summary generation method realizes the generation of the text abstract with better quality from a Webpage aiming at the conditions that the Webpage structure is complex, the main body is hard to identify and the Chinese automatic abstract is still positioned in the probe stage.

Description

technical field [0001] The invention relates to the technical fields of web page text extraction, natural language processing, and Chinese automatic summarization, in particular to a generation method of web document summaries based on text structure analysis. Background technique [0002] At present, the Internet has become the main source for people to obtain information. Especially with the rapid development of User Generated Content (UGC) in recent years, the information on the Internet is growing explosively. Although search engines can return search results according to user requirements. But users still need to find the most suitable webpage for their needs from the search list, especially because there are a large number of search engine optimization and reposting phenomena on the Internet, which brings great difficulties to users to quickly and accurately find information. [0003] The automatic summarization system uses computers to quickly process web documents,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/80G06F40/30
Inventor 沈怡涛顾君忠林晨
Owner EAST CHINA NORMAL UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products