The invention discloses an automatic extraction method oriented to data of
deep web pages, and belongs to the field of computer
data mining. The automatic extraction method includes obtaining two
deep web pages of the same website at first, and respectively marking the two
deep web pages as a first page and a second page; converting
HTML (
hypertext markup language) documents of the first page and the second page into
XHTML (extensible
hypertext markup language) documents; then removing
noise of the first page and the second page; eliminating repeated
modes of the first page and the second page to generate a webpage
data extraction wrapper; removing
noise of the page with the data to be extracted at first when the page is extracted; marking the page by the webpage
data extraction wrapped after the
noise of the webpage is removed, and finally extracting the marked page. By the aid of the automatic extraction method, efficiency of a repeated mode
elimination algorithm and efficiency of a matching
algorithm are improved, extraction complexity is reduced, the matching
algorithm and an
extraction algorithm, which are designed according to characteristics of the repeated mode
elimination algorithm, in the method are simple and speedy in process, and
data extraction accuracy is improved.