Field-oriented method and system for collecting invisible web resources

A collection method and technology in the field, applied in the field of information retrieval, can solve problems such as difficulty in establishing automatic matching between queries and input items, without considering differences, and difficulty in automatically determining value ranges or accepted data types, etc.

Active Publication Date: 2013-05-22
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF1 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method adopts domain-oriented thinking. Although it can achieve high resource coverage, it still has many defects for a search engine: (1) It does not consider the difference in website design in the same field, and it is difficult to solve the problem of simple query interface and Semantic mapping between intermediary forms, especially dark web resources hidden behind forms that cannot effectively capture single input items; (2) Maintaining intermediary forms and preparing input data for them is heavy work, resulting in poor scalability of this method
This method is more suitable for simple query interfaces, but it is difficult to collect resources for complex query interfaces. The reason is that complex query interfaces contain multiple input items, and it is difficult to automatically determine the value range or accepted data type of each input item, and it is difficult to establish Automatic matching between queries and entries

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Field-oriented method and system for collecting invisible web resources
  • Field-oriented method and system for collecting invisible web resources
  • Field-oriented method and system for collecting invisible web resources

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0076] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below through specific embodiments in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0077] figure 1 It is a schematic flow chart of a domain-oriented dark web resource collection method according to an embodiment of the present invention. This method integrates existing web crawlers, and collects dark web resources related to a certain field through the following steps:

[0078] (1) Identify pages related to the specified field from the captured pages, and obtain a valid form set (such as figure 1 in steps 100, 200 and 300);

[0079] (2) For each form in the valid form set, determine its form type and construct an effective query according to different form types, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a field-oriented method for collecting invisible web resources. The method includes the steps of identifying and designating pages related to a field from grasped pages, obtaining valid form collection, judging the form type of every form in the valid form collection, constructing valid inquiries according to different form types, and outputting results returned for the valid inquiries to be as the collected invisible web resources, wherein the form types refers to single-initem forms or multi-initem forms. The method can achieve automatic indentifying and classifying of an invisible resource inquiry interface, and simultaneously achieves valid construction of inquiries for a simple inquiry interface and a complex inquiry interface. Accordingly, the collection for the invisible resources is achieved. The method not only can be integrated in an existing search engine in a seamless mode, but also can simultaneously collect the invisible resources directed by the simple inquiry interface and the complex inquiry interface.

Description

technical field [0001] The invention relates to information retrieval, in particular to a method for collecting dark web resources. Background technique [0002] With the rapid development of Internet technology, many different types of databases have appeared on the Internet. The information stored in them is huge in quantity and high in quality, forming a huge online information resource library. The information stored in the background database has standardized and unified storage, a good data structure, and high data quality, but most of these databases are hidden in the query interface—behind the form, users can only enter a series of keywords through the query interface The background database information can only be obtained after the query is submitted. However, the current web crawlers do not have the ability to automatically fill in the query interface, so this information cannot be directly obtained by the web crawler through the page hyperlink relationship, so t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 熊锦华林海伦程学旗张永超廖华明
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products