Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

63 results about "Web scraping" patented technology

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Webpage screening method and device thereof

The invention discloses a webpage screening method and a webpage screening device. The method comprises that preset seed webpage is captured; uniform resource locator (URL) information included by the seed webpage is captured; webpage mass fraction corresponding to the URL information is calculated; the URL information is divided into corresponding candidate gather according to preset network address information; the URL information which is not greater than the preset pressure quota is screened out from each candidate gather, the URL information which is screened out and corresponding to the webpage mass fraction which is not lower than the webpage mass fraction and corresponding to arbitrary residual URL information in the relative candidate gather is screened out. The captured pressure value corresponding to the network address is ensured based on the preset pressure quota. The webpage corresponding to the URL information which is screened out is regarded as the target captured webpage. The method lowers the risk of the capturing webpage failure or the risk of banning site so that the goal of improving the success rate of capturing the webpage is achieved.
Owner:人民数据管理(北京)有限公司

Web page crawling method and spider

The invention discloses a web page crawling method and a spider. The method comprises the following steps: injecting seed URL into a Web database; generating a URL list based on the Web database; feeding back the URL in the URL list to a web page crawler; crawling the webpage by the web page crawler according to the fed back URL comforming to the corresponding visit mode of the web page; and updating the URL state in the Web database and injecting newly found URL based on the crawled web page, wherein the visit mode comprises requesting parameter socket, responsing parameter socket, requesting the corresponding relationship between the requesting parameter socket and the responsing parameter socket; the requesting parameter socket comprises requesting parameter, as well as the mapping relationship of the requesting parameter socket and the responsing parameter socket; the responsing parameter socket comprises a responsing parameter, as well as the extraction position information about the extraction position of the responsing parameter in http respongsing message.
Owner:FUJITSU LTD

Method for automatically finding network content quotation

ActiveCN1770159ASpeed ​​up the auto-discovery processLow hardware requirementSpecial data processing applicationsInformation retrievalNatural language understanding
The invention relates to a method for finding network contents being quoted automatically which comprises steps of: introducing pre-searching process for accelerating automatic found process, employing the indexing service provided by searching engine website to eliminate web page grabs and establishing content index. The invention has the advantages of having low requirement on hardware and of being abet to protect intelligent property of network contents.
Owner:新方正控股发展有限责任公司 +2

Method and device for reading webpage resources, and electronic equipment

The embodiment of the invention discloses a method and a device for reading webpage resources, and electronic equipment. The method is applied to the webview controls of an Android operating system of 4.0 to 4.3 versions. The method comprises the following steps: if the loading state of webpage resources to be fetched is loading completion, obtaining the URL (Uniform Resource Locator) information of the webpage resources to be fetched, wherein the webpage resources to be fetched correspond to an obtained webpage fetching request; according to the package name of an application program which constructs the current webpage, obtaining a resource cache file path mapped by the package name; extracting a binary data file under the resource cache file path, and traversing the binary data file to obtain an information field matched with the URL information; and inquiring information before the matched information field, obtaining preset symbolic information, obtaining a webpage resource file corresponding to the URL information according to the information before the symbolic information and a filename calculation strategy, and reading the webpage resource file under the resource cache file path. The method and the device can be applied to improve web resource utilization efficiency.
Owner:KINGSOFT

Filtering expression and rendering engine based method for automatically monitoring update of dynamic webpage

The invention discloses a filtering expression and rendering engine based method for automatically monitoring update of a dynamic webpage. A user appoints an interested part in the webpage as a concerned point through a visualized interface, and an application or a client automatically generates a filtering expression corresponding to the concerned point; a server renders the dynamic webpage by utilizing the rendering engine to obtain the same page seen by the user, and extracts the concerned point of the user; and when the concerned point of the user is updated, the server pushes the update content to the user in time. According to the method, a customizable dynamic webpage monitoring program is realized by helping the user appoint the concerned point and utilize the rendering engine for automatically inspecting webpage update at the server, the problem of lack of customization for a conventional information subscription mode (such as RSS (really simple syndication)) is solved, the defect of incapability of analyzing the dynamic webpage in conventional webpage capture is also overcome, and the efficiency for obtaining webpage information update by the user is improved.
Owner:SOUTHEAST UNIV

Web page grabbing method and web page grabbing system based on big data

The invention provides a web page grabbing method and a web page grabbing system based on big data. The web page grabbing method comprises the following steps of receiving a web page request of a user; classifying the big data according to key word classification of the web page; and transmitting the classified web page which corresponds with the web page request to the user. The web page grabbingmethod and the web page grabbing system provided by the invention have an advantage of high convenience in grabbing the web page.
Owner:SHENZHEN BOXINNUODA ECONOMIC RELATIONS & TRADE CONSULTANTS CO LTD

Dynamic webpage crawling method and device

The invention discloses a dynamic webpage crawling method. The method comprises following steps: arranging at least two queues, crawling url of web-pages to be crawled and priorities, storing them into at least two queues and scheduling according to priorities of url stored in at least two queues; receiving elements of at least two queues called in order to obtain url of elements to be analyzed; and obtaining webpage content by analyzing url of queue elements. The dynamic webpage crawling method has following beneficial effects: procedures for crawling analyses and url of a link library can be scheduled simultaneously according to priorities so that webpages of higher priorities can be crawled firstly; by scheduling at least two queues, de-queuing efficiency and en-queuing efficiency of webpages can be improved; the time complexity is logN so that webpage crawling efficiency can be greatly improved.
Owner:DATAGRAND TECH INC

Cascade crawling method and device for multi-level pages based on web crawlers

The invention relates to a cascade crawling method for multi-level pages based on web crawlers. The method comprises: grabbing an upper-level page, storing grabbed data in an upper-level page data analysis table, and setting main key values for objects needing to continue to grab a lower-level page in the upper-level page data analysis table, wherein the main key values corresponding to the objects are different; grabbing a subordinate page and storing the captured data in a subordinate page data analysis table; setting a foreign key value for the lower-level page data analysis table, obtaining a main key value of an object corresponding to a lower-level page from an upper-level page data analysis table, and taking the main key value as the foreign key value of the lower-level page data analysis table, thereby realizing associated query of an upper-level webpage and a lower-level webpage after grabbed data falls to the ground. According to the method, a data acquisition mode capable ofrestoring logics before and after the webpage is provided, the webpage capture integrity is ensured, the data is stored according to the original webpage hierarchy sequence, and the associated multi-hierarchy page data can be conveniently obtained.
Owner:厦门商集网络科技有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products