Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

1245 results about "Web crawler" patented technology

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).

Graphical search engine visual index

A visual index method provides graphical output from search engine results or other URL lists. Search engine results or a list of URLs are passed to a web crawler that retrieves the web page and other media information present at the associated URL. The web crawler then passes this information to a page renderer which also receives image scale and format information regarding the web pages present at the URLs. The graphical information as well as other media information is then rendered into a reduced graphical form so that the page may be summarily reviewed by the user. Media, visual, or other information may also be downwardly scaled as appropriate or rendered in its original as appropriate (such as with audio data streams). A variety of convenient formats allows the user to quickly and readily scan the presentation at the URL web pages or other data present. Image maps associated with the reduced images may also provide hyperlink access to the linked web page and / or multimedia allowing the links present on the web page in its original to be accessed through the reduced image provided by the web page renderer.
Owner:HYPER SEARCH LLC

Recommending search terms using collaborative filtering and web spidering

In a pay-for-placement search system, the system makes search term recommendations to advertisers managing their accounts in one or more of two ways. A first technique involves looking for good search terms directly on an advertiser's web site. A second technique involves comparing an advertiser to other, similar advertisers and recommending the search terms the other advertisers have chosen. The first technique is called spidering and the second technique is called collaborative filtering. In the preferred embodiment, the output of the spidering step is used as input to the collaborative filtering step. The final output of search terms from both steps is then interleaved in a natural way.
Owner:R2 SOLUTIONS

Multimedia conceptual search system and associated search method

The current disclosure uses the disciplines of Ontology and Epistemology to implement a context / content-based “multimedia conceptual search and planning”, in which the formation of conceptualization is supported by embedding multimedia sensation and perception into a hybrid database. The disclosed system comprises: 1) A hybrid database model to host concept setup. 2) A graphic user interface to let user freely issue searching request in text and graphic mode. 3) A parsing engine conducting the best match between user query and dictionaries, analyzing queried images, detecting and presenting shape and chroma, extracting features / texture of an object. (4) A translation engine built for search engine and inference engine in text and graphic mode. 5) A search engine using partitioned, parallel, hashed indexes from web crawler result, conducting search in formal / natural language in text and graphic mode. 6) A logic interference engine working in text and graphic mode, and 7) A learning / feedback interface.
Owner:INTELLIGENTEK CORP

Systems and methods for generating and maintaining internet user profile data

Systems and methods are provided for automatically generating and maintaining user profile cookie sets. The user profile cookie sets may be used by a web crawler when gathering data such as advertisement data associated with one or more websites. The cookie sets may be generated by choosing a user profile with a set of user traits, selecting a set of websites related to the user traits, and browsing the selected set of websites using a web crawler while allowing the website to place cookies in storage of the web crawler. The cookie sets may be maintained by selecting a website to browse, selecting a user profile associated with the selected website, loading a previously generated cookie set for the selected user profile into the storage of a web crawler, and loading the webpage while allowing the website to place, update, or replace cookies in the storage of the web crawler.
Owner:PATHMATICS INC

Photo Automatic Linking System and method for accessing, linking, and visualizing "key-face" and/or multiple similar facial images along with associated electronic data via a facial image recognition search engine

ActiveUS20070172155A1Quick searchEnhanced and improved organization, classification, and fast sorts and retrievalDigital data information retrievalCharacter and pattern recognitionHealth professionalsWeb crawler
The present invention provides a system and method for input of images containing faces for accessing, linking, and or visualizing multiple similar facial images and associated electronic data for innovative new on-line commercialization, medical and training uses. The system uses various image capturing devices and communication devices to capture images and enter them into a facial image recognition search engine. Embedded facial image recognition techniques within the image recognition search engine extract facial images and encode the extracted facial images in a computer readable format. The processed facial images are then entered for comparison into at least one database populated with facial images and associated information. Once the newly captured facial images are matched with similar “best-fit match” facial images in the facial image recognition search engine's database, the “best-fit” matching images and each image's associated information are returned to the user. Additionally, the newly captured facial image can be automatically linked to the “best-fit” matching facial images, along with comparisons calculated, and / or visualized. Key new use innovations of the system include but are not limited to: input of user selected facial images for use finding multiple similar celebrity look-a-likes, with automatic linking that return the look-a-like celebrities' similar images, associated electronic information, and convenient opportunities to purchase fashion, jewelry, products and services to better mimic your celebrity look-a-likes; health monitoring and diagnostic use by conveniently organizing and superimposing periodically captured patient images for health professionals to view progress of patients; entirely new classes of semi-transparent superimposed training your face to mimic other similar faces, such as mimic celebrity look-a-like cosmetic applications, and or facial expressions; intuitive automatic linking of similar facial images for enhanced information technology in the context of enhanced and improved organization, classification, and fast retrieval objects and advantages; and an improved method of facial image based indexing and retrieval of information from the web-crawler or spider searched Web, USENET, and other resources to provide new types of intuitive easy to use searching, and / or combined use with current key-word searching for optimized searching.
Owner:VR REHAB INC +2

Systems and methods for client-based web crawling

The present invention provides systems and methods for obtaining information from a networked system utilizing a distributed web crawler. The distributed nature of clients of a server is leveraged to provide fast and accurate web crawling data. Information gathered by a server's web crawler is compared to data retrieved by clients of the server to update the crawler's data. In one instance of the present invention, data comparison is achieved by utilizing information disseminated via a search engine results page. In another instance of the present invention, data validation is accomplished by client dictionaries, emanating from a server, that summarize web crawler data. The present invention also facilitates data analysis by providing a means to resist spoofing of a web crawler to increase data accuracy.
Owner:MICROSOFT TECH LICENSING LLC

System and method for automatically gathering dynamic content and resources on the world wide web by stimulating user interaction and managing session information

An apparatus and method for a web crawler to automatically simulate user interaction with a dynamic website in order to gather and extract information from the site. This interactive web crawler will be able to create a search query string for any one of a number of desired search topics and systematically crawl dynamic personalized content on a website and retrieve the information desired by the user / client.
Owner:GOOGLE LLC

Distributed metadata searching system and method

A system and method of distributed metadata searching is disclosed. The present invention permits an extension of the searching and retrieval functions of existing Internet web search engines by utilizing computational resources embodied in user computer systems and search browsers. By distributing the searching and scanning functions to the user level, the present invention reduces the computational and communications burden on Internet web search engines and crawlers, resulting in lower computational resource utilization by Internet search engine providers. Given the exponential growth rate currently being experienced in the Internet community, the present invention provides one of the few methods by which complete searches of this vast distributed database may be performed. The present invention permits embodiments incorporating a Search Manger (1001) further comprising a Service Results Manager (1013), User Profile Database (1012), Service Manager(1013), and Service Database (1014); a Light Weight Application SCANNER (1002); and a Search Engine (1008). These components may be augmented in some preferred embodiments via the use of a Search Browser (1003), Internet Communications (1004); Web Site(s) (1005), Web Crawler(s) (1006), and a Repository Database (1007).
Owner:IBM CORP

Collaborative team crawling:Large scale information gathering over the internet

A distributed collection of web-crawlers to gather information over a large portion of the cyberspace. These crawlers share the overall crawling through a cyberspace partition scheme. They also collaborate with each other through load balancing to maximally utilize the computing resources of each of the crawlers. The invention takes advantage of the hierarchical nature of the cyberspace namespace and uses the syntactic components of the URL structure as the main vehicle for dividing and assigning crawling workload to individual crawler. The partition scheme is completely distributed in which each crawler makes the partitioning decision based on its own crawling status and a globally replicated partition tree data structure.
Owner:IBM CORP

Recommending search terms using collaborative filtering and web spidering

In a pay-for-placement search system, the system makes search term recommendations to advertisers managing their accounts in one or more of two ways. A first technique involves looking for good search terms directly on an advertiser's web site. A second technique involves comparing an advertiser to other, similar advertisers and recommending the search terms the other advertisers have chosen. The first technique is called spidering and the second technique is called collaborative filtering. In the preferred embodiment, the output of the spidering step is used as input to the collaborative filtering step. The final output of search terms from both steps is then interleaved in a natural way.
Owner:R2 SOLUTIONS

Anchor tag indexing in a web crawler system

ActiveUS7308643B1Facilitates indexing informationEffective and efficient text-based indexing systemWeb data indexingDigital computer detailsDocument IdentifierDocument preparation
Provided is a method and system for indexing documents in a collection of linked documents. A link log, including one or more pairings of source documents and target documents is accessed. A sorted anchor map, containing one or more target document to source document pairings, is generated. The pairings in the sorted anchor map are ordered based on target document identifiers.
Owner:GOOGLE LLC

Search engine with multiple crawlers sharing cookies

A web-crawler system includes a plurality of network crawlers configured to fetch documents from hosts on a network and a cookie database shared by the plurality of network crawlers. The cookie database stores cookies and associated information for use by the plurality of network crawlers. Each network crawler is configured to retrieve one or more cookies from the cookie database so as to enable access to documents on at least one of the hosts on the network. In some embodiments, each of the network crawlers may be configured to detect any of a plurality of predefined cookie errors associated with fetching a document. In some embodiments, each of the network crawlers may also be configured to detect when a cookie in the cookie database has expired and to obtain a replacement cookie.
Owner:GOOGLE LLC

Dynamic-content web crawling through traffic monitoring

A dynamic-content web crawler is disclosed. These New Crawlers (NCs) are located at points between the server and user, and monitor content from said points, for example by proxying the web traffic or sniffing the traffic as it goes by. Web page content is recursively parsed into subcomponents. Sub-components are fingerpinted with a cyclic redundancy check code or other loss-full compression in order to be able to detect recurrence of the sub-component in subsequent pages. Those sub-components which persist in the web traffic, as measured by the frequency NCs (6) are defined as having substantive content of interest to data-mining applications. Where a substantive content sub-component is added to or removed from a web page, then this change is significant and is sent to a duplication filter (11) so that if multiple NCs (6) detect a change in a web page only one announcement of the changed URL will be broadcast to data-mining applications (8). The NC (6) identifies substantive content sub-components which repeatably are part of a page pointed to by a URL. Provision is also made for limiting monitoring to pages having a flag authorizing discovery of the page by a monitor.
Owner:RESOURCE CONSORTIUM LTD LLC

Directed web crawler with machine learning

A web crawler identifies and characterizes an expression of a topic of general interest (such as cryptography) entered and generates an affinity set which comprises a set of related words. This affinity set is related to the expression of a topic of general interest. Using a common search engine, seed documents are found. The seed documents along with the affinity set and other search data will provide training to a classifier to create classifier output for the web crawler to search the web based on multiple criteria, including a content-based rating provided by the trained classifier. The web crawler can perform it's search topic focused, rather than "link" focused. The found relevant content will be ranked and results displayed or saved for a specialty search.
Owner:MCNAMEE J PAUL +4

Method and system for obtaining script related information for website crawling

A web crawler system has an automatic website crawler and a virtual browser that provides script related information to the website crawler. The virtual browser transforms an HTML document included in a web page of the website into an XML document, and builds a document object model containing document objects in a tree structure based on the XML document. The virtual browser extracts from the DOM scripts that are potentially executable, and executes the extracted scripts using a browser object model provided for the virtual browser containing objects and methods and properties that are used for script execution so as to capture script related information generated by execution of the scripts.
Owner:IBM CORP

System and method for distributed web crawling

The present invention provides for the efficient downloading of data set addresses from among a plurality of host computers, using a plurality of web crawlers. Each web crawler identifies URL's in data sets downloaded by that web crawler, and identifies the host computer identifier within each such URL. The host computer identifier for each URL is mapped to the web crawler identifier of one of the web crawlers. If the URL is mapped to the web crawler identifier of a different web crawler, the URL is sent to that web crawler for processing, and otherwise the URL is processed by the web crawler that identified the URL. Each web crawler sends URL's to the other web crawlers for processing, and each web crawler receives URL's from the other web crawlers for processing. In a preferred embodiment, each web crawler processes only the URL's assigned to it, which are the URL's whose host identifier is mapped to the web crawler identifier for that web crawler. Each web crawler filters the URL's assigned to it by comparing them against a database of URL's already known by the web crawler and removing the already known URL's. If a URL is not already known to the web crawler, the data set corresponding to the URL is scheduled for downloading.
Owner:HEWLETT PACKARD DEV CO LP

System and method for online duplicate detection and elimination in a web crawler

As part of the normal crawling process, a crawler parses a page and computes a de-tagged hash, called a fingerprint, of the page content. A lookup structure consisting of the host hash (hash of the host portion of the URL) and the fingerprint of the page is maintained. Before the crawler writes a page to a store, this lookup structure is consulted. If the lookup structure already contains the tuple (i.e., host hash and fingerprint), then the page is not written to the store. Thus, a lot of duplicates are eliminated at the crawler itself, saving CPU and disk cycles which would otherwise be needed during current duplicate elimination processes.
Owner:IBM CORP

Method and system for automatic product searching, and use thereof

A client application monitors web pages visited by a consumer and determines if the visited web page is product oriented and, if so, then contacts a product server to automatically retrieve and display corresponding product purchasing information if available in product centric database. However, if the web page is not found in the database, it and its product information is added thereto. The database is created by a product information gathering web crawler and a second web product price crawler using the harvested product information to find prices corresponding to the product on unvisited web pages.
Owner:ZICHERMAN AMIR SHLOMO

Duplicate document detection in a web crawler system

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.
Owner:GOOGLE LLC

Method for abstracting network data and web reptile system

A web crawler system used for picking up webpage data is prepared as providing data pick-up task to the second component and receiving execution result of data pick-up task from the second component by the first component, communicating with webpage server to obtain webpage data and operating DOM model to pick up data as well as describing picked up data then sending picked up data and its description to the first component by the second one.
Owner:李沫南

Security for WAP servers

A method and system for improving the security and control of internet / network web application processes, such as web applications. The invention enables validation of requests from web clients before the request reaches a web application server. Incoming web client requests are compared to an application model that may include an allowed navigation path within an underlying web application. Requests inconsistent with the application model are blocked before reaching the application server. The invention may also verify that application state data sent to application servers has not been inappropriately modified. Furthermore, the invention enables application models to be automatically generated by employing, for example, a web crawler to probe target applications. Once a preliminary application model is generated it can be operated in a training mode. An administrator may tune the application model by adding a request that was incorrectly marked as non-compliant to the application model.
Owner:F5 NETWORKS INC

Web information extraction system

The invention discloses a Web information extractions system, which is characterized by comprising a retrieve analyzing module, a rule generation module and a data extraction storage module, wherein the retrieve analyzing module comprises a web crawler unit and an HTML resolver; the rule generation module comprises a single-slot extraction rule generation unit and a multi-slot extraction rule generation unit; and the data extraction storage module extracts data from web pages downloaded from the retrieve analyzing module and stores the data in a structural form according to the extraction rule generated by the rule generation module. The system has the following advantages: when single-slot extraction rules are generated, the interface operation is simple and easy to understand; for generating multi-slot extraction rules, the system provides a graphical interface to help a user label so as to save the time and the physical power for the user; for pre-generated extraction rules and mission sequences, the system provides two ways to achieve the extraction and the storage of batch tasks; and the system can finish the tasks of the extraction and the storage in preset period and time according to the parameters configured by the user.
Owner:DALIAN MARITIME UNIVERSITY

Systems and methods for inferring uniform resource locator (URL) normalization rules

Different URLs that actually reference the same web page or other web resource are detected and that information is used to only download one instance of a web page or web resource from a web site. All web pages or web resources downloaded from a web server are compared to identify which are substantially identical. Once identical web pages or web resources with different URLs are found, the different URLs are then analyzed to identify what portions of the URL are essential for identifying a particular web page or web resource, and what portions are irrelevant. Once this has been done for each set of substantially identical web pages or web resources (also referred to as an “equivalence class” herein), these per-equivalence-class rules are generalized to trans-equivalence-class rules. There are two rule-learning steps: step (1), where it is learned for each equivalence class what portions of the URLs in that class are relevant for selecting the page and what portions are not; and step (2), where the per-equivalence-class rules constructed during step (1) are generalized to rules that cover many equivalence classes. Once a rule is determined, it is applied to the class of web pages or web resources to identify errors. If there are no errors, the rule is activated and is then used by the web crawler for future crawling to avoid the download of duplicative web pages or web resources.
Owner:MICROSOFT TECH LICENSING LLC

Similar web page duplicate-removing system based on parallel programming mode

The invention provides a similar web page duplicate-removing system based on a parallel programming mode, comprises a web page content pre-processing module, a web page eigenvector extracting module,a web page feature fingerprint calculation module, a web page fingerprint on-line duplicate-removing module, a web page fingerprint distributed batch duplicate-removing module and a computing platformbased on specific distribution. The system can complete links of carrying out unified conversion of text content encoding, standardization of document structure, web page noise content abortion, thematic content analysis and identification of web pages, lexical segmentation of continuous text content, and the like on the web pages obtained by crawling of web crawlers, thereby forming eigenvectorswhich can present the web pages. Relative algorithms can be used to obtain web page fingerprints which present web page characteristics aiming at the vector. The system provided by the invention accurately and fast detects fully complete repetition or approximate repetition of the web page contents caused by site mirroring, web document transshipment, and the like on the condition of massive amount of data of Internet and completes corresponding repetition-removing works, thereby enhancing the storage efficiency of search engines and bringing better use experience for the search engines.
Owner:HUAZHONG UNIV OF SCI & TECH

Techniques for crawling dynamic web content

An automated form filler and script executor is integrated with a web browser engine, which is communicatively coupled to a web crawler, thereby enabling the crawler to identify dynamic web content based on submission of forms completed by the form filler. The crawler is capable of identifying web pages containing forms that require submission, and JavaScript code that requires execution, respectively, for requesting dynamic web content from a server. The crawler passes a representation of such web pages to the browser engine. The form filler systematically completes the form based on various combinations of search parameter values provided by the web page for requesting dynamic content. Request messages are constructed by the browser engine and passed to the crawler for submission to the server. The dynamic content, received by the crawler from the server in response to the request, can be indexed according to conventional search engine indexing techniques.
Owner:R2 SOLUTIONS

Data store for knowledge-based data mining system

In a data mining system, data is gathered into a data store using, e.g., a Web crawler. The data is classified into entities and stored into underlying vertical and horizontal tables respectively representing miner outputs and entities that can be the subjects of indexing. Data miners use rules to process the entities and append respective keys to the entities representing characteristics of the entities as derived from rules embodied in the miners, with the keys being associated with the entities in the tables. With these keys, characteristics of entities as defined by disparate expert authors of the data miners are identified for use in responding to complex data requests from customers.
Owner:IBM CORP

Uniform resource locator scoring for targeted web crawling

A web crawler system as described herein utilizes a targeted approach to increase the likelihood of downloading web pages of a desired type or category. The system employs a plurality of URL scoring metrics that generate individual scores for outlinked URLs contained in a downloaded web page. For each outlinked URL, the individual scores are combined using an appropriate algorithm or formula to generate an overall score that represents a downloading priority for the outlinked URL. The web crawler application can then download subsequent web pages in an order that is influenced by the downloading priorities.
Owner:MICROSOFT TECH LICENSING LLC

Method and System of Information Engine with Make-Share-Search of consumer and professional Information and Content for Multi-media and Mobile Global Internet

The method of Make, Share and Search Integrated System, improves the user experience of creation and consumption of information content, with instant access to newly created and dynamic information. The new system also reduces or eliminates the need for web crawlers, by capturing search parameters at the time of creation itself, and improves the same whenever any user utilizes the same, to enable instant access to new information. In the next generation of web, content creation by consumers is going to far exceed any professional content, and new Make, Share and Search Integrated System enables this in a superior manner than any of present day technologies do. Further the system is designed for multi-media and mobile environment with global reach, and provides users with creation, sharing, prioritization, search, utilization and consumption in a single integrated technology, replacing the present daySearch engine” paradigm by a new paradigm called “Information Engine”.
Owner:MATHUR ANUP KUMAR
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products