Distributed acquisition system facing web bilingual parallel corpora resources
A technology of collecting system and parallel corpus, applied in transmission systems, special data processing applications, instruments, etc., can solve the problems of few channels for obtaining corpus, low crawling efficiency, and small crawling scale, so as to improve crawling efficiency and save money. The effect of computing resources and resolving possession conflicts
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
specific Embodiment approach 1
[0020] Specific implementation mode one: the distributed collection system facing web bilingual parallel corpus resources described in this implementation mode:
[0021] A link repository module for storing the hyperlinks contained in the crawling task;
[0022] Screen filter module 1, input the link flow from the link repository module, and judge whether the link satisfies the crawling condition; if the crawling condition is met, then judge whether it includes non-bilingual sites, and judge whether to crawl according to the rules;
[0023] The webpage crawler module 2 obtains the download list from the screening filter module 1, and then downloads the webpage corresponding to the url link in the download list from the Internet;
[0024] The original webpage library module, the webpage downloaded by the webpage crawler module 2 is saved in the original webpage library module, for storing the original webpage that the webpage crawler module 2 grabs;
[0025] The bilingual dete...
specific Embodiment approach 2
[0040] Embodiment 2: This embodiment is a further description of the link repository module described in Embodiment 1: it is used to store and maintain a large-scale crawled link library, which includes the URL address of the web page, the crawl status and the crawl status. Take time.
[0041] This embodiment stores these meta-information in the captured task list to decide whether to perform a crawl or an incremental update on a link.
specific Embodiment approach 3
[0042] Embodiment 3: This embodiment is a further description of the screening filter module 1 described in Embodiment 1: the screening filter module 1 sequentially reads link items from the link repository module and screens a link to be grabbed list; the filtering strategy is composed of custom filtering rules and blacklist rules; the filtering rules include general regular expressions, and non-bilingual sites provided by the blacklist; after reading a record from the link repository module , make rules to judge whether to add it to the crawling list as the input of the web crawler module 2; another function is to update the link repository module regularly, and eliminate redundant and worthless links according to the filtering rules. Improve link repository quality.
[0043] In this embodiment, the non-bilingual websites that have been discriminated are dynamically added to the blacklist during the translation corpus collection process, and are directly ignored in the next ...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com