A distributed acquisition
system facing web bilingual
parallel corpora resources relates to the technical field of corpora acquisition, and solves the problems that the conventional
system is low in
crawling scale, less in corpora acquiring ways, and lower in
crawling efficiency. The
system comprises an interlinking
memory pool module, a screening filter module, a webpage crawl device module, an original webpage
library module, a bilingual detection module, a
blacklist module, a bilingual webpage
library module and an interlinking withdrawal device module. The invention overcomes the technical defects in the conventional technical field, adopts
the Internet as a corpora acquisition target, can effectively solve the resource occupation conflicting problem of a distributed system, can provide a
universal design framework for a bilingual
parallel corpora acquisition system, can dynamically add non-bilingual sites into a
blacklist unceasingly, can effectively grab
parallel corpora in
the Internet, and can greatly improve the bilingual corpora grabbing efficiency.