A bilingual parallel segment extraction method based on Dirichlet process from comparable corpus

A fragment and bilingual technology, applied in semantic analysis, natural language data processing, special data processing applications, etc., can solve the problems of poor extraction effect and low quality of parallel fragments

Active Publication Date: 2019-02-01
KUNMING UNIV OF SCI & TECH
View PDF4 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The present invention provides a method for extracting bilingual parallel segments of comparable corpus based on the Dirichlet process, so as to solve the problems of poor effect and low quality of extracting parallel segments

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A bilingual parallel segment extraction method based on Dirichlet process from comparable corpus
  • A bilingual parallel segment extraction method based on Dirichlet process from comparable corpus
  • A bilingual parallel segment extraction method based on Dirichlet process from comparable corpus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0053] Embodiment 1: as figure 1 Shown, a kind of method based on Dirichlet process bilingual parallel segment extraction of comparable corpus, described method comprises the steps:

[0054] Step1. Obtain the topic model of the bilingual comparable corpus pair through the bilingual LDA topic model;

[0055] Step1.1. Use the corresponding word segmentation tool to perform word segmentation and stop word preprocessing on the bilingual comparable corpus;

[0056] Step1.2. For the processed bilingual comparable corpus, obtain the topic model of the bilingual comparable corpus pair through the bilingual LDA topic model;

[0057] Step2. Randomly segment the bilingual comparable corpus through Poisson distribution, then set a topic similarity, and initially screen the parallel segment sets of comparable corpora through the topic similarity;

[0058] Step3. The matching probability between each parallel segment is obtained by the Dirichlet process, and then the final parallel segmen...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for extracting bilingual parallel segments of comparable corpus based on a Dirichlet process, belonging to the technical field of machine learning translation and natural language processing. At first, that subject distribution of the bilingual comparable corpus pair is obtain through the bilingual subject model, Bilingual comparable corpus is segmented randomly by Poisson distribution, and then a theme threshold is set, and the parallel fragment set of comparable corpus is initially screened by the threshold, and finally the matching probability of each parallel fragment is obtained by Dirichlet process, and the final accurate parallel fragment pair is further obtained by Gibbs sampling. Under the same comparable corpus environment, the extraction methodbased on the Dirichlet process of the invention has better effect of obtaining parallel fragment pairs.

Description

technical field [0001] The invention relates to a method for extracting bilingual parallel segments of comparable corpus based on a Dirichlet process, and belongs to the technical fields of machine learning translation and natural language processing. Background technique [0002] Parallel corpus is essential for many machine translation applications. No matter how rich the corpus resources are, with the diversification of languages, more data will always be needed. However, obtaining high-quality data in the field of natural language Bilingual parallel resources are often difficult, because the existing parallel corpus has problems such as difficult acquisition, small quantity, narrow field, and poor timeliness. Therefore, many researchers try to find new methods or expand the original methods to Extract parallel corpus. Through the research, it is found that the comparable corpus is a bilingual resource that can be widely used, because the comparable corpus contains a lar...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/289G06F40/30
Inventor 严馨蒋亚芳余正涛徐广义周枫郭剑毅
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products