Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A training data sampling method for establishing a word translation model

A data sampling and translation model technology, applied in natural language translation, electronic digital data processing, special data processing applications, etc., can solve problems such as easy confusion, no obvious definition, and insufficient word accuracy, and achieve efficient training and saving The effect of training cost and reducing the amount of training data

Pending Publication Date: 2019-02-26
陈虎
View PDF9 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The disadvantage is that the word accuracy is not high enough
The disadvantage is that a large amount of labeled data is required for training
On the one hand, some words have many explanations and require a long drop-down text box to display, which increases the difficulty for users to browse on smart devices with small screens; Clearly defined, easily confused

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A training data sampling method for establishing a word translation model
  • A training data sampling method for establishing a word translation model
  • A training data sampling method for establishing a word translation model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044] figure 1 A flow chart of adaptive sampling of source language training data according to the present invention is shown. Adaptive sampling of training data such as figure 1 shown. The goal of sampling here is to sample a certain number of 200 statute sentences for each interpretation of the word from such as free network materials or other massive original translation data for any translated word. Assuming an average of 5 interpretations per word, the sampling target for each word is approximately 1000 exemplification sentences. The so-called example sentence can be understood here as the only context that contains this word. The context can be one sentence or several sentences.

[0045] Humans were then used to label the words according to the different interpretations of the word in context. The so-called label is the serial number of each interpretation of each word. For example, if a word has 5 different interpretations, then the tags are tag 1, tag 2, tag 3, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A training data sampling method for establishing a word translation model includes: first, randomly sampling a first proportion of example sentences for a target number of the word from raw data in afirst round of sampling; Labeling the example sentences obtained in the first round of sampling and storing the labeled sentences in a label data pool; Performing word embedding preprocessing on tag data in the tag data pool and obtaining data center points of each category corresponding to respective interpretations of the words; Heuristic clustering of raw data using different types of centers;enabling The example sentence acquired in the first round of sampling to be subjected to data post-processing and the processing result is fed back to the next round of sampling, so that the total number of samples reaches the target number of samples.

Description

technical field [0001] The invention relates to the field of artificial intelligence, in particular to a training data sampling method for establishing a word translation model. Background technique [0002] At present, when the translation system performs translation, the translation results are mostly provided by machine translation. Among them, machine translation is the process of using a computer to convert a natural source language into another natural target language. In order to improve the accuracy of machine translation, many translation systems currently provide a multi-candidate mechanism for translation results. When users are not satisfied with the current translation results, they can choose candidate results at the word or phrase level. [0003] However, the existing machine translation is based on limited resources in the corpus and the limitation of the description ability of the translation model itself. Therefore, there is still a certain gap between th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/28
CPCG06F40/55
Inventor 陈虎尹文鹏
Owner 陈虎
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products