Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A typo proofreading method and device for automatically generating training data

A training data, automatic generation technology, applied in the computer field, can solve problems such as low efficiency, time-consuming and laborious, and achieve the effect of solving time-consuming

Active Publication Date: 2021-05-14
京华信息科技股份有限公司
View PDF9 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Manual collection of wrong corpus with typos is time-consuming, laborious and inefficient

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A typo proofreading method and device for automatically generating training data
  • A typo proofreading method and device for automatically generating training data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0028] Such as figure 1 As shown, an embodiment of the present invention provides a method for correcting typos that automatically generates training data, including:

[0029] Step S101: Obtain a given corpus and perform word segmentation processing on the given corpus to obtain several first phrases.

[0030] Step S102: Generate several confusing word sets according to each of the first phrases; wherein, each of the confusing word sets includes a core phrase and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and device for correcting typos that automatically generate training data. The method includes performing word segmentation processing on a given corpus to obtain a number of first phrases; generating a number of confusing word sets according to each first phrase; Select the first phrase to be replaced from several first phrases in the first phrase, and then use the same confusing word set as the core phrase and the first phrase to be replaced as the selected word set; replace the first phrase to be replaced in the given corpus with the selected The similar phrases in the fixed word set generate error corpus; the given corpus and the error corpus are used as training data sets, and the typo proofreading model is trained according to the training data set; the text to be proofread is proofread according to the typo proofreading model. By implementing the present invention, the problems of long time-consuming and low efficiency of manual collection of wrong corpus can be solved in the prior art.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a method and device for correcting typos that automatically generate training data. Background technique [0002] Typo proofreading is one of the tasks of text proofreading. With the development of science and technology, the method of automatic model building and error correction through machine learning has gradually become popular. In the process of training the model, a large amount of training data is required. The existing training data needs to manually collect the user's error corpus and then mark it to generate training samples. Manual collection of wrong corpus with typos is time-consuming, laborious and inefficient. Contents of the invention [0003] Embodiments of the present invention provide a method and device for correcting typos that automatically generate training data, which can automatically generate error corpus with typos, use the generated error corpu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/232G06F40/289G06F40/247
CPCG06F40/232G06F40/247G06F40/289
Inventor 蓝建敏池沐霖
Owner 京华信息科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products