Method and device for constructing word segmentation training data

A technology of training data and construction method, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of word segmentation training data sparse data, limited data sources, etc., to enrich data sources and overcome data sparse problems. Effect

Active Publication Date: 2018-01-30
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, not all corpus data can find the content of web pages containing anchor text data on the Internet, so the data sources of this scheme are very limited
Therefore, if the word segmentation training data is obtained completely in this way, the obtained word segmentation training data will have obvious data sparse problems

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for constructing word segmentation training data
  • Method and device for constructing word segmentation training data
  • Method and device for constructing word segmentation training data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] The present invention will be further described in detail below with reference to the drawings and embodiments. It can be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention. In addition, it should be noted that, for ease of description, the drawings only show a part but not all of the content related to the present invention.

[0023] figure 1 and figure 2 The first embodiment of the invention is shown.

[0024] figure 1 It is a flowchart of a method for constructing word segmentation training data provided by the first embodiment of the present invention. See figure 1 , The method for constructing the word segmentation training data includes:

[0025] S110: Obtain the user's query sentence in a query session of the user and the page title of the webpage link clicked by the user in the query result of the query sentence.

[0026] Since there may be different understandings of the corp...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a construction method and device of the word segmentation training data. The construction method of the word segmentation training data comprises the following steps: acquiring an inquiry sentence of a user in an inquiry session of the user and the webpage title of a webpage finally clicked by the user; comparing the inquiry sentence with the webpage title to obtain a public character string between the inquiry sentence and the webpage title; performing word segmentation on the inquiry sentence and the webpage title according to the obtained public character string. By adopting the construction method and device of word segmentation training data provided by the embodiment of the invention, the data source of the word segmentation training data is enriched, and the problem of data sparseness of the word segmentation training data is solved.

Description

Technical field [0001] The embodiment of the present invention relates to the technical field of natural language processing, and in particular to a method and device for constructing word segmentation training data. Background technique [0002] Most word segmentation techniques require a corpus based on the background. Therefore, the tagging quality of the corpus in the corpus determines the quality of the final word segmentation result. At present, most of the corpus data labeling in the corpus is done manually. The manual labeling of corpus data requires high professional quality of the annotators, and the manual labeling process is time-consuming and laborious, resulting in low efficiency of word segmentation for corpus data. [0003] One solution to improve the efficiency of word segmentation of corpus data is to use anchor text on the web page as a reference to segment the corpus data. For example, the text "John Wayne is a 19th century British philosopher and mathematici...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/3344G06F16/9535
Inventor 石磊张开旭
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products