Multi-level long text vector retrieval method and device and electronic equipment

A long-text, multi-level technology, applied in unstructured text data retrieval, text database indexing, neural learning methods, etc., can solve difficult word ambiguity, cumbersome, time-consuming and other problems, and achieve the effect of improving recall efficiency

Active Publication Date: 2021-06-18
北京智源人工智能研究院
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The retrieval method based on sparse matrix has the following disadvantages: (1) Manual feature engineering is required, which is a cumbersome, time-consuming and error-prone process, and the code for manual feature engineering each time is for a specific problem. When a new problem, new data set, we need to rewrite the relevant code; (2) it is difficult to solve the problem of word ambiguity in the open field
However, the training method using positive and negative binary classification labels has the following problems: After a long document containing answers is divided into multiple text segments when training the model, some text segments contain documents, and some text segments contain documents. Although it does not contain an answer, it is semantically related to the search request
For example, what are the material requirements for the application of the 2021 Beijing Natural Science Foundation of China? "For such a request, the corresponding document only has a hint of "2021 Beijing Natural Science Foundation Project" at the beginning, but because the long text is cut into text fragments, the first paragraph is used as a negative example for training, which reduces the effect of the model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-level long text vector retrieval method and device and electronic equipment
  • Multi-level long text vector retrieval method and device and electronic equipment
  • Multi-level long text vector retrieval method and device and electronic equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0044] like figure 1 As shown, the embodiment of the present invention provides a multi-level long text vector retrieval method, including:

[0045] S101, segment the long text in the open field into text segments;

[0046] S102. Use the trained encoder to encode the text segment and the search request into dense vectors, respectively;

[0047] S103, using the text segment and the dense vector of the search request, based on vector retrieval, query to obtain a target text segment similar to the search request;

[0048] Wherein, the encoder is trained using a training data set including multi-level text segments.

[0049] In the actual application process, since long documents often need to be divided into multiple text fragments for model training, the correlation between search requests and text fragments is multi-level, and there are not only two kinds of labels: relevant and irrelevant. For example, the following four text fragments: a. document fragments that contain an...

Embodiment 2

[0085] like figure 2 As shown, another aspect of the present invention also includes a functional module architecture completely corresponding to the aforementioned method flow, that is, an embodiment of the present invention also provides a multi-level long text vector retrieval device, including:

[0086] Text segmentation module 201, for the long text of open field is segmented into text segment;

[0087] A vector encoding module 202, configured to encode the text segment and the search request respectively into dense vectors using a trained encoder, the encoder is trained using a training data set including multi-level text segments;

[0088] The vector retrieval module 203 is configured to use the text segment and the dense vector of the search request to obtain a target text segment similar to the search request based on vector retrieval.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multi-level long text vector retrieval method and device and electronic equipment. The method comprises the following steps: segmenting a long text in the open field into text segments; respectively encoding the text fragment and the search request into dense vectors by using a trained encoder; querying to obtain a target text fragment similar to the search request based on vector retrieval by using the text fragment and the dense vector of the search request, wherein the encoder is obtained by training a training data set comprising multi-level text fragments. By considering the multi-level correlation between the text fragments in the training data set and the search request, the obtained model can easily select a proper fragment from a plurality of related fragments, and the recall efficiency is remarkably improved.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a multi-level long text vector retrieval method, device and electronic equipment. Background technique [0002] Open-domain question answering is an important task in the field of natural language processing. It can be simply described as: Given a fact-type question, the system needs to retrieve the document where the answer to the question is located from a large-scale multi-domain document library, and then extract or generate the answer from it. For open domain question answering tasks, document retrieval is often the most important part, and the accuracy of document retrieval determines the upper limit of the overall effect of the system. [0003] At present, the commonly used methods for document retrieval in open domain question answering tasks are based on sparse matrix or dense vector retrieval. Among them, retrieval methods based on sparse matrices ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/31G06F16/33G06F16/332G06N3/04G06N3/08
CPCG06F16/316G06F16/3344G06F16/3329G06N3/08G06N3/045
Inventor 钱泓锦刘占亮窦志成文继荣曹岗
Owner 北京智源人工智能研究院
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products