Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

A method and electronic device for long text retrieval in an open domain question answering task

An open-field, long-text technology, applied in unstructured text data retrieval, text database query, semantic analysis, etc., can solve problems such as poor generalization, error-prone, and cumbersome processes, and achieve strong reusability and improved accuracy rate, improving the effect of ambiguity

Active Publication Date: 2020-12-15
北京智源人工智能研究院
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] This method has the following inevitable disadvantages in practical applications: (1) It requires complex manual feature engineering, which is cumbersome, time-consuming and error-prone
In addition, the code for manual feature engineering is only for a specific problem. When a new problem or new data set needs to be solved, the relevant code needs to be rewritten; (2) It is difficult to solve the problem of word ambiguity in the open field
For example, for the word "apple", if its contextual information is ignored, it is difficult for the system to identify whether it represents a fruit or a technology company; (3) lack of deep understanding of semantics
For example, for the words "Ministry of Industry and Information Technology" and "Ministry of Industry and Information Technology", the system cannot automatically find the correlation between them and needs to be normalized manually; (4) The space for effect optimization is limited
Due to the technical limitations of artificial feature engineering, when the retrieval effect reaches a certain level, it is difficult to continue to optimize; (5) Poor generalization
Due to the strong domain attributes of various index constructions in the system, when encountering search requests outside the text domain, the effect is often poor

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and electronic device for long text retrieval in an open domain question answering task
  • A method and electronic device for long text retrieval in an open domain question answering task
  • A method and electronic device for long text retrieval in an open domain question answering task

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0050] Such as figure 1 As shown, the embodiment of the present invention provides a method for long text retrieval in an open domain question answering task, including:

[0051]S101, using a pre-trained encoder to encode open domain documents and search requests into document dense vectors and request dense vectors respectively; wherein, the encoder uses historical search requests, positive samples and negative samples as sample data for training;

[0052] S102. Calculate the similarity score between the search request and the open domain document according to the document dense vector and the request dense vector, and select the open domain document whose similarity score meets the requirements as a candidate document;

[0053] S103. Select a target document corresponding to the search request from the candidate documents.

[0054] The above method can be described as:

[0055] Given a collection of historical search requests , the document collection where the answer co...

Embodiment 2

[0100] Such as image 3 As shown, another aspect of the present invention also includes a functional module architecture completely corresponding to the aforementioned method flow, that is, the embodiment of the present invention also provides a device for long text retrieval in an open domain question answering task, including:

[0101] The encoding module 201 is used to encode the open-domain documents and search requests into document dense vectors and request dense vectors respectively by using a pre-trained encoder; wherein, the encoder uses historical search requests, positive samples and negative samples as sample data to perform train;

[0102] A candidate document selection module 202, configured to calculate the similarity score between the search request and the open domain document according to the document dense vector and the request dense vector, and select the open domain document whose similarity score meets the requirements as a candidate document;

[0103] ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for long text retrieval in an open domain question and answer task, andelectronic equipment. The method comprises the following steps of: respectively encoding an opendomain document and a search request into a document dense vector and a request dense vector by using a pre-trained encoder, wherein the encoder adopts a historical search request, a positive sampleand a negative sample as sample data for training; calculating a similarity score of the search request and the open domain document according to the document dense vector and the request dense vector, and selecting the open domain document of which the similarity score meets the requirement as a candidate document; and selecting a target document corresponding to the search request from the candidate documents. According to the invention, the reusability is strong; words of the same anaphora are attached with approximate semantic expressions; the ambiguity problem brought to search by one word with multiple meanings is effectively improved; the model training effect is good; the method has relatively strong generalization capability on cross-domain documents; and the search effect, the search performance, the usability, the maintainability and the like are greatly improved, and the potential is improved.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a method and electronic equipment for long text retrieval in an open field question answering task. Background technique [0002] Open-domain question answering is an important task in the field of natural language processing. The process of open domain question answering can be simply described as: for a given factual question, first retrieve the document where the answer to the question is located from a large-scale multi-domain document library, and then extract or generate the answer from the document. Among them, the accuracy of document retrieval often determines the upper limit of the effect of the whole process. Therefore, document retrieval is the most important part of open domain question answering tasks. [0003] Currently, common methods in the document retrieval stage are based on sparse matrices, such as using TD-IDF or BM25. Specifically, suc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/33G06F40/30
CPCG06F16/33G06F40/30
Inventor 钱泓锦刘占亮刘家俊窦志成
Owner 北京智源人工智能研究院
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products