However, there are some defects in the current search engines. There are three main problems: one is that the search engines return too much retrieval information, some of which contain some
noise data, and users cannot effectively locate the required information; the other is that the search engines do not understand users. The real
search intent; the third is that the search engine simply considers the matching of keywords, and does not consider the grammatical and
semantic relationship of the
search terms, so it is difficult to improve the accuracy of query retrieval
However, the classification of Chinese questions in the prior art cannot effectively improve the
hit rate and speed of retrieval, and cannot narrow the scope of retrieval of questions and reduce the retrieval time; the classification of questions in the prior art cannot optimize the retrieval items, and cannot recommend similarity to users. Question entry, the
recall rate of the
question answering system is low; the question classification affects the accuracy of the
question answer, and the quality of the question classification
algorithm determines the accuracy of the answer. The prior art question classification single
algorithm is monotonous and inefficient. It is not conducive to improving the
hit rate of answers. The existing Chinese question classification system cannot meet the requirements of intelligent question and answer for online consultation, and cannot be applied to the rigorous intelligent medical field;
[0008] Second, there is still a big gap between the classification of Chinese questions and the classification of English questions, especially in the field of medical question classification. The main reason is that Chinese questions have their own characteristics. Compared with English questions, Chinese The grammatical structure of questions is complex and the
semantic information is diverse; the second is the lack of corresponding corpus and
knowledge base; the third is that the research and application of Chinese question classification is relatively late, and most of the existing Chinese question classification adopts rule-based classification method, achieved some results on some standard data sets, by improving the Bayesian model to classify Chinese questions, extracting the main body of questions and combining word segmentation and part-of-speech feature values to classify questions, but its accuracy is affected by the analysis of
syntactic structure Accuracy
impactAffected by the calculation method of semantic correlation, in general, the problems encountered in the classification of Chinese questions include: the questions themselves are short and contain a small number of words, which makes the problem of dimensionality disaster and data sparse in the training of question classification, The efficiency and accuracy of Chinese question classification cannot meet the requirements of medical online consultation;
[0009] Third, as a key technology in online
medical consultation, intelligent
question answering directly affects the quality and user experience of this emerging medical service. One of the core problems of intelligent
question answering is to efficiently classify questions, but the characteristic of medical questions
Sentence keywords are less, composed of diseases or symptoms + interrogative words + verbs, the efficiency of the existing method of constructing the
feature vector of medical inquiry is low, and the error of the full-text
index method is relatively large. In the Chinese environment, the classification of medical questions is more difficult. It is obvious that the construction of network question feature vectors is slow, and it is easy to cause problems such as excessive dimensionality and sparse data when building question feature vectors. The efficiency of question classification is very low, which will cause synonyms to generate different distributed vectors. And limited by the corpus, it cannot identify new words on
the Internet very well, and the accuracy of word association and the classification efficiency of medical questions are low;
[0010] Fourth, the semantic correlation algorithm has obvious shortcomings. It does not consider the difference in
semantics. Some words have
polysemy. The semantic correlation algorithm is just a simple concept mapping, which is easy to introduce
noise data. In addition, semantic correlation The degree algorithm needs to consider all the data of the search engine encyclopedia page, and the preprocessing stage consumes more time and resources, indicating that the text vector includes all search engine encyclopedia concepts, and the dimension of the vector reaches 900,000 dimensions, and the amount of calculation is too large;
[0011] Fifth, Chinese questions contain rich
semantic information. Its structure is complex, the forms of questions are diverse, and there are
polysemy and synonymous dependencies between words. Most of the Chinese questions are relatively short and contain only few keywords. There are many problems in question classification
The existing text representation method is the
vector space model. This representation method results in sparse vectors and too large dimensions, and cannot describe the
semantic relationship between words well, resulting in large errors in the calculation of similarity and affecting the accuracy of the test. However, due to the lack of training corpus, the similarity is inaccurate, and the words in some dictionaries are not rich enough to eliminate the error of synonyms and solve the problem of unregistered words. The problem of word vector construction, without considering the frequency of words, grammar,
semantics and context, the obtained feature word vectors cannot meet the requirements