Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A Query Method of Indefinite Length Words and Sentences Based on Inverted Index

A technology of inverted index and query method, applied in the field of data science, to achieve the effect of improving query efficiency, improving word segmentation efficiency, and saving query and retrieval time

Active Publication Date: 2022-02-08
HARBIN INST OF TECH
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The purpose of the present invention is to overcome the deficiencies of the existing manual retrieval of words and sentences in evaluation documents, and provide a method for querying words and sentences of evaluation documents based on inverted indexes, so that information can be quickly and accurately retrieved from text data and data mining value

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Query Method of Indefinite Length Words and Sentences Based on Inverted Index
  • A Query Method of Indefinite Length Words and Sentences Based on Inverted Index
  • A Query Method of Indefinite Length Words and Sentences Based on Inverted Index

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach

[0075] The audit and evaluation data of colleges and universities are mainly text reports in Word and PDF formats, which include quantitative numerical evaluations and qualitative textual evaluations of the teaching quality of colleges and universities. The text-based assessment is the main part of the assessment report. When looking for common problems and individual problems among colleges and universities, it is necessary to search the assessment data for key words, especially indefinite-length words and sentences.

[0076] Step 1: Perform data preprocessing on the evaluation report to be processed, uniformly convert it into plain text format and store it in the same directory, as shown in Table 1.

[0077] Table 1 Example table of pending data

[0078] serial number file name file type file format File size 1 Jilin Police College Evaluation class documentation Word file (.doc) 64KB 2 Zhejiang University of Foreign Languages conference ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A query method for variable-length words and sentences in evaluation documents based on an inverted index, which involves an indexing method in the field of data science and a word segmentation method in the field of NLP, and solves the query problem of variable-length words and sentences in evaluation documents. The steps of the present invention are: 1. Carry out data preprocessing on the document to be queried, and use the jieba word segmentation method to carry out word segmentation processing to obtain word dictionary and word frequency information; 2. Establish an adaptive inverted table based on the inverted index principle of the complete reconstruction strategy; 3. , Combining the information of the variable-length words and sentences to be searched, through the self-adaptive inverted list indexing the position information of each word in the words and sentences, identifying the position information of the variable-length words and sentences and indexing the paragraphs where they are located, to complete the query function of variable-length words and sentences in evaluation documents. The basic idea of ​​the present invention is to segment the text data into words, establish an inverted index, and then realize fast searching for words and sentences of indefinite length, so as to realize the query function of evaluation documents. It has a wide range of application scenarios, so it has high socio-economic value.

Description

technical field [0001] The invention relates to a data indexing method in the field of data science and a word segmentation method in the field of natural language processing, in particular to a query method for indefinite-length words and sentences of evaluation documents based on an inverted index. Background technique [0002] With the explosive growth of the amount of data in the information age, it is found that there is a huge data value hidden behind the massive data, which attracts more and more researchers to study the data. For the data value of structured data, traditional or modern data mining methods can be used to obtain better results, but for unstructured data, such as the data value of massive evaluation text reports, modern data mining methods and methods in areas such as natural language processing to extract information value. Evaluation documents are characterized by the coexistence of digital evaluation and text evaluation, and there is no clear evalua...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/31G06F40/289G06F40/242
CPCG06F40/242G06F40/289
Inventor 沈毅赵虹博杨朔王宏志张淼
Owner HARBIN INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products