Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Lucene full-text retrieval based Chinese word segmentation method

A Chinese word segmentation and full-text technology, applied in the field of power system, can solve the problems of fuzzy information, difficult quantitative and accurate analysis, redundancy, etc., to improve efficiency, clear word segmentation results, and improve the level of power grid marketing services.

Inactive Publication Date: 2016-01-27
JIANGSU ELECTRIC POWER INFORMATION TECH +1
View PDF5 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, because most of them are written descriptions, there are problems of fuzzy and redundant information, and it is difficult to use traditional data analysis methods to quantitatively and accurately analyze

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Lucene full-text retrieval based Chinese word segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0017] A Chinese word segmentation method based on lucene full-text search, figure 1 It is a flow chart of the Chinese word segmentation method based on lucene full-text search. The method includes the following steps:

[0018] 1. Store the dictionary in the database as one word per line. In addition to the main dictionary containing commonly used words and the quantifier dictionary of commonly used quantifiers that come with the program, users can add extended dictionaries and stop word dictionaries as needed.

[0019] 2. Cache the dictionary in the database in the server in the form of a tree. The dictionaries in the cache are divided into three types: the main dictionary, the stop word dictionary and the measure word dictionary. The extended word dictionaries added by users are stored in the main dictionary.

[0020] 3. Enter the text information that needs word segmentation;

[0021] 4. The input text is matched verbatim with the three dictionary trees of quantifiers, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention discloses a Lucene full-text retrieval based Chinese word segmentation method. The method comprises: storing a dictionary in a database in the form of one word for each row; caching the dictionary in the database into a server in the form of a tree; inputting text information that needs to be segmented; matching a text with a caching dictionary tree word by word, and outputting a successfully matched longest word; and outputting a word segmentation result. According to the method provided by the present invention, a user can extract useful information from massive fuzzy data for detailed study and summarization, and it is convenient for the user to perform semantic analysis and data analysis, so that a problem in a marketing service can be found in time, thereby improving a power grid marketing service level.

Description

technical field [0001] The invention belongs to electric power systems, and relates to a data analysis method for electric power systems, in particular to a Chinese word segmentation method based on lucene full-text retrieval. Background technique [0002] In the current power system, especially in the field of marketing, the amount of data is large and involves a wide range of areas, which is worthy of in-depth analysis and mining. However, due to the fact that most of them are written descriptions, there are problems of fuzzy and redundant information, and it is difficult to use traditional data analysis methods for quantitative and accurate analysis. Contents of the invention [0003] Aiming at the problems existing in the prior art, the object of the present invention is to provide a Chinese word segmentation method based on lucene full-text search, the method performs word segmentation operations on massive Chinese text information collected in the system, not only ca...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 王成现王全强郝翠萍
Owner JIANGSU ELECTRIC POWER INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products