Extraction method and retrieval method of customs data product words

A technology of data products and extraction methods, applied in the field of communication, can solve the problems of inability to guarantee real-time performance, incorrect segmentation of words and sentences, difficult keyword extraction, etc., and achieve the effect of improving extraction rate and accuracy, high accuracy and reducing extraction difficulty.

Pending Publication Date: 2020-11-20
深圳市小满科技有限公司
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, since this kind of product word contains a large number of unstructured words and sentences, it is difficult to extract its keywords, and it is difficult to have a suitable algorithm to effectively extract them
[0003] The traditional product thesaurus is usually based on manual entry and network collection, which requires a lot of cost to maintain, and does not guarantee real-time performance
At present, there are still the following problems in the extraction of product words: 1. Wrong words are easy to exist in handwritten words during manual entry; 2. There are errors in the segmentation between words and sentences; 3. Product words often contain the performance of the product , quality, etc., such description sentences contain acronyms, numbers, stop words, symbols, etc., which may easily cause poor results from word segmentation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Extraction method and retrieval method of customs data product words
  • Extraction method and retrieval method of customs data product words
  • Extraction method and retrieval method of customs data product words

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044] The technical solutions of the various embodiments of the present invention will be clearly and completely described below. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them; based on the embodiments of the present invention, those skilled in the art All other embodiments obtained by the skilled person without creative work belong to the protection scope of the present invention.

[0045] The present invention provides a kind of extraction method of customs data product word, comprises the following steps:

[0046] S1. First, the format of the customs description text is unified, and then the special symbols other than hyphens are cleaned and deleted, and the word interval is standardized. Among them, the special symbols include, but are not limited to, long dashes "—", dashes "–", single quotation marks "''", double quotation marks """", ellipsis marks "...", one or more of. That is, the customs descripti...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an extraction method and a retrieval method of customs data product words. The extraction method comprises the following steps: firstly, cleaning up redundant parts in a customsdescription text, and converting the redundant parts into a better processing form; then, heuristically finding out segmentation words in the customs description text, and segmenting product words and description parts; replacing quantifiers and date regularities in the text with space characters or deleting the quantifiers and date regularities; deleting a description part in the text through agrammatical rule, or extracting product word groups from the data by using mutual information and left and right information entropies to obtain product word groups of which the number of words is less than or equal to 5, and adding the product word groups into a lexicon. The retrieval method comprises the steps that firstly, word segmentation is conducted on a text to be retrieved, and then retrieval is conducted in a constructed word bank through a bit map or hash map structure. According to the method, the grammatical structure, the mutual information, the character information and the specific structure information of the customs data are combined, the advantages of various information can be fully combined, and the product words can be accurately extracted and retrieved.

Description

technical field [0001] The invention belongs to the technical field of communication, and in particular relates to a method for extracting and retrieving customs data product words. Background technique [0002] Customs data and courier data generally must contain descriptions of the goods being shipped. These descriptions are descriptions of specific products, with a large amount of product word information, such as product names, product attributes, manufacturer information, product functions, and advertising words, etc. . Therefore, it is more feasible to clean and utilize product words. When we have a complete product thesaurus, we can use it to quickly retrieve existing text data, increasing the utilization rate and retrieval efficiency of text data. However, since this kind of product word contains a large number of unstructured words and sentences, it is difficult to extract its keywords, and it is difficult to have a suitable algorithm to effectively extract it. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/33G06F40/289G06F40/129
CPCG06F16/3344G06F40/289G06F40/129
Inventor 车进曹彬
Owner 深圳市小满科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products