A Topic Classification Method for Lao Language Texts

A topic classification and text technology, applied in the fields of natural language processing and machine learning, can solve problems such as ignoring information and text misunderstanding, and achieve the effects of avoiding zero probability problems, improving accuracy, and reducing the time for judging classification

Active Publication Date: 2022-04-12
KUNMING UNIV OF SCI & TECH
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But it has its own shortcomings, that is, it thinks that all feature attributes are conditionally independent, which is equivalent to putting text feature information into a word bag without considering the impact of the order of words, which often ignores a lot of information, misinterpreting the text

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Topic Classification Method for Lao Language Texts
  • A Topic Classification Method for Lao Language Texts
  • A Topic Classification Method for Lao Language Texts

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0020] Embodiment 1: as Figure 1-2 Shown, a kind of Lao language text subject classification method, described method step is as follows: Step1, utilize web crawler technology to crawl Lao text, the text that has crawled five categories in total is respectively: economy, politics, education, tourism ,generally. Store them in the corresponding five folders. The folders are named after categories to facilitate subsequent retrieval and processing, and then perform text processing on the crawled articles to remove some noise words that have nothing to do with classification, so as to build a corpus; Further, the noise words can be set to include emoticons, numbers, spaces, and stop words; wherein emoticons, numbers, and spaces are removed by regular expressions, and stop words are removed by using a stop word table (appearing in the stop word table words are removed). When removing some unrelated noise words, the regular expression encoding is used as follows: u"^[\u0000-\u10ff...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Lao language text topic classification method, which belongs to the technical fields of natural language processing and machine learning. The invention combines the N-gram language feature extraction model and the naive Bayesian mathematical model to realize topic recognition of Lao articles, and eliminates the limitation of naive Bayesian to a certain extent. It considers the conditional independence assumption, regards the text as a bag of words model, does not consider the order information between words, and uses the unigram and bigram feature models at the same time to improve the recognition rate of the text.

Description

technical field [0001] The invention relates to a Lao language text topic classification method, which belongs to the technical fields of natural language processing and machine learning. Background technique [0002] With the popularization of the network, the information on the network increases exponentially. When users use search engines to retrieve the information they want, web pages often return thousands of relevant pages, and how can users quickly and effectively locate the desired information without viewing these pages one by one? At this time, topic recognition plays an important role. It can use our pre-trained classifier to locate the topic of the content that the user wants in the limited information input by the user, so as to respond effectively to the user. The Naive Bayesian classification model is a method with a long history and a solid theoretical foundation. It is a direct and efficient method for dealing with many problems at the same time, and many ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/9535
Inventor 周兰江王兴金张建安周枫
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products