Spark framework-based parallelization method of text clustering model PW-LDA
A text clustering and model technology, applied in text database clustering/classification, unstructured text data retrieval, etc., can solve problems such as poor performance, hard disk I/O time-consuming, etc., to achieve large-scale data, speed up The program runs and the effect of high algorithm complexity
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment
[0024] Figure 1 to Figure 3 For the parallelization method of a kind of text clustering model PW-LDA based on Spark framework of the present invention, mainly comprise the following steps:
[0025] S1: Load the corpus data of scientific literature and initialize it as a distributed data type object of Spark.
[0026] S2: Segment the text in the imported corpus through the Map method, and preprocess the stop words to obtain training samples.
[0027] S3: Use Spark's Word2Vec interface to perform word vector training on the training samples.
[0028] S4: According to the result of Word2Vec, use the Partition algorithm to extract the target segment from the text of the training sample and realize the parallelism of the algorithm through the Map method.
[0029] S5: Use Spark's GraphX-based LDA interface to train the target segment extracted by the Partition algorithm to obtain a topic-word matrix.
[0030] S6: Calculate the topic vector according to the topic-word matrix obta...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com