Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Unsupervised clustering method and system for metagenome contigs

An unsupervised clustering and metagenomics technology, applied in the unsupervised clustering method and system field of metagenomic contigs, can solve problems affecting algorithm performance, limited effect, and large number of strains

Pending Publication Date: 2021-03-09
ZHEJIANG NORMAL UNIVERSITY
View PDF8 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The clustering methods included in DAS tool and Metawrap are applied to the same data set at the same time. Some methods do not perform well in complex data sets (such as CONCOCT, MyCC). At the same time, the ability of most methods to reconstruct genes on complex data sets needs to be improved. Therefore, although the integration of the clustering results of these methods can improve the number and quality of reconstructed genes, the effect is limited.
[0004] The main problems of the existing unsupervised clustering methods are: 1. The ability to reconstruct genes in complex environments needs to be improved
2. The selection of the number of strains in the clustering process is very different from the real situation. The number of strains is a key parameter of the unsupervised clustering method, which greatly affects the performance of the algorithm
3. It is difficult to distinguish the same species but different strains in the sample. There are two reasons. One is to use the assembly tool to assemble the reads. Due to the high similarity between the same species but different strains, chimeras are prone to occur during the assembly process , the second reason is that there are a large number of strains in a complex environment, and it is difficult to effectively distinguish

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unsupervised clustering method and system for metagenome contigs
  • Unsupervised clustering method and system for metagenome contigs
  • Unsupervised clustering method and system for metagenome contigs

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0077] This embodiment provides a method for unsupervised clustering of metagenomic contigs, including steps:

[0078] S1. Assemble multiple reads into contigs by overlapping fragments;

[0079] S2. Calculate the frequency and abundance of the quadruplet oligonucleotides of the contigs using the eigenvectorization method to obtain the frequency and abundance of the quadruplet oligonucleotides of the contigs;

[0080] S3. Set the initial number of species K in the K-means algorithm 1 , and randomly select K 1 contigs as the initial cluster center;

[0081] S4. Calculate K according to the frequency and abundance of quadruple nucleotides obtained from contigs 1The probability of each contigs in the contigs to each cluster center, and assign each contigs to the cluster center with the highest probability among the calculated probabilities;

[0082] S5. Update the cluster center using the sample mean of each cluster;

[0083] S6. Repeat steps S4-S5 until the cluster center no...

Embodiment 2

[0126] This example provides an unsupervised clustering method for metagenomic contigs. The difference from Example 1 is that:

[0127] In this embodiment, the method of the present invention is tested on three data sets of different complexity, and compared with other state-of-the-art clustering tools CONCOCT, Maxbin2.0, MetaBAT, MyCC, and COCACOLA on the same data set.

[0128] The data adopts the simulation data set specially used for benchmarking in the CAMI paper, which is generated in order to have a unified evaluation standard for each clustering algorithm. It is divided into low-complexity datasets (40 genomes and 20 circular elements), medium-complexity datasets (132 genomes and 100 circular elements), and high-complexity datasets (596 genomes and 478 Circular Elements), these datasets are from newly sequenced approximately 700 microbial isolates and 600 circular element genomes that differ from the strain, species, genus, or sequence represented by the public genomes...

Embodiment 3

[0136] This embodiment provides an unsupervised clustering system for metagenomic contigs, including:

[0137] The assembly module is used to assemble multiple reads into contigs by overlapping fragments;

[0138] The first calculation module is used to calculate the frequency and abundance of the quadruple oligonucleotides of the contigs by using the eigenvectorization method to obtain the frequency and joint abundance of the quadruple nucleotides of the contigs;

[0139] Selection module, used to set the initial number of species K in the K-means algorithm 1 , and randomly select K 1 contigs as the initial cluster center;

[0140] The second calculation module is used to calculate K according to the obtained quadruple nucleotide frequency and abundance of contigs 1 The probability of each contig in the contigs to each cluster center, and assign each contig to the cluster center with the highest probability among the calculated probabilities;

[0141] An update module for...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an unsupervised clustering method for metagenome contigs, which comprises the following steps of: before clustering, putting reads of each sample together to construct a gene library, assembling the reads into contigs by using an assembling tool, and performing feature vectorization on each contigs according to the frequency of quadruple oligonucleotides and co-abundance, then carrying out clustering according to a pre-trained probability model and a recursion strategy, introducing CheckM to carry out clustering result quality detection and simplify the complexity of asample and taking the clustering result as an algorithm termination condition, and taking a result obtained by Marker gene analysis as strain number initialization and clustering center sequence initialization in algorithm initialization.

Description

technical field [0001] The invention relates to the technical field of biological information analysis, in particular to an unsupervised clustering method and system for metagenomic contigs. Background technique [0002] Before the emergence of metagenomic technology, people's research on microorganisms was mainly through the pure cultivation of a single microbial species. However, in the natural environment, most microorganisms are difficult or impossible to be purely cultured on the culture medium. With the development of second-generation sequencing technology, metagenomic technology came into being. It can directly obtain the genetic material of all microorganisms in the sample from the natural environment, without the need for pure culture on the medium like traditional methods. This provides new research ideas for scientists to study the community structure of microorganisms, the interaction between microorganisms, and the relationship between microorganisms, the env...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G16B40/30G16B30/20G06K9/62
CPCG16B40/30G16B30/20G06F18/23213
Inventor 李小波姜忠俊
Owner ZHEJIANG NORMAL UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products