Multi-species feature selection and unknown gene identification methods

A feature selection and multi-species technology, applied in the field of life sciences, can solve the problems of lack of comprehensiveness in the regulation of non-coding RNA expression and lack of identification standards for identifying non-coding RNAs

Active Publication Date: 2017-02-22
TSINGHUA UNIV
View PDF4 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Non-coding RNAs perform complex and delicate regulatory functions in organisms, and the research on their related functions has attracted widespread interest from many biologists. However, unlike protein-coding genes, we lack a comprehensive understanding of the expression regulation of non-coding RNAs , the biophysical characteristics of different non-coding RNAs are also different, which makes the identification of non-coding RNAs within species and between different species still lack a unified identification standard, and high-throughput identification of all types of non-coding RNAs is still an extremely difficult task. Challenging problems, the corresponding computational methods need to be developed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-species feature selection and unknown gene identification methods
  • Multi-species feature selection and unknown gene identification methods
  • Multi-species feature selection and unknown gene identification methods

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0076] For each species (diagonal element positions), non-coding RNAs are well distinguished from defined protein-coding, untranslated, and negative control regions. In order to verify the robustness of the common features selected by RNAfeature, the accuracy of its cross-species prediction was tested.

[0077] Such as image 3 As shown, the accuracy of cross-species prediction of non-coding RNA, measuring the accuracy of four species (H for human, M for mouse, F for Drosophila, W for nematode) models to distinguish four types of genomic elements (accuracy ACC) . The accuracy of within-species predictions is shown on the diagonal, and the accuracy of cross-species predictions is shown as off-diagonal elements. In the figure, red represents the identified protein coding region (CDS), purple represents the 5' and 3' untranslated regions (UTRs), green represents canonical non-coding RNAs (canonical ncRNAs), and blue represents the negative control region (negative control .

...

Embodiment 2

[0082] RNAfeature only uses classic non-coding RNAs (including rRNA, tRNA, snRNA, miRNA, Y RNA, etc.) as positive sample sets, so it is necessary to test the ability of common features to predict new non-coding RNAs. For this reason, the present invention designs cross-validation across categories, that is, firstly a certain type of non-coding RNA (including rRNA, tRNA, snRNA, miRNA, Y RNA, etc.) is removed from the RNAfeature training set, and then the common feature training model is used , use the model to predict the type of non-coding RNA that will be eliminated.

[0083] Such as Figure 4 As shown, using the human data as an example, the boxplot shows the probability distribution (y-axis) when a specific type of genomic element (title of each window) is predicted to be different classes (x-axis). Different colors represent different classes: red for defined protein coding regions (CDS), purple for 5' and 3' untranslated regions (UTRs), green for canonical ncRNAs, and bl...

Embodiment 3

[0086] Cross-species cross-validation was also performed for the remaining species. Such as Figure 5 As shown, in the data of 4 species (human, mouse, Drosophila and C. Accuracy of coding RNAs (canonical ncRNAs) and negative control regions. NA indicates that this type of non-coding RNA is not present in this species. Darker colors represent higher accuracy.

[0087] The results showed that the accuracy of the cross-species cross-validation for the four species was high, and it was much higher than the accuracy of random guessing. In comparison, the accuracy of the human and mouse models is higher than that of the Drosophila and nematode models, which may be due to the greater number of annotated non-coding RNAs in humans and mice, and the model training is more difficult. full.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multi-species feature selection and unknown gene identification methods, and belongs to the field of life science. The multi-species feature selection method comprises the steps of performing feature valuation on a small fragment region covering a whole genome; performing tagging processing; performing feature selection in species; and performing feature selection between the species. An efficient and accurate calculation method is constructed by depending on integration of gene generality among different species; and the method is used for accurate identification and unknown gene description.

Description

technical field [0001] The invention relates to the field of life sciences, in particular to a method for multi-species feature selection and identification of unknown genes. Background technique [0002] A number of tools for predicting the probability of protein-coding transcripts have been published, including CONC, CPC, PhyloCSF, RNAcode, PLEK, CNCI, CNCTDiscriminator, CPAT, HMMER, and lncRNA-ID (1-10), but the vast majority of these tools Some only used the sequence information of the transcripts. These sequence information include but not limited to: Open reading frame (Openreading frame, ORF) characteristics, such as ORF length and coverage, etc. (1,2,4,7,9); base frequency (nucleotide frequencies) characteristics, such as k-mer Sequence patterns, codon usage, etc. (1,2,5,7-9); conservation score features such as base sequence alignment or protein sequence alignment, etc. (1-4); Evolution-related features such as substitution rate and phylogenic score (7,10) and in ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/10
CPCG16B99/00
Inventor 鲁志胡龙
Owner TSINGHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products