Method for screening disease-related proteins based on complex network

A complex network and related protein technology, which is applied in the field of screening disease-related proteins based on complex networks, can solve problems such as model underfitting, overfitting, and insufficiency, and achieve the effects of reducing workload, improving accuracy, and improving precision

Active Publication Date: 2020-09-08
天士力国际基因网络药物创新中心有限公司
View PDF11 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, a large amount of unlabeled data is likely to cause model underfitting or overfitting in the process of machine learning, resulting in the model being insufficient to learn the information in the entire sample space or the model's normalization ability is insufficient

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for screening disease-related proteins based on complex network
  • Method for screening disease-related proteins based on complex network
  • Method for screening disease-related proteins based on complex network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0038] Example 1: Coronary heart disease

[0039] This embodiment shows the technical effect 1 of the invention. Using the node2vec algorithm to extract the characteristic structure of the protein in the protein interaction network, compared with the traditional topological properties, it can improve the accuracy of protein recognition in the protein interaction network. In the S5 part of this example, the combined algorithm of "node2vec algorithm and PU-learning algorithm" was used to obtain 958 coronary heart disease-related proteins, with an accuracy rate (92.35%) and a recall rate (56.89%); while using the "topological property algorithm and PU-learning algorithm" "learning algorithm" algorithm combination, and obtained 8348 proteins related to coronary heart disease, with a precision rate (32.93%) and a recall rate (10.66%).

[0040] Such as figure 1 As shown, the embodiment of the present invention provides a method for screening disease-related proteins based on a comp...

Embodiment 2

[0046] Embodiment two: ischemic cardiomyopathy

[0047] In this embodiment, in order to protect the parameter range and dimension d of the method node2vec, a dimension of 128-256 is selected to obtain a more stable protein interaction network analysis result. When the dimension is 64, the RN results of the stable negative sample set are less. When the dimension is 512, the sensitivity of the data model on the verification set is greatly reduced.

[0048] S1: Obtain 270 seed genes related to ischemic cardiomyopathy based on GWAS catalog, IPA, DisGeNET and other databases (set P);

[0049] S2: Based on the protein-protein interaction database (BIOGRID, HPRD, INTACT, STRING protein interaction database), construct a protein interaction network with the seed gene of ischemic cardiomyopathy as the core, the network consists of 9329 proteins, 30274 protein- The composition of protein interaction relationship; 263 of the 270 seed genes in S1 were identified in the protein interactio...

Embodiment 3

[0054] Example Three: Atrial Fibrillation

[0055] In this embodiment, in order to protect the parameter range of the method node2vec, the preferred range of p is [2, 5], and the preferred range of q is [0.1, 3].

[0056] S1: Based on GWAS catalog, Malacards, DisGeNET and other databases, 141 atrial fibrillation-related seed genes (set P) were obtained;

[0057] S2: Based on the protein-protein interaction database (BIOGRID, HPRD, INTACT, STRING protein interaction database), construct a protein interaction network with atrial fibrillation seed gene as the core, which consists of 5745 proteins and 13606 protein-protein interactions Relationship composition; 131 of the 141 seed genes in S1 were identified in the protein interaction database, and 10 proteins were not identified.

[0058] S3: Based on the node2vec algorithm, extract the characteristic data of 9329 proteins in the S2 protein interaction network; during the implementation process, set the random walk parameters wa...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for screening disease-related proteins based on a complex network. The method comprises the following steps: 1) acquiring a seed gene related to a target disease; 2) based on a protein-protein interaction database, constructing a protein interaction network taking the seed gene as a core; 3) extracting characteristic data of the protein in the protein interaction network; 4) taking the characteristic data of the protein as training data, and training by adopting a machine learning algorithm to obtain a PU classifier; and 5) predicting the protein related to thetarget disease in the protein interaction network according to the PU classifier. The method can quickly and efficiently identify protein related to a disease, and is helpful for biomedical experts to carry out experimental verification or related researchers to carry out work.

Description

technical field [0001] The invention relates to the technical field of protein screening, in particular to a method for screening disease-related proteins based on complex networks. Background technique [0002] The identification of disease-related proteins plays an important role in molecular typing, diagnosis, and treatment of diseases. Accurate and efficient identification of disease-related proteins helps to discover disease-causing genes and identify drug targets, which is of far-reaching significance in disease diagnosis and drug design. As an important research tool for exploring disease susceptibility genes, GWAS can quickly discover more significant disease susceptibility loci. However, GWAS does not have a high utilization rate of data, masking a large number of potentially significant disease-related proteins. At the same time, the single point association analysis of traditional GWAS treats each gene in the body independently, ignoring the interaction between ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G16B40/20G16B25/10G06N20/00
CPCG06N20/00G16B25/10G16B40/20
Inventor 李旭任静王学敏张文闫凯境
Owner 天士力国际基因网络药物创新中心有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products