A method for screening disease-associated proteins based on complex networks

A technology for complex networks and related proteins, applied in the field of screening disease-related proteins based on complex networks, can solve the problems of under-fitting, over-fitting, and insufficiency of models, and achieve the effects of improving accuracy, reducing workload, and improving accuracy.

Active Publication Date: 2021-08-24
天士力国际基因网络药物创新中心有限公司
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, a large amount of unlabeled data is likely to cause model underfitting or overfitting in the process of machine learning, resulting in the model being insufficient to learn the information in the entire sample space or the model's normalization ability is insufficient

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for screening disease-associated proteins based on complex networks
  • A method for screening disease-associated proteins based on complex networks
  • A method for screening disease-associated proteins based on complex networks

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0038] Example 1: Coronary heart disease

[0039] This embodiment shows the technical effect 1 of the invention. Using the node2vec algorithm to extract the characteristic structure of the protein in the protein interaction network, compared with the traditional topological properties, it can improve the accuracy of protein recognition in the protein interaction network. In the S5 part of this example, the combined algorithm of "node2vec algorithm and PU-learning algorithm" was used to obtain 958 coronary heart disease-related proteins, with an accuracy rate (92.35%) and a recall rate (56.89%); while using the "topological property algorithm and PU-learning algorithm" "learning algorithm" algorithm combination, and obtained 8348 proteins related to coronary heart disease, with a precision rate (32.93%) and a recall rate (10.66%).

[0040] Such as figure 1 As shown, the embodiment of the present invention provides a method for screening disease-related proteins based on a comp...

Embodiment 2

[0046] Embodiment two: ischemic cardiomyopathy

[0047] In this embodiment, in order to protect the parameter range and dimension d of the method node2vec, a dimension of 128-256 is selected to obtain a more stable protein interaction network analysis result. When the dimension is 64, the RN results of the stable negative sample set are less. When the dimension is 512, the sensitivity of the data model on the verification set is greatly reduced.

[0048] S1: Obtain 270 seed genes related to ischemic cardiomyopathy based on GWAS catalog, IPA, DisGeNET and other databases (set P);

[0049] S2: Based on the protein-protein interaction database (BIOGRID, HPRD, INTACT, STRING protein interaction database), construct a protein interaction network with the seed gene of ischemic cardiomyopathy as the core, the network consists of 9329 proteins, 30274 protein- The composition of protein interaction relationship; 263 of the 270 seed genes in S1 were identified in the protein interactio...

Embodiment 3

[0054] Example Three: Atrial Fibrillation

[0055] In this embodiment, in order to protect the parameter range of the method node2vec, the preferred range of p is [2, 5], and the preferred range of q is [0.1, 3].

[0056] S1: Based on GWAS catalog, Malacards, DisGeNET and other databases, 141 atrial fibrillation-related seed genes (set P) were obtained;

[0057] S2: Based on the protein-protein interaction database (BIOGRID, HPRD, INTACT, STRING protein interaction database), construct a protein interaction network with atrial fibrillation seed gene as the core, which consists of 5745 proteins and 13606 protein-protein interactions Relationship composition; 131 of the 141 seed genes in S1 were identified in the protein interaction database, and 10 proteins were not identified.

[0058] S3: Based on the node2vec algorithm, extract the characteristic data of 9329 proteins in the S2 protein interaction network; during the implementation process, set the random walk parameters wa...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for screening disease-related proteins based on a complex network. The method is as follows: 1) Obtain the seed gene related to the target disease; 3) extracting the feature data of the protein in the protein interaction network; 4) using the feature data of the protein as training data, and using a machine learning algorithm to train a PU classifier; 5) predicting the PU classifier according to the PU classifier Proteins in the protein interaction network associated with the target disease. The method of the invention can quickly and efficiently identify proteins related to diseases, and is helpful for biomedical experts to carry out experimental verification or relevant researchers to carry out work.

Description

technical field [0001] The invention relates to the technical field of protein screening, in particular to a method for screening disease-related proteins based on complex networks. Background technique [0002] The identification of disease-related proteins plays an important role in molecular typing, diagnosis, and treatment of diseases. Accurate and efficient identification of disease-related proteins helps to discover disease-causing genes and identify drug targets, which is of far-reaching significance in disease diagnosis and drug design. As an important research tool for exploring disease susceptibility genes, GWAS can quickly discover more significant disease susceptibility loci. However, GWAS does not have a high utilization rate of data, masking a large number of potentially significant disease-related proteins. At the same time, the single point association analysis of traditional GWAS treats each gene in the body independently, ignoring the interaction between ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G16B40/20G16B25/10G06N20/00
CPCG06N20/00G16B25/10G16B40/20
Inventor 李旭任静王学敏张文闫凯境王文佳
Owner 天士力国际基因网络药物创新中心有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products