A similarity analysis method, implementation system and medium based on negative sequence patterns of biological sequences

A similarity analysis and biological sequence technology, applied in the application field of high-efficiency negative sequence rules, can solve problems such as lack of similarity measurement methods, and achieve the effect of saving memory and time consumption

Active Publication Date: 2021-04-27
山东元竞信息科技有限公司
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The existing similarity analysis methods are mainly for PSP. For the NSP we excavated earlier, there is still a lack of a unified similarity measurement method.
However, sequence alignment has some disadvantages, prompting people to try to find other methods to compare DNA sequence similarity

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A similarity analysis method, implementation system and medium based on negative sequence patterns of biological sequences
  • A similarity analysis method, implementation system and medium based on negative sequence patterns of biological sequences
  • A similarity analysis method, implementation system and medium based on negative sequence patterns of biological sequences

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0083] A similarity analysis method based on negative sequence patterns of biological sequences, such as figure 1 shown, including the following steps:

[0084] (1) Data preprocessing

[0085] For each sequence or genome to be processed, preprocessing is performed before it is subjected to frequent pattern mining. The letters in the DNA sequence are represented by numbers; since the length of the DNA sequence is very long, the DNA sequence represented by the number is divided into several blocks, and the number of bases in each block is the same, and the obtained blocks are used as frequent pattern mining. data set;

[0086] In the present invention, each sequence is first divided into several blocks, and each block is composed of the same number of continuous bases. These blocks are independent of each other, and the block size can vary in practice. Note that if the size of the last block is smaller than the specified block size, then this block will be discarded. To mak...

Embodiment 2

[0099] According to a kind of similarity analysis method based on the negative sequence pattern of biological sequence described in embodiment 1, its difference is:

[0100] In step (2), the f-NSP algorithm is used to mine the data set, the data set is D, and the steps are as follows:

[0101] A. Use the GSP algorithm to obtain all positive and frequent sequences, and store the bitmap corresponding to each positive and frequent sequence in the hash table; including:

[0102] a. Scan the data set to get all sequence patterns with a length of 1 and put them into the original seed set P 1 middle;

[0103] b. From the original seed set P 1 Obtain sequence patterns with a length of 1, and connect them to generate a candidate sequence set C with a length of 2 2 ; Use the Apriori property on the candidate sequence set C 2 Perform pruning, and then scan the candidate sequence set C 2 Determine the support of the remaining sequences, save the sequence patterns whose support is hig...

Embodiment 3

[0116] According to the similarity analysis method of a negative sequence pattern based on a biological sequence described in Example 1, the difference is that in step (3), the maximum frequent positive and negative sequence patterns are graphically represented, including: constructing in the complex plane A purine-pyrimidine diagram, in the purine-pyrimidine diagram, the first and second quadrants are purines, including A, G and The third and fourth quadrants are pyrimidines, including T, C and

[0117] (b+di)→A(I)

[0118] (d+bi)→G(Ⅱ)

[0119] (b-di)→T(Ⅲ)

[0120] (d-bi)→C(Ⅳ)

[0121]

[0122]

[0123]

[0124]

[0125] A unit vector of the four nucleotides A, G, T, C and their corresponding negative sequences As shown in formula (Ⅰ) to formula (Ⅷ):

[0126] In formula (I) to formula (VIII), b and d are non-zero real numbers, A and T are conjugated, and so are G and C, ie, A, T, C, G represent actual base pairs, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention relates to a similarity analysis method based on a negative sequence pattern of a biological sequence, a realization system and a medium, including: (1) data preprocessing: the letters in the DNA sequence are represented by numbers; and divided into several blocks, Several blocks obtained are used as data sets for frequent pattern mining; (2) frequent pattern mining: use the f-NSP algorithm to mine data sets; (3) graphically represent the maximum frequent positive and negative sequence patterns; Negative sequence patterns are transformed into digital sequences; (4) DNA sequence similarity analysis: the similarity of different DNA sequences is obtained, and the corresponding DNA sequence with the smallest similarity is selected as the DNA sequence to be studied. The invention can effectively express and analyze negative sequences, and can obtain different analysis results by selecting different combinations of maximum frequent patterns, thereby greatly saving computer memory and time consumption.

Description

technical field [0001] The invention relates to a similarity analysis method, a realization system and a medium of a biological sequence-based negative sequence pattern, and belongs to the application technical field of decision-making high-utility negative sequence rules. Background technique [0002] In recent years, we have obtained a large amount of biological sequence data. With the advancement of DNA and protein sequencing technology, it is very important to interpret various information contained in biological sequence data, especially the genetic and regulatory information in DNA sequences, protein sequence structure and The demand for data analysis tools for functional relationships increases, and sequence similarity analysis is widely used. Whenever we obtain a new DNA sequence, we hope to prove that it is similar to some known sequences through similarity analysis. If it has homology with known sequences, it will greatly save the function of re-determining the new...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/16G16B30/10G16B45/00G16B50/00
CPCG16B30/10G16B45/00G16B50/00G06F17/16G16B40/30
Inventor 董祥军芦月
Owner 山东元竞信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products