Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Suffix array based fuzzy tandem repeat recognition method

A technology of series repeating sequences and suffix arrays, which is applied in electrical digital data processing, special data processing applications, instruments, etc., and can solve problems such as occupancy, repetition cycle size limitation, and large memory space

Inactive Publication Date: 2014-11-19
CENT SOUTH UNIV
View PDF2 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Among them, TRF is a tandem repeat recognition program developed by Benson in 1999. Its shortcoming is that there is a limit to the size of the repeat period. In the TRF 3.21 version, the maximum allowable repeat period is 2000bp; and the REPuter program uses a suffix tree-based Algorithm to identify repeated sequences. When the sequence is long, the number of suffixes constructed will be large and occupy a large memory space. Each character in the input sequence requires an average of 12.5 bytes of storage, so it is suitable for programs applied to large data The amount of sequence is a limit

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Suffix array based fuzzy tandem repeat recognition method
  • Suffix array based fuzzy tandem repeat recognition method
  • Suffix array based fuzzy tandem repeat recognition method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0113] In the experiment, the default copy number is 2, and the scores of pairing, mismatching, and gaps are set to +2, -2, and -2 respectively during sequence alignment; only sequences with a matching degree greater than 50% will be listed in the results.

[0114] The sequence used in the experiment:

[0115] The promoter and exon part of the human frataxin gene (Friedreich's ataxia, U43748), the base number of this part of the gene is 2465bp.

[0116] Frederick's ataxia is caused by an abnormal copy number of the trinucleotide repeat sequence (GAA) of the human frataxin gene. The parameters of the experiment are: min_p=2, min_ex=2*min_p, min_score=20, L_Align=100. min_p is the minimum copy number, min_ex is the minimum number of bases for left and right extensions, and min_score is the minimum score for sequence alignment. Results The length of the selected repeat sequence was >30bp.

[0117] Comparing the experimental results with the results of Benson's algorithm [Benso...

Embodiment 2

[0119] The sequence used in the experiment:

[0120] human T cell beta receptor sequence

[0121] Parameter settings: min_p=2, min_ex=2*min_p, min_score=50, L_Align=200. The length of the fuzzy sequence is > 100

[0122] Table 2 Tandem repeats of human T cell β receptor sequences

[0123]

[0124] The length of the tandem repeat sequence in Table 2 is greater than 100bp. bp stands for base pair; 1 bp = 1 base pair.

[0125] Table 3 Fuzzy tandem repeat search of human T cell β receptor sequences (manual alignment)

[0126]

[0127]

[0128] R_match is a rough match calculated by hand.

Embodiment 3

[0130] The sequence used in the experiment:

[0131] Sequence of the first chromosome of yeast

[0132] min_p=2, min_ex=2*min_p, min_score=50, L_Align=200. The length of the fuzzy sequence is >100.

[0133] Table 4 Fuzzy tandem repeat search based on the first chromosome sequence of yeast

[0134]

[0135] The tandem repeat lengths in the table are all greater than 100bp

[0136] Table 5 Fuzzy tandem duplication search based on the first chromosome segment of yeast

[0137]

[0138]

[0139] R_match is a rough match calculated by hand.

[0140] Embodiments 1, 2, and 3 are all tandem repeat sequences obtained by using the method of the present invention, wherein embodiment 1 is compared with the original method, and the appearance of new fuzzy tandem repeat sequences reflects that this method is looking for higher complexity. The superiority of the fuzzy tandem repeat sequence; the actual data used in Example 2 is the human T cell beta receptor sequence, and the a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a suffix array based fuzzy tandem repeat recognition method. The suffix array based fuzzy tandem repeat recognition method includes: imputing acquired DNA (deoxyribonucleic acid) alkali-based sequences into a computer in the form of character strings; processing genomic sequences on the basis of the dictionary sorting algorithm to generate corresponding suffix arrays; acquiring the largest public prefix sequences on the basis of the suffix arrays; acquiring the largest tandem repeats on the basis of the accurate tandem repeat recognition algorithm; acquiring the optimal offset on the basis of improved FFT (fast fourier transform) transform; comparing the sequences on the basis of the dynamic programming algorithm; acquiring the fuzzy tandem repeats on the basis of the fuzzy tandem repeat recognition method. By the method, the repetitive sequences in the genomic sequences can be rapidly recognized and accurately analyzed, and the fuzzy tandem repeats of the sequences can be found out.

Description

technical field [0001] The invention relates to the field of DNA repeat sequence recognition, in particular to a fuzzy tandem repeat sequence recognition method based on a suffix array. Background technique [0002] The completion of the Human Genome Project (Human Genomic Project, HGP) and the sequencing of other species have led to an unprecedented rapid increase in biological sequence data. The massive amount of data makes the computer an indispensable and important tool in biological research. Obtaining genome sequence data through sequencing is only the first step to success. The more important work is to understand and use these data, obtain the knowledge and laws hidden behind the data, and deepen the understanding of life phenomena. [0003] The identification methods similar to the method of this application include TRF (Tandem Repeats Finder) algorithm and REPuter program. Among them, TRF is a tandem repeat recognition program developed by Benson in 1999. Its sho...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/22
Inventor 刘正春陈熹张春明赵雪丰朱自强
Owner CENT SOUTH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products