Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

65 results about "Suffix array" patented technology

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms and within the field of bibliometrics. Suffix arrays were introduced by Manber & Myers (1990) as a simple, space efficient alternative to suffix trees. They had independently been discovered by Gaston Gonnet in 1987 under the name PAT array (Gonnet, Baeza-Yates & Snider 1992).

Method and Computer Program Product for Finding the Longest Common Subsequences Between Files with Applications to Differential Compression

A differential compression method and computer program product combines hash value techniques and suffix array techniques. The invention finds the best matches for every offset of the version file, with respect to a certain granularity and above a certain length threshold. The invention has two variations depending on block size choice. If the block size is kept fixed, the compression performance of the invention is similar to that of the greedy algorithm, without the expensive space and time requirements. If the block size is varied linearly with the reference file size, the invention can run in linear-time and constant-space. It has been shown empirically that the invention performs better than certain known differential compression algorithms in terms of compression and speed.
Owner:IBM CORP

Pattern search method, pattern search apparatus and computer program therefor, and storage medium thereof

A fast search is performed of a large text database, while suppressing an increase in the data size of the data structure used for the process. A pattern search method for searching a target character string for a desired pattern includes: a range search step and a character string extraction step. At the range search step, intermediate patterns are obtained by adding characters in order, one by one, from the last character of the pattern to the first, and a range is determined for a suffix array, which corresponds to the target character string, wherein the first character of each of the intermediate patterns is present. Then, at the character string extraction step, elements of the character string are designated that correspond to elements included in the range of the suffix array, and character string segments are extracted consisting of the same number of elements as the elements of the pattern and having the elements of the character string as their first characters.
Owner:IBM CORP

Software reusing method based on code clone automatic detection and timely prompting

The invention discloses a software reusing method based on code clone automatic detection and timely prompting. The method includes the steps that firstly, a Java lexical analyzer is generated through a lexical generator; secondly, if a background monitor monitors that a code in the current editing area is modified, the code in the current editing area and a developed code are input into the Java lexical generator, token values corresponding to source code files are generated according to a conversion rule, tokens are connected to form a Token sequence, and the Token sequence is stored in a one-dimensional array; thirdly, according to the multiplication algorithm or a DC3 algorithm, a suffix array and a rank array are constructed, and the longest public prefix array is generated; fourthly, meaningless code fragments are filtered out from the longest public prefix array, and if code clones still exist, a user is prompted to reuse or reconstruct the code clones. By means of the method, the user can conduct development while the background monitor detects the source code.
Owner:SOUTHEAST UNIV

Pulse sorting method based on isomorphic sequence

The invention provides a pulse sorting method based on an isomorphic sequence, and aims to solve the problem in detection of pulse groups which are practically submersed into a large quantity of pulse streams. The method comprises the following steps S1, acquiring a pulse repetition time interval sequence through first-order backward difference of the arrival time sequence of pulse streams, and quantizing the value of each element of the pulse repetition time interval sequence; S2, sieving the repetition substrings of the pulse repetition time interval sequence by using a suffix array and a maximum public prefix; S3, deleting shorter sub-strings from sub-strings having inclusion relations, and combining and jointing sub-strings having overlapping relations; S4, constructing a pulse stream arrival time difference matrix specific to the remaining pulse streams; S5, extracting a positive real number sequence of each line of the difference matrix to construct a pile of arrays, and sequencing the arrays to obtain a plurality of subsets; S6, searching for the sum of the subsets and a maximum common subsequence, and determining the position of a target pulse; S7, performing harmonic wave verification and pulse loss verification.
Owner:BEIJING INSTITUTE OF TECHNOLOGYGY

Suffix array construction method

The invention discloses a suffix array construction method within a linear time. The method comprises the following steps of: 1) scanning a character string S from right to left, comparing two adjacent characters S[i] and S[i+1] which are scanned at the present to obtain the type of each character and the type of the suffix, and recording the types by using an array t; 2) scanning the array t from left to right, finding out all positions where an LMS character appears, obtaining initial pointers of all LMS sub strings, and recording the pointers of the LMS sub strings by using P1; 3) sequencing all the LMS sub strings in the S via the pointer array P1 of the LMS sub strings and arrays B and SA; 4) renaming each LMS sub string in the character string S according to a sequenced result obtained in the step 3 to form a new shortened string S1; 5) if each character in the S1 is unique, directly sequencing the suffixes of the S1 to calculate the suffix array SA1 of the S1, otherwise, recursively calling an SA-IS algorithm by using the S1 and the SA1 which serve as input parameters; 6) concluding and calculating the suffix array SA of the S according to the suffix array SA1 of the S1; and 7) returning.
Owner:农革

String suffix array construction method on basis of radix sorting

InactiveCN102073740ARun fastSmall space consumptionSpecial data processing applicationsArray data structureRadix sort
The utility model discloses a string suffix array construction method on the basis of radix sorting, which comprises the following steps of: (1) scanning a string S from right to left, comparing two adjacent characters S and S<i+1> which are scanned currently to obtain the type of each character and each suffix and carrying out recording by an array t; (2) scanning an array t from left to right, searching the positions at which all d-characters appear, obtaining initial pointers of all d-substrings and recording the pointer of each d-substring by a d-substring pointer array P1; (3) carrying out radix sorting on all d-weighted substrings in the S by the d-substring pointer array P1, an array B and an array SA; (4) renaming each d-weighted substring in the string S according to a result obtained by sorting in the step (3) to form a shortened novel string S1; (5) if each character of the S1 is unique, sorting each suffix of the S1 to calculate a suffix array SA1 of the S1, or carrying out recursive call on an SA-IS algorithm by using the S1 and the SA1 as input parameters; (6) carrying out induction calculation on the suffix array SA of the S according to the suffix array SA1 of the S1, which is obtained in the step (5); and (7) returning.
Owner:农革

Suffix array based fuzzy tandem repeat recognition method

The invention discloses a suffix array based fuzzy tandem repeat recognition method. The suffix array based fuzzy tandem repeat recognition method includes: imputing acquired DNA (deoxyribonucleic acid) alkali-based sequences into a computer in the form of character strings; processing genomic sequences on the basis of the dictionary sorting algorithm to generate corresponding suffix arrays; acquiring the largest public prefix sequences on the basis of the suffix arrays; acquiring the largest tandem repeats on the basis of the accurate tandem repeat recognition algorithm; acquiring the optimal offset on the basis of improved FFT (fast fourier transform) transform; comparing the sequences on the basis of the dynamic programming algorithm; acquiring the fuzzy tandem repeats on the basis of the fuzzy tandem repeat recognition method. By the method, the repetitive sequences in the genomic sequences can be rapidly recognized and accurately analyzed, and the fuzzy tandem repeats of the sequences can be found out.
Owner:CENT SOUTH UNIV

Method for rapidly screening positions from position database

The invention discloses a method for rapidly screening positions from a position database. The method relates to the position database, a position index generator, a position search server and a Web client. The position index generator is interconnected with the position database by an open database. The position index generator generates an index file according to the position data in the position database in an index structure based on a suffix array and distributes the generated index file to the position search server. The position search server is connected with the Web client through a socket. The Web client sends an inquiry request to the position search server through the socket. The position search server searches the index file and returns search results to the Web client for display through the socket. According to the method, all positions in the position database can be searched rapidly and effectively, and the search speed is greatly improved at the premise of guaranteeing the search accuracy.
Owner:QIAN JIN NETWORK INFORMATION TECH SHANGHAI LTD

Fuzzy search method and fuzzy search device

The invention discloses a fuzzy search method and a fuzzy search device and belongs to the technical field of fuzzy search. The fuzzy search method includes: structuring suffix arrays for contact persons in a contact list in advance, wherein each suffix array includes at least one suffix array item acquired according to characters of the contact persons; sorting the suffix array items of all the suffix arrays acquired in the structure according to preset rules, and when a keyword for searching a contact person is received, performing binary search in all the sorted suffix array items according to the keyword to acquire the suffix array matched with the keyword and taking the contact person corresponding to the searched suffix array as a search result. The fuzzy search device comprises a structuring module, a sorting module and a search module. By the fuzzy search method and the fuzzy search device, time in searching the contact person is shortened, efficiency in searching the contact person is improved, and user experience is improved.
Owner:BEIJING FEINNO COMM TECH

Identifying repeat subsequences by left and right contexts

A system and method of identifying repeat subsequences having at least a value of x for threshold of different left contexts and a value of y for a threshold of different right contexts for an input sequence are disclosed. The method may include generating a lexicographically sorted suffix array for the input sequence and a longest common prefix array. The suffix array is traversed in lexicographic order comparing the longest common prefix values between consecutive suffixes. Suffixes with the same longest common prefix are representative of occurrence of the same repeat, a higher longest common prefix indicates a new occurrence of a longer repeat, and a lower longest common prefix indicates the last occurrence of a repeat.
Owner:XEROX CORP

Correctness verification method and system of suffix array and longest common prefix

ActiveCN107015952AImplement correctness verificationReduce time and space overheadNatural language data processingSpecial data processing applicationsArray data structureValidation methods
The invention relates to a correctness verification method and system of a suffix array and a longest common prefix. The method includes the steps that T is scanned once from right to left, the size of a character T[i] and the size of a subsequent character T[i+1] are compared according to the definition of suffix types, and the types of the character T[i] and the suffix suf(T, i) of T are calculated and recorded in t[i]; elements in SA1 and LCPA1 are initialized as -1; SA is scanned once from left to right, and all LMS suffixes and LCP values thereof in SA are found according to an array t and recorded in SA1 and LCPA1 in sequence respectively; the adjacent LMS suffixes and the LCP values thereof in SA1 are subjected to correctness verification according to the character string T, the array t, SA1 and LCPA1; L-type suffixes and LCP values thereof are inductively sorted according to the character string T, the array t, B, C, SA1 and LCPA1; S-type suffixes and LCP values thereof are inductively sorted according to the character string T, the array t, B, C, SA1 and LCPA1; SA, SA1, LCPA and LCPA1 are scanned once in sequence, whether SA and SA1 are identical and LCPA and LCPA1 are identical or not is determined through comparison, and if the two groups are identical through comparison, SA and LCPA of T are correct.
Owner:SYSU CMU SHUNDE INT JOINT RES INST +1

Extracting method and device of complex named entity

The invention is applicable to the field of information extraction and provides an extracting method and a device of a complex named entity. The method comprises the steps of filtering text data of a text, connecting substrings separated by punctuations in the filtered text data into a long string through specified connectors, recording a home position of a Chinese character or an English character of the long string, storing the recorded home position of the Chinese character or the English character into an established suffix data set to determine an ordered sequence of suffixes in the suffix data set, determining a longest common prefix of the adjacent suffixes according to the ordered sequence of the suffixes, taking the determined longest common prefix of the adjacent suffixes as a repeated string of the text, and extracting the complex named entity of the text according to at least two out of the frequency of the repeated string of the text, mutual information of the repeated string and the independence of the repeated string. With the adoption of the method and the device, the more accurate complex named entity can be obtained, and the extracting accuracy of the complex named entity is improved.
Owner:SHENZHEN SHI JI GUANG SU INFORMATION TECH

Rapid character string matching method based on suffix array

The invention provides a rapid character string matching method based on a suffix array. The method includes two stages, the first stage includes that the appearing position of a pattern string in a text string is limited in a possible interval of the suffix array taking a first character of the pattern string as a beginning character by binary search, and the second stage includes that search conditions are further limited on the interval, suffixes with the length smaller than that of the pattern string and with last characters different from those of the pattern string are excluded, the comparison frequency of the characters is decreased, character string matching range is narrowed, and the appearing position of the pattern string in the text string is rapidly acquired.
Owner:南京搜文信息技术有限公司

Classification method, device and system based on platelet differentially expressed gene marker

The invention belongs to the computer technology field, and provides a classification method, device and system based on a platelet differentially expressed gene marker. The method comprises the steps that a sequencing reading sequence of a target sample platelet transcriptome is acquired; a comparison result between the sequencing reading sequence and a human genome is acquired according to a suffix array search algorithm and a sequence splitting / searching / extending strategy; a gene expression estimation value is determined according to a maximum likelihood method; a gene expression difference of a positive sample set and a negative sample set is acquired through a linear statistical method; a hyperplane expression is constructed according to the positive sample set and the negative sample set; an entity gene expression estimation values are classified according to the hyperplane expression, the entity gene expression estimation values and a support vector machine principle. According to the classification method, device and system based on the platelet differentially expressed gene marker, the differentially expressed gene marker can be quickly and accurately identified, and the classification precision of corresponding individuals of a group is improved.
Owner:张渠

Method for multi-thread parallel construction of suffix array and system

The invention discloses a method for multi-thread parallel construction of a suffix array and a system. The method comprises the following steps: scanning a character string X, calculating the types of characters and suffixes by using an L / S type recognition method, and recording in an array t; scanning the array t, finding out the positions where LMS characters appear, obtaining first character pointers of LMS substrings and recording the pointers of the LMS substrings with an array P1; performing parallel induction and sorting on all LMS substrings in the X by the P1, a B and an SA and saving in an SA1; according to the sorting result, performing multi-thread parallel renaming on the LMS substrings in the X to form an X1; checking whether each character in the X1 is unique, if yes, directly sorting suffixes of the X1 to calculate a suffix array of the X1 and saving in the SA1; and finally calculating a suffix array of the X and saving in the SA according to the suffix array of the X1 saved in the SA1. The invention has high running speed, can be matched with the memory of a multi-core computer, and is suitable for constructing a suffix array of a large-scale string.
Owner:FOSHAN SHUNDE SUN YAT SEN UNIV RES INST +2

String matching in hardware using the fm-index

String matching is a ubiquitous problem that arises in a wide range of applications in computer science, e.g., packet routing, intrusion detection, web querying, and genome analysis. Due to its importance, dozens of algorithms and several data structures have been developed over the years. A recent breakthrough in this field is the FM-index, a data structure that synergistically combines the Burrows-Wheeler transform and the suffix array. In software, the FM-index allows searching (exact and approximate) in times comparable to the fastest known indices for large texts (suffix trees and suffix arrays), but has the additional advantage to be much more space-efficient than those indices. This disclosure discusses an FPGA-based hardware implementation of the FM-index for exact and approximate pattern matching.
Owner:RGT UNIV OF CALIFORNIA

Inexact Search Acceleration

A system and method are disclosed for inexact search acceleration using reference data. A representative system includes one or more memory circuits storing a plurality of queries and a FM-index of the reference data; and one or more FPGAs configured to select a query; select a substring of the selected query; read a section of the FM-index and calculate a plurality of suffix array intervals for the substring with a corresponding plurality of prepended characters in a first or next position; read a first or next character in the first or next position of the query and select a suffix array interval for the read first character; determine whether the suffix array interval is valid and whether a beginning of the query has been reached; returning a first search result when the suffix array interval is valid and the beginning of the query has been reached; and returning a second search result that no match of the query with the reference data was found when the suffix array interval is not valid.
Owner:MICRON TECH INC

Pair character string retrieval system

A data structure of index information for retrieving pair character strings on a computer at high speed is provided. A method of retrieving a pair character strings appearing in close proximity of each other in a document using the index information at high speed is also provided. Bits of a suffix array of reference document data are rearranged, thereby creating index information LSA localizable, or usable as an index for a subregion of the document. Through use of this, a process of dichotomizing a region, where the entire document is designated as an initial region, is repeated and positions of index information for a query character string in the reference document data are gradually detailed. The distance between the pair is evaluated and candidates are narrowed down. Finally, positions where the pair character strings occur in close proximity of each other are identified.
Owner:HITACHI LTD

Gene sequence alignment method and system

The invention discloses a gene sequence alignment method and system. The method comprises the following steps: storing a reference genome sequence and a query genome sequence in a distributed storage system; under a Spark heterogeneous distributed computing platform framework, segmenting a reference genome sequence according to row offset, and preprocessing to obtain a plurality of preprocessed reference data sets; establishing an index for each preprocessing reference data set by adopting a suffix array algorithm, and combining all the preprocessing reference data sets after the index is established to obtain a reference sequence index file; carrying out CUDA fine-grained sequence comparison on each fragment in the query genome sequence and a reference sequence index file by adopting a seed extension algorithm, and determining position information of each fragment in the reference sequence index file; and combining the position information of all the fragments in the reference sequence index file to obtain a gene sequence comparison result. According to the invention, the calculation speed and precision of a large-scale sequence alignment algorithm are improved.
Owner:HUAZHONG AGRI UNIV

A suffix array indexing method and apparatus for real-time data stream

ActiveCN109299152ASolve the problem of low retrieval efficiencyRetrieval does not affectDigital data information retrievalSpecial data processing applicationsReal-time dataArray data structure
The invention discloses a suffix array indexing method for a real-time data stream. The method comprises the following steps: a server receives the real-time data stream, extracts source data, and pretreats the source data into documents; parsing the document, distributing the document according to the domain, receiving the source data in each domain, and starting an independent thread to index and store the data; a domain consists of a plurality of segments. After receiving the source data, the domain object writes the source data directly into the segments and sets the segment source data update signal to return the response. If all domains of the document return a response, the response information is returned to the client; the suffix array construction tool listens for the segment source data update signal in the background, automatically constructs the suffix array for the segment source data, and generates the segment suffix array; a segment source data, a segment suffix array,and a segment information are linked into a full suffix array index, and the source data is indexed successfully. The invention can index heterogeneous data in real time without word segmentation, andadopts asynchronous mode to generate index to accelerate response time. The invention is suitable for data indexing field.
Owner:SUN YAT SEN UNIV

Correctness verification method and system of suffix array

The invention relates to a correctness verification method and system of a suffix array. The method includes the steps that T is scanned once from right to left, the size of a currently scanned character T[i] and the size of a subsequent character T[i+1] are compared according to the definition of suffix types, and the types of the characters T[i] and suf(T, i) of T are calculated and recorded in t[i]; T is scanned once from left to right, the positions of all LMS characteristics are found, and first-character pointers of all LMS sub-strings are obtained accordingly and recorded with an array P1; according to arrays P1, B and SA, the LMS sub-strings of T are sorted through an inducting and sorting method, and the result is saved in an array SA1; SA is scanned from left to right, and if SA[i] is of an LMS type, SA[i] is saved in SA1; whether characters in T1 are unique or not is judged, if yes, SA1 is directly calculated according to the name of T1, and an array C is updated through SA1; the suffix array SA of T is inductively calculated according to T1 and SA1, the correctness of SA is verified through the array C in the calculation process, and if SA is correct, the array C is updated through SA.
Owner:SYSU CMU SHUNDE INT JOINT RES INST +1

A massive small file query method and system using suffix array index

The invention discloses a massive small file inquiry method adopting suffix array index. As that small file are merged and store on the distributed file system, the invention improves the utilizationrate of space. At that same time, a suffix array index is establish for each small file to record its storage information and the attribute information of the small file itself, and an effective smallfile update method is provided to support the small file query in various way, thereby avoiding the traditional single low-efficient massive small file query, and ensuring the instantaneity, accuracyand efficiency of the query. The invention solves the problems of simple merging small files in the prior art, such as single query mode of small files, low reading efficiency, difficulty in updatingsmall files, poor query instantaneity and the like.
Owner:SUN YAT SEN UNIV

Method for quickly looking for feature character strings in text sequential data

InactiveCN105653567ASolve problems where an exact match is requiredSolve matching problemsSpecial data processing applicationsArray data structureOriginal data
The invention discloses a method for quickly looking for feature character strings in a text sequential data. The method comprises the following steps of (1) acquiring a text sequence from information, namely a character string, (2) generating a suffix array, (3) searching in the suffix array and resolving according to binary search. In the third step, according to lines of the suffix matrix, search is conducted to each line; and if a field occurs for designated times in a concentrative way in binary search results, similarity of two fields is calculated and the field close to the similarity most is the candidate field. Advantages of original data in the sequence is effectively utilized, so problems of data analysis complication and slow speed due to limitation of LSH algorithm to the unordered data can be overcome; besides, after fuzzy check, delete and selection can be directly conducted; a candidate part can be directly filtered via similarity calculation; and a problem that a sub-sequence has to be fully matched for similarity search algorithm can be overcome.
Owner:CHANGSHU RES INSTITUE OF NANJING UNIV OF SCI & TECH

Parallel data difference method

The invention provides a parallel data difference method. The parallel data difference method comprises the steps that 1, files are preprocessed, wherein a source file and a target file are initialized, a suffix array of the source file is generated, and a patch file is created and initialized; 2, the target file is segmented, wherein the target file is segmented according to the thread number, and one thread is added for each part of segmented target file to perform independent processing; 3, the thread processing process is performed, the segmented target file is initialized in each thread, patch files are created, the source file and the target file are compared through the suffix array to generate differential data, and the differential data is written into the patch file; 4, host processes are merged and process, wherein the patch files with the written differential data of the threads are written into the patch files together. By adopting a multi-thread paralleling technology, the patch generating speed is improved.
Owner:INST OF INFORMATION ENG CAS

Method and system for constructing suffix arrays (SAs) in parallel in constant working space

The invention discloses a method and system for constructing suffix arrays (SAs) in parallel in a constant working space. The method comprises the steps of obtaining first character pointers of all LMS (LeftMost S-type) substrings in a character string X and recording in an array P1; carrying out parallel inductive sorting in the constant working space on all the LMS substrings by utilizing the P1and an SA; obtaining a character string X1; distinguishing the different construct input parameters of the SA according to the uniqueness of the characters in the X1; and finally carrying out parallel inductive calculating on the SAs of the character string X1 in the constant working space through the corresponding relation between the X1 and the SA1 and storing in the SA. The method disclosed bythe invention has the beneficial effects that the computer memory requirement is reduced; the running speed is higher; the time-space complexity is optimized; and the method is suitable for constructing the SAs of large-scale character strings.
Owner:FOSHAN SHUNDE SUN YAT SEN UNIV RES INST +2

Audio frequency comparison method

The invention discloses a rapid audio frequency comparison method. The audio frequency comparison part of the rapid audio frequency comparison method comprises the following steps of: reading an audio frequency p and an audio frequency q and marking off a feature segment set Cp of the audio frequency p and a feature segment set Cq of the audio frequency q; rapidly calculating an energy feature value sequence Wp of the feature segment set Cp of the audio frequency p and an energy feature value sequence Wq of the feature segment set Cq of the audio frequency q by utilizing a CUDA (Compute Unified Device Architecture) function on a GPU (Graphics Processing Unit); forming an energy matrix by a feature value of each feature segment according to the sequence of the feature segments; finding out a common feature segment set Seg of two feature value sequences by utilizing a suffix array deformation algorithm; rapidly scanning the common feature segment set Seg, finding out a communication region and using a set Vres returning to the communication region as an audio frequency comparison result; and marking the comparison result on an oscillogram.
Owner:NANJING TREDO INFORMATION TECH

Short message search method and system based on suffix arrays

The invention relates to a short message search method based on suffix arrays. The method comprises the steps that S1, a suffix array is constructed for each short message in a short message list, and then all suffix array items in all the suffix arrays obtained through construction are ordered; S2, when a keyword for searching for a short message is received, all characters in the received keyword are sequentially used as indexes for binary search according to a character receiving order; S3, the i(th) character in the keyword is used as an index to perform binary search in all the suffix array items which are ordered, and the suffix array corresponding to the suffix array items with the first character being the index is used as an i(th) search result; S4, it is assumed that i=i+1, the i(th) character in the keyword is used as an index to perform binary search in the suffix array items contained in an (i-1)th search result, and then the suffix array corresponding to the suffix array items with the first character being the index is used as the i(th) search result; and S5, the step S4 is executed repeatedly till i is greater than n, and at the moment the short message corresponding to the i(th) search result is used as a short message search result to be output, wherein n is the number of characters contained in the keyword.
Owner:SYSU CMU SHUNDE INT JOINT RES INST +1
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products