Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

46 results about "Approximate string matching" patented technology

In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

Methods and systems for implementing approximate string matching within a database

A computer-based method for character string matching of a candidate character string with a plurality of character string records stored in a database is described. The method includes performing a clustering operation on at least a portion of the plurality of character string records, the clustering operation generating a plurality of clusters, each cluster comprising a plurality of character strings from the plurality of character string records, the plurality of character strings in each cluster are determined to be similar with respect to each other based on at least one characteristic of the plurality of character strings. The method also includes generating a set of reference character strings that are selected from the plurality of character strings in each cluster, generating an n-gram representation for one of the reference character strings in the set of reference character strings, and generating an n-gram representation for the candidate character string.
Owner:MASTERCARD INT INC

Image similarity detection using approximate pattern matching

ActiveUS8175387B1Efficiently detect similaritySimple and non-resource intensiveCharacter and pattern recognitionPattern matchingByte
Two images are compared to determine how similar they are. First, a process normalizes each image, then horizontal and vertical byte sequences are derived from each image. A similarity formula is used to obtain a similarity value that represents the similarity between the two images. An approximate pattern matching algorithm is used to determine the error distance between the horizontal byte sequences for the images and to determine the error distance between the vertical byte sequences for the images. The error distances and the length of the byte sequences are used to determine the similarity value. Padding is used to make the aspect ratios the same.
Owner:TREND MICRO INC

Electronic text document plagiarism recognition method based on similar string matching distance

InactiveCN101441620AStrong structural recognition abilityPlagiarism meetsSpecial data processing applicationsTheoretical computer scienceDocument preparation
The invention relates to a method for identifying plagiarism of an electronic text document. The method mainly identifies the plagiarism through the approximate string matching distance of a subparagraph. The method to identify whether a document A plagiarizes a document B comprises the following specific steps: firstly, the approximate string matching distance and an approximate matching segment of each paragraph of the document A in the document B are calculated; secondly, according to the approximate matching segment, the retroversion number and the forward jumping number are calculated; the retroversion number refers to the number of generation that the head part of the next approximate matching segment is positioned before the tail part of the last approximate matching segment or the total number of passing segments; the forward jumping number refers to the number of generation that the next approximate matching segment is behind the last approximate matching segment and at least has distance of one segment with the last approximate matching segment or the total number of the alternate segments; and finally, the sum of the approximate string matching distance, the retroversion number and the forward jumping number are summed; the sum is taken as the plagiarism distance of the document A to the document; and if the distance is less than certain threshold value, the document A is suspected of plagiarizing the document B.
Owner:WENZHOU UNIVERSITY

Method for calculating similarity of Geographic Information System (GIS) vector data image watermarks

The invention discloses a method for calculating the similarity of Geographic Information System (GIS) vector data image watermarks belonging to the geographic information version protection field. The method comprises the following steps of: correcting the position of an extracted watermark W' by means of an original watermark W such that the disordered pixels of a version image return to the right positions thereof; and then performing similarity calculation on the corrected watermark W' and the original watermark W by employing a dynamic programming algorithm of approximate character-string matching. The method disclosed by the invention is capable of accurately correcting the pixels of the extracted image watermark to the right positions, visually reflecting the tampered positions of data and objectively measuring the similarity of the original watermark and the extracted watermark; therefore, the quality of watermark authentication is improved to a certain extent, the omission factor of the watermark authentication is reduced, and the theory and method system of the geographic information version protection is completed; and the method can be applied to the aspects of the version protection technology and secure transmission of the GIS vector data.
Owner:NANJING NORMAL UNIVERSITY

Method for accelerating character string matching by trans-border protection mechanism

The invention provides a method which uses a boundary violation protection mechanism for accelerating the matching of character strings. A tail position of a text is obtained according to the length of the text to be matched, and the last end character of the text is assumed to be positioned at the position of loc; an isolation word of one character is arranged in the position of loc plus 1, and the isolation word is any character that does not appear in a mode; a copy mode is connected to the position of loc plus 2 of the text; a normal character string matching is implemented without checking whether a subscript crosses a boundary; whether a subscript crosses the boundary or not is judged in front of the matching position of an output mode, if the subscript does not cross the boundary, the matching position is output, and if the subscript crosses the boundary, the matching action is then finished. The method of the invention has no relation with the concrete realization of the matching of the character strings and is a general improved method for present matching problems of various character strings. The output action after the mode matching in the whole string matching process is the action with the lowest frequency of all the actions appearing in the string matching process. Therefore, the method of the invention can minimize the total number of the examination operations for the subscript boundary violation.
Owner:HARBIN ENG UNIV

Systems and methods for building an electronic dictionary of multi-word names and for performing fuzzy searches in the dictionary

The present invention automatically builds a contracted dictionary from a given list of multi-word proper names and performs fuzzy searches in the contracted dictionary. The contracted dictionary of proper names includes two linked trie-based dictionaries: a first dictionary is used to store single word names, each word name having an ID number; and a second dictionary is used to store multi-word names encoded with ID numbers. Information related to the multi-word names is also stored as a gloss to the terminal node of the multi-word entry of the trie-based dictionary. An approximate lookup for a multi-word name is conducted first for each word of the multi-word name using an approximate matching technique such as a phonetic proximity or a simple edit distance. Accordingly, N suggestions is determined for each word of the multi-word name under consideration. Then, multi-word candidates are assembled in ID notation. Finally, an approximate search for each assembled candidate is performed based on an edit distance or a n-grams approximate string matching. Edit distances and N-grams are used to measure how similar two strings are. The result is a set of multi-word suggestions in an ID notation. This ID notation is encoded back to the original form using the first trie-based dictionary.
Owner:IBM CORP

Kinship analysis method based on household registration information data

The invention provides a kinship analysis method based on household registration information data. The method comprises the following steps: S1, carrying out encoding of basic relationships in the kinship through letters and numeric characters, so as to obtain a character code set for the basic relationships; S2, determining connection symbols, positive relationships and reverse relationships, wherein the connection symbols are symbols connecting character codes corresponding to the basic relationships, the known kinship is defined as one of the positive relationships, and a relationship opposite to each positive relationship is defined as one of the reverse relationships; S3, obtaining a character string of the kinship to be analyzed according to data of the kinship to be analyzed and through the character codes, the connection symbols, and the reverse relationships; S4, carrying out simplification of the character string according to simplification rules, so as to obtain a new character string, of which the length is smaller than the length of the original character string; and S5, carrying out character string matching of the simplified new character string according to matching rules, so as to obtain analysis results of the kinship to be analyzed.
Owner:ENC DATA SERVICE CO LTD

System and method for variant string matching

A method, computer program product, and system for variant string matching. A computer implemented method for variant string matching may comprise comparing with a computing device two unidentical strings in a training variant string pair. The two unidentical strings may represent the same item from training data, which may be stored in a memory. The two unidentical strings may be compared to determine if they include an identical substring pair, and a first unidentical substring pair. The computer implemented method may also determine if the first unidentical substring pair includes a first unidentical substring and a second unidentical substring. The computer implemented method may further determine if the first unidentical substring pair is in the training data. The first unidentical substring pair may be entered into the training data as a first variant string pair if it is not in the training data.
Owner:SRA INTERNATIONAL

Font information fusion-based medicine-taking bill recognition result error correction method

The invention relates to a font information fusion-based medicine-taking bill recognition result error correction method and belongs to the field of character recognition. The method comprises the following steps of: constructing a standard medicine word bank, storing each piece of medicine information in the word bank in a BK tree memory structure as a node, setting a search distance threshold n, reducing a data search scale through a threshold search rule, and obtaining a result candidate set; carrying out similarity matching on a character string to be corrected after character recognition and a character string in a result candidate set, improving a traditional editing distance formula on the basis of an original similarity matching scheme, keeping the insertion and deletion operation cost unchanged, and reducing character replacement cost; during character replacement operation, considering relevant information of three fonts including five-stroke codes, four-corner codes and strokes, and improving character string approximate matching precision; and replacing the character string with the highest similarity as an error correction result. According to the method of the invention, a medicine-taking bill identification result is corrected, so that the medicine-taking bill identification accuracy is improved.
Owner:CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products