Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

402 results about "Edit distance" patented technology

In computational linguistics and computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

Method and Apparatus for Automatic Detection of Spelling Errors in One or More Documents

Methods and apparatus are provided for automatically detecting spelling errors in one or more documents, such as documents being processed for the creation of a lexicon According to one aspect of the invention, a spelling error is detected in one or more documents by determining if at least one given word in the one or more documents satisfies a predefined misspelling criteria, wherein the predefined misspelling criteria comprises the at least one given word having a frequency below a predefined low threshold and the at least one given word being within a predefined edit distance of one or mote other words in the one or more documents having a frequency above a predefined high threshold; and identifying a given word as a potentially misspelled word if the given word satisfies the predefined misspelling criteria
Owner:IBM CORP

End-to-end identification method for scene text with random shape

The invention discloses an end-to-end identification method for a scene text with a random shape. The method comprises the steps of extracting a text characteristic through a characteristic pyramid network for generating a candidate text box by an area extracting network; adjusting the position of the candidate text box through quick area classification regression branch for obtaining more accurate position of a text bounding box; inputting the position information of the bounding box into a dividing branch, obtaining a predicated character sequence through a pixel voting algorithm; and finally processing the predicated character sequence through a weighted editing distance algorithm, finding out a most matched word of the predicated character sequence in a given dictionary, thereby obtaining a final text identification result. According to the method of the invention, the scene texts with the random shape can be simultaneously detected and identified, wherein the scene texts comprisehorizontal text, multidirectional text and curved text. Furthermore end-to-end training can be completely performed. Compared with prior art, the identification method according to the invention has advantages of obtaining advantageous effects in accuracy and versatility, and realizing high application value.
Owner:HUAZHONG UNIV OF SCI & TECH

Spell-check for a keyboard system with automatic correction

An adaptation of standard edit distance spell-check algorithms leverages probability-based regional auto-correction algorithms and data structures for ambiguous keypads and other predictive text input systems to provide enhanced typing correction and spell-check features. Strategies for optimization and for ordering results of different types are also provided.
Owner:TEGIC COMM

Method and apparatus for detecting near duplicate videos using perceptual video signatures

Methods and apparatus for detection and identification of duplicate or near-duplicate videos using a perceptual video signature are disclosed. The disclosed apparatus and methods (i) extract perceptual video features, (ii) identify unique and distinguishing perceptual features to generate a perceptual video signature, (iii) compute a perceptual video similarity measure based on the video edit distance, and (iv) search and detect duplicate and near-duplicate videos. A complete framework to detect unauthorized copying of videos on the Internet using the disclosed perceptual video signature is disclosed.
Owner:GOOGLE LLC

Spell-check for a keyboard system with automatic correction

User input is received, specifying a continuous traced path across a keyboard presented on a touch sensitive display. An input sequence is resolved, including traced keys and auxiliary keys proximate to the traced keys by prescribed criteria. For each of one or more candidate entries of a prescribed vocabulary, a set-edit-distance metric is computed between said input sequence and the candidate entry. Various rules specify when penalties are imposed, or not, in computing the set-edit-distance metric. Candidate entries are ranked and displayed according to the computed metric.
Owner:CERENCE OPERATING CO

Identifying alternative spellings of search strings by analyzing self-corrective searching behaviors of users

A computer-implemented process identifies useful alternative spellings of search strings submitted to a search engine. The process takes into consideration spelling changes made by users, as detected by programmatically analyzing search string submissions of a population of search engine users. In one embodiment, an assessment of whether a second search string represents a useful alternative spelling of a first search string takes into consideration (1) an edit distance between the first and second search strings, and (2) a likelihood that a user who submits the first search string will thereafter submit the second search string, as determined by monitoring and analyzing actions of users.
Owner:AMAZON TECH INC

Document Scanning and Data Derivation Architecture.

InactiveUS20070033118A1Reduce and eliminate manual typingEliminate or reduce common typographical errorsComplete banking machinesFinanceFeature vectorImaging analysis
Proprietary suite of underlying document image analysis capabilities, including a novel forms enhancement, segmentation and modeling component, forms recognition and optical character recognition. Future version of the system will include form reasoning to detect and classify fields on forms with varying layout. Product provides acquisition, modeling, recognition and processing components, and has the ability to verify recognized data on the image with a line by line comparison. The key enabling technologies center around the recognition and processing of the scanned forms. The system learns the positions of lines and the location of text on the pre-printed form, and associates various regions of the form with specific required fields in the electronic version. Once the form is recognized, the preprinted material is removed and individual regions are passed to an optical character recognition component. The current proprietary OCR engine is trained with a variety of Roman text fonts and has a back end dictionary that can be customized to account for the fact that the system knows which field it is recognizing. The engine performs segmentation to obtain isolated characters and computes a structure based feature vector. The characters are normalized and classified using a cluster centric classifier, which responds well to variations in the symbols contour. An efficient dictionary lookup scheme provides exact and edit distance lookup using a TRIE structure. An edit distance is computed and a collection of near misses can be output in a lattice to enhance the final recognition result. The current classification rate can exceed 99% with context. The ultimate goal of this system is to enable the processing of all tax forms including forms with handwritten material.
Owner:TAXSCAN TECH

Similarity detection for error reports

ActiveUS20110066908A1Reducing and eliminating duplicationAccurate degreeError preventionTransmission systemsFrame differenceError reporting
Techniques for determining similarity between error reports received by an error reporting service. An error report may be compared to other previously-received error reports to determine similarity and facilitate diagnosing and resolving an error that generated the error report. In some implementations, the similarity may be determined by comparing frames included in a callstack of an error report to frames included in callstacks in other error reports to determine an edit distance between the callstacks, which may be based on the number and type of frame differences between callstacks. Each type of change may be weighted differently when determining the edit distance. Additionally or alternatively, the comparison may be performed by comparing a type of error, process names, and / or exception codes for the errors contained in the error reports. The similarity may be expressed as a probability that two error reports were generated as a result of a same error.
Owner:MICROSOFT TECH LICENSING LLC

Method of performing approximate substring indexing

Approximate substring indexing is accomplished by decomposing each string in a database into overlapping “positional q-grams”, sequences of a predetermined length q, and containing information regarding the “position” of each q-gram within the string (i.e., 1st q-gram, 4th q-gram, etc.). An index is then formed of the tuples of the positional q-gram data (such as, for example, a B-tree index or a hash index). Each query applied to the database is similarly parsed into a plurality of positional q-grams (of the same length), and a candidate set of matches is found. Position-directed filtering is used to remove the candidates which have the q-grams in the wrong order and / or too far apart to form a “verified” output of matching candidates. If errors are permitted (defined in terms of an edit distance between each candidate and the query), an edit distance calculation can then be performed to produce the final set of matching strings.
Owner:AMERICAN TELEPHONE & TELEGRAPH CO

System and method for measuring confusion among words in an adaptive speech recognition system

A system and method are proposed for measuring confusability or similarity between given entry pairs, including text string pairs and acoustic model pairs, in systems such as speech recognition and synthesis systems. A string edit distance (Levenshiten distance) can be applied to measure distance between any pair of text strings. It also can be used to calculate a confusion measurement between acoustic model pairs of different words and a model-driven method can be used to calculate a HMM model confusion matrix. This model-based approach can be efficiently calculated with low memory and low computational resources. Thus it can improve the speech recognition performance and models trained from text corpus.
Owner:NOKIA CORP

Music retrieval system based on audio fingerprint features

The invention belongs to the technical field of information retrieval, and particularly relates to a music retrieval system based on audio fingerprint features. The system is composed of a preprocessing module, a feature extraction module, a reverse index module and a fine matching module. The preprocessing module mainly carries out audio signal conversion, resampling and filtering; the feature extraction module is used for representing audio files, wherein the audio fingerprint features are adopted to select the most stable point from a frequency spectrum as the feature point through twice screening based on dynamic threshold values, and each feature is represented by a dot pair; according to the reverse index module, the features are used as key words, reverse indexes are built according to the features of a song library, and the index result is returned according to the number of the same key words; according to the fine matching module, the sequential relationship of the audio features is combined, an improved editing distance is adopted as the similarity of two feature sequences, and therefore the index result is optimized. The music retrieval system based on the audio fingerprint features is suitable for the retrieval of a large number of songs, and can particularly conduct effective retrieval on record inquiry segments.
Owner:FUDAN UNIV

Detecting duplicate records in databases

The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key-foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
Owner:MICROSOFT TECH LICENSING LLC

Method for text error correction after voice recognition based on domain identification

The invention belongs to the field of voice recognition text processing and discloses a method for text error correction after voice recognition based on domain identification and aims at solving theproblem that a processing method in the prior art needs lots of labor for intervention, is low in error correction efficiency and cannot conduct error correction on proper names. The method comprisesthe following steps that (a) error knowing and analysis are conducted on texts obtained after voice recognition, and the field which text sentences belong to are primarily determined; (b) sentences toundergo error correction are segmented according to predefined syntax rules and are divided into redundancy portions and core portions; (c) a search engine is utilized to perform character string fuzzy matching and determine candidate specific word bank sets of the core portions of the sentences; (d) similarity scores are calculated according editing distances, and error correction is conducted on the redundancy portions and the core portions; (e) the redundancy portions and core portions undergoing the error correction are fused, and then error correction results are output.
Owner:SICHUAN CHANGHONG ELECTRIC CO LTD

Character string updated degree evaluation program

There is provided a character string updated degree evaluation program that enables quantitative grasping of an amount of intellectual work through editing and updating of character strings. A text subjected to comparison is divided into common part character strings each having a length greater than or equal to a threshold value, and non-common part character strings. A number of edited points from the original text and a context edit distance are calculated based on the rate of the common part character strings and the occurrence pattern thereof. A number of edited point is acquired from a number of elements contained in a common part character string set, and a context edit distance is acquired from a change in an order of occurrence of the common part character strings. Calculation of a new creation percentage and analysis by an N-gram are performed on the non-common part character string. The new creation percentage is acquired from the total length of the elements contained in a non-common part character string set, and a new creation novelty degree is acquired from a non-partial matching rate between a non-common part character string set and an element contained in the non-common part character string set. Calculations for the common part character string set and for the non-common part character string set are united, thereby calculating a text updated degree.
Owner:NAT UNIV CORP NAGAOKA UNIV TECH

Dynamic match lattice spotting for indexing speech content

A system for indexing and searching speech content, the system includes two distinct stages, a speech indexing stage (100) and a speech retrieval stage (200). A phone lattice (103) is generated by passing speech content (101) through a speech recogniser (102). The resulting phone lattice is then processed to produce a set of observed sequences Q=(Θ,i) where Θ are the set of observed phone sequences for each node i in the phone lattice. During the retrieval stage (200), a user first inputs a target word (205) into the system, which is then reduced to a target phone sequence P=(p1, p2, . . . , pN) (207). The system then compares target sequence P with the set of observed sequences Q (208), suitably by scoring each observed sequence against the target sequence using a Minimum Edit Distance (MED) calculation to produce a set of matching sequences R (209).
Owner:QUEENSLAND UNIVERSITY OF TECH

Method and device for calculating similarity of Chinese character strings on the basis of edit distance

An embodiment of the invention provides a method for calculating the similarity of Chinese character strings on the basis of edit distance. The method includes: calculating the similarity of Chinese characters in character strings to be compared; calculating the similarity of the Chinese character strings to be compared. According to the method, the Chinese characters in the character strings are converted into four-corner codes by the four-corner code method; the similarity of the Chinese characters is accordingly calculated on the basis of edit distance; on this basis, the weight of edit distance is replaced with the similarity of the Chinese characters to calculate the similarity of the character strings. The Chinese characters are converted into numeric strings for comparison, thus matching of the Chinese characters is more precise; the weight of the edit distance is replaced with the similarity of the Chinese characters to calculate the similarity of the character strings, thus the edit distance algorithm is applied to matching of the Chinese character strings under the Chinese language environment and matching results are more accurate. In addition, another embodiment of the invention provides a device for calculating the similarity of Chinese character strings on the basis of edit distance.
Owner:SHENZHEN AUDAQUE DATA TECH

Method for matching Chinese similarity

The invention provides a method for matching Chinese similarity. An edit distance formula and a keyboard fingering rule are used to obtain the edition similarity of the corresponding pinyin of Chinese, namely, whether the Chinese and the pinyin are easily mixed up during edition is reflected; the pronunciation rules of the initial consonant and the final sound of Chinese characters are used for obtaining the initial consonant similarity and the final sound similarity of character strings; and common fuzzy tones in dialects or common pronunciation are combined to calculate the pronunciation similarity among character strings. Because the Chinese character pattern is one of the most important characteristics of Chinese, character pattern coding namely the Five-stroke Method coding is used for calculating the character pattern similarity among character strings; information is collected and calculated at the same time for updating data; and the above similarities are combined to obtain the whole similarity of Chinese word, various factors, such as Chinese spelling custom, user input custom, keyboard layout, mandarin pronunciation rules, dialects, common wrong pronunciation, Chinese character patterns and the like are fully considered, the statistical regularity is combined, and the similarity among Chinese words is comprehensively evaluated.
Owner:TSINGHUA UNIV

Spell-check for a keyboard system with automatic correction

An adaptation of standard edit distance spell-check algorithms leverages probability-based regional auto-correction algorithms and data structures for ambiguous keypads and other predictive text input systems to provide enhanced typing correction and spell-check features. Strategies for optimization and for ordering results of different types are also provided.
Owner:TEGIC COMM

Clustering of near-duplicate documents

Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined based on comparing their document vectors. In one process, initial clusters are formed by applying a first edit-distance constraint relative to a root document of each cluster. The initial clusters can be merged subject to a second edit-distance constraint that limits the maximum edit distance between any two documents in the cluster. The second edit-distance constraint can be defined such that whether it is satisfied can be determined by comparing cluster structures rather than individual documents.
Owner:HEWLETT-PACKARD ENTERPRISE DEV LP

Methods and systems for mining websites

Mining of websites that in one embodiment includes obtaining web usage data of user sessions of a website, wherein the website has a hierarchical structure with granular levels and has mapping from each webpage of the website into the hierarchical structure, mapping the user sessions to the hierarchical structure of the website resulting in hierarchical user sessions, initiating an edit distance metrics to determine similarity in the hierarchical user sessions, and clustering similar hierarchical user sessions into groups.
Owner:NBCUNIVERSAL

Method for computing the minimum edit distance with fine granularity suitably quickly

This invention related to a method for computing the minimum edit distance, measured as the number of insertions plus the number of deletions, between two sequences of data, which runs in an amount of time that is nearly proportional to the size of the input data under many circumstances. Utilizing the A* (or A-star) search, the invention searches for the answer using a novel counting heuristic that gives a lower bound on the minimum edit distance for any given subproblem. In addition, regions over which the heuristic matches the maximum value of the answer are optimized by eliminating the search over redundant paths. The invention can also be used to produce the edit script. The invention can be modified for other types of comparison and pattern recognition.
Owner:ANACAPA NORTH

Method and device for identifying human face through double models

The invention discloses a method for identifying human face through double models and mainly solves the problem that the traditional identification method greatly depends on textures of a human face image. The method of the invention comprises the following steps: dividing a human face image sample set into a test image set and a train image set, and studying a train image to obtain a characteristic face subspace and an active apparent model; projecting test and train images to the characteristic face subspace to obtain texture models, and calculating the distance between the test and train image texture models; automatically searching test and train image characteristic points according to the active apparent model, constructing shape models, and taking an image edit distance as the distance between test and train image shape models; and determining identity information of the test image through weighted fusion of the distances. Compared with the texture-based or structural information-based identification method, the method of the invention has the advantage of higher identification rate to the human face image with changed expression, illumination and size, particularly to the human face image acquired under the condition of changed illumination, and can be used for authentication under the influence of a plurality of factors.
Owner:XIDIAN UNIV

Spelling variation dictionary generation system

A system for effectively collecting, without omissions, spelling variations centering on particular technical terms occurring in documents. In advance, the system sorts technical terms considered to be potential spelling variations from among a large-scale collection of terms. By measuring the edit distance adjusted for the cost of the terms that are potential spelling variations, the system can collect terms considered spelling variations from among the potential spelling variation terms with a high degree of accuracy.
Owner:HITACHI LTD

Computer-assisted computing method of semantic distance between short texts

A computer-assisted computing method of the semantic distance between short texts belongs to the technical field of Chinese written message treatment and is characterized in that the semantic distance between two short texts is defined as the sum of the syntactic structure distance and unit semantic distance for computation. Webpage mark removing, variation short text treatment and participle treatment are conducted on the texts to obtain a series of word strings, semantic alignment is conducted on corresponding word strings in the two short texts according to a word similarity array, the syntactic structure distance is obtained according to the word adjustment times in the process, the five-grade structure in words in the <extended synonym thesaurus>, simultaneously Chinese key words and near-synonym concept are introduced, so that 5 kinds of operations including insertion, deletion, replacement and the like are conducted on the words on the basis of semantic alignment with the words as unit, and weight of the sum of various operations after weight is added is used for showing unit semantic distance between the word strings. The relative accuracy of the semantic distance between the texts is higher than that of classical compile distance algorithm.
Owner:BEIJING UNIV OF TECH

Method and device for obtaining synonym

The invention relates to a method and a device for obtaining synonyms. The method comprises: obtaining a text set, performing word segmentation on the text set to generate a first term set; filtering null words of the first term set through a stop word list, to generate a second term set; performing distance editing processing on any two words in the second term set, to generate a first synonym set; establishing a vector space model on the words in the first term set; according to the model, obtaining space vectors of each pair of synonyms, calculating cosine similarity value of each pair of synonyms, performing cosine threshold filtration strategy identification on each pair of synonyms, to generate a second synonym set; performing part-of-speech tagging on the words in the second synonym set, to generate a third synonym set; and processing the words in the third synonym set by a unary model, to obtain synonyms. Thus, the method and the device realize that retrieved synonyms are more accurate, and ambiguous words and null words do not exist, so as to intelligently retrieve related webpage of synonyms.
Owner:ADVANCED NEW TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products