Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

35 results about "Bigram" patented technology

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on. Gappy bigrams or skipping bigrams are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in a dependency grammar).

System and method for productive generation of compound words in statistical machine translation

A method and a system for making merging decisions for a translation are disclosed which are suited to use where the target language is a productive compounding one. The method includes outputting decisions on merging of pairs of words in a translated text string with a merging system. The merging system can include a set of stored heuristics and / or a merging model. In the case of heuristics, these can include a heuristic by which two consecutive words in the string are considered for merging if the first word of the two consecutive words is recognized as a compound modifier and their observed frequency f1 as a closed compound word is larger than an observed frequency f2 of the two consecutive words as a bigram. In the case of a merging model, it can be one that is trained on features associated with pairs of consecutive tokens of text strings in a training set and predetermined merging decisions for the pairs. A translation in the target language is output, based on the merging decisions for the translated text string.
Owner:XEROX CORP

Automatic microblog text abstracting method based on unsupervised key bigram extraction

The invention discloses an automatic microblog text abstracting method based on unsupervised key binary word extraction. The automatic microblog text abstracting method comprises the steps of preprocessing a microblog; standardizing a binary word; extracting a key binary word based on a mixed TF-IDF (term frequency-inverse document frequency), TexRank and an LDA (local data area); sequencing sentences based on the intersection similarity and a mutual information strategy; extracting abstract sentences based on a similarity threshold value; generating abstract by reasonably combining the abstract sentences. According to the automatic microblog text abstracting method, the binary word is used as a minimum vocabulary unit, and the binary word has richer text information than words, so that the sentences based on the key binary word is higher in noise immunity and accuracy than the sentences based on key word extraction; meanwhile, when the abstract sentences are extracted, the similarity threshold value is introduced to control redundancy, so that the abstract is higher in recall rate. The abstract generated by the method is accurate, simple and comprehensive; the efficiency and the quality that a user acquires knowledge are obviously improved, and the time of the user is greatly saved.
Owner:INST OF AUTOMATION CHINESE ACAD OF SCI

Apparatus for secure computation of string comparators

We present an apparatus which can be used so that one party learns the value of a string distance metric applied to a pair of strings, each of which is held by a different party, in such a way that none of the parties can learn anything else significant about the strings. This apparatus can be applied to the problem of linking records from different databases, where privacy and confidentiality concerns prohibit the sharing of records. The apparatus can compute two different string similarity metrics, including the bigram based Dice coefficient and the Jaro-Winkler string comparator. The apparatus can implement a three party protocol for the secure computation of the bigram based Dice coefficient and a two party protocols for the Jaro-Winkler string comparator which are secure against collusion and cheating. The apparatus implements a three party Jaro-Winkler string comparator computation which is secure in the case of semi-honest participants
Owner:TELECOMM RES LAB

A Text Positive and Negative Sentiment Classification Method

InactiveCN107423371BImprove the efficiency of scientific computingMaximize class discriminationSemantic analysisSpecial data processing applicationsLexical itemClassification methods
The invention discloses a text positive and negative emotion classification method. The method comprises the steps that all texts in a text set are preprocessed to form a noiseless positive and negative text set; unigram word segmentation and bigram word segmentation are performed on positive and negative texts; after stop words are removed, a non-repeat multidimensional feature vector space is formed; inverse document frequency calculation is performed on variant word frequency of all-dimensional feature vectors in the multidimensional feature vector space; and finally after training is performed with a formed lexical item-document matrix being a supervised classifier support vector machine and an input factor of logic regression in combination with marked positive and negative emotion category tags, a final text linear classifier prediction model is obtained, that is, emotion classification can be performed on a new unknown text. Through the method, the characteristic that emotional words in a marked corpus have innate classification capability is effectively utilized, a new calculation method is proposed to maximize category discrimination of the emotional words, and therefore the precision of text emotion classification through a computer is improved.
Owner:HUBEI NORMAL UNIV

Multi-criterion Chinese word segmentation method based on local self-attention mechanism and segmentation tree

The invention discloses a multi-criterion Chinese word segmentation method based on a local attention mechanism and a segmentation tree. According to the method, for a text sequence of a corpus, the method comprises the following implementation steps: inputting a text sequence, obtaining unigram features and Bigram features of each character through word2vec, combining the unigram features and theBigram features with a predefined position vector to serve as an embedded layer, transmitting the embedded layer to a self-attention network, and obtaining the output of the embedded layer; and labelingeach character through crf layer decoding, and obtaininga plurality of labeling results; combining the labeling results into a segmentation tree to form a plurality of segmentation sequences; inputting the plurality of segmentation sequences into a scoring system, and selecting the group of segmentation sequences with the highest score as output. According to the method, the accuracy of multi-criterion word segmentation is improved.
Owner:HANGZHOU DIANZI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products