Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Wikipedia-based Chinese and English cross-language entity matching method

A Wikipedia, matching method technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as non-compliance with the LOD open principle, knowledge integrity and reliability loss, etc.

Active Publication Date: 2017-04-19
ZHEJIANG UNIV
View PDF2 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, our knowledge base is not associated with authoritative knowledge bases in other languages, such as DBPedia, which is a great loss to the integrity and reliability of knowledge, and also does not conform to the open principle of LOD

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Wikipedia-based Chinese and English cross-language entity matching method
  • Wikipedia-based Chinese and English cross-language entity matching method
  • Wikipedia-based Chinese and English cross-language entity matching method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0141] An example is provided below to describe the implementation steps of the present invention in detail:

[0142] (1) The data sets used in the example come from Chinese Wikipedia and English Wikipedia. The number of pages in Chinese Wikipedia is 1,020,863, and the number of pages in English Wikipedia is 6,144,107. Analyze the information structure of the above pages, extract titles, abstracts, directories, categories, link-in links, link-out links, full-text text and other information, and store these information in the lucene index. Except for the title, other fields can be null.

[0143](2) Randomly select 3,000 pages with existing cross-language links from the Chinese Wikipedia in (1), and use the outgoing links and existing cross-language links to extract the English cross-language page candidate set of these 3,000 Chinese pages .

[0144] (3) Use the existing cross-language links to construct training data and train the parameters in the latent Dirichlet distributi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Wikipedia-based Chinese and English cross-language entity matching method. The method comprises the steps of firstly, obtaining Chinese and English Wikipedia page data through wikidump, and preprocessing data: extracting titles, abstracts, directories, full texts, link-out links, link-in links and classified information of pages; translating the titles of the Chinese wiki pages into English; performing word segmentation on the abstracts, the directories and the full texts; and extracting page cross-language links and classified cross-language links existent in Chinese and English Wikipedia; secondly, for the Chinese wiki pages, obtaining English cross-language page candidate sets of the Chinese wiki pages according to link-out link information; thirdly, calculating features between the Chinese pages and the English cross-language page candidate sets of the Chinese pages; and finally, building a sorting model, performing similarity sorting on the English cross-language page candidate sets of the current Chinese pages, and taking the English cross-language page candidate sets with the highest similarity as the cross-language links of the current Chinese pages.

Description

technical field [0001] The present invention relates to methods such as topic model, deep learning, and text similarity calculation, and in particular to a Chinese-English cross-language entity matching method based on Wikipedia. Background technique [0002] With the development of machine learning, deep learning and other technologies, the construction of knowledge base has also been improved. There are already many knowledge bases. For example, DBpedia is a special example of Semantic Web applications. It extracts structured data from Wikipedia entries to enhance Wikipedia's search function and links other data sets to Wikipedia. ; Freebase is a large-scale cooperative knowledge base, which integrates many resources on the Internet. Entries in Freebase are also similar to DBpedia, both in the form of structured data. By accessing its data, it can be found that all the contents are formatted, stored and displayed in triple format. The schema is fixed, and entries of the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/3337G06F16/9558
Inventor 鲁伟明戴豪庄越挺
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products