Wikipedia-based Chinese and English cross-language entity matching method

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A Wikipedia, matching method technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as non-compliance with the LOD open principle, knowledge integrity and reliability loss, etc.

Active Publication Date: 2017-04-19

ZHEJIANG UNIV

View PDF2 Cites 20 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, our knowledge base is not associated with authoritative knowledge bases in other languages, such as DBPedia, which is a great loss to the integrity and reliability of knowledge, and also does not conform to the open principle of LOD

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0141] An example is provided below to describe the implementation steps of the present invention in detail:

[0142] (1) The data sets used in the example come from Chinese Wikipedia and English Wikipedia. The number of pages in Chinese Wikipedia is 1,020,863, and the number of pages in English Wikipedia is 6,144,107. Analyze the information structure of the above pages, extract titles, abstracts, directories, categories, link-in links, link-out links, full-text text and other information, and store these information in the lucene index. Except for the title, other fields can be null.

[0143](2) Randomly select 3,000 pages with existing cross-language links from the Chinese Wikipedia in (1), and use the outgoing links and existing cross-language links to extract the English cross-language page candidate set of these 3,000 Chinese pages .

[0144] (3) Use the existing cross-language links to construct training data and train the parameters in the latent Dirichlet distributi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a Wikipedia-based Chinese and English cross-language entity matching method. The method comprises the steps of firstly, obtaining Chinese and English Wikipedia page data through wikidump, and preprocessing data: extracting titles, abstracts, directories, full texts, link-out links, link-in links and classified information of pages; translating the titles of the Chinese wiki pages into English; performing word segmentation on the abstracts, the directories and the full texts; and extracting page cross-language links and classified cross-language links existent in Chinese and English Wikipedia; secondly, for the Chinese wiki pages, obtaining English cross-language page candidate sets of the Chinese wiki pages according to link-out link information; thirdly, calculating features between the Chinese pages and the English cross-language page candidate sets of the Chinese pages; and finally, building a sorting model, performing similarity sorting on the English cross-language page candidate sets of the current Chinese pages, and taking the English cross-language page candidate sets with the highest similarity as the cross-language links of the current Chinese pages.

Description

technical field [0001] The present invention relates to methods such as topic model, deep learning, and text similarity calculation, and in particular to a Chinese-English cross-language entity matching method based on Wikipedia. Background technique [0002] With the development of machine learning, deep learning and other technologies, the construction of knowledge base has also been improved. There are already many knowledge bases. For example, DBpedia is a special example of Semantic Web applications. It extracts structured data from Wikipedia entries to enhance Wikipedia's search function and links other data sets to Wikipedia. ; Freebase is a large-scale cooperative knowledge base, which integrates many resources on the Internet. Entries in Freebase are also similar to DBpedia, both in the form of structured data. By accessing its data, it can be found that all the contents are formatted, stored and displayed in triple format. The schema is fixed, and entries of the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/3337G06F16/9558

Inventor 鲁伟明戴豪庄越挺

Owner ZHEJIANG UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Wikipedia-based Chinese and English cross-language entity matching method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology