Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Method for discovering sensitive data in text big data

A technology of sensitive data and big data, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as difficulties in sensitive information, and achieve comprehensive and accurate analysis, easy implementation, and improved efficiency.

Inactive Publication Date: 2018-07-13
NO 30 INST OF CHINA ELECTRONIC TECH GRP CORP
View PDF3 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] (1) To solve the problem of finding sensitive data or detecting sensitive information formed after artificial interference in traditional methods, the method of the present invention can effectively discover potential sensitive private information in heterogeneous text large data collections;

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for discovering sensitive data in text big data
  • Method for discovering sensitive data in text big data
  • Method for discovering sensitive data in text big data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] Such as figure 1 As shown, the basic idea of ​​the method for discovering potential sensitive data in text big data proposed by the present invention is:

[0030] (1) First, establish a sensitive word information database, which contains the normative description of each predefined sensitive word, as well as various methods of artificial interference and deformation, including character splitting, cyberspeak, typographical processing, pinyin translation, etc. (standardized description and corresponding The variation descriptions belong to the same sensitive information), and at the same time, the weight coefficient of each sensitive word is determined in the thesaurus according to the context and word semantics where the word appears;

[0031] (2) then set up a sensitive word retrieval search tree for all sensitive words in the sensitive lexicon;

[0032] (3) Preprocessing the text, including removing punctuation marks and removing auxiliary words, stop words, etc.;

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for discovering sensitive data in text big data. The method comprises the following steps that a sensitive word information base is established; a sensitive word retrieval tree is established for all sensitive words in the sensitive word information base; sensitive word retrieval discover is carried out on a pretreated text with a character streaming through the retrieval tree, and the occurrence frequency of each sensitive word is counted; the sensitivity of a to-be-retrieved text is calculated by using the occurrence frequency of the sensitive words, documentclass, and weight rank of the sensitive words; the sensitivity of the to-be-retrieved text is compared with a preset threshold, and a retrieval text of which a sensitivity value exceeds the thresholdis determined to be a sensitive text. According to the method for discovering the sensitive data in the text big data, a method to quickly discover sensitive information of normalized and non-normalized description for massive heterogeneous texts is provided, the mode of the quick and accurate search combined with the fuzzy retrieval of the sensitive information achieves the quick discover of thepotential sensitive information.

Description

technical field [0001] The invention relates to a method for finding sensitive data in text big data. Background technique [0002] Accurately and quickly discover sensitive privacy data in massive, heterogeneous, and polysemous data, meet the needs of data sharing and exchange, data release, and data security use in a big data environment, and provide a basis for data access control. At present, the methods used to find sensitive information in text mainly include: keyword fast matching algorithm, ontology semantic retrieval, and data mining based methods. Among them, the sensitive information discovery technology based on pattern matching is the main technical means at present. Sensitive data discovery technology is widely used in information filtering, data exchange and sharing, secure mail, system audit, data exchange, news bulletins, etc. [0003] (1) Method 1: content keyword matching method [0004] This method mainly focuses on keyword comparison and matching, con...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/313G06F16/334
Inventor 杨永刚张锋军李庆华牛作元
Owner NO 30 INST OF CHINA ELECTRONIC TECH GRP CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products