Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A method and device for extracting full-text data

A data extraction and data technology, which is applied in electrical digital data processing, digital data information retrieval, special data processing applications, etc., can solve the problems of labor consumption, low efficiency, low efficiency, etc., to shorten data extraction time and improve extraction efficiency. , the effect of reducing the search matching time

Active Publication Date: 2019-12-06
RUN TECH CO LTD BEIJING
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] There are two traditional full-text extraction methods: one is the template-based extraction method, which is suitable for the information extraction of a specific website, but it is powerless for the data generated by the changeable mobile APP and different websites; One is to extract full-text content based on regular expressions. This method is suitable for offline full-text extraction with a small amount of data. Once faced with a large amount of data submitted by APP, the efficiency is relatively low.
Therefore, in the case of a large amount of data, these two methods will consume a lot of manpower, and the efficiency is low, and they can no longer meet the needs in the case of a large amount of data.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for extracting full-text data
  • A method and device for extracting full-text data
  • A method and device for extracting full-text data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0022] figure 1 It is a schematic flowchart of a method for extracting full-text data provided in Embodiment 1 of the present invention. The method can be executed by a device for extracting full-text data. The device can be implemented by means of hardware and / or software. The specific method includes Do as follows:

[0023] S110. Parse the network packet data into session data.

[0024] The method provided in this embodiment is applicable to data extraction of various communication protocols, and the following uses HyperText Transfer Protocol (HyperText Transfer Protocol, HTTP) data as an example to describe in detail. Firstly, the network packet data obtained from the data source is parsed into session data in text format. For HTTP protocol data, the HTTP protocol stack is used to parse it into HTTPPOST session data. The parsed session data includes HTTP header and HTTP entity part. To parse and restore HTTP POST session data according to the HTTP protocol stack, it is ...

Embodiment 2

[0034] figure 2 A schematic flow diagram of a method for extracting full-text data provided in Embodiment 2 of the present invention, as shown in figure 2 As shown, the method includes:

[0035] S210. Parse the network packet data into session data.

[0036] S220. Determine whether the entity part of the session data conforms to a preset data format.

[0037] If yes, perform operations S230 and S250 in sequence, otherwise return to perform operations S240 and S220 in sequence.

[0038] S230. Mark the session data in a data format.

[0039] S240. Parse subsequent network packet data into session data.

[0040] S250. Perform multi-mode matching on the session data conforming to the preset data format, and judge whether the preset feature string is matched.

[0041] When the preset feature string is matched, operations S260, S270, S280, and S290 are executed sequentially, otherwise, operations S240 and S220 are executed sequentially.

[0042] S260. Obtain the hit position...

Embodiment 3

[0066] image 3 A device for extracting full-text data provided in Embodiment 3 of the present invention, such as image 3 As shown, the device includes:

[0067] Parsing module 31, for parsing network packet data into session data;

[0068] Annotation module 32, is used for judging whether the entity part of described session data conforms to preset data format, if so, carry out data format label to described session data;

[0069] The multi-mode matching module 33 is used to perform multi-mode matching on the session data conforming to the preset data format, judge whether to hit the preset feature string, and obtain the hit position of the preset feature string when hitting the preset feature string;

[0070] The data extraction module 34 is configured to determine the corresponding extraction function of the session data according to the data format annotation of the session data and the hit position of the preset feature string, and perform the extraction function on th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a full-text data extraction method and device. The method comprises the following steps: analyzing network packet data into session data; judging whether the entity part of the session data accords with a preset data format or not, and carrying out data format labelling on the session data if the entity part of the session data accords with the preset data format; carrying out multimode matching on the session data which accords with the preset data format, judging whether a preset characteristic string is hit or not and obtaining the hitting position of the preset characteristic string when the preset characteristic string is hit; and determining an extraction function corresponding to the session data according to the data format label of the session data and the hitting position of the preset characteristic string, and carrying out data extraction on the session data according to the extraction function. According to the full-text data extraction method and device, the technical effect of improving the full-text data extraction efficiency of mass data is realized.

Description

technical field [0001] Embodiments of the present invention relate to the field of mobile and big data processing technologies, and in particular to a method and device for extracting full-text data. Background technique [0002] With the rapid development of the Internet, data has penetrated into every industry and business function field, and has gradually become an important factor of production, accompanied by massive data that humans can analyze and process. In medium-sized and above cities such as Beijing and Shanghai, various types of data generated in network activities every day have exceeded PB level. For example, the mobile application (Application, APP) will generate several terabytes of submitted data every day. These data contain information such as latitude and longitude, mobile phone serial number, user identification card number, mobile phone unique identification code, etc., and this information is very useful in the security supervision industry , so mass...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/9535
CPCG06F16/9535
Inventor 冯建业
Owner RUN TECH CO LTD BEIJING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products