INT8 offline quantization and integer inference method based on Transform model

A technology of quantizing coefficients and integers, applied in biological neural network models, electrical digital data processing, digital data processing components, etc. Edge device operation and other issues, to achieve the effect of improving computing speed, reducing computing power and storage requirements, and reducing precision loss

Active Publication Date: 2021-06-22
SOUTH CHINA UNIV OF TECH
View PDF8 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] As a new general-purpose model in the field of natural language processing, the Transformer model surpasses traditional neural networks such as LSTM in all aspects. The price paid is the multiplication of model complexity and network parameters, resulting in a sharp increase in computing power and power consumption. , making them difficult to run in edge devices
Directly using the INT8 offline quantization method of the existing convolutional neural network to the Transformer model will result in loss of accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • INT8 offline quantization and integer inference method based on Transform model
  • INT8 offline quantization and integer inference method based on Transform model
  • INT8 offline quantization and integer inference method based on Transform model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0065] Such as figure 1 As shown, a kind of Transformer model-based INT8 off-line quantization and integer inference method of the present embodiment comprises the following steps:

[0066] S1, convert the L2 norm of the normalization layer in the original Transformer floating-point model to L1 norm; then perform model training on the Transformer floating-point model, and obtain the trained floating-point model and its parameters.

[0067] The normalization layer is calculated according to the following formula:

[0068]

[0069] Among them, x is the input data, μ represents the average value of the row where the input data is located, α and β are trainable parameters in the floating point model, n represents the size of the row,

[0070] S2. Carry out forward inference through a small amount of data, obtain the quantization coefficients of the input data of the matrix operation of each layer in the floating-point model, and extract them as general floating-point data. ...

Embodiment 2

[0114] A kind of INT8 off-line quantization and integer inference method based on Transformer model of the present embodiment, in step S43, self-attention layer integer inference method is: as figure 2 As shown, the query vector q, key vector k, and value vector v of the INT8 type obtained by input quantization are used for linear layer calculation and attention calculation with the quantized weight data, and the quantization operation between matrix operations is completed through the shift operation. The calculated integer result is residually connected with the query vector and input to the normalization layer of L1 norm for output.

[0115] The integer inference method of the feed-forward neural network calculation layer is as follows: image 3 As shown, the quantized input data and the quantized weight data of the first linear layer are directly calculated in the linear layer, and the INT8 type calculation result is obtained by shifting, and the ReLU function is calculat...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an INT8 offline quantization and integer inference method based on a Transform model. The INT8 offline quantization and integer inference method based on the Transform model comprises the following steps: converting an L2 norm of a normalization layer in an original Transform floating point model into an L1 norm; carrying out model training; performing forward inference through a small amount of data to obtain a quantization coefficient of input data of each layer of matrix operation, and extracting the quantization coefficient as general floating point data; obtaining a weight quantization coefficient of each linear layer in the floating point model, extracting the weight quantization coefficient as general floating point data, and determining an optimal weight quantization coefficient in each layer according to a mean square error calculation method; converting quantization coefficients related to quantization operation in the inference process into 2-n floating-point number forms, and adjusting the quantization coefficients through a joint coefficient adjustment method; and obtaining an integer inference model of INT8 based on the adjusted quantization coefficient in combination with a normalization layer of an L1 norm. According to the invention, errors caused by hardware resources required by model calculation and model quantification can be reduced, hardware resource consumption is reduced, and the deduction speed of the model is increased.

Description

technical field [0001] The present invention relates to the technical field of natural language processing, and more specifically, relates to an INT8 (8-bit integer) off-line quantization method and an integer inference method of a Transformer model-based natural language processing neural network model. Background technique [0002] With the emergence of deep learning algorithms, artificial intelligence has ushered in the third outbreak, and the increase in the number of parameters and computational complexity of deep learning algorithms puts forward higher performance requirements for hardware. Designing dedicated hardware accelerators in the field of deep learning is effective solution to this need. How to reduce the delay and storage of deep neural network calculations is an important research direction for the implementation of neural network algorithms and the design of neural network accelerators. [0003] Model quantization is an ideal technical approach to solve th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06N3/063G06N3/04G06F7/483
CPCG06N3/063G06F7/483G06N3/047G06N3/048G06N3/045
Inventor 姜小波邓晗珂何昆方忠洪
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products