Q8BERT: Quantized 8Bit BERT 리뷰

intel에서 나온 NeurIPS 2019에 발표된 Q8BERT 논문이다. arxiv 링크는 https://arxiv.org/pdf/1910.06188.pdf이다. BERT를 fine tuning phase때 quantization aware training을 적용하여 4배 압축하고, intel CPU의 8bit 연산을 사용해 연산을 가속했다.

1. Introduction

quantization-aware training을 fine-tuning process에 적용
GEMM 연산과 FC, Embedding Layer 연산을 quantize함
8bit quantized inference 해도 99% accuracy를 유지함

2. Method

linear quantization 사용
Intel Xeon Cascade Lake의 VNNI를 사용할 경우 FP32 연산에 비해 3.7배 빨라짐 (Bhandare et al., 2019)

2.1. Quantization Scheme

symmetric linear quantization 사용함
\[\text{Quantize}(x \vert S^x, M) = \text{Clamp}(x \times S^x, -M, M)\] \[\text{Clamp}(x, a, b) = \min(\max(x, a), b)\]
\(S^x\)는 \(x\)에 대한 quantization scaling factor이고, \(M\)은 highest quantized value이다.
8bit이므로 \(M\)은 127
scaling factor 계산법
- weights: \(S^W = \frac M {\max(\vert W \vert)}\)
- activations: \(S^x = \frac M {\text{EMA}(\max(\vert x \vert))}\)
  - EMA: Exponential Moving Average

2.2. Quantized-Aware Training

Quantization-aware training은 inference stage에서 quantize할 수 있게 training하는 방법 중 하나
post training에 반대되는 방식
fake quantization을 도입
- quantization error를 보여주고 quantization error gap을 줄이기 위해
fake quantization은 FP32값들을 rounding하는 효과가 있다. (Jacob et al.)
- outlier들을 줄여서 quantization이 잘 되도록 함
rounding하는 operation이 미분가능하지 않으니 Straight-Through Estimator (STE)를 사용함 (Bengio et al., 2013)

3. Implementation

training phase
- FC는 fake quantized input과 fake quantized weight 사이에 GEMM 연산을 하고 quantize안한 bias를 더한다.
inference phase
- embedding layer는 int8 embedding vector를 반환하고, quantized FC는 int8 input, weight를 연산하고 int32 bias를 더한다. 그 후 activation을 연산한다.
pytorch transformers 이용했다고 한다. (현재 huggingface/transformers)
Higher precision이 중요한 Softmax, Layer Normalization, GELU는 FP32로 사용함

4. Evaluation

Dynamic Quantization과 비교했을 때 Quantization Aware Training이 훨씬 Accuracy Reduction이 적은 것을 볼 수 있다.

읽어보장

6. Conclusions and Future Work

BERT 압축을 위해 다른 model compression methods 적용
quantized BERT 모델에 다른 compresssion 적용해보기

__

얼마나 빠른지는 적혀있지 않다.
Softmax, LayerNorm, GELU도 int8, int32로 진행한 결과 있었으면 좋았을텐데
distillation과 같이 적용가능할까?
quantization이 잘되는 이유가 fp32가 redundant해서 그런걸까 아니면 BERT가 reduandant해서 그런걸까

April 14, 2020

Tags: paper