๐Ÿ“ƒ ZeRO: Memory Optimization Towards Training A Trillion Parameter Models ๋ฆฌ๋ทฐ

May 1, 2020

๋งค์šฐ ํฐ ๋ชจ๋ธ์˜ ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ๋กœ MegaTron์„ ๋›ฐ์–ด๋„˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค˜ ํ™”์ œ์˜€๋˜ ๋…ผ๋ฌธ์ด๋‹ค. arvix ๋งํฌ๋Š” https://arxiv.org/abs/1910.02054์ด๊ณ , pytorch์šฉ ๊ตฌํ˜„์€ GitHub - microsoft/DeepSpeed์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Tags: paper
Read More

๐Ÿ“ƒ TinyBERT: Distilling BERT For Natual Language Understanding ๋ฆฌ๋ทฐ

May 1, 2020

TinyBERT๋Š” Under Review ์ƒํƒœ์ธ ๋…ผ๋ฌธ์ด๊ณ , ํ™”์›จ์ด Noahโ€™s Ark Lab์—์„œ ๋‚˜์˜จ ๋…ผ๋ฌธ์ด๋‹ค. ์ฝ”๋“œ๋Š” GitHub huawei-noah/Pretrained-Language-Model/TinyBERT์— ์žˆ๋‹ค. arxiv ๋งํฌ๋Š” https://arxiv.org/abs/1909.10351์ด๋‹ค.

Tags: paper
Read More

๐Ÿ“ƒ Layer Normalization ๋ฆฌ๋ทฐ

May 1, 2020

Layer Normalization์€ BERT์— ์“ฐ์ด๋Š” ๊ฒƒ ๋•Œ๋ฌธ์— ์ฐพ์•„๋ณด๊ฒŒ ๋œ ๋…ผ๋ฌธ์ด๋‹ค. arxiv ๋งํฌ๋Š” https://arxiv.org/abs/1607.06450์ด๋‹ค. training์‹œ๊ฐ„์„ ์ค„์ด๋Š” ๊ฒƒ์ด ํฐ ๊ธฐ์—ฌ์ธ๋ฐ, ์ด๋ฆ„์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด neuron์˜ activity๋ฅผ normalizeํ•˜๋Š” ๊ฒƒ์ด๋‹ค. Batch Normalization๋„ ๋น„์Šทํ•œ ์—ญํ• ์„...

Tags: paper
Read More

๐Ÿ“ƒ Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model ๋ฆฌ๋ทฐ

April 27, 2020

TensorFlow ์ƒ์—์„œ FP32๋ฅผ INT8๋กœ quantization์„ ํ•ด๋ณด๋Š” ๋…ผ๋ฌธ์ด๋‹ค. 1.5๋ฐฐ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์–ป์œผ๋ฉด์„œ 0.5 BLEU score accuracy๋งŒ ๋–จ์–ด์กŒ๋‹ค๊ณ  ํ•œ๋‹ค. ๋˜ํ•œ intel cpu์— ์ตœ์ ํ™”๋ฅผ ์ง„ํ–‰ํ–ˆ๋‹ค. arxiv ๋งํฌ๋Š” https://arxiv.org/abs/1906.00532์ด๊ณ , intel์—์„œ ๋‚˜์˜จ ๋…ผ๋ฌธ์ด๋‹ค.

Tags: paper
Read More

๐Ÿ“ƒ Patient Knowledge Distillation for BERT Model Compression ๋ฆฌ๋ทฐ

April 16, 2020

EMNLP 2019์— Accept๋œ ๋งˆ์ดํฌ๋กœ์†Œํ”„ํŠธ์—์„œ ๋‚˜์˜จ PKD (Patient Knowledge Distillation) ๋ฐฉ์‹์˜ Model Compression ๋…ผ๋ฌธ์ด๋‹ค. arxiv ๋งํฌ๋Š” https://arxiv.org/abs/1908.09355์ด๊ณ  ์ฝ”๋“œ๋Š” GitHub - intersun/PKD-for-BERT-Model-Compression์— ์žˆ๋‹ค.

Tags: paper
Read More

๐Ÿ“ƒ Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding ๋ฆฌ๋ทฐ

April 16, 2020

์ด ๋…ผ๋ฌธ์ด ๋‚˜์˜ค๊ธฐ ์–ผ๋งˆ ์ „์— ๋งˆ์ดํฌ๋กœ ์†Œํ”„ํŠธ์—์„œ ๋‚˜์˜จ MT-DNN (Liu et al., 2019)์— Knowledge Distillation์„ ์ ์šฉํ•œ ๋…ผ๋ฌธ์ด๋‹ค. arvix๋งํฌ๋Š” https://arxiv.org/abs/1904.09482์ด๊ณ  ์ฝ”๋“œ๋Š” GitHub - namisan/mt-dnn์—์„œ ํ™•์ธ ๊ฐ€๋Šฅํ•˜๋‹ค. ํŠน์ดํ•˜๊ฒŒ ๋‹ค๋ฅธ...

Tags: paper
Read More

๐Ÿ“ƒ Q8BERT: Quantized 8Bit BERT ๋ฆฌ๋ทฐ

April 14, 2020

intel์—์„œ ๋‚˜์˜จ NeurIPS 2019์— ๋ฐœํ‘œ๋œ Q8BERT ๋…ผ๋ฌธ์ด๋‹ค. arxiv ๋งํฌ๋Š” https://arxiv.org/pdf/1910.06188.pdf์ด๋‹ค. BERT๋ฅผ fine tuning phase๋•Œ quantization aware training์„ ์ ์šฉํ•˜์—ฌ 4๋ฐฐ ์••์ถ•ํ•˜๊ณ , intel CPU์˜ 8bit ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•ด ์—ฐ์‚ฐ์„ ๊ฐ€์†ํ–ˆ๋‹ค.

Tags: paper
Read More

๐Ÿ“ƒ FastBERT: a Self-distilling BERT with Adaptive Inference Time ๋ฆฌ๋ทฐ

April 14, 2020

์ด ๋…ผ๋ฌธ ์—ญ์‹œ BERT๊ฐ€ ๋„ˆ๋ฌด ์„œ๋น™ํ•˜๊ธฐ ํฐ ๋ชจ๋ธ์ด๋ผ์„œ fine tuning ์‹œ์— self distillation์„ ์ ์šฉํ•ด๋ณธ ๊ฒƒ์ด๋‹ค. 2019 Tencent Rhino-Bird Elite Training Program์—์„œ ํŽ€๋”ฉ๋ฐ›์•„์„œ ์ž‘์„ฑํ•œ ๊ฒƒ์ด๋‹ค. arxiv ๋งํฌ๋Š” https://arxiv.org/abs/2004.02178์ด๋‹ค.

Tags: paper
Read More

๐Ÿ“ƒ DynaBERT: Dynamic BERT with Adaptive Width and Depth ๋ฆฌ๋ทฐ

April 13, 2020

์ด ๋…ผ๋ฌธ์—์„œ๋Š” BERT, RoBERTa๊ฐ€ ๋งค์šฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, memory, computing power๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์ด ํ•„์š”ํ•˜๋ฏ€๋กœ ๊ทธ๋ฅผ ์••์ถ•ํ•ด๋ณด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์•„์ง WIP์ธ ๋…ผ๋ฌธ์ด๊ณ , https://arxiv.org/abs/2004.04037๊ฐ€ ๋งํฌ์ด๋‹ค. ํ™”์›จ์ด์—์„œ ๋‚˜์˜จ ๋…ผ๋ฌธ์ด๋‹ค.

Tags: paper
Read More