๐Ÿ“ƒ Are Sixteen Heads Really Better than One? ๋ฆฌ๋ทฐ

Multi head attention์ด ํ‘œํ˜„๋ ฅ์ด ์ข‹๊ณ  ๋งŽ์€ ์ •๋ณด๋ฅผ ๋‹ด์„ ์ˆ˜ ์žˆ๋‹ค์ง€๋งŒ, ๋ชจ๋“  head๊ฐ€ ํ•„์š”ํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค. ์ด์— ๊ด€ํ•œ ๋…ผ๋ฌธ์ด Are Sixteen Heads Really Better Than One? (Michel et al., 2019)์ด๊ณ , arxiv ๋งํฌ๋Š” https://arxiv.org/abs/1905.10650์ด๋‹ค.

Abstract

  • MultiHead๋กœ ํ•™์Šต์ด ๋˜์—ˆ๋”๋ผ๋„ Test Time์—๋Š” ๋งŽ์€ head๋ฅผ ์ œ๊ฑฐํ•ด๋„ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์กดํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•จ.
  • ํŠนํžˆ ๋ช‡๋ช‡ ๋ ˆ์ด์–ด๋Š” single head์—ฌ๋„ ์„ฑ๋Šฅํ•˜๋ฝ์ด ์—†์—ˆ๋‹ค.

1. Introduction

  • greedy ํ•˜๊ณ  iterativeํ•œ attention head pruning ๋ฐฉ๋ฒ• ์ œ์‹œ
  • inference time์„ 17.5% ๋†’์˜€๋‹ค.
  • MT๋Š” pruning์— ํŠนํžˆ ๋ฏผ๊ฐํ–ˆ๋Š”๋ฐ, ์ด๋ฅผ ์ž์„ธํžˆ ์‚ดํŽด๋ด„

2. Background: Attention, Multi-headed Attention, and Masking

  • ๊ฑฐ์˜ ๋‹ค ํŒจ์Šค
  • Multi Head Attention Maskingํ•˜๋Š” ๊ฒƒ์€ mask variable๋กœ ๊ณ„์‚ฐํ•จ
  • ํŠน์ • head์˜ ๊ฒฐ๊ณผ๊ฐ’์„ 0์œผ๋กœ ์ง€์ •

3. Are All Attention Heads Important?

  • WMT์—์„œ ํ…Œ์ŠคํŠธ

3.2. Ablating One Head

  • ํ•˜๋‚˜์˜ Head๋งŒ ์ œ๊ฑฐํ•˜๋Š” ํ…Œ์ŠคํŠธ
  • at test time, most heads are redundant given the rest of the model.

3.3. Ablating All Heads but One

  • ๊ทธ๋Ÿผ head๊ฐ€ ํ•˜๋‚˜ ์ด์ƒ ํ•„์š”ํ• ๊นŒ?
  • ๋Œ€๋ถ€๋ถ„์˜ layer๋Š” 12/16 head๋กœ trian๋˜์—ˆ์–ด๋„ test time์— 1 head๋„ ์ถฉ๋ถ„ํ•˜๋‹ค.
  • ๊ทผ๋ฐ NMT๋Š” ๋˜๊ฒŒ ๋ฏผ๊ฐํ•จ
    • WMT์—์„œ enc-dec์˜ last layer๊ฐ€ 1 head ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ 13.5 BLEU point์ด์ƒ ๋–จ์–ด์ง
    • last layer๊ฐ€ dec์˜ last layer์ธ๊ฐ€..?

3.4. Are Important Heads the Same Across Datasets?

  • ์ค‘์š”ํ•œ head๋Š” ๋‹ค๋ฅธ ํƒœ์Šคํฌ์—์„œ๋„ ์ค‘์š”ํ• ๊นŒ?
  • ์–ด๋Š์ •๋„ ์ค‘์š”ํ•จ, ๊ทธ๋Ÿฐ ๊ฒฝํ–ฅ์„ ๋ณด์ž„

4. Iterative Pruning of Attention Heads

  • iterativeํ•˜๊ฒŒ ์ ๋‹นํžˆ ์ž๋ฅด์ž

4.1. Head Importance Score for Pruning

  • head mask์— ๋Œ€ํ•œ loss๋กœ ๊ณ„์‚ฐํ•œ๋‹ค.
  • Molchanos et al., 2017 ๋ฐฉ๋ฒ•์„ tayler expansionํ•œ ๊ฑฐ๋ž‘ ๊ฐ™๋‹ค
  • Molchanos et al., 2017์— ๋”ฐ๋ผ์„œ importance score๋ฅผ l2 norm์œผ๋กœ ์ •๊ทœํ™”ํ•จ

4.2. Effect of Pruning on BLEU/Accuracy

  • 20% ~ 40%์ •๋„ pruning์ด ๊ฐ€๋Šฅํ–ˆ๋‹ค.
  • Appendix์— ๋” ์žˆ์Œ

4.3. Effect of Pruning on Efficiency

  • ์†๋„๋Š” ์–ผ๋งˆ๋‚˜ ์ค„๊นŒ?? 1080 ti๋ฅผ ๊ฐ€์ง„ ๋จธ์‹  ๋‘๋Œ€์—์„œ ํ…Œ์ŠคํŠธํ•จ
  • ๊ฐœ์ธ์ ์œผ๋กœ๋Š” ์—ญ์‹œ pruning์€ memory footprint๋ฅผ ์ค„์—ฌ์ฃผ๋Š” ๊ฒƒ์ด ํฐ๊ฐ€?? ์‹ถ๊ธฐ๋„ ํ•˜๋‹ค
    • ์–ด์ฐจํ”ผ ์—ฐ์‚ฐ์€ ์ง„ํ–‰์„ ํ•˜๊ณ , ์—ฐ์‚ฐ์—์„œ ์ง„ํ–‰ํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ์ฃผ๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ๋ ‡๊ฒŒ dramaticํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ์•„๋‹Œ ๋“ฏ ํ•จ

5. When Are More Heads Important? The Case of Machine Translation

  • ๊ฒฐ๋ก :
    • In other words, encoder-decoder attention is much more dependent on multi-headedness than self-attention.

  • ์—ญ์‹œ self-attention์ด redundancy๊ฐ€ ๋†’์€ ๊ฑด๊ฐ€??

6. Dynamics of Head Importance during Training

  • Trained Model์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค Training Model์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์€ ์–ด๋–ค๊ฐ€??์— ๊ด€ํ•œ ๊ฒƒ
  • epoch ๋๋งˆ๋‹ค ๊ฐ pruning level์— ๋”ฐ๋ผ ์„ฑ๋Šฅ ์ธก์ •ํ•ด๋ด„
  • early epoch ๋•Œ๋Š” ๊ต‰์žฅํžˆ ๋น ๋ฅด๊ฒŒ ์„ฑ๋Šฅ์ด ํ•˜๋ฝํ•˜๋Š”๋ฐ, ํ•™์Šต์ด ์ง„ํ–‰๋  ์ˆ˜๋ก ์ค‘์š”ํ•œ head๋งŒ ์ค‘์š”ํ•ด์ง€๊ณ  ๋‚˜๋จธ์ง€๋Š” ์•„๋‹ˆ๊ฒŒ ๋จ
  • ํŒจ์Šค
  • ๊ทผ๋ฐ ๋‚˜์ค‘์— ๋‹ค์‹œ ๋ณด์ž
May 18, 2020
Tags: paper