๐Ÿ“• CS224n Lecture 14 Transformers and Self-Attention For Generative Models

14๊ฐ•์€ ๊ฐ•์—ฐ์ž๋ฅผ ์ดˆ๋Œ€ํ•ด์„œ ๊ฐ•์˜๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. Google AI์—์„œ ๋‚˜์˜จ ์—ฐ์‚ฌ์ž ๋‘๋ถ„์ด๋ผ๊ณ  ํ•œ๋‹ค. NLP ๊ณต๋ถ€ํ•˜๋ ค๊ณ  ๋“ฃ๋Š” ๊ฒƒ์ด๊ณ , ๋‹ค๋ฅธ ๊ฒƒ๋“ค ํ•˜๊ธฐ์—๋„ ์•ฝ๊ฐ„ ๋ฒ…์ฐฌ๋“ฏ ์‹ถ์–ด์„œ NLP ๋‚ด์šฉ์„ ์ข€ ๋ฒ—์–ด๋‚˜๋Š” ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ, ์Œ์„ฑ ์ฒ˜๋ฆฌ ๊ฐ™์€ ๋ถ€๋ถ„์€ ๋งŽ์ด ๊ฑด๋„ˆ๋›ฐ์—ˆ๋‹ค ใ… ใ… 

Previous works

Variable Length Data์˜ representation์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ NLP์—์„œ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ๊ทธ๋ฅผ ์œ„ํ•œ ์„ ํƒ์ง€๋ฅผ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๊ผฝ์„ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์šฐ์„  RNN์€ Variable Length Representation์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์ข‹์€ ์„ ํƒ์ง€์ด๊ณ , LSTM, GRU๊ฐ™์€ ๊ฒ€์ฆ๋œ ๋ชจ๋ธ์ด ๋‚˜์™€์žˆ์ง€๋งŒ, Sequential Computation์„ ํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณ‘๋ ฌํ™”๊ฐ€ ์–ด๋ ต๊ณ , long, short range dependency์— ๋Œ€ํ•œ ๋ชจ๋ธ๋ง์ด ์–ด๋ ต๋‹ค. ๊ทธ๋ž˜์„œ ๋ณ‘๋ ฌํ™”๊ฐ€ ์‰ฌ์šด CNN์„ ์ด์šฉํ•˜๋ฉด long dependency๋ฅผ ํ•™์Šตํ•˜๊ธฐ๊ฐ€ ๋งค์šฐ ์–ด๋ ค์›Œ์ง„๋‹ค. layer๋ฅผ ์—„์ฒญ ์Œ“์•„์•ผ์ง€๋งŒ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค. NMT์˜ Encoder์™€ Decoder์‚ฌ์ด์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ Attention๊ฐ™์€ ๊ฒฝ์šฐ๋Š” Representation์—๋„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์–ด๋–ค์ง€์— ๋Œ€ํ•œ ์•„์ด๋””์–ด๊ฐ€ ๋‚˜์™”๊ณ  ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค๊ณ  ํ•œ๋‹ค. ์ฃผ๋กœ ์–ธ๊ธ‰๋˜๋Š” ๋ชจ๋ธ์€ self-attention.

Self Attention

self attention์—์„œ๋Š” short dependency๋˜, long dependency๋˜ constant path length๋ฅผ ์ค€๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  gating/multiplicative interaction์ด ๊ธฐ๋ฐ˜์ธ ๋ชจ๋ธ์ด๋‹ค. (matmul ๊ฐ™์€) โ€œ๊ทธ๋Ÿผ ์ด ๋ชจ๋ธ์ด sequential computation์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ์„๊นŒ?โ€๋ผ๋Š” ์งˆ๋ฌธ์ด ์ž์—ฐ์Šค๋ ˆ ๋‚˜์˜ค๊ฒŒ ๋˜๊ณ , ๊ทธ์— ๋Œ€ํ•œ ๋Œ€๋‹ต์ด Transformer์ด๋‹ค.

์ถ”๊ฐ€์ ์œผ๋กœ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ๋Š” ์ž๋ฃŒ๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  • Classification & regression with self-attention: Parikh et al. (2016), Lin et al. (2016)
  • Self-attention with RNNs: Long et al. (2016), Shao, Gows et al. (2017)
  • Recurrent attention: Sukhbaatar et al. (2015)



residual connection, self-attention layer ๊ฐ™์€ ์ด์ „์˜ ๊ฐ•์˜์— ์„ค๋ช…์ด ๋˜์—ˆ๋˜ ๋ถ€๋ถ„์— ๋Œ€ํ•ด ์ „์ฒด์ ์œผ๋กœ ์„ค๋ช…์„ ํ•˜๋ฉด์„œ ์‹œ์ž‘ํ•œ๋‹ค. ์˜ค๋ฅธ์ชฝ ์•„๋ž˜ ๋ ˆ์ด์–ด์˜ attention ์ผ๋ถ€๋ถ„์ด ๋ณด์ด์ง€ ์•Š๋Š” ์ด์œ ๋Š” masked multi-head attention layer์ด๊ธฐ ๋–„๋ฌธ์ด๋‹ค. (Attention is All You Need ๋…ผ๋ฌธ ์ฐธ๊ณ )

Attention is Cheap

Self Attention์˜ computational complexity๋Š” \(O(length^2 * dim)\)์ธ๋ฐ, RNN์˜ computation complexity๋Š” \(O(length * dim^2)\)์ด๋‹ค. ๋”ฐ๋ผ์„œ length๊ฐ€ dim๋ณด๋‹ค ์ž‘์€ ์ƒํ™ฉ์—์„œ ํ›จ์”ฌ ์ ์–ด์ง„๋‹ค. ๊ฐ•์˜์—์„œ ๋‚˜์˜จ LSTM์˜ ์ƒํ™ฉ์€ length์™€ dim์ด ๊ฐ™๋”๋ผ๋„ 4๋ฐฐ๋‚˜ ์ ์€ complexity๋ฅผ ๊ฐ€์ง„๋‹ค.

Convolution vs Attention vs Multihead Attention

ํ•˜์ง€๋งŒ Attention์€ ๋ฌธ์ œ์ ์ด ์žˆ๋Š”๋ฐ, ์˜ˆ๋ฅผ ๋“ค์–ด I kicked the ball์ด๋ผ๋Š” ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ convolution์„ ์ง„ํ–‰ํ•œ๋‹ค๊ณ  ํ•˜๋ฉด, ๊ธฑ๊ธฑ์˜ ๋‹จ์–ด์— filter๊ฐ€ ๋‹ค๋ฅธ๊ฐ’์„ ์ ์šฉํ•˜๋ฉด์„œ ํ•„์š”ํ•œ ๊ฐ’์„ ๋ฝ‘์•„๋‚ธ๋‹ค. ํ•˜์ง€๋งŒ, Attention์€ ๊ทธ๋ฅผ averagingํ•˜๋ฏ€๋กœ, ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ๋ฝ‘์•„๋‚ด๊ธฐ๊ฐ€ ํž˜๋“ค๋‹ค. ๊ทธ๋ž˜์„œ multi-head attention์ด ๋‚˜์™”๋‹ค. ๊ทธ๋ž˜์„œ ํ•„์š”ํ•œ ์ •๋ณด๋งŒ์„ ์ ๋‹นํžˆ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

Multihead Attention


๋„ˆ๋ฌด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ , SOTA๋„ ๋งŽ์ด ์ฐ์œผ๋‹ˆ๊นŒ ์ตœ๊ทผ์— ๋งŽ์€ ๋ชจ๋ธ๋“ค์ด transfomer ๊ธฐ๋ฐ˜์œผ๋กœ ๋‚˜์˜จ๋‹ค. framework๋“ค์€ tensor2tensor1, Sockeye2๋ฅผ ์ฐพ์•„๋ณด์ž.

Importance of Residual Connections

Residual connection์„ ์ด์šฉํ•˜๋ฉด positional information์„ higher layer๋กœ ๋‹ค๋ฅธ ์ •๋ณด์™€ ํ•จ๊ป˜ ์ „๋‹ฌํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

With Residuals


Attention์„ ์ด์šฉํ•˜๋Š” ๋งŒํผ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋Š” Transfer Learning๊ณผ ๊ฐ™์€ ํ‚ค์›Œ๋“œ๋ฅผ ์ฐพ์•„๋ด๋„ ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค. (์•„์ง ์ดํ•ด ์ž˜ ๋ชปํ•จ) ๋ญ”๊ฐ€ ์†Œ๊ฐœํ•˜๋Š” ๊ฒƒ์„ ์œ„์ฃผ๋กœ ์ญˆ์šฐ์šฑ ์ง€๋‚˜๊ฐ”๋Š”๋ฐ ๋„ˆ๋ฌด ๋นจ๋ฆฌ ์ญ‰ ์ง€๋‚˜๊ฐ€์„œ ํฅ๋ฏธ๋กœ์šด ๋‚ด์šฉ๋„ ๋งŽ์•˜์ง€๋งŒ, ์ œ๋Œ€๋กœ ์บ์น˜๋ฅผ ๋ชปํ•œ ๊ฒƒ ๊ฐ™์•„์„œ CS224n ์Šคํ„ฐ๋””๊ฐ€ ๋๋‚˜๊ณ  ๋‚˜๋ฉด ์ด ๊ฐ•์˜๋งŒ ๋‹ค์‹œ ๋ณด์•„๋„ ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

  1. github tensorflow/tensor2tensor tensor2tensor repositoryย 

  2. github awslabs/sockeye sockeye repositoryย 

June 9, 2019
Tags: cs224n