๐Ÿ“• CS224n Lecture 4 Backpropagation

CS224n ๋„ค๋ฒˆ์งธ ๊ฐ•์˜๋ฅผ ๋“ฃ๊ณ  ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ!! ์ด๋ฒˆ ๊ฐ•์˜๋Š” ๋‹ค๋ฅธ ๊ฐ•์˜๋ฅผ ๋“ค์œผ๋ฉด์„œ ๋งŽ์ด ๋ณด์•˜๋˜ ๋‚ด์šฉ์ด๊ณ  ๋งŽ์ด ๋‹ค๋ฅผ ๊ฒƒ์ด ์—†๋‹ค ์ƒ๊ฐํ•˜๊ณ  ๋ณ„ ๊ธฐ๋Œ€์—†์ด ๋“ค์—ˆ๋‹ค.

Matrix gradients for our simple neural net and some tips

ํŽธ๋ฏธ๋ถ„ ํ•˜๋Š” ์‹์€ ๊ฑด๋„ˆ๋›ด๋‹ค! ๋„ˆ๋ฌด ์—ฌ๊ธฐ์ €๊ธฐ ๋งŽ์ด ๋‚˜์˜ค๊ธฐ๋„ ํ–ˆ๊ณ  ๊ฐœ์ธ์ ์œผ๋กœ๋„ ์ •๋ฆฌํ•  ํ•„์š”์„ฑ์„ ๋ชป ๋Š๋‚€๋‹ค.

๋‹ค๋งŒ, ์ด๋Ÿฐ์ €๋Ÿฐ ํŒ์ด ๋‚˜์™”๋Š”๋ฐ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  • Tip 1: Carefully define your variables and keep track of their dimensionality!
  • Tip 2: Chain rule!
  • Tip 3: For the top softmax part of a model: First consider the derivative wrt when (the correct class), then consider derivative wrt when (all the incorrect classes)
  • Tip 4: Work out element-wise partial derivatives if youโ€™re getting confused by matrix calculus!
  • Tip 5: Use Shape Convention. Note: The error message that arrives at a hidden layer has the same dimensionality as that hidden layer

์—ฌํŠผ ์ญ‰ ๊ฑด๋„ˆ๋›ฐ์–ด์„œ word gradients๋ฅผ window model์—์„œ ๊ณ„์‚ฐํ•˜๋Š” ๋ถ€๋ถ„๊นŒ์ง€ ์™”๋‹ค. window๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ์˜ gradient๋ฅผ ๊ณ„์‚ฐํ•œ ๊ฒฐ๊ณผ๊ฐ€ window ์ „์ฒด์ธ๋ฐ, ์ด๋Š” word vector๋“ค์„ ๋‹จ์ˆœํžˆ ์—ฐ๊ฒฐํ•œ ๊ฒƒ์ด๋ฏ€๋กœ ๋‹ค์‹œ ๋‚˜๋ˆ ์„œ ์ƒ๊ฐํ•ด์ค€๋‹ค.

Updating word gradients in window model

gradient๋ฅผ ๊ฐ€์ ธ์™ธ์„œ word vector๋ฅผ ์—…๋ฐ์ดํŠธํ•  ๋•Œ ์ฃผ์˜ํ•ด์•ผํ•˜๋Š” ์ ์ด ์žˆ๋‹ค. ์ž˜ ์ƒ๊ฐํ•ด๋ณด๋ฉด ์›๋ž˜์˜ ML ์ ‘๊ทผ๋ฒ•์€ n์ฐจ์›์— ๋ฐ์ดํ„ฐ๋“ค์ด ๊ณต๊ฐ„์— ์กด์žฌํ•  ๋•Œ decision boundary๋ฅผ ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ, word vector๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ word vector ์ž์ฒด๊ฐ€ ์›€์ง์ธ๋‹ค. ํŠน์ • batch์— ๋Œ€ํ•ด ํ•™์Šตํ•œ๋‹ค๊ณ  ํ•  ๋•Œ, batch์— ์กด์žฌํ•˜์ง€ ์•Š์€ ๋‹จ์–ด๋“ค์€ ์›€์ง์ด์ง€ ์•Š์ง€๋งŒ, batch์— ๋“ค์–ด์žˆ๋Š” ๋‹จ์–ด๋“ค์€ ์›€์ง์ด๊ฒŒ ๋œ๋‹ค.

๊ทธ์— ๋Œ€ํ•œ ๋น„๊ต์  ์ข‹์€ ํ•ด๊ฒฐ์ฑ…์€ pre-trained word vector๋“ค์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋Œ€๋ถ€๋ถ„, ๊ฑฐ์˜ ๋ชจ๋“  ๊ฒฝ์šฐ์— ์ข‹์€ ๋‹ต์ด ๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ๋งŒ์•ฝ ์ข‹์€ ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ pre trained ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ fine tuning์„ ํ•ด์ค˜๋„ ์ข‹๋‹ค๊ณ  ํ•œ๋‹ค. (๋‹ค๋งŒ, ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์ธ ๊ฒฝ์šฐ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์˜คํžˆ๋ ค ํ•ด๊ฐ€ ๋ ์ˆ˜๋„ ์žˆ๋‹ค๊ณ )

Computation graphs and backpropagation

์ด์ œ graph๋กœ ์„ค๋ช…ํ•˜๋Š” backprop ๋ถ€๋ถ„์ธ๋ฐ, ๊ฑด๋„ˆ๋›ด๋‹ค.

Stuff you should know

๋‹ค์–‘ํ•œ, ์ข€ ์•Œ์•„๋‘๋ฉด ์ข‹์„ ๊ฒƒ๋“ค์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•˜๋Š”๋ฐ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฆฌ์ŠคํŠธ๋ฅผ ์•Œ๋ ค์ค€๋‹ค.

  • Regularization: overfitting์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ธฐ๋ฒ•
  • Vecotrization: pythonicํ•œ ๋ฐฉ๋ฒ•์€ ML์—์„œ๋Š” ์ข€ ๋งŽ์ด.. ๋Š๋ฆด ์ˆ˜ ์žˆ๋‹ค.
  • non-linearity: activation function์— ๋Œ€ํ•ด ์„ค๋ช…์„ ํ–ˆ๋Š”๋ฐ, sigmoid, tanh๋Š” ์ด์ œ ํŠน๋ณ„ํ•œ ์ƒํ™ฉ์—์„œ๋งŒ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. ReLU๋ฅผ ๊ทธ๋ƒฅ ์ฒ˜์Œ ์‹œ๋„ํ•ด๋ณด๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฑฐ๋ผ๊ณ ..
  • parameter initialization: weight๋ฅผ ์ฒ˜์Œ ์–ด๋–ป๊ฒŒ ์ดˆ๊ธฐํ™”ํ• ์ง€๊ฐ€ ๋ฌธ์ œ์ธ๋ฐ, 0์€ ์“ฐ์ง€๋ง๊ณ (backprop ํ•ด์•ผํ•˜๋‹ˆ๊นŒ) Xavier๊ฐ™์€ ๊ฒƒ์„ ์จ์ฃผ๋ฉด ์ž˜ ๋œ๋‹ค๊ณ  ํ•œ๋‹ค.
  • optimization: SGD, adargrad, RMSProp, Adam, SparseAdam๊ฐ™์€ ๊ฒƒ๋“ค์ด ๋งŽ์ด ๋‚˜์™”๋Š”๋ฐ, SGD๊ฐ€ ๋ณดํ†ต์˜ ์ƒํ™ฉ์— ์ž˜ ๋™์ž‘ํ•œ๋Œ€์š”.
  • Learning Rate: ์ ์ ˆํ•œ lr๋ฅผ ์ •ํ•ด์ฃผ๋Š” ๊ฒƒ์ด ์ข‹์€๋ฐ, cyclic learning rates๊ฐ™์€ ์‹ ๊ธฐํ•œ ๋ฐฉ๋ฒ•๋„ ์žˆ์œผ๋‹ˆ ์ž˜ ์ •ํ•ฉ์‹œ๋‹ค.
April 13, 2019 ์— ์ž‘์„ฑ
Tags: cs224n machine learning nlp