๐Ÿ“ƒ Distilling the Knowledge in a Neural Network ๋ฆฌ๋ทฐ

๊ตฌ๊ธ€์—์„œ Geoffrey Hinton, Oriol Vinyals, Jeff Dean์ด ์ž‘์„ฑํ•œ Distillation ๊ฐœ๋…์„ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ์ด๋‹ค. arvix ๋งํฌ๋Š” https://arxiv.org/abs/1503.02531์ด๊ณ , NIPS 2014 ์›Œํฌ์ƒต์— ๋‚˜์˜จ ๋…ผ๋ฌธ์ด๋‹ค.

Abstract

  • ๋ชจ๋ธ์„ ensembleํ•˜๋Š” ๊ฒƒ์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป๋Š” ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ง€๋งŒ ๋„ˆ๋ฌด ์—ฐ์‚ฐ์ด ๋น„์‹ธ๊ณ  ๋ฐฐํฌํ•˜๊ธฐ ํž˜๋“ค๋‹ค.
  • ๊ทธ๋ž˜์„œ ํ•ด๋‹น ์ •๋ณด๋ฅผ ์••์ถ•ํ•˜์—ฌ ๊ฐ„๋‹จํ•œ ๋‰ด๋Ÿด ๋„ท์— ์˜ฎ๊ฒจ์ฃผ๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

1 Introduction

  • ํฐ ๋ชจ๋ธ (๋…ผ๋ฌธ์—์„œ๋Š” cumbersome model์ด๋ผ ๋งํ•œ๋‹ค)์˜ knowledge๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ transferํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ํฐ ๋ชจ๋ธ์—์„œ ๋‚˜์˜จ class probabilities๋ฅผ ๋ฐ”๋กœ small model์˜ target (soft target) ์ด์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
  • ์ด๊ฒƒ์ด ์™œ ํšจ๊ณผ์ ์ธ์ง€๋Š” ์•„๋ž˜ ์„ค๋ช…์„ ๋ณด์ž

    MNIST์šฉ์œผ๋กœ ํ•™์Šต๋œ ํฐ ๋ชจ๋ธ์€ ๊ต‰์žฅํžˆ ๋†’์€ ์ •ํ™•๋„๋กœ ์ˆซ์ž๋“ค์„ ๋งž์ถœํ…Œ์ง€๋งŒ, ์–ด๋Š์ •๋„ ๋‹ค๋ฅธ ํด๋ž˜์Šค์—๋„ prob์„ ์ค€๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด 2๋ฅผ ๋งž์ถœ ๋•Œ ๋‹ฎ์€ ์ˆซ์ž์ธ 3๊ณผ 7๋„ ๋‚ฎ์€ ํ™•๋ฅ ์ด์ง€๋งŒ ๊ฐ’์„ ๋ถ€์—ฌํ•  ๊ฒƒ์ด๋‹ค. ์ด ์ •๋ณด๋“ค์€ ๊ต‰์žฅํžˆ ์ค‘์š”ํ•œ ์ •๋ณด์ธ๋ฐ, data์˜ structure์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋“ค์–ด์žˆ๋Š” ๊ฐ’์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

    • ํ•˜์ง€๋งŒ ๊ทธ ๊ฐ’๋„ ๊ต‰์žฅํžˆ ๋‚ฎ์€ ๊ฐ’์ด๋ผ, temperature ๊ฐœ๋…์„ ๋„์ž…ํ–ˆ๋‹ค. (๋‹ค๋ฅธ ๊ฐ’์ด 0์— ๊ฐ€๊นŒ์šฐ๋ฉด hard target์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ๊ณผ ๋‹ค๋ฅผ ๊ฒƒ์ด ์—†๋‹ค.)

2 Distillation

  • ๋ณดํ†ต์˜ softmax ์‹๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ temperature ๊ฐœ๋…์„ ๋„์ž…ํ•œ๋‹ค. logit ์— ๋Œ€ํ•ด prob ๋Š” ์•„๋ž˜ ์‹์ด ๋œ๋‹ค.

    • T๋Š” Temperature์ด๊ณ , T=1์ด๋ผ๋ฉด ๋ณดํ†ต์˜ softmax ์‹์ด๋‹ค. T๊ฐ€ ์ปค์ง€๋ฉด ํ›จ์”ฌ softํ•œ probability distribution์ด ๋‚˜์˜จ๋‹ค.
  • loss๋Š” ๋‘๊ฐ€์ง€๋ฅผ ์ฃผ๊ฒŒ ๋˜๋Š”๋ฐ,
    • ๋†’์€ T์— ๋Œ€ํ•ด์„œ distilled model๊ณผ cumbersome model์˜ output ์‚ฌ์ด์˜ cross entropy loss์™€
    • T=1๋กœ ๋‘๊ณ  hard label๊ณผ distilled model์˜ output ์‚ฌ์ด์˜ cross entropy loss๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
  • ํ•˜์ง€๋งŒ ์ฒซ๋ฒˆ์งธ loss๊ฐ€ gradient ๊ณ„์‚ฐ ์‹œ ์œผ๋กœ scaling๋˜๋ฏ€๋กœ, ํ•ด๋‹น loss์— weight๋ฅผ ์ฃผ๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. ๊ฐ๊ฐ์— ๋ฅผ ๊ณฑํ•ด์„œ ์ ์šฉํ•ด์ฃผ์ž. (๊ฒฐ๊ตญ hard target์€ ์•ˆ๊ณฑํ•œ๋‹ค๋Š” ๋ง ์•„๋‹Œ๊ฐ€..?)
    • softmax - cross entropy ์‹ ๋ฏธ๋ถ„ํ•ด๋ณด๋‹ˆ๊นŒ ์œผ๋กœ scaling๋œ๋‹ค.
    • ์ด๊ฒŒ hyper parameter๋ฅผ ๋ณ€๊ฒฝํ•˜๋”๋ผ๋„ ๊ฒฐ๊ตญ sfot target, hard target์˜ relative contribution์ด ์•ˆ๋ฐ”๋€Œ๋„๋ก ํ•ด์ค€๋‹ค.

2.1 Matching logits is a special case of distillation

  • ๋จผ์ € Softmax - Cross Entropy ์‹์˜ ๋ฏธ๋ถ„์€ https://ratsgo.github.io/deep%20learning/2017/10/02/softmax/๋ฅผ ์ฐธ๊ณ ํ•˜์ž.

  • Cross Entropy ๊ณ„์‚ฐ (v_i๋Š” cumbersome model์˜ ๊ฒฐ๊ณผ logit)

    ์—ฌ๊ธฐ์„œ temperatrue๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋†’๋‹ค๋ฉด

    softmax ์‹์˜ ๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์ ธ ๊ธฐ์šธ๊ธฐ๊ฐ€ 1์ด๋ฏ€๋กœ ๋กœ ๊ทผ์‚ฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

    ์—ฌ๊ธฐ์„œ logit์ด zero-mean์ด๋ผ๋ฉด ์•„๋ž˜์ฒ˜๋Ÿผ ์ „๊ฐœ๊ฐ€ ๋œ๋‹ค.

  • ๊ทธ๋ž˜์„œ ๋†’์€ temperature์—์„œ๋Š” distaillation์ด ์„ minimizeํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.
    • ์–ด์ฐจํ”ผ gradient ๊ณ„์‚ฐํ•  ๋•Œ ์œผ๋กœ scaling์„ ํ•ด์ฃผ๋‹ˆ ํ•ญ์ด ์‚ฌ๋ผ์ง€๋Š”๋ฐ,
    • ์„ ์ ๋ถ„ํ•œ ๊ฒƒ์ด loss์™€ ๊ฐ™์•„์•ผ ํ•˜๋‹ˆ ํ•ญ์„ minimizeํ•ด์•ผ ํ›ˆ๋ จ์ด ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
    • ์—ฌ๊ธฐ์„œ ์•Œ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ ์ ˆ๋Œ“๊ฐ’์ด ํฌ๊ณ  ์Œ์ˆ˜์ธ logits์€ ์œ ์šฉํ•œ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
  • ๋‚ฎ์€ temperature์—์„œ๋Š” negative์— ์‹ ๊ฒฝ์„ ๋งŽ์ด ์“ฐ์ง€ ์•Š๋„๋ก ํ›ˆ๋ จ์ด ๋œ๋‹ค.
    • ๋‚ฎ์€ temperature์˜ ๊ฒฝ์šฐ์—๋Š” softmax ๊ฐ’ ์ž์ฒด๋ฅผ ๋งž์ถ”๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ์ธ๊ฐ€???
    • ๊ทผ๋ฐ ์ด๊ฒŒ logit ๊ฐ’ ์ž์ฒด๊ฐ€ ์—„์ฒญ noisyํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ข‹์€ ์ ์ด ๋  ์ˆ˜ ์žˆ๋Š” ์žˆ๋‹ค.
  • distilled model์ด parent model์˜ ์ •๋ณด๋ฅผ ๋‹ค ๋‹ด๊ธฐ์— ๋„ˆ๋ฌด ์ž‘๋‹ค๋ฉด temperature๋ฅผ ์ž‘๊ฒŒ ํ•ด๋ณด์ž. (large negative logit์„ ๋ฌด์‹œํ•  ์ˆ˜ ์žˆ๋„๋ก)

โ€”

  • 3 Preliminary experiments on MNIST
  • 4 Experiments on speech recognition
  • 5 Training ensembles of specialists on very big datasets

์œ„ ์žฅ๋“ค์€ ์ฝ์–ด๋งŒ ๋ณด์ž

6 Soft Targets as Regularizers

  • soft target์ด overfitting์„ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ ์“ฐ์ผ ์ˆ˜ ์žˆ๋‹ค.

โ€”โ€”

๊ทธ ๋’ค๋„ ์ฝ์–ด๋งŒ ๋ณด์ž

April 16, 2020 ์— ์ž‘์„ฑ
Tags: paper