๐Ÿ“• CS330 Lecture 2 Multi-Task & Meta-Learning Basics

2๊ฐ•์ด๊ณ  Multi-Task & Meta-Learning Basics์ด๋‹ค.


Multitask learning

Model

  • MultiTask Learning objective: \(\min_\theta \sum^T_{i=1}\mathscr{L}_i(\theta, \mathscr{D}_i)\) (Loss: \(\mathscr{L}\), Dataset: \(\mathscr{D}\))
  • ๊ฐ€์žฅ ์‰ฝ๊ฒŒ multitask๋ฅผ ํ•˜๋Š” ๋ฐฉ๋ฒ•: ์—ฌ๋Ÿฌ Expert Model์„ ๋งŒ๋“  ๋‹ค์Œ ํƒœ์Šคํฌ ์ข…๋ฅ˜์— ๋”ฐ๋ผ์„œ ์‚ฌ์šฉํ•œ๋‹ค. -> No Shared Parameters.
  • ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•: Classifier์— ํƒœ์Šคํฌ ์ธ๋ฑ์Šค๋ฅผ ํ”ผ์ณ๋กœ ๋„ฃ์–ด์ค€๋‹ค.
    • ์˜๊ฒฌ: one-hot vector๋กœ ๋„ฃ์–ด์ค€๋‹ค๋Š” ๋Š๋‚Œ์ธ ๊ฒƒ ๊ฐ™์€๋ฐ ํ•™์Šต์ด ์ž˜ ๋ ๊นŒ? ๋‹ค๋ฅธ classifier๋ฅผ ์“ฐ๋Š” ํŽธ์ด ์ข‹์•„๋ณด์ด๋Š”๋ฐ
  • ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋“ค
    • Multi-head classification -> ์ผ๋ฐ˜์ ์œผ๋กœ ๋‚ด๊ฐ€ ์•Œ๊ณ  ์žˆ๋Š” Multi-task learning. MT-DNN์„ ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.
    • Input vector์— ํƒœ์Šคํฌ์˜ ์ž„๋ฒ ๋”ฉ์„ ๊ณฑํ•ด์ฃผ์–ด์„œ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐฉ๋ฒ• (multiplicative gating)
      • Multiplicative conditioning์€ ๋„คํŠธ์›Œํฌ์™€ head๋“ค์„ ์ „๋ถ€ ํ•œ๊บผ๋ฒˆ์— generalizeํ•œ๋‹ค.
      • ์˜๊ฒฌ: ํ•˜์ง€๋งŒ ์ ์šฉํ•  ๋ถ€๋ถ„์„ ์ฐพ๊ธฐ๊ฐ€ ํž˜๋“ค์–ด ๋ณด์ธ๋‹ค. ๋น„์Šทํ•œ ์ข…๋ฅ˜์˜ ๋ถ„๋ฅ˜๋ฅผ ํ•ด์•ผํ•˜๊ณ , ๋ ˆ์ด๋ธ” ๊ฐฏ์ˆ˜๊ฐ€ ๊ฐ™์•„์•ผํ•˜๋ฉฐ, ํƒœ์Šคํฌ ๊ฐ„์˜ ๋ ˆ์ด๋ธ”์ด ๊ฐ๊ฐ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ์–ด์•ผ ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ธ๋‹ค.
  • Conditioning ๋ฐฉ๋ฒ•์„ ๊ณ ๋ฅด๋Š” ๋ฒ•
    • Problem Dependent
    • Largely guided by intuition or knowledge of the problem
    • currently more of an art than a science

objective

  • Vanilla MTL Objective๊ฐ€ ์ž˜ ๋™์ž‘ํ•˜๊ธด ํ•˜๋Š”๋ฐ, weighted sum์„ ์“ธ๋•Œ๋„ ๋งŽ๋‹ค. \(\min_\theta \sum^T_{i=1}w_i\mathscr{L}_i(\theta, \mathscr{D}_i)\)
    • -> ๋‚˜๋„ ์ด ๋ฐฉ๋ฒ•์„ ๋” ๋งŽ์ด ์”€
  • weightingํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์—ฌ๋Ÿฌ๊ฐ€์ง€๊ฐ€ ์žˆ๋Š”๋ฐ ์•„๋ž˜์ •๋„
    • various heuristics (Chen et al. GradNorm. ICML 2018)
    • use task uncertainty (see Kendall et al. CVPR 2018) https://arxiv.org/abs/1705.07115
      • ๊ฐ„๋‹จํžˆ ์‚ดํŽด๋ดค๋Š”๋ฐ ์ด๊ฒŒ ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ์— ์ข‹์•„๋ณด์ธ๋‹ค.
    • optimize for the worst-case task loss for fairness and robustness

optimization

  • ๋ณ„๋‹ค๋ฅธ ๋‚ด์šฉ์€ ์—†๊ณ , task๋“ค์ด uniformํ•˜๊ฒŒ ์ž˜ ๋ฝ‘ํžˆ๋Š”์ง€ ์‚ดํŽด๋ณด๋ฉด ์ข‹๋‹ค๊ณ 
  • regression ๋ฌธ์ œ์ผ ๊ฒฝ์šฐ ๊ฐ™์€ ์Šค์ผ€์ผ์ธ์ง€ ์ฒดํฌํ•˜์ž.

Common Challenges in MTL

  • Negative transfer: if independent networks work the best
    • Maybe optimization problem.
      • caused by cross-task interference.
      • Tasks may learn at different rates.
    • maybe representational capacity
      • MT networks often need to be much larger than single-task model
    • if nagative transfer problem occurs, share less parameters.
  • Overfitting
    • Share more paraemters

Case study

  • Recommending What Video to Watch Next: A Multitask Ranking System
    • Recommendataion systems of Youtube
    • conflicting objectives
      • videos that users will rate highly
      • videos that users will share
      • videos that users will watch
      • ์œ„ ์…‹์ค‘ ์–ด๋–ค๊ฑธ ์ถ”์ฒœํ•ด์•ผํ•˜๋‚˜
    • implicit bias caused by feedback: ๋ชจ๋ธ์˜ ์ถ”์ฒœ ๊ฒฐ๊ณผ๊ฐ€ ์œ ์ € ํ–‰๋™์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฏ€๋กœ, ๋ถ€์ •์ ์ธ ํ”ผ๋“œ๋ฐฑ์ด ๋  ์ˆ˜๋„ ์žˆ๋‹ค.
    • ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋…ผ๋ฌธ์„ ์ฝ์ž

MTL vs Transfer Learning

์Šฌ๋ผ์ด๋“œ์—๋Š” Transfer Learning๊ณผ์˜ ๋น„๊ต๊ฐ€ ์กด์žฌ. ๋น„๋””์˜ค์—๋Š” meta learning๊ณผ์˜ ๋น„๊ต๊ฐ€ ์กด์žฌ

  • MTL: Solve multiple tasks at once
  • Transfer Learning: Solve target tasks after solving source task by transferring knowledge learned from source task.
    • Key assumption: Cannot acces source task dataset during transfer
  • Transfer learning is a valid solution to MTL (not vice versa)
April 6, 2021
Tags: cs330