๐ CS330 Lecture 2 Multi-Task & Meta-Learning Basics
2๊ฐ์ด๊ณ Multi-Task & Meta-Learning Basics์ด๋ค.
- ๊ฐ์ ์ฌ์ดํธ http://cs330.stanford.edu/
- ๊ฐ์ ๋น๋์ค https://www.youtube.com/playlist?list=PLoROMvodv4rMC6zfYmnD7UG3LVvwaITY5
- 2๊ฐ pdf
Multitask learning
Model
- MultiTask Learning objective: \(\min_\theta \sum^T_{i=1}\mathscr{L}_i(\theta, \mathscr{D}_i)\) (Loss: \(\mathscr{L}\), Dataset: \(\mathscr{D}\))
- ๊ฐ์ฅ ์ฝ๊ฒ multitask๋ฅผ ํ๋ ๋ฐฉ๋ฒ: ์ฌ๋ฌ Expert Model์ ๋ง๋ ๋ค์ ํ์คํฌ ์ข ๋ฅ์ ๋ฐ๋ผ์ ์ฌ์ฉํ๋ค. -> No Shared Parameters.
- ๋ ๋ค๋ฅธ ๋ฐฉ๋ฒ: Classifier์ ํ์คํฌ ์ธ๋ฑ์ค๋ฅผ ํผ์ณ๋ก ๋ฃ์ด์ค๋ค.
- ์๊ฒฌ: one-hot vector๋ก ๋ฃ์ด์ค๋ค๋ ๋๋์ธ ๊ฒ ๊ฐ์๋ฐ ํ์ต์ด ์ ๋ ๊น? ๋ค๋ฅธ classifier๋ฅผ ์ฐ๋ ํธ์ด ์ข์๋ณด์ด๋๋ฐ
- ๋ ๋ค๋ฅธ ๋ฐฉ๋ฒ๋ค
- Multi-head classification -> ์ผ๋ฐ์ ์ผ๋ก ๋ด๊ฐ ์๊ณ ์๋ Multi-task learning. MT-DNN์ ์๊ฐํ๋ฉด ๋๋ค.
- Input vector์ ํ์คํฌ์ ์๋ฒ ๋ฉ์ ๊ณฑํด์ฃผ์ด์ ๋ถ๋ฅํ๋ ๋ฐฉ๋ฒ (multiplicative gating)
- Multiplicative conditioning์ ๋คํธ์ํฌ์ head๋ค์ ์ ๋ถ ํ๊บผ๋ฒ์ generalizeํ๋ค.
- ์๊ฒฌ: ํ์ง๋ง ์ ์ฉํ ๋ถ๋ถ์ ์ฐพ๊ธฐ๊ฐ ํ๋ค์ด ๋ณด์ธ๋ค. ๋น์ทํ ์ข ๋ฅ์ ๋ถ๋ฅ๋ฅผ ํด์ผํ๊ณ , ๋ ์ด๋ธ ๊ฐฏ์๊ฐ ๊ฐ์์ผํ๋ฉฐ, ํ์คํฌ ๊ฐ์ ๋ ์ด๋ธ์ด ๊ฐ๊ฐ ์๊ด๊ด๊ณ๊ฐ ์์ด์ผ ํ๋ ๊ฒ์ฒ๋ผ ๋ณด์ธ๋ค.
- Conditioning ๋ฐฉ๋ฒ์ ๊ณ ๋ฅด๋ ๋ฒ
- Problem Dependent
- Largely guided by intuition or knowledge of the problem
- currently more of an art than a science
objective
- Vanilla MTL Objective๊ฐ ์ ๋์ํ๊ธด ํ๋๋ฐ, weighted sum์ ์ธ๋๋ ๋ง๋ค. \(\min_\theta \sum^T_{i=1}w_i\mathscr{L}_i(\theta, \mathscr{D}_i)\)
- -> ๋๋ ์ด ๋ฐฉ๋ฒ์ ๋ ๋ง์ด ์
- weightingํ๋ ๋ฐฉ๋ฒ์ ์ฌ๋ฌ๊ฐ์ง๊ฐ ์๋๋ฐ ์๋์ ๋
- various heuristics (Chen et al. GradNorm. ICML 2018)
- use task uncertainty (see Kendall et al. CVPR 2018) https://arxiv.org/abs/1705.07115
- ๊ฐ๋จํ ์ดํด๋ดค๋๋ฐ ์ด๊ฒ ์ผ๋ฐ์ ์ธ ๊ฒฝ์ฐ์ ์ข์๋ณด์ธ๋ค.
- optimize for the worst-case task loss for fairness and robustness
optimization
- ๋ณ๋ค๋ฅธ ๋ด์ฉ์ ์๊ณ , task๋ค์ด uniformํ๊ฒ ์ ๋ฝํ๋์ง ์ดํด๋ณด๋ฉด ์ข๋ค๊ณ
- regression ๋ฌธ์ ์ผ ๊ฒฝ์ฐ ๊ฐ์ ์ค์ผ์ผ์ธ์ง ์ฒดํฌํ์.
Common Challenges in MTL
- Negative transfer: if independent networks work the best
- Maybe optimization problem.
- caused by cross-task interference.
- Tasks may learn at different rates.
- maybe representational capacity
- MT networks often need to be much larger than single-task model
- if nagative transfer problem occurs, share less parameters.
- Maybe optimization problem.
- Overfitting
- Share more paraemters
Case study
- Recommending What Video to Watch Next: A Multitask Ranking System
- Recommendataion systems of Youtube
- conflicting objectives
- videos that users will rate highly
- videos that users will share
- videos that users will watch
- ์ ์ ์ค ์ด๋ค๊ฑธ ์ถ์ฒํด์ผํ๋
- implicit bias caused by feedback: ๋ชจ๋ธ์ ์ถ์ฒ ๊ฒฐ๊ณผ๊ฐ ์ ์ ํ๋์ ์ํฅ์ ๋ฏธ์น๋ฏ๋ก, ๋ถ์ ์ ์ธ ํผ๋๋ฐฑ์ด ๋ ์๋ ์๋ค.
- ์์ธํ ๋ด์ฉ์ ๋ ผ๋ฌธ์ ์ฝ์
MTL vs Transfer Learning
์ฌ๋ผ์ด๋์๋ Transfer Learning๊ณผ์ ๋น๊ต๊ฐ ์กด์ฌ. ๋น๋์ค์๋ meta learning๊ณผ์ ๋น๊ต๊ฐ ์กด์ฌ
- MTL: Solve multiple tasks at once
- Transfer Learning: Solve target tasks after solving source task by transferring knowledge learned from source task.
- Key assumption: Cannot acces source task dataset during transfer
- Transfer learning is a valid solution to MTL (not vice versa)
April 6, 2021
Tags:
cs330