๐Ÿ“• CS224n Lecture 17 The Natural Language Decathlon: Multitask Learning as Question Answering

Richard Socher๋ผ๋Š” Saleforce์˜ Chief Scientist๊ฐ€ ๊ฒŒ์ŠคํŠธ๋กœ ๋‚˜์™€ ๊ฐ•์˜๋ฅผ ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

๊ฐ•์˜๋Š” ์ „์ฒด์ ์œผ๋กœ multi-task learning์— ๋Œ€ํ•œ ๋‚ด์šฉ์ธ๋ฐ, single-task์˜ ํ•œ๊ณ„์— ๋Œ€ํ•ด์„œ ๋จผ์ € ์•Œ์•„๋ณด์ž. ์ตœ๊ทผ์— dataset, task, model, metric์— ๋Œ€ํ•œ ์—„์ฒญ๋‚œ ๋ฐœ์ „์ด ์žˆ์—ˆ์ง€๋งŒ, ์˜ˆ์ „์—๋Š” ์ƒˆ๋กœ์šด ๋ชจ๋ธ์€ ๊ฑฐ์˜ randomํ•œ ์ƒํƒœ์—์„œ ์ƒˆ๋กœ ์‹œ์ž‘ํ•˜๊ฑฐ๋‚˜ ์ผ๋ถ€๋งŒ pre-train๋œ ์ƒํƒœ์—์„œ ์‹œ์ž‘ํ•ด์•ผํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ์‹œ๊ฐ„์ด ์ง€๋‚˜๋ฉด์„œ word2vec, GloVe, CoVe, ELMo, BERT์ฒ˜๋Ÿผ ๋” ๋งŽ์€ ๋ถ€๋ถ„์„ pretrainํ•ด์„œ ๋ชจ๋ธ์„ ์ƒˆ๋กœ ๊ตฌ์„ฑํ•  ๋•Œ ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

๊ทธ๋Ÿผ ์ „์ฒด๋ฅผ ์™œ pretrained model์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„๊นŒ?

๊ทธ๋Ÿผ ๋งŽ์€ ํƒœ์Šคํฌ๋ฅผ ํ•˜๋‚˜์˜ NLP ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ๋ฌถ์„ ์ˆ˜๋Š” ์—†์„๊นŒ?

๊ทธ๋Ÿผ ํฌ๊ฒŒ 3๊ฐœ์˜ ๋ถ„๋ฅ˜๋กœ NLP ํƒœ์Šคํฌ๋“ค์„ ๋‚˜๋ˆ„์–ด๋ณด์ž

  • sequence tagging : NER, aspect specific sentiment
  • text classification : dialogue state tracking, sentiment classification
  • seq2seq : MT, summarization, QA

๊ฒฐ๋ก ์€ salesforce์—์„œ ๊ฐœ๋ฐœํ•˜๊ณ  ์žˆ๋Š” decaNLP์— ๋Œ€ํ•œ ์•ฝ๊ฐ„์˜ ํ™๋ณด๊ฐ€ ๋“ค์–ด๊ฐ€๊ธฐ๋„ ํ•˜๋Š” ๊ฒƒ ๊ฐ™์ง€๋งŒ, ์–ด์จŒ๋“  ์ด๋Ÿฐ multitask Learning์„ ๋ชฉํ‘œ๋กœ ํ•˜๊ณ  ๊ฐœ๋ฐœํ•œ ์‹œ์Šคํ…œ์ด๋ผ๊ณ  ํ•œ๋‹ค. decaNLP๋Š” task-specificํ•œ module์ด๋‚˜ parameter๊ฐ€ ์—†๋‹ค๊ณ  ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋‹ค๋ฅธ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์กฐ์ •์ด ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ํ•œ๋‹ค. ๋ณด์ง€๋ชปํ•œ ํƒœ์Šคํฌ์— ๋Œ€ํ•ด์„œ ๋Œ€์‘ํ•˜๊ณ  ์‹ถ์—ˆ๋‹ค๊ณ .

๊ทธ๋ฆฌ๊ณ  multitask QA์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ด์ฃผ๋Š”๋ฐ ์™„์ „ ์žฌ๋ฐŒ์–ด๋ณด์ธ๋‹ค. fixed GloVe + character n-gram embedding์œผ๋กœ linear layer ๊ฑฐ์นœ ํ›„์— Shared BiLSTM + skip connection์œผ๋กœ ์—ฐ๊ฒฐํ•œ๊ฑฐ ๊ฑฐ์น˜๊ณ  attention summationํ•ด์ฃผ๋Š” ๋ถ€๋ถ„์ด ์žˆ๋Š”๋ฐ ์ด ๋ถ€๋ถ„ ์ œ๋Œ€๋กœ ์ดํ•ด๋ชปํ–ˆ๋‹ค. ์™œ ๊ทธ๋ ‡๊ฒŒ ํ•˜๋Š”์ง€..? ์•”ํŠผ ์„œ๋กœ attention์„ ์ž˜ ์„ž์–ด์ฃผ๊ณ  ๋‚˜์„œ ์ฐจ์› ์ถ•์†Œ๋ฅผ ์œ„ํ•ด ๋˜ BiLSTM์„ ๊ฑฐ์นœ ํ›„ Transformer Layer๋ฅผ ๊ฑฐ์นœ๋‹ค. ๊ทธ๋ฆฌ๊ณ  Transformer layer ์ดํ›„๋กœ ์ œ๋Œ€๋กœ ์ดํ•ด ๋ชปํ•จ..

ํƒœ์Šคํฌ๋ณ„๋กœ ๋ฐ์ดํ„ฐ์…‹ - Metric์€ ์ด๋ ‡๊ฒŒ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

๊ทธ ๋‹ค์Œ์—๋Š” multitask learning์„ ์œ„ํ•œ training strategy๋ฅผ ์„ค๋ช…ํ•ด์ค€๋‹ค. ์ฒซ๋ฒˆ์งธ๋Š” fully joint.

The first strategy we consider is fully joint. In this strategy, batches are sampled round-robin from all tasks in a fixed order from the start of training to the end. This strategy performed well on tasks that required fewer iterations to converge during single-task training (see Table 3), but the model struggles to reach single-task performance for several other tasks. In fact, we found a correlation between the performance gap between single and multitasking settings of any given task and number of iterations required for convergence for that task in the single-task setting.

๊ฐ•์˜ ์„ค๋ช…์„ ์ž˜ ์ดํ•ดํ•˜์ง€ ๋ชปํ•˜๊ณ˜์–ด์„œ ํ•ด๋‹น ๋…ผ๋ฌธ์„ ์ฐพ์•„๋ณด์•˜๋‹ค. curriculum learning1์„ ์œ„์ฒ˜๋Ÿผ ๋…ผ๋ฌธ์—์„œ ์„ค๋ช…ํ•˜๋Š”๋ฐ, batch๋ฅผ samplingํ•  ๋•Œ, fixed order๋กœ ๊ณ„์† RR๋กœ ๋Œ๋ฉด์„œ ์ˆ˜์ง‘ํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ํ•œ๋‹ค. ์—„์ฒญ ๋งŽ์ด ๋Œ์•„๊ฐ€ converge๋˜๋Š” ํƒœ์Šคํฌ๋“ค์€ ์ž˜ ๋™์ž‘ํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ . ๊ทธ๋ž˜์„œ anti-curriculum learning์„ ์‹œ๋„ํ•ด๋ณด์•˜๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, ์ด๊ฑฐ๋Š” phase๋ฅผ ๋‘๊ฐœ๋กœ ๋‚˜๋ˆˆ ๋‹ค์Œ์— ์ฒซ๋ฒˆ์งธ๋Š” jointlyํ•˜๊ฒŒ ํ•™์Šตํ•˜๊ณ  ๋ณดํ†ต ์ด๋“ค์ด ๋” ์–ด๋ ค์šด ๊ฒƒ๋“ค์ด๋ผ๊ณ  ํ•œ๋‹ค. ๋‘๋ฒˆ์งธ ํŽ˜์ด์ฆˆ๋Š” fully jointly๋กœ ํ•™์Šตํ•œ๋‹ค.

๋นจ๊ฐ•์ด first phase์ด๊ณ , ํŒŒ๋ž‘์ด ๊ทธ ๋‚˜๋จธ์ง€์ด๋‹ค. Reddishํ•œ ๋ถ€๋ถ„์ด ์–ด๋ ต๊ณ  ๋ฐ˜๋Œ€์ชฝ์ด ์‰ฝ๋‹ค๊ณ .

์•”ํŠผ ๊ทธ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด์„œ ์—ด์‹ฌํžˆ ๋งํ•˜๋‹ค๊ฐ€ ๋๋‚ด๋Š”๋ฐ, Related Work๋ฅผ ๋” ๋งŽ์ด ์ฝ์–ด๋ณด์•„์•ผ ์•ž ๋ถ€๋ถ„๋„ ์ž˜ ์ดํ•ดํ•  ๋“ฏ ์‹ถ๋‹ค

์œ„ ๋ชฉ๋ก์„ ์ฝ์–ด๋ณด์ž..

  1. Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009. ๋‚˜์ค‘์— ๊ผญ ๋ณด์ž.. ์ดํ•ด๋Š” ํ•ด์•ผ์ง€..ย 

July 7, 2019
Tags: cs224n