๐Ÿ“ƒ GPT2 ๋ฆฌ๋ทฐ

GPT์— ์ด์–ด์„œ GPT2 ๋…ผ๋ฌธ (Language Models are Unsupervised Multitask Learners)๋„ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ฝ์–ด๋ณด์•˜๋‹ค. ์—ญ์‹œ ์ •๋ฆฌํ•˜๊ธฐ์— ๊ท€์ฐฎ์€ ๋ถ€๋ถ„์€ ๋‹ค ๊ฑด๋„ˆ๋›ด๋‹ค.

Abstract

์›๋ž˜ NLP๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ task specific dataset์—๋‹ค๊ฐ€ supervised learning์„ ํ•˜๋Š”๋ฐ, GPT2๋Š” ์ด๋Ÿฐ supervision ์—†์ด LM๋งŒ์œผ๋กœ ํ’€์–ด๋ณด์ž๊ณ  ํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. ๊ฐ€์žฅ ํฐ ๋ชจ๋ธ GPT-2๋Š” 1.5B๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๋Š” Transformer๋กœ WebText์— ๋Œ€ํ•ด ๋‹ค ํ•™์Šต์ด ์•ˆ๋˜์—ˆ์–ด๋„ ํ…Œ์ŠคํŠธํ•œ 8๊ฐœ์˜ ๋ถ„์•ผ ์ค‘ 7๊ฐœ์—์„œ sota๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

1. Introduction

Language task๋“ค์—์„œ ์ตœ๊ทผ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ชจ๋ธ์€ pre-training๊ณผ superivsed fine-tuning์˜ ์กฐํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋Š” ๋ชจ๋ธ์ธ๋ฐ, ์ด๋Ÿฐ ์ ‘๊ทผ๋ฒ•์ด ๋” ์œ ์—ฐํ•œ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋œ ์ •๋ณด๋ฅผ transferํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ํ•œ๋‹ค. ์ฒซ๋ฒˆ์งธ๋Š” Word2Vec์ฒ˜๋Ÿผ word vector๋ฅผ ํ•™์Šตํ•ด์„œ task-specific architecture์˜ input์œผ๋กœ ๋„ฃ์–ด์ฃผ๋‹ค๊ฐ€, recurrent network์˜ contextual representation์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๊ณ , ์ด์ œ๋Š” task-specificํ•œ architecture์—†์ด ๊ทธ๋ƒฅ self-attention block์„ ๊ณ„์† ์ด์–ด์„œ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ฌธ์ œ์ ์€ ์ด๋Ÿฐ ๋ฐฉ์‹์€ ์—ฌ์ „ํžˆ supervised training์„ ์š”๊ตฌํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. supervised training์„ ํ•  ์ˆ˜ ์—†์„ ๋•Œ, ์ฆ‰, supervised data๊ฐ€ ์—†๊ฑฐ๋‚˜ ๋งค์šฐ ์ ์„ ๋•Œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ LM์„ ํŠน์ •ํ•œ ํƒœ์Šคํฌ๋“ค์„ ์œ„ํ•ด ๋™์ž‘ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.

๊ทธ๋ž˜์„œ GPT-2๋Š” ์ด ๋ฐฉ๋ฒ•์„ ํ•ฉ์ณ์„œ LM์œผ๋กœ ์—„์ฒญ๋‚˜๊ฒŒ ํ•™์Šต์‹œ์ผœ์„œ down-stream task๋“ค์„ parameter ์ˆ˜์ •์ด๋‚˜ architecture modification ์—†์ด ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

2. Approach

์ผ๋‹จ ํ•ต์‹ฌ์€ LM์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์–ธ์–ด๋Š” natural sequential ordering์ด ์žˆ์œผ๋‹ˆ joint probabilities๋ฅผ conditional probabilities์˜ ๊ณฑ์œผ๋กœ factorizeํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค.

(๊ทผ๋ฐ ์ด๊ฑฐ ์ด ์•„๋‹ˆ๋ผ ์•„๋‹๊นŒโ€ฆ?) ์—ฌ๊ธฐ์„œ conditional probability๊ฐ€ ๋‚˜์™”์œผ๋‹ˆ๊นŒ ์ด ๊ฒƒ๋“ค์„ ์ž˜ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” self-attention arhictecture๋กœ ์ž˜ ๊ณ„์‚ฐํ•œ๋‹ค.

๊ทผ๋ฐ general system์€ ๋งŽ์€ ํƒœ์Šคํฌ๋“ค์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•˜๋Š”๋ฐ, ์œ„ ํ˜•ํƒœ๋Š” ๋ฐ–์— ์ˆ˜ํ–‰์„ ๋ชปํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ์™€ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ๋ชจ๋ธ๋ง์„ ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. task conditioning์€ ๋ณดํ†ต architectrure level์—์„œ ๊ตฌํ˜„ํ•˜๋Š” ๊ฒƒ์€ task specific encoders and decoders(Kaiser et al., 2017)์™€ ๊ฐ™์€ ๊ฒƒ์„ ์‚ดํŽด๋ณด๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค. ๊ทธ์™€ ๋ฐ˜๋Œ€๋กœ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ ˆ๋ฒจ์—์„œ ๊ตฌํ˜„ํ•˜๋Š” ๊ฒƒ์€ the inner and outer loop optimization framework of MAML (Finn et al., 2017)๊ฐ™์€ ๊ฒƒ์„ ์‚ดํŽด๋ณด๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค.

์œ„์˜ LM์œผ๋กœ ์—„์ฒญ ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ๊ณ  ์—„์ฒญ๋‚˜๊ฒŒ ๋งŽ์ด ํ•™์Šต์‹œ์ผœ์„œ GPT-2๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š”๋ฐ, ์ด๋Ÿฐ ํ•™์Šต๋ฐฉ์‹์ด ๊ธฐ๊ณ„๋ฒˆ์—ญ, Reading Comprehension๊ณผ ๊ฐ™์€ ๊ฒƒ์—๋„ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด (translate to french, english text, french text)์™€ ๊ฐ™์€ ์ˆœ์„œ๋กœ ๊ตฌ์„ฑ๋˜๋ฉด ๋ฒˆ์—ญ ํƒœ์Šคํฌ๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด๊ณ , (answer the question, document, question, answer)์™€ ๊ฐ™์€ ์ˆœ์„œ๋กœ ๊ตฌ์„ฑ๋˜๋ฉด reading comprehension์ด ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

๋ฐฉ๋Œ€ํ•œ ๋ง๋ญ‰์น˜๋กœ๋ถ€ํ„ฐ ์ด๋ ‡๊ฒŒ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค๋ฉด QA ํƒœ์Šคํฌ์—์„œ๋งŒ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ ๋ฆฌ์›Œ๋“œ ์—†์ด ๋ฐ”๋กœ QA ํƒœ์Šคํฌ๋ฅผ LM๋งŒ์œผ๋กœ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

2.1. Training Dataset

๊ธฐ๋ณธ์ ์œผ๋กœ ์›น ํฌ๋กค๋ง์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•˜๋‹ค๊ณ  ํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ๋ฌธ์„œ์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ์œ„ํ•ด ์–ด๋Š์ •๋„ ์ œํ•œ์‹œํ‚ฌ ํ•„์š”์„ฑ์ด ์žˆ์—ˆ๋Š”๋ฐ, ์‚ฌ๋žŒ์ด ์ผ์ผํžˆ ์ œ์–ดํ•˜๋Š” ๊ฒƒ์€ ๋„ˆ๋ฌด ํž˜๋“ค๊ณ , ์‹œ๊ฐ„์ด ๋งŽ์ด ๋“ค์–ด ์šฐ์„  Reddit์—์„œ 3karma์ด์ƒ์„ ๋ฐ›์€ ์•„์›ƒ ๋ฐ”์šด๋“œ ๋งํฌ๋ฅผ ์ „๋ถ€ ๋‹ค ๊ธ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋Ÿฐ๊ฒŒ ์‹ค์ œ๋กœ ํฅ๋ฏธ๋กœ์šด ๋งํฌ๊ฑฐ๋‚˜, ๊ต์œก์ ์ธ ๋งํฌ๊ฑฐ๋‚˜, ์•„๋‹ˆ๋ฉด ๋‹จ์ˆœํ•œ ์žฌ๋ฏธ์žˆ๋Š” ๋งํฌ๋ฅผ ์ฐพ๋Š” ํœด๋ฆฌ์Šคํ‹ฑ์ด ๋œ๋‹ค๊ณ  ๋ณธ ๊ฒƒ์ด๋‹ค. HTML๋กœ๋ถ€ํ„ฐ text๋ฅผ ๋ฝ‘์•„๋‚ด๊ธฐ ์œ„ํ•ด Dragnet๊ณผ Newpapaer content extractor๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

์œ„ํ‚คํ”ผ๋””์•„๋Š” ์ „๋ถ€ ์ œ์™ธํ–ˆ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, ๊ทธ ์ด์œ ๊ฐ€ ์œ„ํ‚คํ”ผ๋””์•„๋Š” ๋งŽ์ด ์“ฐ์ด๋Š” ๋ฐ์ดํ„ฐ ์†Œ์Šค์ด๊ธฐ ๋•Œ๋ฌธ์— test evaluation tasks๋“ค์—์„œ ๋ถ„์„์ด ๋ณต์žกํ•ด์งˆ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ํ•œ๋‹ค.

2.3. Model

Transformer ๊ธฐ๋ฐ˜์ด๊ณ  GPT ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ๊ธฐ๋ฐ˜์ด์ง€๋งŒ, Layer Normalization์ด ๊ฐ sub block์˜ input์œผ๋กœ ๋‹ค ์˜ฎ๊ฒจ์กŒ๋‹ค๊ณ  ํ•œ๋‹ค. ์•„๋งˆ sub block์€ transformer block์ธ ๋“ฏ ํ•˜๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ถ”๊ฐ€์ ์œผ๋กœ layer norm์ด final self attention block์— ์ถ”๊ฐ€๋˜์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.


โ€œ๋ฐ์ดํ„ฐ์–‘๊ณผ ์ปดํ“จํŒ… ํŒŒ์›Œ๋ฅผ ์—„์ฒญ ๋Š˜๋ฆฌ๋ฉด LM๋งŒ์œผ๋กœ๋„ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์™€ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋„ ์ž˜ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ๋‹ค.โ€๊ฐ€ ํ•ต์‹ฌ์ธ ๊ฒƒ ๊ฐ™๋‹ค.

๋” ์ฝ์–ด๋ณด๊ณ  ์‹ถ์€ ๋ฆฌ์ŠคํŠธ

  • Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137โ€“1155, 2003.
  • Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., and Uszkoreit, J. One model to learn them all. arXiv preprint arXiv:1706.05137, 2017.
October 27, 2019 ์— ์ž‘์„ฑ
Tags: nlp paper