๐Ÿ“• CS224n Lecture 13 Modeling contexts of use: Contextual Representations and Pretraining

13๊ฐ• ์ •๋ฆฌ! 11๊ฐ•๋ถ€ํ„ฐ์ธ๊ฐ€? ๊ทธ๋•Œ๋ถ€ํ„ฐ ๋Œ€๋ถ€๋ถ„ ์†Œ๊ฐœ๊ฐ€ ๋˜์–ด๊ฐ€๊ณ  ์žˆ์–ด์„œ ์ข‹์€ ๋งํฌ ์ •๋ฆฌ ์ •๋„๋งŒ ํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

Reflections on word representation

์ง€๊ธˆ๊นŒ์ง€๋Š” word embedding์„ ์‹œ์ž‘๋ถ€ํ„ฐ ํ–ˆ๋Š”๋ฐ, pretrained model๋ฅผ ์‚ฌ์šฉํ•˜์ž๋Š” ๋ง. ๊ทธ ์ด์œ ๋Š” ๋” ๋งŽ์€ ๋‹จ์–ด์™€ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค๋Š” ์ด์œ ์ด๋‹ค. ์‹ค์ œ๋กœ ์„ฑ๋Šฅ๋„ ๋” ๋†’์€ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

๊ทผ๋ฐ unknown words์— ๋Œ€ํ•ด์„œ๋Š” ์–ด๋–ป๊ฒŒ ๋Œ€์‘ํ•  ๊ฒƒ์ธ๊ฐ€? UNK์œผ๋กœ ๋งคํ•‘ํ•ด์„œ ์–ด์ฉŒ๊ตฌ์ €์ฉŒ๊ตฌ๋ฅผ ํ•˜์ง€๋งŒ ๊ฒฐ๋ก ์€ char-level model์„ ์‚ฌ์šฉํ•˜์ž! ๋˜๋Š” ํ…Œ์ŠคํŠธ๋•Œ <UNK>๊ฐ€ unsupervised word embedding์— ์กด์žฌํ•œ๋‹ค๋ฉด ๊ทธ๊ฑธ ๊ณ„์† ์“ฐ๊ณ , ๊ทธ๋ƒฅ ์•„์˜ˆ ๋ชจ๋ฅด๋Š” ๊ฒƒ์€ random vector๋กœ ๋งŒ๋“ ๋‹ค์Œ์— vocab์— ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ๊ณ ๋ คํ•ด๋ณด๋ผ๊ณ  ํ•œ๋‹ค. 1

์–ด์ฐŒ๋˜์—ˆ๋“  word embedding์„ ์‹œ์ž‘๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์€ ๋‘๊ฐ€์ง€ ํฐ ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š”๋ฐ, ํ•˜๋‚˜์˜ ๋‹จ์–ด์— ๋Œ€ํ•ด context ์ƒ๊ด€์—†์ด ๋‹ค ๊ฐ™์€ representation์„ ๊ฐ€์ ธ์˜จ๋‹ค๋Š” ์ ์ด๋‹ค.

์ด ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ๋ฒ•์€ ๋ฌด์—‡์ผ๊นŒ? NLM์—์„œ LSTM Layer๋“ค์€ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์ด๋‹ค. ๊ทธ ๋ง์€ context-specific word representation์„ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ง์ด ์•„๋‹๊นŒ?

LSTM Layer in NLM

Pre-ELMo and ELMO

TagLM2์ด๋ž€ ๋…ผ๋ฌธ์€ ELMo๊ฐ€ ๋‚˜์˜ค๊ธฐ ์ „์— ๋‚˜์˜จ ๋…ผ๋ฌธ์ด๋‹ค. ๋ฉ”์ธ ์•„์ด๋””์–ด๋Š” word representation์„ context์•ˆ์—์„œ ํ•ด๋‚ด๊ณ  ์‹ถ์ง€๋งŒ, ๊ทธ๋ ‡๊ฒŒ ๊ธฐ์กด์˜ ํ•™์Šต๋ฐฉ์‹๊ณผ ๋‹ค๋ฅด์ง€ ์•Š๊ฒŒ ํ•˜๊ณ  ์‹ถ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ž˜์„œ semi-supervised approach ๋ฐฉ์‹์„ ์ฐจ์šฉํ•˜์˜€๋‹ค. ์ด๊ฒŒ Pre-ELMo

TagLM

CoVe๋ผ๋Š” ๋ชจ๋ธ๋„ ์žˆ์—ˆ๋Š”๋ฐ, ์ด๊ฑด ๊ทธ๋ƒฅ ๊ฐ•์˜์—์„œ ๋„˜์–ด๊ฐ

ELMO๋Š” Deep Contextualized word representations๋ผ๋Š” ์ œ๋ชฉ์„ ๊ฐ€์ง„ ๋…ผ๋ฌธ์˜ ๋ชจ๋ธ์ด๋‹ค. word token vector์™€ contextual word vector์˜ breakout version์ด๋‹ค. word token vector๋ฅผ long context๋กœ๋ถ€ํ„ฐ ๋ฐฐ์šด๋‹ค. (๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์€ fixed window context๋กœ๋ถ€ํ„ฐ ๋ฐฐ์šฐ๋‚˜..?)

bi-drectional LM์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ์„ฑ๋Šฅ๋•Œ๋ฌธ์— ์ด์ƒํ•  ์ •๋„๋กœ ํฐ LM์„ ์‚ฌ์šฉํ•˜์ง„ ์•Š๋Š”๋‹ค. ๋‘๊ฐœ์˜ biLSTM layer๋กœ ๊ตฌํ˜„ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ๋˜ํ•œ initial word representation์„ ์œ„ํ•ด character CNN์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•˜๊ณ , redisual connection๋„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ž์„ธํ•œ ์‚ฌํ•ญ์€ ๋…ผ๋ฌธ์„ ์ฝ์–ด๋ณด์•„์•ผ ์•Œ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.

ULMfit and onward

ULMfit: Universal Language Model 3

์–ด๋–ป๊ฒŒ NLM Knowledge๋ฅผ ๊ณต์œ ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‚˜๊ฐ€ ํ•ต์‹ฌ. text classification์„ ์˜ˆ์‹œ๋กœ ๊ฐ•์˜์—์„œ๋Š” ์„ค๋ช…ํ•œ๋‹ค. ULMfit์€ reasonable-size์—ฌ์„œ ๋” ์•Œ๋ ค์ง„ ์ ๋„ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. 1GPU๋กœ ํ•™์Šต์ด ๊ฐ€๋Šฅํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. transfer learning ๊ฐ™์€ ํ‚ค์›Œ๋“œ๋ฅผ ๊ฐ™์ด ์ฐพ์•„๋ณด์ž.

ULMfit ์ดํ›„๋กœ ์ ์  ๊ณ„์† ํฐ๋ชจ๋ธ์ด ๋งŽ์ด ๋‚˜์˜จ๋‹ค. OpenAI์—์„œ ๋งŒ๋“  2048๊ฐœ์˜ TPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” GPT-2 ๋ชจ๋ธ์€ ๊ฝค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค๊ณ .

๊ทผ๋ฐ ํฐ ๋ชจ๋ธ๋“ค์€ ์ „๋ถ€ Transformer๋‹ค.

Transformer architecture

Motivation

์šฐ๋ฆฐ RNN ์—ฐ์‚ฐ์„ parallelizationํ•˜๊ณ  ์‹ถ๋‹ค. ๊ทผ๋ฐ long range dependencies๋Š” ๊ทธ๋Œ€๋กœ ํ•„์š”ํ•˜๋‹ค. ๊ทธ๋ฅผ ์œ„ํ•ด Recurrant Model์—์„œ๋Š” Attention์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์—ˆ๊ณ , attention์ด ํ•ด๋‹น dependency๋ฅผ ์•Œ๋ ค์ฃผ๋‹ˆ, ๊ทธ๋ƒฅ Recurrant Model์„ ์“ฐ์ง€ ์•Š๊ณ  attention๋งŒ ์จ๋ณด๋Š” ๊ฒƒ์€ ์–ด๋–ค๊ฐ€?

Overview

Attention is all you need ๋…ผ๋ฌธ์„ ๋ณด์ž. ๊ทธ์™€ ๊ฐ™์ด ๋ณด๊ธฐ๋ฅผ ์ถ”์ฒœํ•˜๋Š” ๋ฆฌ์†Œ์Šค๋“ค์€ ์•„๋ž˜์ •๋„์ด๋‹ค.

๋…ผ๋ฌธ์—์„œ๋Š” Dot Product Attention, Scaled Dot Product Attention, Multi-head attention์„ ํ•˜๋Š”๋ฐ, ๋…ผ๋ฌธ ์ฝ๊ณ ๋„ ์ž˜ ์ดํ•ด ์•ˆ๋˜์—ˆ์œผ๋‹ˆ๊นŒ ๋…ผ๋ฌธ ์ •๋ฆฌํ• ๋•Œ ๋‹ค์‹œ ๋ณด์ž.

์•„๋ž˜์™€ ๊ฐ™์€ ํ‚ค์›Œ๋“œ/๋…ผ๋ฌธ์„ ์ฐพ์•„๋ณด์ž

  • byte-pair encoding
  • checkpoint averaging
  • adam optimizer
  • dropout
  • label smoothing
  • auto-regressive decoding with beam search and length penalties

BERT

๋…ผ๋ฌธ์€ ์—ฌ๊ธฐ๋ฅผ ๋ณด๋ฉด ๋œ๋‹ค.

BERT์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ์–ธ์–ด๋Š” ์–‘๋ฐฉํ–ฅ์œผ๋กœ ์ดํ•ดํ•ด์•ผ ํ•˜๋Š”๋ฐ, ์™œ ํ•œ์ชฝ๋งŒ ๋ณผ๊นŒ?๋ผ๋Š” ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ Bidrectional context๋ฅผ ๊ตฌ์„ฑํ–ˆ๋‹ค. ํ•™์Šต์€ k%์˜ ๋‹จ์–ด๋ฅผ ๊ฐ€๋ฆฌ๊ณ  ๊ทธ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ prediction์„ ํ†ตํ•ด ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•ญ์ƒ 15%๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, k๊ฐ€ ๋†’์œผ๋ฉด context๊ฐ€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š๊ณ , k๊ฐ€ ๋„ˆ๋ฌด ์ ์œผ๋ฉด ํ•™์Šตํ•˜๊ธฐ์—๋Š” ๋„ˆ๋ฌด cost๊ฐ€ ๋†’๋‹ค.

์ถ”๊ฐ€๋กœ Next Sentence Prediction ๊ฐ™์€ ๊ฒƒ๋„ ์ง„ํ–‰ํ•˜๋Š”๋ฐ, sentence ์‚ฌ์ด์˜ relationship์„ ํ•™์Šตํ•˜๋Š” ํƒœ์Šคํฌ์ด๋‹ค. sentence A์™€ B๊ฐ€ ์ฃผ์–ด์ง€๋ฉด IsNextSentence์ธ์ง€, NotNextSentence์ธ์ง€ ๋งž์ถ”๋Š” ํƒœ์Šคํฌ์ด๋‹ค.

bert๋Š” transformer encoder๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  self-attention์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— locality bias๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. ๋˜ํ•œ long distance context๋„ ์ถฉ๋ถ„ํžˆ ๊ณ ๋ ค๋œ๋‹ค.

  1. A Comparative Study of Word Embeddings for Reading Comprehension ํ—ค๋”ฉ ๋…ผ๋ฌธย 

  2. Semi-supervised sequence tagging with bidirectional language models taglm ๋…ผ๋ฌธย 

  3. Universal Language Model Fine-tuning for Text Classification ์—ฌ๊ธฐ ๋…ผ๋ฌธ์„ ์ฐธ๊ณ ํ•˜์žย 

June 9, 2019 ์— ์ž‘์„ฑ
Tags: cs224n machine learning nlp