๐Ÿ“• CS224n Lecture 1 Introduction and Word Vectors

์ €๋ฒˆ์ฃผ๋ถ€ํ„ฐ CS224n ์Šคํ„ฐ๋””๋ฅผ ์‹œ์ž‘ํ–ˆ๋‹ค! CS231n ๋“ค์„ ๋•Œ๋ž‘ ๋‹ค๋ฅด๊ฒŒ ๋‹ค๋ฅธ ์‚ฌ๋žŒ์ด๋ž‘ ๊ฐ™์ด ํ•˜๋ฉด ๊ทธ๋ž˜๋„ ๋๊นŒ์ง€ ๋“ฃ์ง€ ์•Š์„๊นŒ ์‹ถ์–ด์„œ ํ•œ๋ฒˆ ํ•ด๋ณด์ž๊ณ  ์ฃผ์œ„์‚ฌ๋žŒ๋“ค์„ ๋Œ์–ด๋ชจ์•„๋ดค๋‹ค. ์ด ํฌ์ŠคํŠธ๋Š” 1๊ฐ• ๊ฐ•์˜ ๋…ธํŠธ๋ฅผ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ด๋‹ค.

Introduction

๊ฐ„๋žตํ•˜๊ฒŒ ๊ฐ€๋ฅด์น  ๊ฒƒ์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ์„ค๋ช…ํ•œ๋‹ค.

  • NLP with Deep Learning
  • ์ธ๊ฐ„์˜ ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ๋น… ํ”ฝ์ณ (์ธ์‚ฌ์ดํŠธ..?)
  • pytorch๋กœ ์ง„์งœ๋กœ ๊ตฌํ˜„ํ•ด๋ณด๊ธฐ
    • word meaning
    • dependency parsing
    • machine translation
    • quesion answering ๋“ฑ๋“ฑ๋“ฑโ€ฆ

์ด๋ฒˆ๋ถ€ํ„ฐ pytorch๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

Word Vectors

Human Language and word meaning

language๋ž€ ๊ฒƒ ์ž์ฒด๊ฐ€ ์ƒ๋‹นํžˆ ๋ถˆํ™•์‹คํ•œ ๊ฒƒ์ด๋‹ค. ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•˜๋Š” ์ˆ˜๋‹จ์ด๊ธฐ๋„ ํ•˜๋ฉฐ socializeํ•˜๋Š” ์ˆ˜๋‹จ(์‚ฌ๋žŒ๋“ค์ด ๋„คํŠธ์›Œํ‚น ํ•˜๋Š” ์ˆ˜๋‹จ)์ด๊ธฐ๋„ ํ•˜๋‹ค. ๊ทธ๋Ÿฐ ์–ธ์–ด๋ฅผ ์ปดํ“จํ„ฐ๋กœ ๋ถ„์„ํ•˜๋Š” ๊ฒƒ์€ computer vision ๋“ฑ์— ๋น„ํ•˜๋ฉด ์ƒ๋‹นํžˆ ์ตœ๊ทผ์˜ ์ผ์ด๋‹ค. ๊ทธ๋ž˜๋„ ์ผ๋‹จ ์•Œ๊ณ  ๊ฐ€์•ผํ• ๊ฒƒ์€ ๋‹ค๋ฅธ ์ •๋ณด ๊ตํ™˜ ์ˆ˜๋‹จ์— ๋น„ํ•ด์„œ ์–ธ์–ด๋ฅผ ํ†ตํ•ด์„œ ์ •๋ณด๋ฅผ ์ฃผ๊ณ  ๋ฐ›๋Š” ๊ฒƒ์€ ์ƒ๋‹นํžˆ ์†๋„๊ฐ€ ๋Š๋ฆฌ๋‹ค.

๊ทธ๋Ÿผ ์–ธ์–ด/๋‹จ์–ด๋ฅผ ํ†ตํ•ด ์ „๋‹ฌํ•˜๋Š” meaning์˜ ๋œป ๋ฌด์—‡์ผ๊นŒ. โ€œword, phrase๋“ฑ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฌด์–ธ๊ฐ€โ€์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ผ๋ฐ˜์ ์œผ๋กœ meaning์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ signifier(symbol) <=> signified(idea or thing) ์ •๋„์ด๋‹ค.

๊ทผ๋ฐ ๊ทธ๋Ÿผ meaning์„ ์ปดํ“จํ„ฐ๋กœ๋Š” ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•˜๊ณ , ์ดํ•ดํ•ด๋ณผ ์ˆ˜ ์žˆ์„๊นŒ. ๊ฐ„๋‹จํ•˜๊ฒŒ wordnet๋ฅผ ์‚ฌ์šฉํ•ด๋ณผ ์ˆ˜ ์žˆ๊ฒ ๋‹ค. ์œ ์˜์–ด ๋“ฑ์„ ์ˆ˜๋งŽ์ด ์ •๋ฆฌํ•ด๋†“์€ ์‚ฌ์ „๊ณผ ์œ ์‚ฌํ•œ ๋ฆฌ์ŠคํŠธ์ด๋‹ค. (nltk์•ˆ์— ํฌํ•จ๋˜์–ด ์žˆ๋‹ค)

  • ๊ทธ๋Ÿผ ์ด wordnet์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ์ข‹์„๊นŒ? ๋ฌธ์ œ์ ์€ ์—†์„๊นŒ?
    • ๋‰˜์•™์Šค๊ฐ€ ์—†์–ด์ง„๋‹ค.
    • ์ƒˆ๋กœ์šด ๋‹จ์–ด๋“ค์ด ์—†๋‹ค.
    • ์ฃผ๊ด€์ ์ด๋‹ค.
    • ์ธ๊ฐ„์˜ ๋…ธ๋™์ด ๋‹ค์†Œ ๋งŽ.....์ด ๋“ค์–ด๊ฐ„๋‹ค.
    • ๋‹จ์–ด์˜ ์œ ์‚ฌ๋„๋ฅผ ์ •ํ™•ํžˆ ํ‘œํ˜„ํ•  ์ˆ˜ ์—†๋‹ค.

๊ทธ๋ž˜์„œ neural net ์Šคํƒ€์ผ๋กœ ๋‚˜ํƒ€๋‚ด๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค. (Representing words as discrete symbols) ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด์ž! ๊ทธ๋ž˜์„œ ํ•˜๋‚˜ํ•˜๋‚˜์˜ ๋‹จ์–ด๋ฅผ one-hot ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด๋Š”๋ฐ, ๋ฌธ์ œ

  • ๋‹จ์–ด์˜ ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋“  ๋ฒกํ„ฐ๊ฐ€ orthogonalํ•˜๋‹ค. (one-hot ๋ฒกํ„ฐ๋‹ˆ๊นŒ..)
  • ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋ƒˆ๋Š”๋ฐ ์œ ์‚ฌ๋„๋”ฐ์œ„ ๋ฒ„๋ ธ๋‹ค.

๊ทธ๋ž˜์„œ ์œ ์‚ฌ๋„๋ฅผ ๋ฒกํ„ฐ ์ž์ฒด๊ฐ€ ํฌํ•จํ•  ์ˆ˜ ์žˆ๋„๋ก encodingํ•˜์ž! ์ด๋Ÿฌํ•œ ์ƒ๊ฐ์— ๋Œ€ํ•ด ์•„์ฃผ ํฐ ์ธ์‚ฌ์ดํŠธ๋ฅผ J. R. Firth๋ž€ ์‚ฌ๋žŒ์ด ์ฃผ์—ˆ๋Š”๋ฐ, ์ด๋Š”

Distributional semantics: A wordโ€™s meaning is given by the words that frequently appear close-by

์ด๋‹ค. ๋น„์Šทํ•œ ๋‹จ์–ด๋Š” ๋น„์Šทํ•œ ์œ„์น˜์— ๋งŽ์ด ์œ„์น˜ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ context๋กœ๋ถ€ํ„ฐ ๋ฝ‘์•„์˜จ๋‹ค. ์ฆ‰ ๋‹จ์–ด๋ฅผ context๋ฅผ ์‚ฌ์šฉํ•ด embeddingํ•˜์—ฌ dense vector๋กœ ํ‘œํ˜„ํ•œ๋‹ค. ํ•™์Šต ํ›„ n dimension์„ visualizeํ•ด๋ณด๋‹ˆ๊นŒ(PCA ๋“ฑ์œผ๋กœ) ๋น„์Šทํ•œ ๋‹จ์–ด๊ฐ€ ๋‹ค ๋ชจ์—ฌ์žˆ๋”๋ผ.

distributional semantics

word2vec overview

word2vec์˜ ๋ฉ”์ธ ์•„์ด๋””์–ด๋Š” ์ด๊ฑฐ๋‹ค.

  • ํฐ corpus์˜ ๋ฐ์ดํ„ฐ ์•ˆ์—์„œ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ vector๋กœ ํ‘œํ˜„ํ•˜์ž.
  • word vector์˜ ์œ ์‚ฌ๋„๋ฅผ ์ด์šฉํ•ด ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ํ•ด๋‹น context์— ์žˆ์„ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜์ž.
  • ๊ณ„์† ํ™•๋ฅ ์„ maximizeํ•˜๊ธฐ ์œ„ํ•ด word vector๋ฅผ ์กฐ์ ˆํ•˜์ž.
๊ฐ๊ฐ์˜ ๋‹จ์–ด์— ๋Œ€ํ•œ ํ™•๋ฅ 

์ž ๊ทธ๋Ÿผ ์‹ค์ œ๋กœ ์ž์„ธํ•˜๊ฒŒ ์‚ดํŽด๋ณด์ž. objective function (cost, error function)์œผ๋กœ๋Š” ์•„๋ž˜ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.(\(J(\theta)\)) \(T\)๋Š” ์ „์ฒด ๋‹จ์–ด์˜ ๊ฐฏ์ˆ˜์ด๋‹ค. \(t\)๋Š” ๋‹จ์–ด์˜ position์ด๋‹ค. \(m\)์€ window size์ด๋‹ค.

\[L(\theta) = \prod_{t=1}^T \prod_{-m \leq j \leq m, j \neq 0} P(w_{t + j} | w_t; \theta)\] \[J(\theta) = - \frac 1 T log L(\theta)\]

\(\theta\)๊ฐ€ optimize๋  ๋ณ€์ˆ˜์ด๊ณ , \(L\)์€ likelyhood, ์šฐ๋„๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋ฉฐ, ๋•Œ๋•Œ๋กœ \(L = J'\)์ด๋‹ค. objective function์„ minimize ์‹œํ‚ค๋Š” ๊ฒƒ์ด predictive accuracy๋ฅผ maximizeํ•˜๋Š” ๊ฒƒ์ด ๋œ๋‹ค.

์ž ๊ทธ๋Ÿผ ํ™•๋ฅ ์€ ์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐํ•˜๋ƒ๋ฉด, ์ผ๋‹จ ๋จผ์ € \(\vec u\)์™€ \(\vec v\)๋ฅผ ๋จผ์ € ์ •์˜ํ•œ๋‹ค.

  • \(\vec u_w\)๋Š” ๋‹จ์–ด๊ฐ€ context word์ผ๋•Œ ์“ฐ๋Š” ๋ฒกํ„ฐ์ด๋‹ค.
  • \(\vec v_w\)๋Š” ๋‹จ์–ด๊ฐ€ center word์ผ๋•Œ ์“ฐ๋Š” ๋ฒกํ„ฐ์ด๋‹ค.

๊ทธ๋–„ center word๊ฐ€ c์ด๊ณ , context word๊ฐ€ o์ผ๋–„ ํ™•๋ฅ ์„ ์•„๋ž˜์ฒ˜๋Ÿผ ๊ณ„์‚ฐํ•œ๋‹ค.

\[P(o|c) = \frac {exp(u_o^T v_c)} {\sum_{w\in V} exp(u_w^T v_c)}\]

softmax ์‹๊ณผ ๋น„์Šทํ•˜๋‹ค. ์—ฌ๋‹ด์œผ๋กœ softmax์˜ soft๋Š” ํ™•๋ฅ ์ด๋ผ softํ•˜๊ฒŒ ๋ถ„ํฌ์‹œํ‚จ๋‹ค๋Š” ๋ง์ด๊ณ , softmax์˜ max๋Š” ์ œ์ผ ํ™•๋ฅ ์„ ์ฆํญ์‹œํ‚จ๋‹ค๋Š” ๋ง์ด๋‹ค.

Optimization

์—ฌํŠผ ์ด๋ ‡๊ฒŒ ์‹๋“ค์„ ์ •ํ–ˆ์œผ๋‹ˆ ํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” optimization์„ ํ•ด์•ผํ•œ๋‹ค. \(\theta\)๋Š” \(2dV\)์˜ ์ฐจ์›์ด ๋˜๊ณ , (V๊ฐœ์˜ ๋‹จ์–ด์— ๋Œ€ํ•ด d์ฐจ์›์˜ ๋ฒกํ„ฐ๋“ค์ด 2๊ฐœ์”ฉ(u, v) ์žˆ๋‹ค) ๊ทธ๋ƒฅ \(\theta\)๋ฅผ ๋ฐ”๊พธ๋ฉด์„œ \(J\)๋ฅผ minimize์‹œํ‚ค๋ฉด ๋œ๋‹ค๊ณ  ํ•œ๋‹ค. ํŽธ๋ฏธ๋ถ„ ํ•˜๋Š” ๊ฑด ๋‚˜์ค‘์— ๋‹ค์‹œ ๋ด๋„ ์•Œ๊ฑฐ๋ผ ์ƒ๊ฐํ•˜๊ณ .. ์ ์–ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\(J(\theta) = - \frac 1 T log L(\theta)\) ์ด๊ณ , \(L\)์ด \(\prod\)๋ฅผ ํฌํ•จํ•˜๋‹ˆ๊นŒ ๋กœ๊ทธ๊ฐ€ \(\prod\)์•ˆ์œผ๋กœ ๋“ค์–ด๊ฐ€๋ฉด์„œ \(\prod\)๊ฐ€ \(\sum\)์œผ๋กœ ๋ฐ”๋€๋‹ค. ๊ทธ ๋•Œ

\[\frac \partial {\partial v_c} log p(o | c) = u_o - \sum_{x=1}^V p(x|c) u_x\]

์ด๋‹ค. \(u_o\)๊ฐ€ ์‹ค์ œ context word์ด๊ณ , ๊ทธ ๋’ค์˜ ํ•ญ์ด expected context word์ด๋‹ค. ์ฆ‰, ์‹ค์ œ context word์™€ expected context word์˜ ์ฐจ์ด๋ฅผ ์ค„์ธ๋‹ค.

์—ฌํŠผ ์‹ค์ œ๋กœ ๊ตฌํ˜„ํ•  ๋•Œ numpy, matplotlib, jupyter, gensum, sklearn์„ ์ฐธ๊ณ ํ•ด์„œ ๊ตฌํ˜„ํ•˜๋Š”๋ฐ, ๊ทธ๋ƒฅ colab์œผ๋กœ ํ•˜๋ฉด ๋  ๊ฑฐ ๊ฐ™๋‹ค.

April 6, 2019
Tags: cs224n