๐Ÿ“ƒ GloVe ๋…ผ๋ฌธ ์ •๋ฆฌํ•ด๋ณด๊ธฐ

์ด ๋…ผ๋ฌธ์€ cs224n ๊ฐ•์˜ 2๊ฐ•์—์„œ suggested readings๋กœ ์ถ”์ฒœ๋œ ๋…ผ๋ฌธ์ด๋‹ค. ์Šคํƒ ํฌ๋“œ์—์„œ ์ž‘์„ฑํ•œ ๋…ผ๋ฌธ์ด๊ณ , ์˜๋ฏธ๊ถŒ์—์„œ ๋ฒ ์ด์Šค๋กœ ํ™œ์šฉ์ด ๋งŽ์ด ๋œ๋‹ค๊ณ  ํ•ด์„œ ๋”ฐ๋กœ ์ •๋ฆฌ๋ฅผ ํ•ด๋ณด๊ธฐ๋กœ ํ—€๋‹ค! ์‚ฌ์‹ค ์ด๋ฒˆ ํฌ์ŠคํŠธ๋Š” ๋‚ด๊ฐ€ ๋‹ค์‹œ ๋ณด๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•˜๊ฒŒ ์ •๋ฆฌํ•˜๋Š” ๊ฒƒ์ด๋ผ ์„ค๋ช…์ด ๋ถ€์กฑํ•˜๋‹ค.

Introduction

word vector๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์—๋Š” ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ํ•˜๋‚˜๋Š” global matrix factorization methods์ด๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” local context window methods์ด๋‹ค. ์ „์ž์—๋Š” LSA ๊ฐ™์€ ๊ฒƒ์ด ํ•ด๋‹น๋˜๊ณ , ํ›„์ž์—๋Š” skipgram๊ฐ™์€ ๊ฒƒ์ด ํ•ด๋‹น๋œ๋‹ค. LSA์™€ ๊ฐ™์€ ๋ชจ๋ธ๋“ค์€ ํ†ต๊ณ„ํ•™์  ์ •๋ณด๋“ค์„ ํšจ์œจ์ ์œผ๋กœ ๊ทน๋Œ€ํ™”์‹œํ‚ค๋Š” ๋Œ€์‹  analogy task์—๋Š” ํ˜•ํŽธ์—†๋‹ค. skipgram๊ณผ ๊ฐ™์€ ๋ชจ๋ธ๋“ค์€ ๊ทธ์— ๋น„ํ•ด analogy์—๋Š” ํŠนํ™”๋˜์–ด ์žˆ์œผ๋‚˜, ํ†ต๊ณ„์ ์ธ ์ •๋ณด๋“ค์„ ์ œ๋Œ€๋กœ ์ˆ˜์ง‘ํ•˜๊ธฐ๊ฐ€ ํž˜๋“ค๋‹ค.

๊ทธ๋ž˜์„œ ์ด ๋…ผ๋ฌธ์—์„œ weighted least squares model์„ ์†Œ๊ฐœํ•˜๋Š”๋ฐ global word-word co-occurance counts๋ฅผ ํ›ˆ๋ จํ•œ ๋ชจ๋ธ์ด๊ณ , ํ†ต๊ฒŒ์ ์ธ ์ •๋ณด๋„ ์ž˜ ํ™œ์šฉํ•˜๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. 1

The GloVe Model

\(X_{ij}\)๊ฐ€ ๊ธฐ๋ณธ ๋‹จ์œ„์ธ word-word co-occurance count๋ฅผ \(X\)๋ผ๊ณ  ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  \(X_i = \sum_k X_{ik}\)๋ผ ํ•œ๋‹ค. context word \(i\)์—์„œ word \(j\)๊ฐ€ ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์€

\[P_{ij} = P(j|i) = \frac {X_{ij}} {X_i}\]

์ด๋‹ค. ์—ฌ๊ธฐ์„œ ๊ฐ๊ฐ์˜ ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์˜ ๋น„์œจ์„ GloVe์—์„œ ํ™œ์šฉํ•˜๊ฒŒ ๋˜๋Š”๋ฐ ์„ธ ๋‹จ์–ด \(i\), \(j\), \(k\)์— ๋Œ€ํ•ด ์•„๋ž˜์ฒ˜๋Ÿผ ์ ์„ ์ˆ˜ ์žˆ๋‹ค.

\[F(w_i, w_j, \tilde {w}_k ) = \frac {P_{ik}} {P_{jk}}\]

\(\tilde{w}\)๋Š” context word vector์ธ๋ฐ, word2vec์ด \(u\), \(v\) ๋ฒกํ„ฐ๋ฅผ ๋‚˜๋ˆ„์–ด์“ฐ๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•˜๊ฒŒ ์ƒ๊ฐํ•˜๋ฉด ๋ ๋“ฏ ์‹ถ๋‹ค. ์ž ์œ„์˜ ์‹์—์„œ \(F\)๊ฐ€ ๋‘ ๋‹จ์–ด์˜ ์ฐจ์ด์— ์˜์กด์ ์ด๋‹ˆ ์ด๋ ‡๊ฒŒ ๋ฐ”๊พธ๊ณ ,

\[F(w_i - w_j, \tilde {w}_k ) = \frac {P_{ik}} {P_{jk}}\]

๋˜ ์ธ์ž๋Š” ๋ฒกํ„ฐ์ธ๋ฐ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฒƒ์€ ์Šค์นผ๋ผ๊ฐ’์ด๋‹ˆ ์ด๋ ‡๊ฒŒ ๋ฐ”๊พธ์ž

\[F((w_i - w_j)^\intercal \tilde {w}_k ) = \frac {P_{ik}} {P_{jk}}\]

๊ทผ๋ฐ ์—ฌ๊ธฐ์„œ ์•Œ์•„์•ผ ํ•  ์ ์ด, word-word co-occurance matrix \(X\)๋ž‘, word \(w\), context word \(\tilde{w}\)๋ž‘ ๊ตฌ๋ถ„์ด ๋ชจํ˜ธํ•˜๋‹ค. ๊ทธ๋ž˜์„œ \(X\)๋ฅผ symmetricํ•˜๊ฒŒ, \(w\)๋Š” \(\tilde{w}\)์™€ ๋ฐ”๊ฟ”์“ธ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์•ผํ•œ๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด์„œ \(F\)๊ฐ€ group \((\mathbb{R}, +)\)์™€ \((\mathbb{R}, \times)\)์— ๋Œ€ํ•ด homomorphismํ•จ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. 2 ๊ทธ๋Ÿฌํ•œ homomorphism์„ ๋ณด์žฅ๋ฐ›์œผ๋ฉด ์ด๋ ‡๊ฒŒ ์ˆ˜์ •์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

\[F((w_i - w_j)^\intercal \tilde {w}_k ) = F(w_i^\intercal \tilde {w}_k - w_j^\intercal \tilde {w}_k) = \frac {F(w_i^\intercal \tilde {w}_k)} {F(w_j^\intercal \tilde {w}_k)}\]

\(F\)๋Š” ๋งˆ์น˜ exp์™€ ๋น„์Šทํ•˜๊ฒŒ ํ’€์–ด์ง€๋ฏ€๋กœ, ์ด๋Ÿฐ์‹์œผ๋กœ ์œ ๋„ํ•ด๋ณด์ž. (๋งจ ์œ„์˜ ์‹ ์ฐธ๊ณ )

\[w_i^\intercal \tilde {w}_k = \log P_{ik} = \log {X_{ik}} - \log {X_i}\]

๊ทผ๋ฐ ์ด๊ฒŒ \(\log {X_i}\) ํ•ญ์ด \(k\)$์— ๋…๋ฆฝ์ ์ด๋ผ ์ด๋Ÿฐ์‹์œผ๋กœ bias๋กœ ์ •๋ฆฌ๊ฐ€๋Šฅํ•˜๋‹ค.

\[w_i^\intercal \tilde {w}_k + b_i + \tilde{b}_k = \log {X_{ik}}\]

๊ทผ๋ฐ ์—ฌ๊ธฐ์„œ ๋˜ ๋ฌธ์ œ์ ์ด \(X\)๊ฐ€ 0์ด ๋‚˜์˜ฌ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ..์ธ๋ฐ, ์ด๊ฑธ \(\log (X_{ik}) \rightarrow \log (1 + X_{ik})\)์™€ ๊ฐ™์€ ํ˜•์‹์œผ๋กœ \(X\)์˜ sparsity๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ divergence๋ฅผ ํ”ผํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. ์ž ์—ฌํŠผ ์—ฌ๊ธฐ์„œ cost function์„ ๋ฝ‘์•„๋‚ด๋Š”๋ฐ, ๊ทธ ์‹์ด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[J = \sum_{i,j = 1}^V f(X_{ij}) ( w_i^\intercal \tilde{w}_j +b_i + \tilde{b}_j - \log {X_{ij}} )^2\]

์ด ์‹์„ ๋ณด๋ฉด์„œ ์–ด๋Š์ •๋„ ๋– ์˜ค๋ฅธ ์•„์ด๋””์–ด๋Š” โ€œ\(\log {X_{ij}}\)๊ฐ€ ์‹ค์ œ co-occurance์ด๊ณ , \(w_i^\intercal \tilde{w}_j +b_i + \tilde{b}_j\)๋Š” word vector๋กœ๋ถ€ํ„ฐ ๋ฝ‘์•„๋‚ด๋Š” ์˜ˆ์ƒ co-occurance๋‹ˆ๊นŒ ๊ทธ ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ์ œ๊ณฑํ•œ๋‹ค์Œ ๊ฐ๊ฐ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์–ด ํ•ฉํ•˜๋ฉด cost function์ธ๊ฐ€?โ€ ์ •๋„์ด๋‹ค. ๋ฌผ๋ก  ํ˜ผ์ž ์ƒ๊ฐํ•œ๊ฑฐ๋ผ ์ •ํ™•ํ•œ์ง€๋Š”.. ๋ชจ๋ฅด๊ฒ ๋‹ค.

Relationship to Other Models

\(Q_{ij}\) ์„ word \(j\)๊ฐ€ context of word \(i\)์— ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์ด๋ผ๊ณ  ํ•  ๋•Œ, ๋‹ค์Œ ์‹๊ณผ ๊ฐ™์•„์ง„๋‹ค.

\[Q_{ij} = \frac {\exp (w_i^\intercal \tilde w _j)} {\sum_{k=1}^V \exp(w_i^\intercal \tilde w _k)}\]

softmax์ธ๋ฐ, ๊ทธ๋ฅผ ์ด์šฉํ•œ objective function์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

\[J = - \sum_{i \in corpus \\ j \in context(i)} \log Q_{ij}\]

์ด๋ฅผ co-occurance matrix \(X\)๋ฅผ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•ด์„œ ์ด๋ ‡๊ฒŒ ๋ณ€ํ˜•ํ•˜๋ฉด ํ›จ์”ฌ ๋นจ๋ผ์ง„๋‹ค. (\(X\)์˜ 75% ~ 90%๊ฐ€ 0์ด๋‹ˆ..)

\[J = - \sum_{i = 1}^V \sum_{j=1}^V X_{ij} \log Q_{ij}\]

์—ฌ๊ธฐ์„œ ์•ž์˜ ์‹๋“ค์„ ์ด์šฉํ•ด ์ด๋ ‡๊ฒŒ ๋ณ€ํ˜•์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

\[J = - \sum_{i = 1}^V X_i \sum_{j=1}^V P_{ij} \log Q_{ij} = \sum_{i=1}^V X_{i} H(P_i, Q_i)\]

์—ฌ๊ธฐ์„œ \(H(P_i, Q_i)\)๋Š” Cross entropy์ด๋‹ค. ๊ทผ๋ฐ, cross entropy๋Š” distance๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์ค‘ ํ•˜๋‚˜์ธ๋ฐ, ์–ด๋–ค ๋•Œ์— weight๋ฅผ ๋„ˆ๋ฌด ๋งŽ์ด ์ค€๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ์ด๋ ‡๊ฒŒ ๋‹ค๋ฅธ ๊ฑฐ๋ฆฌ๋ฅผ ์“ฐ๋„๋ก ๋ฐ”๊ฟ”์ฃผ์ž.

\[\hat J = \sum_{i, j}^V X_{i} (\hat P_{ij} - \hat Q_{ij}) ^2\]

์—ฌ๊ธฐ์„œ \(\hat P_{ij} = X_{ij}\), \(\hat Q_{ij} = \exp(w_i^\intercal \tilde {w}_j)\)์ฒ˜๋Ÿผ unnormalize๋œ ๋ถ„ํฌ๊ฐ€ ๋œ๋‹ค. ๊ทผ๋ฐ ์ด๊ฒƒ๋„ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. \(X_{ij}\)๋Š” ๋ณดํ†ต ๋„ˆ๋ฌด ํฐ ๊ฐ’์„ ์ทจํ•˜๊ฒŒ ๋˜๋ฏ€๋กœ, squared error๋ฅผ minimizeํ•ด์ฃผ์ž.

\[\hat J = \sum_{i, j}^V X_{i} (\log \hat P_{ij} - \log \hat Q_{ij}) ^2 \\ = \sum_{i, j}^V X_{i} (w_i^\intercal \tilde {w}_j - \log X_{ij}) ^2\\ = \sum_{i, j}^V f(X_{ij}) (w_i^\intercal \tilde {w}_j - \log X_{ij}) ^2\]

์ž ๊ทผ๋ฐ ์ด๊ฑด ๊ฒฐ๊ตญ GloVe์˜ cost function๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๊ฐ€ ๋˜์—ˆ๋‹ค.


๊ทธ ๋’ค๋Š” ๋„ˆ๋ฌด ์–ด๋ ค์›Œ๋ณด์—ฌ์„œ ์•„์ง ๋ชป๋ดค๋‹ค ใ… ใ… 

  1. http://nlp.stanford.edu/projects/glove/ ์—ฌ๊ธฐ์— ์†Œ์Šค์ฝ”๋“œ๊ฐ€ ์˜ฌ๋ผ๊ฐ€ ์žˆ๋‹ค.ย 

  2. https://en.wikipedia.org/wiki/Group_homomorphism ์—ฌ๊ธฐ์— ๊ฐ„๋žตํ•˜๊ฒŒ ์ž˜ ์„ค๋ช…๋˜์–ด ์žˆ๋‹ค.ย 

April 8, 2019
Tags: paper