๐Ÿ“ƒ Word2Vec ๋…ผ๋ฌธ ์ •๋ฆฌํ•ด๋ณด๊ธฐ

CS224n์„ ๋“ฃ๊ธฐ ์‹œ์ž‘ํ•˜๊ณ  ๋‚˜์„œ ๊ฐ™์ด ๋‚˜์˜ค๋Š” suggested readings๋ฅผ ๊ฐ€๋” ์ฝ๋Š”๋ฐ, word2vec์˜ ๋…ผ๋ฌธ์œผ๋กœ ์œ ๋ช…ํ•œ Efficient Estimation of Word Representation in Vector Space๋„ ๊ทธ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ณต๋ถ€๋ฅผ ํ•˜๋ฉด์„œ ๋‹ค์‹œ ๋ณผ๋งŒํ•œ ๋‚ด์šฉ, ์ƒ๊ฐ๋‚ฌ๋˜ ๋‚ด์šฉ๋“ค์„ ์ ์–ด๋‘๊ธฐ ์œ„ํ•ด ์ด ํฌ์ŠคํŠธ๋ฅผ ์ž‘์„ฑํ–ˆ๋‹ค.

word2vec

word2vec๋Š” word embedding ๋ฐฉ๋ฒ• ์ค‘์˜ ํ•˜๋‚˜๋กœ, ๋…ผ๋ฌธ์— ์ œ์•ˆํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๊ฐ„๋žต ์„ค๋ช…์œผ๋กœ โ€œfor computing continous vector representations fof words from very large dataโ€๋ผ๊ณ  ์ ํ˜€์žˆ๋‹ค.

์ฒ˜์Œ์œผ๋กœ NLP ์ชฝ์œผ๋กœ ๊ณต๋ถ€๋ฅผ ํ•˜๋Š” ๊ฑฐ๋ผ ์–ด๋ ต๊ธฐ๋„ ํ•˜๊ณ  ์žฌ๋ฐŒ๊ธฐ๋„ํ•œ ๊ฐœ๋…์ด ๋งŽ์ด ๋‚˜์™”๋‹ค. ๊ทธ๋ž˜๋„ CS231n์˜ ๊ฐ•์˜ ๋ช‡๊ฐœ๋ฅผ ์ฐพ์•„๋ณด๊ณ  ์ •๋ฆฌํ–ˆ์—ˆ๋Š”๋ฐ, ๊ทธ ์ดํ›„์— ๋ณด๋‹ˆ๊นŒ ๋‚˜๋ฆ„ ๊ทธ๋Ÿญ์ €๋Ÿญ ์ดํ•ดํ• ๋งŒํ•œ ๋…ผ๋ฌธ์ด์—ˆ๋˜ ๊ฒƒ ๊ฐ™๋‹ค. CS224n ๊ต์ˆ˜๋‹˜์ด ์„ค๋ช… ์ž˜ ํ•˜์‹  ๊ฒƒ๋„ ์žˆ๊ฒ ์ง€๋งŒ..?

์—ฌํŠผ ๋‹ค์‹œ ๋ณผ๋งŒํ•œ ๋‚ด์šฉ๋งŒ ์ž‘์„ฑํ•˜๊ณ , conclusion๊ฐ™์€ ๊ฒƒ๋“ค์€ ๋นผ๊ณ  ์ž‘์„ฑํ•œ๋‹ค.

Introduction

2013์— ๋‚˜์˜จ ๋…ผ๋ฌธ์ธ๋ฐ, ๊ทธ ๋•Œ์˜ ๋งŽ์€ NLP ๋ฐฉ๋ฒ•๋“ค์€ ๋‹จ์–ด๋ฅผ ํ•˜๋‚˜์˜ ์ž‘์€ ๋‹จ์œ„๋กœ ์ทจ๊ธ‰ํ•˜๊ณ  ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทผ๋ฐ ์ด๋Ÿฐ ๋ฐฉ๋ฒ•์€ ๋‹จ์–ด ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„๋ฅผ ํŒ๋ณ„ํ•˜๊ธฐ๊ฐ€ ํž˜๋“ค๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์„ฑ๋Šฅ์ด ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ์— ํฌ๊ฒŒ ์ขŒ์šฐ๋œ๋‹ค. ํ•˜์ง€๋งŒ ์–ด๋Š์ •๋„ ์žฅ์ ๋„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”๋ฐ, ๊ฐ„๋‹จํ•˜๊ณ  robustํ•˜๋ฉฐ, ๋‚˜๋ฆ„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒˆ๋‹ค๊ณ  ํ•œ๋‹ค. Ngram์ด ๊ทธ ์ค‘ ํ•˜๋‚˜๋ผ๊ณ  ํ•œ๋‹ค. 1 ํ•˜์ง€๋งŒ ํ†ต๊ณ„์ ์ธ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์ง€ ์•Š๊ณ  ๊ธฐ๊ณ„ํ•™์Šต์„ ์‚ฌ์šฉํ•œ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜๋ฉด์„œ distributed representation์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๊ฝค ์ข‹์€ ์ ‘๊ทผ๋ฒ•์œผ๋กœ ๋– ์˜ฌ๋ž๋‹ค๊ณ  ํ•œ๋‹ค. 2

๊ทธ๋ž˜์„œ 50 ~ 100์ •๋„ dimension์— word vector๋กœ ์ž„๋ฒ ๋”ฉ ํ•˜๋Š” ๊ฒƒ์„ ์†Œ๊ฐœํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ word vector์˜ ๋†€๋ผ์šด ์ ์„ ์†Œ๊ฐœํ•œ๋‹ค. word2vec์„ ์„ค๋ช…ํ•˜๋‹ค ๋ณด๋ฉด ๋งŽ์ด ๋‚˜์˜ค๋Š” ๋‹จ์–ด๋ฅผ ๋”ํ•˜๊ณ  ๋นผ๋Š” ์—ฐ์‚ฐ์„ ์•„๋ž˜์ฒ˜๋Ÿผ ์˜ˆ์‹œ๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉด์„œ ์†Œ๊ฐœํ•œ๋‹ค.

vector(โ€Kingโ€) - vector(โ€Manโ€) + vec- tor(โ€Womanโ€) results in a vector that is closest to the vector representation of the word Queen

๋‹จ์–ด๋ฅผ ์—ฐ์†์ ์ธ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด ์ด ๋…ผ๋ฌธ์ด ์ฒ˜์Œ์€ ์•„๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ํ•ด๋‹น ๋ฐฉ์‹์„ ์ œ์•ˆํ–ˆ๋˜ ๋งŽ์€ ๋…ผ๋ฌธ๋“ค ์ค‘ NNLM์— ๊ด€ํ•œ ๋…ผ๋ฌธ2์—์„œ ์ œ์•ˆํ•œ ๋ฐฉ์‹์ด ๋งŽ์€ ์ฃผ๋ชฉ์„ ๋ฐ›์•˜๋‹ค. feedforward neural network ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

Model Architecture

neural network๋ฅผ ์‚ฌ์šฉํ•ด์„œ distributed representation์„ ๊ตฌํ˜„ํ•œ๋‹ค. ํ•ด๋‹น ์•„ํ‚คํ…์ณ๋ฅผ ๋งํ•˜๊ธฐ ์ „์— computational complexity๋ฅผ ๋จผ์ € ๋…ผํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•˜๋ฉด์„œ ์–ด๋–ป๊ฒŒ ์ •ํ™•๋„๋Š” ๋†’์ด๊ณ  complexity๋Š” ๋‚ฎ์ถ”์—ˆ๋Š”์ง€ ์†Œ๊ฐœํ•ด์ค€๋‹ค. ์ผ๋‹จ ์•ž์œผ๋กœ ๋‚˜์˜ฌ ๋ชจ๋ธ๋“ค์€ ์ด๋Ÿฐ training complexity๋ฅผ ๊ฐ–๋Š”๋‹ค๊ณ  ํ•œ๋‹ค.

๊ฐ€ training epoch, ๊ฐ€ training set์— ์กด์žฌํ•˜๋Š” ๋‹จ์–ด์˜ ์ˆ˜, ๋Š” ์•ž์œผ๋กœ ๊ฐ ๋ชจ๋ธ์—์„œ ์ •์˜๋  ๊ฒƒ์ด๋ผ๊ณ  ํ•œ๋‹ค. ๋ณดํ†ต epoch๋Š” 3 ~ 50์œผ๋กœ ์ •ํ•˜๊ณ , ๋Š” 1 billion๊นŒ์ง€๋กœ ์ •ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. ํ•™์Šต์€ SGD์™€ backprop์„ ์ด์šฉํ•œ๋‹ค๊ณ .

NNLM

NNLM์€ input, projection, hidden, output layer๋กœ ๊ตฌ์„ฑ์ด ๋˜์–ด ์žˆ๋‹ค. input layer์—์„œ ๊ฐ€ ๋‹จ์–ด์˜ ์ˆ˜๋ผ๊ณ  ํ•˜๋ฉด, 1-of- coding(one hot)์„ ์‚ฌ์šฉํ•ด์„œ ๋‹จ์–ด๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ณ , ๊ณต์œ ๋˜๋Š” projection์šฉ ํ–‰๋ ฌ์„ ์‚ฌ์šฉํ•ด์„œ ์ฐจ์›์˜ projection layer๋กœ projectํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. ์ฐจ์›์ด๋‹ˆ๊นŒ ํ•œ๋ฒˆ์— ๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•˜๋‹ค. ๋ณดํ†ต ์€ 10์ •๋„๋กœ ์“ฐ๊ณ , projection layer๋Š” 500 ~ 2000 ์ฐจ์›์ฏค ์“ด๋‹ค๊ณ  ํ•œ๋‹ค. hiddden layer๋Š” 500 ~ 1000์ฐจ์›์ฏค, output layer์— ๋Œ€ํ•ด์„œ๋Š” ๋ผ๊ณ ๋งŒ ๋‚˜์™€์žˆ๋‹ค. ์ž ์ด๋ ‡๊ฒŒ ๋˜๋‹ˆ๊นŒ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

๊ฐ€ ๋งค์šฐ ๋น„์‹ผ ์—ฐ์‚ฐ์ด๋ฏ€๋กœ hierarchical softmax๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. (์ด๊ฑด ๋‹ค์Œ์— ์ •๋ฆฌํ•ด์•ผ์ง€) ๋˜๋Š” normalize๋ฅผ ์•ˆํ•œ๋‹ค๊ณ . ์–ด์ฉ„๋“  ๊ทธ๋Ÿฐ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜๋ฉด ์ •๋„๋กœ ์ค„์–ด๋“ค์–ด์„œ ๊ฐ€ ์ œ์ผ complexityํ•ด์ง„๋‹ค.

word2vec์—์„œ๋Š” huffman binary tree๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด๋Š” ๋ฅผ ์ •๋„๋กœ ์ค„์—ฌ์ค€๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ์ ์–ด๋‘๊ณ .. ์‚ฌ์‹ค์€ ๋Š” bottleneck์ด ์•„๋‹ˆ๋‹ˆ ๊ทธ๋ ‡๊ฒŒ๊นŒ์ง€ ์ค‘์š”ํ•˜์ง„ ์•Š๋‹ค๊ณ  ํ•œ๋‹ค.

RNNLM

RNN์˜ ํ™œ์šฉ์ด Feedforward NNLM์˜ ํ•œ๊ณ„(context length๋ฅผ ๋ช…์‹œํ•œ๋‹ค๋˜๊ฐ€)๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ์ด๋ก ์ ์œผ๋กœ ์ผ๋ฐ˜ NN๋ณด๋‹ค ๋ณต์žกํ•œ ํŒจํ„ด์— ๋Œ€ํ•ด ํ›จ์”ฌ ํšจ์œจ์ ์ธ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ–ˆ๋‹ค.

New Log-linear Models

์ด์ „ Model Architecture์—์„œ ์†Œ๊ฐœํ–ˆ๋˜ ๊ฒƒ๋“ค์€ neural net์ด ๋งค๋ ฅ์ ์ž„์„ ์•Œ๊ฒŒ ํ•ด์ฃผ์—ˆ์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ ๋ณต์žก์„ฑ์€ non-linear hiddne layer์—์„œ ์˜ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. () ๊ทธ๋ž˜์„œ ์ƒˆ๋กœ์šด ๋ชจ๋ธ์—์„œ๋Š” NNLM์„ ๋‘ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„์–ด์„œ ํ•™์Šต์„ ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. ์šฐ์„  continous word vector๋ฅผ ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ๋กœ ํ•™์Šต์‹œํ‚จ ๋‹ค์Œ์— N-gram NNLM์„ ๊ทธ ์œ„์—์„œ ํ•™์Šต์‹œํ‚จ๋‹ค.

Continuous Bag of words model

๊ทธ๋ž˜์„œ ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ์•„ํ‚คํ…์ณ ์ค‘ ํ•˜๋‚˜๊ฐ€ feedforward NNLM๊ณผ ๋น„์Šทํ•˜์ง€๋งŒ, non-linear hidden layer๋ฅผ ์—†์• ๊ณ , projection matrix ๋ฟ๋งŒ ์•„๋‹Œ layer๊นŒ์ง€ ๋ชจ๋“  ๋‹จ์–ด๋“ค์ด ๊ณต์œ ํ•˜๊ฒŒ ํ•œ ๋ชจ๋ธ์ด๋‹ค. ๊ทธ๋ž˜์„œ ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ๋˜‘๊ฐ™์ด project๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ์„ค๋ช…ํ•˜๋ฉด์„œ ์•„๋ž˜์ฒ˜๋Ÿผ ์„ค๋ช…์„ ํ•˜๋Š”๋ฐ,

Furthermore, we also use words from the future

์ด๋Š” ์•„๋งˆ Ngram๊ณผ ๋‹ค๋ฅด๊ฒŒ ๋‹จ์–ด์˜ ์•ž ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋’ค์— ์žˆ๋Š” ๋‹จ์–ด๋“ค๋„ ์ฐธ๊ณ ํ•œ๋‹ค๋Š” ๋ง์ธ ๊ฒƒ ๊ฐ™๋‹ค. ๊ทธ๋ž˜์„œ complexity๋Š” ์•„๋ž˜์™€ ๊ฐ™์•„์ง„๋‹ค.

์•ž ๋’ค๋กœ 4๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ๊ฐ€์ ธ์˜ค๋„๋ก window size๋ฅผ ๊ฒฐ์ •ํ–ˆ์„ ๋•Œ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์ด์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

Continuous skip-gram mdoel

๋‘๋ฒˆ์งธ ์•„ํ‚คํ…์ณ๋Š” CBOW์™€ ์œ ์‚ฌํ•œ๋ฐ, ํ•œ ๋‹จ์–ด๋กœ ๊ฐ™์€ ๋ฌธ์žฅ์•ˆ์˜ ์ฃผ์œ„์˜ ๋‹จ์–ด๋ฅผ classification์„ maximizeํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. training complexity๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ๊ฐ€ maximum distance of the words๋ผ๊ณ  ํ•˜๋Š”๋ฐ, window size์™€ ๋น„์Šทํ•œ ์˜๋ฏธ๋กœ ๋ฐ›์•„๋“ค์ด๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค.

๋ญ ์ด๋ ‡๊ฒŒ ๋งํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด CBOW๋Š” ํ˜„์žฌ ๋‹จ์–ด๋ฅผ ์ฃผ์œ„ ๋‹จ์–ด(context)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์ธกํ•˜๊ณ , Skip-Gram์€ ํ˜„์žฌ ๋‹จ์–ด๋กœ ์ฃผ์œ„ ๋‹จ์–ด(context)๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.

cbow์™€ skip-gram ๋ชจ๋ธ ๊ทธ๋ฆผ

  1. T. Brants et al. 2007 Ngram์˜ ๋…ผ๋ฌธ์œผ๋กœ ํ†ต๊ณ„์ ์ธ ๋ฐฉ๋ฒ•๋ก ์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.ย 

  2. A Neural Probabilistic Language Model word2vec์˜ NNLM์— ๊ด€ํ•œ ๋…ผ๋ฌธ์œผ๋กœ, distributed representation์„ ์„ค๋ช…ํ•˜๋ฉฐ ์ฐธ๊ณ ๋กœ ๋‹ฌ๋ ค์žˆ๋Š” ๋…ผ๋ฌธ ์ค‘ ํ•˜๋‚˜์ด๋‹ค.ย ย 2

April 6, 2019 ์— ์ž‘์„ฑ
Tags: machine learning nlp paper