๐Ÿ“ƒ Negative Sampling ๋…ผ๋ฌธ ์ •๋ฆฌํ•ด๋ณด๊ธฐ

cs224n์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ด€๋ จ๋œ ๋…ผ๋ฌธ์ด suggested readings์— ์žˆ์–ด์„œ ์ฒœ์ฒœํžˆ ์ฝ์–ด๋ณด๋Š” ์ค‘์ธ๋ฐ ์ด๋ฒˆ์— ์ฝ์€ ๋…ผ๋ฌธ์€ negative sampling์œผ๋กœ ์œ ๋ช…ํ•œ Distributed Representations of Words and Phrases and their Compositionality์ด๋‹ค.

Introduction

์›๋ž˜ ํ†ต๊ณ„ํ•™์„ ํ™œ์šฉํ•œ ๋งŽ์€ ๋ชจ๋ธ๋“ค์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋Š”๋ฐ, ๊ทธ๋Ÿฌ๋˜ ์ค‘ neural net์„ ํ™œ์šฉํ•œ ๋ชจ๋ธ์ด ๋‚˜์˜ค๊ณ , ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ, ์ˆ˜๋งŽ์€ matrix multiplication์ด ์žˆ๋‹ค ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋” ์ตœ๊ทผ์—๋Š” skip-gram๊ฐ™์€ ๋ชจ๋ธ์ด ๋‚˜์˜ค๋ฉด์„œ matrix multiplication๋„ ์ค„์ด๋ฉด์„œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋ชจ๋ธ๋„ ๋‚˜์™”๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํ•ด๋‹น skip-gram ๋ชจ๋ธ์˜ ์—ฌ๋Ÿฌ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค€๋‹ค๊ณ  ํ•œ๋‹ค. ์ž์ฃผ ๋‚˜์˜ค๋Š” ๋‹จ์–ด๋ฅผ subsamplingํ•˜๊ฑฐ๋‚˜, NCE(Noise Contrastive Estimation)์„ ์ ์šฉํ•ด๋ณด๊ฑฐ๋‚˜, hierachical softmax๋„ ์ ์šฉํ•ด๋ณธ๋‹ค. ์„ฑ๋Šฅ ํ…Œ์ŠคํŠธ๋Š” ์ง์ ‘ analogical reasoning task๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ํ…Œ์ŠคํŠธ๋ฅผ ํ•œ๋‹ค.

"Montreal":"Montreal Canadiens"::"Toronto":"Toronto Maple Leafs"

Skip-Gram ๋ชจ๋ธ

Hierarchical Softmax

Softmax๋Š” ์ปดํ“จํ„ฐ์˜ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋„ˆ๋ฌด ๋งŽ์•„์„œ ๊ทผ์‚ฌ์‹œํ‚จ๊ฒƒ์ด hierarchical softmax์ด๋‹ค. NNLM์—์„œ ์ฒ˜์Œ ์†Œ๊ฐœ๋˜์—ˆ๋‹ค. ์ผ๋‹จ ์ด์ ๋ถ€ํ„ฐ ๋งํ•ด๋ณด์ž๋ฉด, ๋ณต์žก๋„๊ฐ€ ์›๋ž˜ ๊ฐœ์˜ output node๋ฅผ ์—ฐ์‚ฐํ•˜๋Š” ๊ฒƒ ๋Œ€์‹  ๊ฐœ์˜ output node๋งŒ ๊ณ„์‚ฐํ•ด๋„ ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

HS๋Š” binary tree representation์„ ์‚ฌ์šฉํ•œ๋‹ค. root์—์„œ leaf๋กœ ๊ฐ€๋Š” ๊ธธ์—์„œ ๊ณ„์†ํ•ด์„œ ํ™•๋ฅ ์„ ๊ณฑํ•ด ์„ ํƒํ•ด๊ฐ€๋ฉด์„œ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ด๋Š” ๊ฒƒ์ด๋‹ค. ์ž์„ธํžˆ ์•Œ์•„๋ณธ ๊ฒƒ์€ ์•„๋‹ˆ๋‹ˆ, ๋‹ค๋ฅธ ๊ฒƒ์„ ์ฐธ๊ณ ํ•˜์ž ใ… ใ… 

์‹ค์ œ๋กœ HS๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์‹คํ—˜์„ ํ•ด๋ณด๋ฉด training time๊ณผ accuracy๊ฐ€ ํ–ฅ์ƒ๋œ๋‹ค๊ณ  ํ•œ๋‹ค. binary huffman tree๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋” ์งง์€ ์ฝ”๋“œ๋กœ ๋” ๋น ๋ฅธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค ํ•œ๋‹ค.

Negative Sampling

hierarchical softmax๋Œ€์‹ , Noise Cotrastive Estimation(NCE)๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. NCE๊ฐ€ softmax์˜ log probability๋ฅผ maxmizeํ•˜๋Š”๋ฐ, ๊ทธ๊ฒŒ ์กฐ๊ธˆ ๋น„ํšจ์œจ์ ์ด๋ผ, Negative Sampling(NEG)๋ฅผ ์‚ฌ์šฉํ•ด NCE๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ณ€ํ™˜์‹œ์ผœ์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. skip gram์—์„œ log P(w_O|w_I)(๋ฐ” ๋•Œ๋ฌธ์— ์ž๊พธ ํ…Œ์ด๋ธ”๋กœ ๋ฐ”๋€๋‹ค.. ใ… ใ… )๋ฅผ ๋‹ค ์น˜ํ™˜ํ•œ ์•„๋ž˜ ์‹์„ objective๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

ํฐ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” k๋ฅผ 2 ~ 5์ •๋„๋กœ ์„ ํƒํ•˜๊ณ  ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋Š” k๋ฅผ 5 ~ 20์ •๋„๋กœ ์„ ํƒํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

NCE์™€ NEG์˜ ๋‹ค๋ฅธ ์ ์€ NCE๋Š” sample๊ณผ noise distribution์˜ numberical probabilities๋ฅผ ํ•„์š”๋กœ ํ•˜๊ณ , skip-gram์— ์ค‘์š”ํ•˜์ง€ ์•Š์€ softmax์˜ log probability๋ฅผ maximizeํ•˜์ง€๋งŒ, NEG๋Š” ์ƒ˜ํ”Œ๋งŒ์„ ํ•„์š”๋กœ ํ•œ๋‹ค๋Š” ์ ์ด๋‹ค. ํ›จ์”ฌ ํšจ์œจ์ ์ด๋‹ค.

์‹คํ—˜์— ๋Œ€ํ•œ ์–˜๊ธฐ๋„ ๊ฐ„๋žตํ•˜๊ฒŒ ๋‚˜์™€์žˆ๋Š”๋ฐ, Unigram distribution์˜ 3/4์Šน์„ ์‚ฌ์šฉํ•  ๋•Œ ์ œ์ผ ์„ฑ๋Šฅ์ด ์ž˜ ๋‚˜์™”๋‹ค๊ณ  ํ•œ๋‹ค.

Subsampling of Frequent Words

in, the, a ๊ฐ™์€ ๋‹จ์–ด(frequent word)๋“ค์€ ์ผ๋ฐ˜์ ์ธ ๋‹จ์–ด(rare word)๋“ค๋ณด๋‹ค ํ›จ์”ฌ ์ ์€ ์ •๋ณด๋ฅผ ์ค€๋‹ค. ๊ทธ๋ž˜์„œ ์ด๋“ค์„ ์ผ์ •ํ™•๋ฅ ๋กœ ๋นผ๊ณ  ํ›ˆ๋ จ์„ ์‹œ์ผฐ๋”๋‹ˆ, ์–ด์ฐจํ”ผ frequent word๋“ค์€ ๋ฒกํ„ฐ๊ฐ’์ด ๊ฑฐ์˜ ๋ฐ”๋€Œ์ง€ ์•Š๊ณ , rare word๋Š” ํ›จ์”ฌ ๋น ๋ฅด๊ฒŒ ํ•™์Šต๋˜์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ์ œ์™ธ๋ฅผ ์‹œํ‚ค๋ƒ๋ฉด, ๋‹จ์–ด ์— ๋Œ€ํ•ด์„œ ์•„๋ž˜์™€ ๊ฐ™์€ ํ™•๋ฅ ๋กœ ์ œ์™ธ๋ฅผ ์‹œํ‚จ๋‹ค.

๋Š” ๊ฐ๊ฐ์˜ ๋‹จ์–ด์˜ ๋นˆ๋„์ด๊ณ , t๋Š” threshold์ด๋‹ค. ๋Œ€๋žต ์ •๋„๋ผ๊ณ  ํ•œ๋‹ค.

Empirical Results

HS, NCE, NEG, subsampling์„ ์ง„ํ–‰ํ–ˆ๋‹ค. analogy๋กœ ๊ฒ€์ฆํ•˜๋Š” ์ž‘์—…์€ syntactic(quick:quickly::slow::slowly๊ฐ™์€)๊ณผ semantic์œผ๋กœ ๋‚˜๋‰˜์—ˆ๋‹ค.1

๋ญ ์‹ค์ œ ๊ฒฐ๊ณผ๋Š” ๊ฑด๋„ˆ๋›ฐ์—ˆ๊ณ  ์—ฌ๊ธฐ์„œ๋Š” ์ฝ”๋“œ ๋งํฌ๋งŒ ๋ดค๋‹ค.

  1. https://code.google.com/archive/p/word2vec/ ์ž์„ธํ•œ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ๋ฅผ ์ฐธ๊ณ ํ•˜์žย 

April 7, 2019 ์— ์ž‘์„ฑ
Tags: machine learning nlp paper