๐Ÿ“• CS224n Lecture 1 Introduction and Word Vectors

์ €๋ฒˆ์ฃผ๋ถ€ํ„ฐ CS224n ์Šคํ„ฐ๋””๋ฅผ ์‹œ์ž‘ํ–ˆ๋‹ค! CS231n ๋“ค์„ ๋•Œ๋ž‘ ๋‹ค๋ฅด๊ฒŒ ๋‹ค๋ฅธ ์‚ฌ๋žŒ์ด๋ž‘ ๊ฐ™์ด ํ•˜๋ฉด ๊ทธ๋ž˜๋„ ๋๊นŒ์ง€ ๋“ฃ์ง€ ์•Š์„๊นŒ ์‹ถ์–ด์„œ ํ•œ๋ฒˆ ํ•ด๋ณด์ž๊ณ  ์ฃผ์œ„์‚ฌ๋žŒ๋“ค์„ ๋Œ์–ด๋ชจ์•„๋ดค๋‹ค. ์ด ํฌ์ŠคํŠธ๋Š” 1๊ฐ• ๊ฐ•์˜ ๋…ธํŠธ๋ฅผ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ด๋‹ค.

Introduction

๊ฐ„๋žตํ•˜๊ฒŒ ๊ฐ€๋ฅด์น  ๊ฒƒ์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ์„ค๋ช…ํ•œ๋‹ค.

  • NLP with Deep Learning
  • ์ธ๊ฐ„์˜ ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ๋น… ํ”ฝ์ณ (์ธ์‚ฌ์ดํŠธ..?)
  • pytorch๋กœ ์ง„์งœ๋กœ ๊ตฌํ˜„ํ•ด๋ณด๊ธฐ
    • word meaning
    • dependency parsing
    • machine translation
    • quesion answering ๋“ฑ๋“ฑ๋“ฑโ€ฆ

์ด๋ฒˆ๋ถ€ํ„ฐ pytorch๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

Word Vectors

Human Language and word meaning

language๋ž€ ๊ฒƒ ์ž์ฒด๊ฐ€ ์ƒ๋‹นํžˆ ๋ถˆํ™•์‹คํ•œ ๊ฒƒ์ด๋‹ค. ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•˜๋Š” ์ˆ˜๋‹จ์ด๊ธฐ๋„ ํ•˜๋ฉฐ socializeํ•˜๋Š” ์ˆ˜๋‹จ(์‚ฌ๋žŒ๋“ค์ด ๋„คํŠธ์›Œํ‚น ํ•˜๋Š” ์ˆ˜๋‹จ)์ด๊ธฐ๋„ ํ•˜๋‹ค. ๊ทธ๋Ÿฐ ์–ธ์–ด๋ฅผ ์ปดํ“จํ„ฐ๋กœ ๋ถ„์„ํ•˜๋Š” ๊ฒƒ์€ computer vision ๋“ฑ์— ๋น„ํ•˜๋ฉด ์ƒ๋‹นํžˆ ์ตœ๊ทผ์˜ ์ผ์ด๋‹ค. ๊ทธ๋ž˜๋„ ์ผ๋‹จ ์•Œ๊ณ  ๊ฐ€์•ผํ• ๊ฒƒ์€ ๋‹ค๋ฅธ ์ •๋ณด ๊ตํ™˜ ์ˆ˜๋‹จ์— ๋น„ํ•ด์„œ ์–ธ์–ด๋ฅผ ํ†ตํ•ด์„œ ์ •๋ณด๋ฅผ ์ฃผ๊ณ  ๋ฐ›๋Š” ๊ฒƒ์€ ์ƒ๋‹นํžˆ ์†๋„๊ฐ€ ๋Š๋ฆฌ๋‹ค.

๊ทธ๋Ÿผ ์–ธ์–ด/๋‹จ์–ด๋ฅผ ํ†ตํ•ด ์ „๋‹ฌํ•˜๋Š” meaning์˜ ๋œป ๋ฌด์—‡์ผ๊นŒ. โ€œword, phrase๋“ฑ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฌด์–ธ๊ฐ€โ€์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ผ๋ฐ˜์ ์œผ๋กœ meaning์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ signifier(symbol) <=> signified(idea or thing) ์ •๋„์ด๋‹ค.

๊ทผ๋ฐ ๊ทธ๋Ÿผ meaning์„ ์ปดํ“จํ„ฐ๋กœ๋Š” ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•˜๊ณ , ์ดํ•ดํ•ด๋ณผ ์ˆ˜ ์žˆ์„๊นŒ. ๊ฐ„๋‹จํ•˜๊ฒŒ wordnet๋ฅผ ์‚ฌ์šฉํ•ด๋ณผ ์ˆ˜ ์žˆ๊ฒ ๋‹ค. ์œ ์˜์–ด ๋“ฑ์„ ์ˆ˜๋งŽ์ด ์ •๋ฆฌํ•ด๋†“์€ ์‚ฌ์ „๊ณผ ์œ ์‚ฌํ•œ ๋ฆฌ์ŠคํŠธ์ด๋‹ค. (nltk์•ˆ์— ํฌํ•จ๋˜์–ด ์žˆ๋‹ค)

  • ๊ทธ๋Ÿผ ์ด wordnet์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ์ข‹์„๊นŒ? ๋ฌธ์ œ์ ์€ ์—†์„๊นŒ?
    • ๋‰˜์•™์Šค๊ฐ€ ์—†์–ด์ง„๋‹ค.
    • ์ƒˆ๋กœ์šด ๋‹จ์–ด๋“ค์ด ์—†๋‹ค.
    • ์ฃผ๊ด€์ ์ด๋‹ค.
    • ์ธ๊ฐ„์˜ ๋…ธ๋™์ด ๋‹ค์†Œ ๋งŽ.....์ด ๋“ค์–ด๊ฐ„๋‹ค.
    • ๋‹จ์–ด์˜ ์œ ์‚ฌ๋„๋ฅผ ์ •ํ™•ํžˆ ํ‘œํ˜„ํ•  ์ˆ˜ ์—†๋‹ค.

๊ทธ๋ž˜์„œ neural net ์Šคํƒ€์ผ๋กœ ๋‚˜ํƒ€๋‚ด๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค. (Representing words as discrete symbols) ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด์ž! ๊ทธ๋ž˜์„œ ํ•˜๋‚˜ํ•˜๋‚˜์˜ ๋‹จ์–ด๋ฅผ one-hot ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด๋Š”๋ฐ, ๋ฌธ์ œ

  • ๋‹จ์–ด์˜ ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋“  ๋ฒกํ„ฐ๊ฐ€ orthogonalํ•˜๋‹ค. (one-hot ๋ฒกํ„ฐ๋‹ˆ๊นŒ..)
  • ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋ƒˆ๋Š”๋ฐ ์œ ์‚ฌ๋„๋”ฐ์œ„ ๋ฒ„๋ ธ๋‹ค.

๊ทธ๋ž˜์„œ ์œ ์‚ฌ๋„๋ฅผ ๋ฒกํ„ฐ ์ž์ฒด๊ฐ€ ํฌํ•จํ•  ์ˆ˜ ์žˆ๋„๋ก encodingํ•˜์ž! ์ด๋Ÿฌํ•œ ์ƒ๊ฐ์— ๋Œ€ํ•ด ์•„์ฃผ ํฐ ์ธ์‚ฌ์ดํŠธ๋ฅผ J. R. Firth๋ž€ ์‚ฌ๋žŒ์ด ์ฃผ์—ˆ๋Š”๋ฐ, ์ด๋Š”

Distributional semantics: A wordโ€™s meaning is given by the words that frequently appear close-by

์ด๋‹ค. ๋น„์Šทํ•œ ๋‹จ์–ด๋Š” ๋น„์Šทํ•œ ์œ„์น˜์— ๋งŽ์ด ์œ„์น˜ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ context๋กœ๋ถ€ํ„ฐ ๋ฝ‘์•„์˜จ๋‹ค. ์ฆ‰ ๋‹จ์–ด๋ฅผ context๋ฅผ ์‚ฌ์šฉํ•ด embeddingํ•˜์—ฌ dense vector๋กœ ํ‘œํ˜„ํ•œ๋‹ค. ํ•™์Šต ํ›„ n dimension์„ visualizeํ•ด๋ณด๋‹ˆ๊นŒ(PCA ๋“ฑ์œผ๋กœ) ๋น„์Šทํ•œ ๋‹จ์–ด๊ฐ€ ๋‹ค ๋ชจ์—ฌ์žˆ๋”๋ผ.

distributional semantics

word2vec overview

word2vec์˜ ๋ฉ”์ธ ์•„์ด๋””์–ด๋Š” ์ด๊ฑฐ๋‹ค.

  • ํฐ corpus์˜ ๋ฐ์ดํ„ฐ ์•ˆ์—์„œ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ vector๋กœ ํ‘œํ˜„ํ•˜์ž.
  • word vector์˜ ์œ ์‚ฌ๋„๋ฅผ ์ด์šฉํ•ด ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ํ•ด๋‹น context์— ์žˆ์„ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜์ž.
  • ๊ณ„์† ํ™•๋ฅ ์„ maximizeํ•˜๊ธฐ ์œ„ํ•ด word vector๋ฅผ ์กฐ์ ˆํ•˜์ž.
๊ฐ๊ฐ์˜ ๋‹จ์–ด์— ๋Œ€ํ•œ ํ™•๋ฅ 

์ž ๊ทธ๋Ÿผ ์‹ค์ œ๋กœ ์ž์„ธํ•˜๊ฒŒ ์‚ดํŽด๋ณด์ž. objective function (cost, error function)์œผ๋กœ๋Š” ์•„๋ž˜ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.() ๋Š” ์ „์ฒด ๋‹จ์–ด์˜ ๊ฐฏ์ˆ˜์ด๋‹ค. ๋Š” ๋‹จ์–ด์˜ position์ด๋‹ค. ์€ window size์ด๋‹ค.

๊ฐ€ optimize๋  ๋ณ€์ˆ˜์ด๊ณ , ์€ likelyhood, ์šฐ๋„๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋ฉฐ, ๋•Œ๋•Œ๋กœ ์ด๋‹ค. objective function์„ minimize ์‹œํ‚ค๋Š” ๊ฒƒ์ด predictive accuracy๋ฅผ maximizeํ•˜๋Š” ๊ฒƒ์ด ๋œ๋‹ค.

์ž ๊ทธ๋Ÿผ ํ™•๋ฅ ์€ ์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐํ•˜๋ƒ๋ฉด, ์ผ๋‹จ ๋จผ์ € ์™€ ๋ฅผ ๋จผ์ € ์ •์˜ํ•œ๋‹ค.

  • ๋Š” ๋‹จ์–ด๊ฐ€ context word์ผ๋•Œ ์“ฐ๋Š” ๋ฒกํ„ฐ์ด๋‹ค.
  • ๋Š” ๋‹จ์–ด๊ฐ€ center word์ผ๋•Œ ์“ฐ๋Š” ๋ฒกํ„ฐ์ด๋‹ค.

๊ทธ๋–„ center word๊ฐ€ c์ด๊ณ , context word๊ฐ€ o์ผ๋–„ ํ™•๋ฅ ์„ ์•„๋ž˜์ฒ˜๋Ÿผ ๊ณ„์‚ฐํ•œ๋‹ค.

softmax ์‹๊ณผ ๋น„์Šทํ•˜๋‹ค. ์—ฌ๋‹ด์œผ๋กœ softmax์˜ soft๋Š” ํ™•๋ฅ ์ด๋ผ softํ•˜๊ฒŒ ๋ถ„ํฌ์‹œํ‚จ๋‹ค๋Š” ๋ง์ด๊ณ , softmax์˜ max๋Š” ์ œ์ผ ํ™•๋ฅ ์„ ์ฆํญ์‹œํ‚จ๋‹ค๋Š” ๋ง์ด๋‹ค.

Optimization

์—ฌํŠผ ์ด๋ ‡๊ฒŒ ์‹๋“ค์„ ์ •ํ–ˆ์œผ๋‹ˆ ํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” optimization์„ ํ•ด์•ผํ•œ๋‹ค. ๋Š” ์˜ ์ฐจ์›์ด ๋˜๊ณ , (V๊ฐœ์˜ ๋‹จ์–ด์— ๋Œ€ํ•ด d์ฐจ์›์˜ ๋ฒกํ„ฐ๋“ค์ด 2๊ฐœ์”ฉ(u, v) ์žˆ๋‹ค) ๊ทธ๋ƒฅ ๋ฅผ ๋ฐ”๊พธ๋ฉด์„œ ๋ฅผ minimize์‹œํ‚ค๋ฉด ๋œ๋‹ค๊ณ  ํ•œ๋‹ค. ํŽธ๋ฏธ๋ถ„ ํ•˜๋Š” ๊ฑด ๋‚˜์ค‘์— ๋‹ค์‹œ ๋ด๋„ ์•Œ๊ฑฐ๋ผ ์ƒ๊ฐํ•˜๊ณ .. ์ ์–ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

์ด๊ณ , ์ด ๋ฅผ ํฌํ•จํ•˜๋‹ˆ๊นŒ ๋กœ๊ทธ๊ฐ€ ์•ˆ์œผ๋กœ ๋“ค์–ด๊ฐ€๋ฉด์„œ ๊ฐ€ ์œผ๋กœ ๋ฐ”๋€๋‹ค. ๊ทธ ๋•Œ

์ด๋‹ค. ๊ฐ€ ์‹ค์ œ context word์ด๊ณ , ๊ทธ ๋’ค์˜ ํ•ญ์ด expected context word์ด๋‹ค. ์ฆ‰, ์‹ค์ œ context word์™€ expected context word์˜ ์ฐจ์ด๋ฅผ ์ค„์ธ๋‹ค.

์—ฌํŠผ ์‹ค์ œ๋กœ ๊ตฌํ˜„ํ•  ๋•Œ numpy, matplotlib, jupyter, gensum, sklearn์„ ์ฐธ๊ณ ํ•ด์„œ ๊ตฌํ˜„ํ•˜๋Š”๋ฐ, ๊ทธ๋ƒฅ colab์œผ๋กœ ํ•˜๋ฉด ๋  ๊ฑฐ ๊ฐ™๋‹ค.

April 6, 2019 ์— ์ž‘์„ฑ
Tags: cs224n machine learning nlp