๐Ÿ“• CS224n Lecture 8 Machine translation, Seq2seq, Attention

CS224n 8๋ฒˆ์งธ ๊ฐ•์˜๋ฅผ ๋“ฃ๊ณ  ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ! machine translation์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด๊ณ  seq2seq์™€ attention์„ ์‚ดํŽด๋ณธ๋‹ค.

Machine Translation

Pre-neural translation

์ผ๋‹จ ๊ธฐ๊ณ„๋ฒˆ์—ญ์€ source language์˜ ๋ง๋“ค์„ target language์˜ ๋ง๋กœ ์˜ฎ๊ธฐ๋Š” ํƒœ์Šคํฌ์ด๋‹ค. 1950โ€™s๊นŒ์ง€๋Š” ๋Œ€๋ถ€๋ถ„ rule base๋กœ ๊ตฌํ˜„ํ–ˆ๋‹ค. (์‚ฌ์ „์„ ์ด์šฉํ•œ mapping์ด ๋งŽ์•˜๋‹ค) 1990โ€™s - 2010โ€™s statistical machine translatin ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค. data๋กœ๋ถ€ํ„ฐ probability model์„ ์‚ฌ์šฉํ–ˆ๊ณ , ์ด๋ฅผ SMT๋ผ๊ณ  ์ค„์—ฌ๋ถ€๋ฅธ๋‹ค.

๊ฐ€ translation model์ด๊ณ  ๊ฐ€ LM์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ฉด ์ •๋ง ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค..

alignment

SMT์—์„œ๋Š” alignment๋ฅผ ํ•™์Šตํ•ด์•ผํ•œ๋‹ค. ๋กœ ๋‚˜ํƒ€๋‚ด๊ณ , word๋ฅผ ๋งคํ•‘ํ•˜๊ณ  ๋‚˜์„œ ๊ฐ๊ฐ์˜ ์–ธ์–ด์— ๋งž๋Š” ์–ด์ˆœ์œผ๋กœ ๋ฐฐ์—ดํ•˜๊ธฐ ์œ„ํ•ด alignment๋ฅผ ๋”ฐ๋กœ ํ•™์Šตํ•œ๋‹ค.

alignment

๊ทผ๋ฐ ์–ด๋–ค ๋‹จ์–ด๋“ค์€ counterpart๋„ ์—†๊ณ , align์„ ํ•˜๋Š” ๊ฒƒ์ด โ€œone to manyโ€, โ€œmany to manyโ€, โ€œmany to oneโ€ ๋“ฑ๋“ฑ ์‹ค์ œ๋กœ ๋งคํ•‘๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๊นŒ์ง€ ๋„ˆ๋ฌด ๋งŽ์•„์„œ ์‰ฝ์ง€ ์•Š๋‹ค. ํ™•๋ฅ ์ ์ธ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ ์ž์ฒด๊ฐ€ ๋ชจ๋“  ๋‹จ์–ด๋“ค์„ ๋Œ์•„์•ผ ํ•˜๋Š” ๊ฒƒ์ธ๋ฐ, ๋„ˆ๋ฌด ๊ณ„์‚ฐ ๋น„์šฉ์ด ํฌ๋‹ค.

NMT

์ž ๊ทธ๋ž˜์„œ NMT(neural machine translation)์„ ํ•œ๋‹ค.

NMT

์ด๊ฑธ seq2seq๋กœ ํ’€์–ด๋‚ธ๋‹ค. ์ž ๊น seq2seq๋กœ ํ‘ธ๋Š” ๋ฌธ์ œ๋ฅผ ๋งํ•ด๋ณด์ž๋ฉด, summarization, dialogue, parsing, code generation ๊ฐ™์€ ๋ฌธ์ œ๋“ค์ด ์žˆ๋‹ค. (conditional LM์˜ ์ผ์ข…)

์œ„์ฒ˜๋Ÿผ <END>๊ฐ€ ๋‚˜์˜ฌ ๋•Œ๊นŒ์ง€ ๊ณ„์†ํ•˜๋Š”๋ฐ, ์ด๊ฒŒ ์•ˆ๋‚˜ํƒ€๋‚˜๋ฉด..? ์ด๋ผ๋Š” ์ƒ๊ฐ์„ ํ–ˆ์ง€๋งŒ, ์–ด๋Š์ •๋„ ๋ฆฌ๋ฐ‹์„ ๋‘”๋‹ค๋Š” ๋ง์„ ๋“ค์—ˆ๋‹ค.

decoding์„ ์œ„์ฒ˜๋Ÿผ ํ•˜๋Š” ๋ฐฉ์‹์ด greedy decoding์ธ๋ฐ, ์ด๊ฒŒ ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค. ์•ž์˜ ๊ฒƒ๋งŒ ๋ณด๊ณ  ์˜ˆ์ธก์„ ํ•˜๋‹ˆ ๊ทธ๋ ‡๊ฒŒ ๋œ๋‹ค.

๊ทธ๋ž˜์„œ beam search decoding ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์ด๋‹ค.

Beam Search Decoding

์žฅ๋‹จ์ 

SMT์™€ ๋น„๊ตํ•ด์„œ NMT๋Š” ๋งŽ์€ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์‚ฌ๋žŒ์ด ๋ณด๊ธฐ์— ํ›จ์”ฌ fluentํ•œ LM์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๊ณ , context๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋Šฅ๋ ฅ ๋˜ํ•œ ๋›ฐ์–ด๋‚˜ ๋ณด์ธ๋‹ค. phrase similarities๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋Šฅ๋ ฅ ๋˜ํ•œ ๋›ฐ์–ด๋‚˜๋‹ค. ๋˜ํ•œ subcomponent๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. SMT๋Š” ๊ทธ ํŠน์„ฑ์ƒ subcomponent๋ฅผ ๋งŽ์ด ์‚ฌ์šฉํ•ด์•ผ ํ•˜์ง€๋งŒ, NMT๋Š” single neural network๋ฅผ end-to-end๋กœ ์ตœ์ ํ™” ํ•˜๋ฉด ๋œ๋‹ค.

ํ•˜์ง€๋งŒ SMT์— ๋น„ํ•ด์„œ ๋””๋ฒ„๊น…์ด ํ›จ์”ฌ ์–ด๋ ต๊ณ , ์ œ์–ดํ•˜๊ธฐ๋„ ์–ด๋ ต๋‹ค. ์ ๋‹นํ•œ evaluation ๋ฐฉ๋ฒ• ๋˜ํ•œ ์—†๋‹ค. ๊ทธ๋ž˜์„œ BLEU๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด ๋˜ํ•œ ์™„๋ฒฝํ•˜์ง„ ์•Š์ง€๋งŒ, ์ˆ˜์น˜์ƒ์œผ๋กœ ๋ณด์—ฌ์ค„ ๋งŒํ•œ ๋‹ค๋ฅธ ๋Œ€์•ˆ์ด ๋งˆ๋•…์น˜ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

BLEU

Bilingual Evaluation Understudy๋กœ, ์‚ฌ๋žŒ์ด ๋ฒˆ์—ญํ•œ ๊ฒƒ๊ณผ ๊ธฐ๊ณ„๊ฐ€ ๋ฒˆ์—ญํ•œ ๊ฒƒ์„ ์šฐ์„  n-gram similarity๋กœ ์Šค์ฝ”์–ด๋ฅผ ๋งค๊ธด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋„ˆ๋ฌด ํ•จ์ถ•ํ•˜์—ฌ ๋ฒˆ์—ญํ•œ ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ๋Š” ํŒจ๋„ํ‹ฐ๋ฅผ ์ค€๋‹ค. ๊ทธ๋ ‡๊ฒŒ ์‚ฌ๋žŒ์ด ์ง์ ‘ ๋ฒˆ์—ญํ•œ ๊ฒƒ๊ณผ ๊ธฐ๊ณ„๊ฐ€ ๋ฒˆ์—ญํ•œ ๊ฒƒ์˜ ์œ ์‚ฌ๋„๋ฅผ ํŒ๋‹จํ•œ๋‹ค.

์ž ๊ทธ๋ž˜์„œ ์ž˜ ํ’€๋ ธ๋Š”๊ฐ€

์œ„์—์„œ ๋งํ•œ ๊ฒƒ์œผ๋กœ ์ƒ๋‹นํžˆ ๋งŽ์€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์„ ๊ฒƒ ๊ฐ™์ง€๋งŒ,

  • ์‚ฌ์ „์— ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด๋Š” ๋‚˜ํƒ€๋‚ด๊ธฐ ํž˜๋“ค๋‹ค.
  • ์ƒ๋‹นํžˆ ๊ธด ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ context๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ํž˜๋“ค๋‹ค.
  • common sense๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ํž˜๋“ค๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด์„œ paper jam(ํ”„๋ฆฐํ„ฐ์— ์ข…์ด๊ฐ€ ๋ผ์ธ ๊ฒƒ)์„ ์‹ค์ œ๋กœ ์ข…์ด๋กœ ๋‹ด๊ทผ ์žผ์ด๋ผ๊ณ  ๋ฒˆ์—ญํ•˜๋Š” ๋“ฑ์˜ ํ˜„์ƒ์ด ์žˆ๋‹ค.
  • training ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ bias๋ฅผ ๊ฐ€์ง„๋‹ค. ๊ฐ•์˜์—์„œ๋Š” sex-neutralํ•œ ๋‹จ์–ด์ž„์—๋„ training ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ bias๋ฅผ ๊ฐ€์ง์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. (nurse -> she, programmer -> he??)
  • ํ•ด์„ ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ฌธ์žฅ์— ๋Œ€ํ•ด ์ž„์˜์˜ ๋ฌธ์žฅ์„ ๋ฑ‰์–ด๋‚ธ๋‹ค.

๊ทธ๋ž˜์„œ ์œ„์˜ ๋ฌธ์ œ๋“ค์ด ์žˆ์–ด ๋” ๋‚˜์€ NMT๋ฅผ ๋งŒ๋“ค๊ณ ์ž ๊ณ ์•ˆํ•ด ๋‚ธ ๊ธฐ๋ฒ•์ด attention์ด๋‹ค.

Attention

seq2seq์˜ ๋ฌธ์ œ์ ์€ encoder์—์„œ decoder๋กœ ๋„˜์–ด๊ฐˆ ๋•Œ, ํ•˜๋‚˜์˜ hidden state๋งŒ์„ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์— information bottleneck์ด ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ด ์ ์„ decoder์˜ ๊ฐ step์„ encoder๋กœ ์ง์ ‘ ์—ฐ๊ฒฐํ•˜์ž๋Š” ์ ์ด๋‹ค.1

Attention

๊ทธ๋ž˜์„œ ๊ฒฐ๊ณผ๋Š”

์ผ๋‹จ ๊ฐ€์žฅ ์ค‘์š”ํ•œ NMT ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  bottleneck ๋ฌธ์ œ๋„ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. direct connection์ด ์ƒ๊ธฐ๋‹ˆ vanishing gradient problem๋„ ๋งŽ์ด ํ•ด๊ฒฐ๋˜์—ˆ๋‹ค. NMT์˜ ๋‹จ์ ์ด๋กœ ํ‰๊ฐ€๋ฐ›๋˜ ๋””๋ฒ„๊ทธ ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋˜ ๋ฌธ์ œ๋„ ์–ด๋Š์ •๋„ ํ’€๋ ธ๋‹ค. attention์„ ์‹œ๊ฐํ™”ํ•  ๊ฒฝ์šฐ alignment์ฒ˜๋Ÿผ ๋‚˜์˜จ๋‹ค.

  1. https://arxiv.org/pdf/1706.03762.pdf Attention Is All You Need๋ผ๋Š” ์ œ๋ชฉ์˜ ๋…ผ๋ฌธ์œผ๋กœ Attention์„ ์ œ์‹œํ•œ ๋…ผ๋ฌธ์ด๋‹ค. ๊ฒ€์ƒ‰ํ•ด๋ณด๋‹ˆ CS224n์˜ ๋‚˜์ค‘์˜ suggested readings์— ์žˆ๋˜๋ฐ ๋ฏธ๋ฆฌ ์ฝ์–ด๋†“๊ณ  ์‹ถ๋‹ค.. ใ… ใ… ย 

May 26, 2019 ์— ์ž‘์„ฑ
Tags: cs224n machine learning nlp