๐Ÿ“ƒ Revealing the Dark Secrets of BERT ๋ฆฌ๋ทฐ

GLUE ํƒœ์Šคํฌ์™€ ๊ทธ subset์„ ์ด์šฉํ•˜์—ฌ ์ •๋Ÿ‰์ , ์ •์„ฑ์ ์œผ๋กœ BERT heads ๋ถ„์„ํ•œ ๋…ผ๋ฌธ์ด๋‹ค. EMNLP 2019์— Accept๋œ ๋…ผ๋ฌธ.

Main Contribution:

  • analysis of BERTโ€™s capacity to capture different kinds of linguistic information by encoding it in its self-attention weights.
  • present evidence of BERTโ€™s over- parametrization & suggest simple way of im- proving its performance.

Methodology

์•„๋ž˜ ์„ธ๊ฐœ์˜ Research Questions์— ๋Œ€ํ•ด ์ง‘์ค‘ํ•จ

  • What are the common attention patterns, how do they change during fine-tuning, and how does that impact the performance on a given task?
  • What linguistic knowledge is encoded in self-attention weights of the fine-tuned models and what portion of it comes from the pre- trained BERT?
  • How different are the self-attention patterns of different heads, and how important are they for a given task?

์‹คํ—˜ ํ™˜๊ฒฝ์€ ์•„๋ž˜์™€ ๊ฐ™์Œ

  • huggingface/pytorch-pretrained-bert ์‚ฌ์šฉํ•˜๊ณ , BERT base uncased ์‚ฌ์šฉ.
  • ์‚ฌ์šฉํ•œ GLUE tasks: MRPC, STS-B, SST-2, QQP, RTE, QNLI, MNLI
  • Winograd๋Š” ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์ด์ฆˆ ๋•Œ๋ฌธ์— ์ œ์™ธํ–ˆ๊ณ , CoLA๋Š” GLUE ์‹ ๋ฒ„์ „์—์„œ ์ œ์™ธ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค.
  • fine-tuning hyperparam์€ BERT ์› ๋…ผ๋ฌธ์„ ๋”ฐ๋ผ๊ฐ

Experiments

Bertโ€™s self-attention patterns

  • BERT์˜ Self attention pattern์„ ๋ฝ‘์œผ๋ฉด ์œ„์™€ ๊ฐ™์€ ํŒจํ„ด๋“ค์ด ์žˆ์Œ
    • Vertical: [CLS], [SEP]๊ฐ™์€ ํ† ํฐ์— Attention์ด ๊ฑธ๋ฆฌ๋Š” ๊ฒƒ.
    • Diagonal: previous, following tokens์— Attention
    • Vertical + Diagonal
    • Block: Intra-sentence attention
    • Heterogeneous
  • Heterogeneous๋Š” 32% ~ 61%๊นŒ์ง€ ๋‹ค์–‘ํ•˜์ง€๋งŒ ์ „์ฒด์ ์œผ๋กœ ๋งŽ์•˜๋‹ค.
  • ๊ทธ๋ž˜์„œ Heterogeneous attention์ด ์ž ์žฌ์ ์œผ๋กœ ๊ตฌ์กฐ์ ์ธ ์ •๋ณด๋ฅผ ์žก์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํŒ๋‹จ.

Relation specific heads in BERT

Baker et al., 1998 ์˜ ๋‚ด์šฉ์„ ์žก์•„๋‚ผ ์ˆ˜ ์žˆ๋Š”์ง€ ํ…Œ์ŠคํŠธ. ์กฐ๊ฑด์„ ์ข€ ๋งŽ์ด ๊ฒ€.

  • ์œ„์™€ ๊ฐ™์€ ์˜ˆ์‹œ๋ฅผ ๋งŽ์ด ๋ณผ ์ˆ˜ ์žˆ์—ˆ๊ณ , ์ด๊ฒŒ ์–ด๋Š์ •๋„์˜ ์ฆ๊ฑฐ๋ฅผ ์ œ์‹œํ•ด์ค€๋‹ค๊ณ  ํ•ด์„ํ•จ
  • ์กฐ๊ธˆ ๋” ์ผ๋ฐ˜์ ์ธ ์ƒํ™ฉ์— ๋Œ€ํ•œ ์ฆ๋ช…์€ future works.

Change in self-attention patterns after fine-tuning

fine tuning ์ „ ํ›„์˜ head๋ณ„ Attention weight๋ฅผ ๋ฝ‘์•„์„œ cosine similarity๋ฅผ ๊ตฌํ•ด๋ด„.

QQP๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” ๋งˆ์ง€๋ง‰ 2๋ ˆ์ด์–ด๊ฐ€ ๋งŽ์ด ๋ฐ”๋€Œ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Attention to linguistic features

  • CLS๋Š” ์•ž์ชฝ ๋ ˆ์ด์–ด๋งŒ Attention์ด ๋งŽ์ด ๋“ค์–ด๊ฐ€๋”๋ผ.
  • ๊ทธ ๋’ค๋ถ€ํ„ฐ๋Š” SEP์— Attention ๊ฑธ๋ฆฌ๋Š” ๊ฒƒ์ด ์ง€๋ฐฐ์ ์ด๋‹ค.
  • SST-2๋Š” SEP ํ† ํฐ์ด ํ•˜๋‚˜๋“ค์–ด๊ฐ€๋‹ˆ๊นŒ (์ž…๋ ฅ ๋ฌธ์žฅ์ด ํ•˜๋‚˜๋‹ˆ๊นŒ) ์œ ๋‚œํžˆ ํฐ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.
  • ์ด๋Ÿฐ ๊ฒฝํ–ฅ์„ ๋ณด๋‹ˆ๊นŒ task specificํ•˜๊ฒŒ linguistic reasoning์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ๋ณด๋‹ค pretrained BERT๋กœ๋ถ€ํ„ฐ ์˜ค๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

Token to token attention

ํŒจ์Šค

Disabling self attention heads

์ด๋ฏธ related works ์„น์…˜์—์„œ self attention heads ๋งˆ์Šคํ‚นํ•˜๋Š” ๋…ผ๋ฌธ์„ ๋ ˆํผ๋Ÿฐ์Šค๋กœ ๊ฑธ์–ด๋†“์Œ

  • ์—ญ์‹œ ์ž˜ ๋˜๊ณ  ์˜ค๋ฅด๊ธฐ๋„ ํ•จ
  • ๋ ˆ์ด์–ด ์ž์ฒด๋ฅผ ๋“œ๋žํ•ด๋„ ์ž˜ ๋จ

Discussion

base์—ฌ๋„ over parameterize๊ฐ€ ์ž˜ ๋จ


BERT์˜ over parameterization์„ ๋‹ค๊ฐ๋„๋กœ ๋ณด์—ฌ์ค€ ๋…ผ๋ฌธ์ธ ๋“ฏ ํ•˜๋‹ค. ๋ช‡๋ช‡ ๋ถ„์„์€ "...?"ํ•œ ๊ฒƒ๋„ ์žˆ์ง€๋งŒ, โ€œ์ด๋ ‡๊ฒŒ ํ•ด๋ผ!โ€๋ผ๋Š” ๋…ผ๋ฌธ๋ณด๋‹ค๋Š” โ€œ์ด๋ ‡๋”๋ผโ€๋ผ๋Š” ๋…ผ๋ฌธ์ด๋ผ ์žฌ๋ฐŒ๊ฒŒ ์ฝ์—ˆ๋‹ค. ๊ฒฝ๋Ÿ‰ํ™”์‹œ์— ์ฐธ๊ณ ํ•  ๋งŒํ•œ ๋…ผ๋ฌธ.

July 6, 2020
Tags: paper