๊ฐœ์š”

MMLU(Massive Multitask Language Understanding)๋Š” 2020๋…„ 9์›”์— Dan Hendrycks์™€ ์—ฐ๊ตฌ์ง„์ด ๋ฐœํ‘œํ•œ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ ๋ฒค์น˜๋งˆํฌ์ด๋‹ค1. ์–ธ์–ด ๋ชจ๋ธ์˜ ๋‹ค์ค‘ ์ž‘์—… ์ •ํ™•๋„๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์œผ๋ฉฐ, ์ธ๊ฐ„์˜ ์ง€์‹๊ณผ ๋ฌธ์ œ ํ•ด๊ฒฐ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ๋ชจ๋ธ์„ zero-shot ๋ฐ few-shot ์„ค์ •์—์„œ๋งŒ ํ‰๊ฐ€ํ•œ๋‹ค.

๋ฒค์น˜๋งˆํฌ ๊ตฌ์„ฑ

๋ฌธ์ œ ๊ตฌ์กฐ

  • ์ด ๋ฌธ์ œ ์ˆ˜: 15,908๊ฐœ์˜ ๊ฐ๊ด€์‹ ๋ฌธ์ œ
  • ๊ฒ€์ฆ ์„ธํŠธ (Validation): 1,540๊ฐœ ๋ฌธ์ œ (๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™” ๋ฐ ์„ค์ • ์„ ํƒ์šฉ)
  • ํ‰๊ฐ€ ์„ธํŠธ (Test): 14,368๊ฐœ ๋ฌธ์ œ (์‹ค์ œ ์„ฑ๋Šฅ ํ‰๊ฐ€์šฉ)

์ฃผ์ œ ๋ฒ”์œ„

57๊ฐœ์˜ ๋‹ค์–‘ํ•œ ์ฃผ์ œ๋ฅผ ๋‹ค๋ฃจ๋ฉฐ, ํฌ๊ฒŒ 4๊ฐœ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋ถ„๋ฅ˜๋œ๋‹ค:

  1. ์ธ๋ฌธํ•™ (Humanities)
    • ์—ญ์‚ฌ, ์ฒ ํ•™, ๋ฒ•๋ฅ  ๋“ฑ
  2. ์‚ฌํšŒ๊ณผํ•™ (Social Sciences)
    • ๊ฒฝ์ œํ•™, ์‚ฌํšŒํ•™, ์‹ฌ๋ฆฌํ•™ ๋“ฑ
  3. STEM
    • ์ˆ˜ํ•™, ๋ฌผ๋ฆฌํ•™, ์ปดํ“จํ„ฐ ๊ณผํ•™, ์ƒ๋ฌผํ•™ ๋“ฑ
  4. ๊ธฐํƒ€ (Other)
    • ์˜์–‘ํ•™, ์ข…๊ตํ•™ ๋“ฑ

๊ฐ ์ฃผ์ œ๋Š” ๊ณ ๋“ฑํ•™๊ต ์ˆ˜์ค€๋ถ€ํ„ฐ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๋‚œ์ด๋„์˜ ๋ฌธ์ œ๋ฅผ ํฌํ•จํ•œ๋‹ค.

๋ฌธํ•ญ ์˜ˆ์‹œ

MMLU๋Š” ๋‹ค์–‘ํ•œ ํ•™๋ฌธ ๋ถ„์•ผ์˜ 4์ง€์„ ๋‹คํ˜• ๋ฌธ์ œ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๋‹ค์Œ์€ ์‹ค์ œ ๋ฌธํ•ญ์˜ ์˜ˆ์‹œ์ด๋‹ค2:

์ถ”์ƒ ๋Œ€์ˆ˜ํ•™ (Abstract Algebra)

Find all c in Z3 such that Z3[x]/(xยฒ+c) is a field.

(A) 0 (B) 1 โœ“ (C) 2 (D) 0 and 1

๊ตญ์ œ๋ฒ• (International Law)

Would a reservation to the definition of torture in the ICCPR be acceptable in contemporary practice?

(A) This is an acceptable reservation if the reserving countryโ€™s legislation employs a different definition (B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR โœ“ (C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law (D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties

์ „๋ฌธ ์˜ํ•™ (Professional Medicine)

A 33-year-old man undergoes a radical thyroidectomy for thyroid cancerโ€ฆ Which vessel damage caused the findings?

(A) Branch of the external carotid artery (B) Branch of the internal carotid artery (C) Branch of the thyrocervical trunk โœ“ (D) Branch of the vertebral artery

ํ‰๊ฐ€ ๋ฐฉ์‹

๊ธฐ๋ณธ ์„ค์ •

  • Zero-shot: ์‚ฌ์ „ ํ•™์Šต ์ค‘ ์Šต๋“ํ•œ ์ง€์‹๋งŒ์œผ๋กœ ํ‰๊ฐ€
  • Few-shot: ์†Œ์ˆ˜์˜ ์˜ˆ์‹œ๋ฅผ ์ œ๊ณตํ•œ ํ›„ ํ‰๊ฐ€
  • ๋ฒ ์ด์Šค๋ผ์ธ: 25% (4์ง€์„ ๋‹คํ˜• ๋ฌด์ž‘์œ„ ์ถ”์ธก)

ํ‰๊ฐ€ ์ง€ํ‘œ

๋ชจ๋ธ์˜ ์ •ํ™•๋„๋ฅผ ๋ฐฑ๋ถ„์œจ๋กœ ์ธก์ •ํ•˜๋ฉฐ, ๊ฐ ์ฃผ์ œ๋ณ„ ์„ฑ๋Šฅ๊ณผ ์ „์ฒด ํ‰๊ท  ์„ฑ๋Šฅ์„ ํ•จ๊ป˜ ๋ณด๊ณ ํ•œ๋‹ค.

์„ฑ๋Šฅ ์ถ”์ด

์ดˆ๊ธฐ ์„ฑ๋Šฅ (2020-2021)

๋…ผ๋ฌธ ๋ฐœํ‘œ ๋‹น์‹œ, ๋Œ€๋ถ€๋ถ„์˜ ์–ธ์–ด ๋ชจ๋ธ์€ ๊ฑฐ์˜ ๋ฌด์ž‘์œ„ ์ˆ˜์ค€(25%)์— ๊ฐ€๊นŒ์šด ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค:

  • GPT-3 175B (few-shot): 43.9%
  • GPT-3 175B (fine-tuned): 53.9%
  • ๊ฐ€์žฅ ํฐ GPT-3 ๋ชจ๋ธ๋„ ๋ฌด์ž‘์œ„ ์ถ”์ธก๋ณด๋‹ค ์•ฝ 20% ํฌ์ธํŠธ๋งŒ ํ–ฅ์ƒ

์—ฐ๊ตฌ์ง„์€ ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€์˜ ์ •ํ™•๋„๋ฅผ ์•ฝ 89.8%๋กœ ์ถ”์ •ํ–ˆ๋‹ค.

์ตœ์‹  ์„ฑ๋Šฅ (2024-2025)

2024๋…„ ์ดํ›„ ์ตœ์‹  ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ์€ ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€(89.8%)์— ๊ทผ์ ‘ํ•˜๊ฑฐ๋‚˜ ์ดˆ๊ณผํ–ˆ๋‹ค3:

๋ชจ๋ธMMLU ์ ์ˆ˜์ถœ์‹œ ์‹œ๊ธฐ
GPT-591.4%2025
GPT-4.190.2%2025
Claude Opus 488.8%2025
Claude 3.5 Sonnet88.7%2024
GPT-4o88.7%2024
Llama 3.1 405B88.6%2024

์ฃผ์š” ์„ฑ๊ณผ:

  • GPT-5๊ฐ€ 91.4%๋กœ ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์„ ์ตœ์ดˆ๋กœ ์ดˆ๊ณผ
  • GPT-4.1์ด 90.2%๋กœ ์ธ๊ฐ„ ์ˆ˜์ค€ ๋ŒํŒŒ
  • 2024-2025๋…„ ์ฃผ์š” ๋ชจ๋ธ๋“ค์ด 88-91% ๋ฒ”์œ„์— ๋ฐ€์ง‘๋˜์–ด ๋†’์€ ์ˆ˜์ค€์˜ ๊ฒฝ์Ÿ ๊ตฌ๋„ ํ˜•์„ฑ

์˜ํ–ฅ๊ณผ ํ™œ์šฉ

์—…๊ณ„ ์˜ํ–ฅ

  • 2024๋…„ 7์›” ๊ธฐ์ค€ 1์–ต ํšŒ ์ด์ƒ ๋‹ค์šด๋กœ๋“œ4
  • ์–ธ์–ด ๋ชจ๋ธ ๊ฐœ๋ฐœ๊ณผ ํ‰๊ฐ€์˜ ํ‘œ์ค€ ๋ฒค์น˜๋งˆํฌ๋กœ ์ž๋ฆฌ์žก์Œ
  • ๋ชจ๋ธ ๊ฐ„ ์„ฑ๋Šฅ ๋น„๊ต์˜ ๊ธฐ์ค€์  ์ œ๊ณต

ํ›„์† ์—ฐ๊ตฌ

MMLU์˜ ์„ฑ๊ณต์€ ์—ฌ๋Ÿฌ ํŒŒ์ƒ ๋ฒค์น˜๋งˆํฌ๋ฅผ ํƒ„์ƒ์‹œ์ผฐ๋‹ค:

  • MMLU-Pro (2024): ๋” ์–ด๋ ต๊ณ  ๊ฒฌ๊ณ ํ•œ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ์„ค๊ณ„5
  • ๊ธฐํƒ€ ๋‹ค์–‘ํ•œ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ‰๊ฐ€ ๋ฒค์น˜๋งˆํฌ

ํ•œ๊ณ„์ 

๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๋ฌธ์ œ

2024๋…„ ๋ถ„์„ ๊ฒฐ๊ณผ, MMLU๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์ œ์ ์ด ๋ฐœ๊ฒฌ๋˜์—ˆ๋‹ค6:

  • 5,700๊ฐœ ๋ฌธ์ œ ๋ถ„์„ ๊ฒฐ๊ณผ ์•ฝ 6.5%์˜ ์ •๋‹ต ์˜ค๋ฅ˜ ์กด์žฌ
  • ์ž˜๋ชป๋œ ์ •๋‹ต, ๋ชจํ˜ธํ•œ ๋ฌธ์ œ ํ‘œํ˜„ ๋“ฑ์˜ ํ’ˆ์งˆ ์ด์Šˆ

๋ฐ์ดํ„ฐ ์˜ค์—ผ ๋ฌธ์ œ

  • Data Contamination: ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋ฒค์น˜๋งˆํฌ ๋ฌธ์ œ๊ฐ€ ํฌํ•จ๋  ๊ฐ€๋Šฅ์„ฑ
  • ๋ชจ๋ธ์ด ์‹ค์ œ ์ดํ•ด ์—†์ด ์•”๊ธฐ๋ฅผ ํ†ตํ•ด ๋†’์€ ์ ์ˆ˜๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ
  • ๋ฒค์น˜๋งˆํฌ์˜ ์‹ ๋ขฐ์„ฑ์— ์˜ํ–ฅ

์„ฑ๋Šฅ ํฌํ™”์™€ ๋ณด์™„ ๋ฒค์น˜๋งˆํฌ

  • 2025๋…„ ๊ธฐ์ค€, MMLU๋Š” ์—ฌ์ „ํžˆ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜์ง€๋งŒ ์ตœ์‹  ๋ชจ๋ธ๋“ค์ด ์ธ๊ฐ„ ์ˆ˜์ค€(89.8%)์— ๊ทผ์ ‘ํ•˜๊ฑฐ๋‚˜ ์ดˆ๊ณผํ•˜๋ฉด์„œ ๋ณ€๋ณ„๋ ฅ์ด ์ €ํ•˜๋˜๊ณ  ์žˆ์Œ
  • ์ด์— ๋”ฐ๋ผ MMLU-Pro ๋“ฑ ๋” ์–ด๋ ค์šด ๋ณ€ํ˜• ๋ฒค์น˜๋งˆํฌ๋“ค์ด ๋“ฑ์žฅํ•˜์—ฌ ๋ณด์™„์ ์œผ๋กœ ํ™œ์šฉ๋˜๊ณ  ์žˆ์Œ

๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ

MMLU ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์Œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋‹ค:

์ง์ ‘ ๋‹ค์šด๋กœ๋“œ

Hugging Face

Hugging Face datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ์‰ฝ๊ฒŒ ๋‹ค์šด๋กœ๋“œ ๊ฐ€๋Šฅ:

from datasets import load_dataset
dataset = load_dataset("cais/mmlu")

๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์กฐ:

  • question: ๊ฐ๊ด€์‹ ๋ฌธ์ œ ํ…์ŠคํŠธ
  • subject: ๋ฌธ์ œ์˜ ์ฃผ์ œ ๋ถ„๋ฅ˜
  • choices: 4๊ฐœ์˜ ์„ ํƒ์ง€ ๋ชฉ๋ก
  • answer: ์ •๋‹ต ์ธ๋ฑ์Šค (0-3, A-D์— ํ•ด๋‹น)

์ฐธ๊ณ  ์ž๋ฃŒ

Footnotes

  1. Hendrycks, D., et al. (2020). โ€œMeasuring Massive Multitask Language Understandingโ€. arXiv:2009.03300. ICLR 2021์—์„œ ๋ฐœํ‘œ๋จ. โ†ฉ

  2. Wikipedia, โ€œMMLUโ€ ํ•ญ๋ชฉ์—์„œ ์ธ์šฉ โ†ฉ

  3. GraphLogic AI, DataCamp, Artificial Analysis, OpenAI, Anthropic ๋“ฑ ๊ณต์‹ ๋ฐœํ‘œ ๋ฐ ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ ์ข…ํ•ฉ (2024-2025) โ†ฉ

  4. Wikipedia, โ€œMMLUโ€ ํ•ญ๋ชฉ (2024๋…„ 7์›” ๊ธฐ์ค€) โ†ฉ

  5. arXiv:2406.01574 โ€œMMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmarkโ€ (NeurIPS 2024) โ†ฉ

  6. 2024๋…„ 6์›” ๋ถ„์„ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ โ†ฉ