๊ฐœ์š”

Learned Sparse Retrieval (LSR)์€ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฟผ๋ฆฌ์™€ ๋ฌธ์„œ๋ฅผ ํฌ์†Œ ๋ฒกํ„ฐ(sparse vector)๋กœ ํ‘œํ˜„ํ•˜๋Š” ์ •๋ณด ๊ฒ€์ƒ‰ ๊ธฐ๋ฒ•์ด๋‹ค1. ์ „ํ†ต์ ์ธ ์–ดํœ˜ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•(TF-IDF, BM25)๊ณผ ๋ฐ€์ง‘ ๋ฒกํ„ฐ ์ž„๋ฒ ๋”ฉ์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•˜์—ฌ, ์—ญ์ƒ‰์ธ(inverted index)์˜ ํšจ์œจ์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์˜๋ฏธ์  ๋งค์นญ ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

LSR์€ ์ „ํ†ต์ ์ธ ํฌ์†Œ ๋ฒกํ„ฐ์™€ ๋‹ฌ๋ฆฌ ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ํ•™์Šต๋˜๋ฉฐ, ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค:

  • Term weighting: ๋ฌธ์„œ์™€ ์ฟผ๋ฆฌ์—์„œ ๊ฐ ๋‹จ์–ด์˜ ์ค‘์š”๋„๋ฅผ ํ•™์Šต
  • Term expansion: ์›๋ณธ ํ…์ŠคํŠธ์— ์—†๋Š” ์˜๋ฏธ์ ์œผ๋กœ ๊ด€๋ จ๋œ ๋‹จ์–ด๋ฅผ ์ถ”๊ฐ€

๋ฐฐ๊ฒฝ๊ณผ ๋™๊ธฐ

์ „ํ†ต์  ํฌ์†Œ ๊ฒ€์ƒ‰์˜ ํ•œ๊ณ„

์ „ํ†ต์ ์ธ BM25 ๊ฐ™์€ ํฌ์†Œ ๊ฒ€์ƒ‰ ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ๊ฐ€์ง„๋‹ค:

  • Vocabulary mismatch: โ€œ์ž๋™์ฐจโ€์™€ โ€œ์ฐจ๋Ÿ‰โ€์„ ์™„์ „ํžˆ ๋‹ค๋ฅธ ๋‹จ์–ด๋กœ ์ทจ๊ธ‰
  • ๋™์˜์–ด ๋ฏธ์ฒ˜๋ฆฌ: ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค์„ ๋ณ„๊ฐœ๋กœ ๊ฐ„์ฃผ
  • ๋ฌธ๋งฅ ๋ฌด์‹œ: ๋‹จ์–ด์˜ ์ค‘์š”๋„๊ฐ€ ๋ฌธ๋งฅ๊ณผ ๋ฌด๊ด€ํ•˜๊ฒŒ ๊ณ ์ •

๋ฐ€์ง‘ ๋ฒกํ„ฐ ๊ฒ€์ƒ‰์˜ ๋ฌธ์ œ

๋ฐ€์ง‘ ๋ฒกํ„ฐ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰(DPR, ANCE ๋“ฑ)์€ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ์ž˜ ํฌ์ฐฉํ•˜์ง€๋งŒ:

  • ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰: k-NN ๊ฒ€์ƒ‰์— ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ํ•„์š”
  • ๋А๋ฆฐ ๊ฒ€์ƒ‰ ์†๋„: ๊ทผ์‚ฌ ์ตœ๊ทผ์ ‘ ์ด์›ƒ ํƒ์ƒ‰์ด inverted index๋ณด๋‹ค ๋น„ํšจ์œจ์ 
  • ํ•ด์„ ์–ด๋ ค์›€: ์–ด๋–ค ์š”์†Œ๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ์˜ํ–ฅ์„ ์ฃผ์—ˆ๋Š”์ง€ ํŒŒ์•… ๊ณค๋ž€

Learned Sparse Retrieval์˜ ํ•ด๊ฒฐ์ฑ…

LSR์€ ๋‹ค์Œ์„ ํ†ตํ•ด ๋‘ ๋ฐฉ์‹์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•œ๋‹ค:

  • ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ ํ•™์Šต
  • Inverted index๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํšจ์œจ์ ์ธ ๊ฒ€์ƒ‰
  • ํฌ์†Œ ํ‘œํ˜„์œผ๋กœ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ ์œ ์ง€

ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ

Nguyen et al.(2023)์€ ๋ชจ๋“  LSR ๋ฐฉ๋ฒ•์„ 4๊ฐ€์ง€ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ํ†ตํ•ฉํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ–ˆ๋‹ค2:

  1. Document term weighting: ๋ฌธ์„œ ๋‚ด ๊ฐ ๋‹จ์–ด์˜ ์ค‘์š”๋„ ๊ณ„์‚ฐ
  2. Query term weighting: ์ฟผ๋ฆฌ ๋‚ด ๊ฐ ๋‹จ์–ด์˜ ์ค‘์š”๋„ ๊ณ„์‚ฐ
  3. Document expansion: ๋ฌธ์„œ์— ๊ด€๋ จ ๋‹จ์–ด ์ถ”๊ฐ€
  4. Query expansion: ์ฟผ๋ฆฌ์— ๊ด€๋ จ ๋‹จ์–ด ์ถ”๊ฐ€

๊ตฌ์„ฑ ์š”์†Œ๋ณ„ ๊ธฐ์—ฌ๋„

ํ†ต์ผ๋œ ํ™˜๊ฒฝ์—์„œ ์žฌํ•™์Šต ์‹คํ—˜ ๊ฒฐ๊ณผ:

  • Document term weighting: ํšจ๊ณผ์„ฑ์— ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์š”์†Œ
  • Query term weighting: ์ž‘์ง€๋งŒ ๊ธ์ •์ ์ธ ์˜ํ–ฅ
  • Document/Query expansion: ์„œ๋กœ ์ƒ์‡„ ํšจ๊ณผ ์กด์žฌ

์‹ค์šฉ์  ๋ฐœ๊ฒฌ: ์ตœ์‹  ๋ชจ๋ธ์—์„œ query expansion์„ ์ œ๊ฑฐํ•˜๋ฉด ํšจ๊ณผ์„ฑ์€ ์œ ์ง€ํ•˜๋ฉด์„œ ์ง€์—ฐ ์‹œ๊ฐ„์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

์ฃผ์š” ๋ชจ๋ธ

DeepCT (2019)

์ดˆ๊ธฐ learned sparse retrieval ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜๋กœ, BERT์˜ ๋ฌธ๋งฅ์  ํ‘œํ˜„์„ ํ™œ์šฉํ•œ๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด:

  • BERT์˜ ๋ฌธ๋งฅํ™”๋œ ์ž„๋ฒ ๋”ฉ ์œ„์— ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ ์‚ฌ์šฉ
  • ๋ฌธ์„œ ๋‚ด ๊ฐ ๋‹จ์–ด์˜ ์ค‘์š”๋„๋ฅผ ์ •์ˆ˜ ๊ฐ’์œผ๋กœ ์˜ˆ์ธก
  • ๋‹จ์–ด๋ณ„๋กœ ๋…๋ฆฝ์ ์ธ ์ ์ˆ˜ ํ•™์Šต

ํ•œ๊ณ„:

  • ๊ฐ ๋‹จ์–ด์˜ ์ค‘์š”๋„์— ๋Œ€ํ•œ ground truth ์ •์˜๊ฐ€ ์–ด๋ ค์›€
  • ์ฟผ๋ฆฌ-๋ฌธ์„œ ๊ด€๋ จ์„ฑ์„ ์ง์ ‘์ ์œผ๋กœ ํ•™์Šตํ•˜์ง€ ์•Š์Œ

DeepImpact (2021)

DeepCT์˜ ํ•œ๊ณ„๋ฅผ ๊ฐœ์„ ํ•œ ๋ชจ๋ธ.

๊ฐœ์„ ์ :

  • ์ฟผ๋ฆฌ-๋ฌธ์„œ ๊ด€๋ จ์„ฑ์„ ํ•™์Šต ๋ชฉํ‘œ๋กœ ์ง์ ‘ ์‚ฌ์šฉ
  • BERT ์ž„๋ฒ ๋”ฉ์„ 2์ธต ์‹ ๊ฒฝ๋ง์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์Šค์นผ๋ผ ์ ์ˆ˜ ์ƒ์„ฑ
  • ๋…๋ฆฝ์ ์ธ ๋‹จ์–ด ์ ์ˆ˜ ๋Œ€์‹  ์ฟผ๋ฆฌ ๋‹จ์–ด ์˜ํ–ฅ๋„์˜ ํ•ฉ์„ ์ตœ์ ํ™”

์„ฑ๋Šฅ:

  • DeepCT๋ณด๋‹ค ํšจ์œจ์  (ํ‰๊ท  ์‘๋‹ต ์‹œ๊ฐ„ 1.1ms, tail 4.5ms)
  • ํšจ๊ณผ์„ฑ์€ SPLADEv2๋‚˜ uniCOIL๋ณด๋‹ค ๋‚ฎ์Œ

uniCOIL (2021)

COIL์˜ ๊ฐ„๋‹จํ•œ ํ™•์žฅ ๋ฒ„์ „์œผ๋กœ, MS MARCO์—์„œ state-of-the-art ๋‹ฌ์„ฑ3.

ํŠน์ง•:

  • COIL์˜ ๋ฒกํ„ฐ ์ถœ๋ ฅ์„ ์Šค์นผ๋ผ ์ค‘์š”๋„ ์ ์ˆ˜๋กœ ๋‹จ์ˆœํ™”
  • Learned Term Impact (LTI) ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์ผ๋ถ€
  • MS MARCO dev queries์—์„œ ์ด์ „ ๋ฐฉ๋ฒ•๋ณด๋‹ค ํฐ ํญ์œผ๋กœ ์šฐ์ˆ˜

ํšจ์œจ์„ฑ ๋Œ€๋น„ ํšจ๊ณผ์„ฑ:

  • SPLADEv2๋ณด๋‹ค ํšจ๊ณผ์ ์ด์ง€๋งŒ ๋А๋ฆผ
  • ๋ณต์žกํ•œ ์•„ํ‚คํ…์ฒ˜๋กœ ์ธํ•ด DeepImpact๋ณด๋‹ค 10๋ฐฐ ์ด์ƒ ๋А๋ฆผ

SPLADE (2021)

๊ฐ€์žฅ ๋„๋ฆฌ ์•Œ๋ ค์ง„ learned sparse retrieval ๋ชจ๋ธ4.

์•„ํ‚คํ…์ฒ˜:

  1. BERT ๊ธฐ๋ฐ˜ transformer๋กœ ์ž…๋ ฅ ํ† ํฐํ™” ๋ฐ ์ธ์ฝ”๋”ฉ
  2. MLM (Masked Language Model) ํ—ค๋“œ๋ฅผ ํ†ตํ•ด ์–ดํœ˜ ํฌ๊ธฐ(30,522)๋กœ projection
  3. Log-saturation ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋กœ ๋‹จ์ผ term์˜ ๊ธฐ์—ฌ๋„ ์ œํ•œ
  4. ์„œ๋ธŒํ† ํฐ ์ž„๋ฒ ๋”ฉ ํ•ฉ์‚ฐ์œผ๋กœ ์ตœ์ข… ํฌ์†Œ ๋ฒกํ„ฐ ์ƒ์„ฑ

ํ•ต์‹ฌ ํ˜์‹ :

  • ๋ช…์‹œ์  ํฌ์†Œ์„ฑ ์ •๊ทœํ™”: FLOPS regularizer๋กœ inverted index ๋น„์šฉ ์ง์ ‘ ์ถ”์ •
  • Log-saturation: ๋ฐ˜๋ณต ์ถœํ˜„ ๋‹จ์–ด์˜ ๊ณผ๋„ํ•œ ์˜ํ–ฅ ๋ฐฉ์ง€
  • End-to-end ๋‹จ์ผ ๋‹จ๊ณ„ ํ•™์Šต: ๋ณต์žกํ•œ ํŒŒ์ดํ”„๋ผ์ธ ์—†์ด ํ•™์Šต

์žฅ์ :

  • Exact term matching๊ณผ ํšจ์œจ์ ์ธ inverted index ์‚ฌ์šฉ
  • Term expansion์œผ๋กœ vocabulary mismatch ํ•ด๊ฒฐ
  • ํšจ๊ณผ์„ฑ๊ณผ ํšจ์œจ์„ฑ์˜ trade-off ์กฐ์ ˆ ๊ฐ€๋Šฅ

SPLADE v2 (2021)

SPLADE์˜ ๊ฐœ์„  ๋ฒ„์ „์œผ๋กœ ์—ฌ๋Ÿฌ ํ•™์Šต ๊ธฐ๋ฒ• ๋„์ž…5.

์ฃผ์š” ๊ฐœ์„ ์‚ฌํ•ญ:

  • Knowledge distillation: ๊ต์‚ฌ ๋ชจ๋ธ์˜ ์ง€์‹ ์ „์ด
  • Hard negative mining: ์–ด๋ ค์šด ๋ถ€์ • ์˜ˆ์ œ๋กœ ํ•™์Šต ๊ฐ•ํ™”
  • ๋” ๋‚˜์€ PLM ์ดˆ๊ธฐํ™”: ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ ์„ ํƒ ์ตœ์ ํ™”
  • ํ’€๋ง ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ˆ˜์ •: ๋ฌธ์„œ ํ‘œํ˜„ ๊ฐœ์„ 

์„ฑ๋Šฅ:

  • TREC DL 2019์—์„œ NDCG@10 9% ์ด์ƒ ํ–ฅ์ƒ
  • BEIR ๋ฒค์น˜๋งˆํฌ์—์„œ state-of-the-art ๋‹ฌ์„ฑ
  • ๋ฐ€์ง‘ ๋ชจ๋ธ๊ณผ ๊ฒฝ์Ÿ ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€

SPLADE v3 (2024)

์ตœ์‹  ๋ฒ„์ „์œผ๋กœ ํ•™์Šต ๋ฐฉ๋ฒ•๋ก ์˜ ๋Œ€ํญ ๊ฐœ์„ 6.

ํ›ˆ๋ จ ๊ฐœ์„ :

  • Hard negative ์ฆ๊ฐ€: 100๊ฐœ ๋„ค๊ฑฐํ‹ฐ๋ธŒ ์‚ฌ์šฉ (top-50 + ๋žœ๋ค 50)
  • ์•™์ƒ๋ธ” distillation: ๋‹จ์ผ ๋ชจ๋ธ ๋Œ€์‹  ์—ฌ๋Ÿฌ cross-encoder ์•™์ƒ๋ธ” ์‚ฌ์šฉ
  • ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์†์‹ค ํ•จ์ˆ˜: KL-Div์™€ MarginMSE ํ˜ผํ•ฉ
  • Self-distillation: SPLADE++์˜ ๋„ค๊ฑฐํ‹ฐ๋ธŒ ์ƒ˜ํ”Œ๋ง

์„ฑ๋Šฅ:

  • MS MARCO dev set์—์„œ MRR@10 > 40 ๋‹ฌ์„ฑ
  • BM25 ๋ฐ SPLADE++๋ณด๋‹ค ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•˜๊ฒŒ ์šฐ์ˆ˜
  • Cross-encoder re-ranker์™€ ๊ฒฝ์Ÿ ๊ฐ€๋Šฅ

์ž‘๋™ ์›๋ฆฌ

1. ํฌ์†Œ ๋ฒกํ„ฐ ์ƒ์„ฑ

์ž…๋ ฅ ํ…์ŠคํŠธ: "machine learning algorithms"

โ†“ BERT tokenization

["machine", "learning", "algorithms"]

โ†“ BERT encoding + MLM head

์–ดํœ˜ ํฌ๊ธฐ(30,522) ๋ฒกํ„ฐ: [0, 0, ..., 0.8, 0, ..., 0.6, ..., 0.4, ..., 0.2, 0, ...]
                              โ†‘machine    โ†‘learning  โ†‘algorithm  โ†‘neural

โ†“ Sparsification (๋‚ฎ์€ ๊ฐ’ ์ œ๊ฑฐ)

{machine: 0.8, learning: 0.6, algorithm: 0.4, neural: 0.2}

ํŠน์ง•:

  • ์›๋ณธ์— ์—†๋Š” โ€œneuralโ€์ด term expansion์œผ๋กœ ์ถ”๊ฐ€๋จ
  • ๋Œ€๋ถ€๋ถ„์˜ ์ฐจ์›์ด 0 (ํฌ์†Œ์„ฑ)
  • ์ค‘์š”ํ•œ ๋‹จ์–ด๋งŒ non-zero ๊ฐ€์ค‘์น˜

2. ์ธ๋ฑ์‹ฑ

ํฌ์†Œ ๋ฒกํ„ฐ๋ฅผ inverted index์— ์ €์žฅ:

Inverted Index:
machine    โ†’ [doc1: 0.8, doc5: 0.6, ...]
learning   โ†’ [doc1: 0.6, doc3: 0.7, ...]
algorithm  โ†’ [doc1: 0.4, doc2: 0.5, ...]
neural     โ†’ [doc1: 0.2, doc4: 0.8, ...]

์ „ํ†ต์ ์ธ BM25์™€ ๋™์ผํ•œ ๊ตฌ์กฐ๋กœ ํšจ์œจ์  ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ.

3. ๊ฒ€์ƒ‰

์ฟผ๋ฆฌ๋„ ๋™์ผํ•˜๊ฒŒ ํฌ์†Œ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ ํ›„ dot product๋กœ ์ ์ˆ˜ ๊ณ„์‚ฐ:

Query: "deep learning"
Query vector: {deep: 0.9, learning: 0.7, neural: 0.3}

Document score = โˆ‘(query_weight ร— doc_weight)
               = (0.7 ร— 0.6) + (0.3 ร— 0.2)
               = 0.42 + 0.06
               = 0.48

์„ฑ๋Šฅ ๋น„๊ต

BEIR ๋ฒค์น˜๋งˆํฌ (์ œ๋กœ์ƒท ํ‰๊ฐ€)

๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ:

๋ชจ๋ธํ‰๊ท  NDCG@10ํŠน์ง•
BM25~0.40๋ฒ ์ด์Šค๋ผ์ธ
DeepImpact~0.44๊ฐ€์žฅ ๋น ๋ฆ„
uniCOIL~0.47๊ท ํ˜•์žกํžŒ ์„ฑ๋Šฅ
SPLADE v2~0.49๋†’์€ ํšจ๊ณผ์„ฑ, ๋А๋ฆผ
SPLADE v3~0.51State-of-the-art
Dense (DPR)~0.45๋†’์€ ๋ฉ”๋ชจ๋ฆฌ

MS MARCO Passage Ranking

์ธ-๋„๋ฉ”์ธ ์„ฑ๋Šฅ:

๋ชจ๋ธMRR@10ํšจ์œจ์„ฑ
BM250.187๋งค์šฐ ๋น ๋ฆ„
DeepCT0.243๋น ๋ฆ„
DeepImpact0.326๋น ๋ฆ„ (1.1ms)
uniCOIL0.353์ค‘๊ฐ„
SPLADE v20.368๋А๋ฆผ
SPLADE v30.400+๋А๋ฆผ (10x+)

Recall ๋น„๊ต (Long Document Retrieval)

๊นŠ์ดDeepImpactuniCOILSPLADE
@1000.650.720.78
@5000.790.840.89
@10000.840.880.93

SPLADE๊ฐ€ ๋ชจ๋“  ๊นŠ์ด์—์„œ ๊ฐ€์žฅ ๋†’์€ recall ๋‹ฌ์„ฑ.

์žฅ์ 

1. ํšจ์œจ์„ฑ

  • Inverted index ํ™œ์šฉ: BM25์™€ ์œ ์‚ฌํ•œ ๊ฒ€์ƒ‰ ์†๋„
  • ๋‚ฎ์€ ๋ฉ”๋ชจ๋ฆฌ: ๋ฐ€์ง‘ ๋ฒกํ„ฐ ๋Œ€๋น„ 7-10% ์ˆ˜์ค€ ์ธ๋ฑ์Šค ํฌ๊ธฐ
  • ๋น ๋ฅธ ๊ฒ€์ƒ‰: k-NN๋ณด๋‹ค ํšจ์œจ์ 

2. ํšจ๊ณผ์„ฑ

  • ์˜๋ฏธ์  ๋งค์นญ: Term expansion์œผ๋กœ ๋™์˜์–ด, ๊ด€๋ จ์–ด ๊ฒ€์ƒ‰
  • ๋ฌธ๋งฅ ์ธ์‹: BERT ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ๋งฅ์— ๋”ฐ๋ฅธ ๊ฐ€์ค‘์น˜ ํ•™์Šต
  • ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ: BEIR, MS MARCO ๋“ฑ์—์„œ ๋ฐ€์ง‘ ๋ชจ๋ธ๊ณผ ๊ฒฝ์Ÿ

3. ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ

  • ๋ช…์‹œ์  ๋‹จ์–ด ๊ฐ€์ค‘์น˜: ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ์ค‘์š”ํ•œ์ง€ ํ™•์ธ ๊ฐ€๋Šฅ
  • ๋””๋ฒ„๊น… ์šฉ์ด: ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์˜ ๊ทผ๊ฑฐ ํŒŒ์•… ๊ฐ€๋Šฅ
  • ์„ค๋ช… ๊ฐ€๋Šฅํ•œ AI: ์‚ฌ์šฉ์ž์—๊ฒŒ ๊ฒ€์ƒ‰ ์ด์œ  ์ œ์‹œ ๊ฐ€๋Šฅ

4. ์œ ์—ฐ์„ฑ

  • Trade-off ์กฐ์ ˆ: ์ •๊ทœํ™” ๊ฐ•๋„๋กœ ํšจ์œจ์„ฑ-ํšจ๊ณผ์„ฑ ๊ท ํ˜• ์กฐ์ •
  • ๊ธฐ์กด ์ธํ”„๋ผ ํ™œ์šฉ: Lucene, Elasticsearch ๋“ฑ ๊ธฐ์กด ์‹œ์Šคํ…œ ์‚ฌ์šฉ
  • ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฒ€์ƒ‰: ๋ฐ€์ง‘ ๋ฒกํ„ฐ์™€ ๊ฒฐํ•ฉ ๊ฐ€๋Šฅ (RRF ๋“ฑ)

๋‹จ์ 

1. ํ•™์Šต ๋น„์šฉ

  • ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ํ•„์š” (MS MARCO ๋“ฑ)
  • ๊ธด ํ•™์Šต ์‹œ๊ฐ„ (ํŠนํžˆ distillation ์‚ฌ์šฉ ์‹œ)
  • GPU ์ž์› ํ•„์š”

2. ์ง€์—ฐ ์‹œ๊ฐ„

  • ์ „ํ†ต์ ์ธ BM25๋ณด๋‹ค ๋А๋ฆผ (ํŠนํžˆ bi-encoder ๋ชจ๋“œ)
  • Term expansion์œผ๋กœ ์ธํ•œ ์ธ๋ฑ์Šค ํฌ๊ธฐ ์ฆ๊ฐ€
  • SPLADE v2/v3๋Š” DeepImpact๋ณด๋‹ค 10๋ฐฐ ์ด์ƒ ๋А๋ฆผ

3. ์–ธ์–ด ์˜์กด์„ฑ

  • ๋Œ€๋ถ€๋ถ„ ์˜์–ด ์ค‘์‹ฌ์œผ๋กœ ๊ฐœ๋ฐœ
  • ๋‹ค๊ตญ์–ด ์ง€์› ์ œํ•œ์ 
  • ์–ธ์–ด๋ณ„ ์žฌํ•™์Šต ํ•„์š”

4. ๋ณต์žก์„ฑ

  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ํ•„์š” (์ •๊ทœํ™” ๊ฐ•๋„, ํ•™์Šต๋ฅ  ๋“ฑ)
  • ์ „ํ†ต์  ๋ฐฉ๋ฒ•๋ณด๋‹ค ๊ตฌํ˜„ ๋ณต์žก
  • ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์ „๋ฌธ ์ง€์‹ ํ•„์š”

๊ตฌํ˜„ ๋ฐ ํ™œ์šฉ

ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„

1. Contrastive Learning

  • Positive์™€ negative ์ƒ˜ํ”Œ ์Œ์œผ๋กœ ํ•™์Šต
  • ๊ด€๋ จ ๋ฌธ์„œ๋Š” ๊ฐ€๊น๊ฒŒ, ๋น„๊ด€๋ จ ๋ฌธ์„œ๋Š” ๋ฉ€๊ฒŒ

2. Distillation (๋” ํšจ๊ณผ์ )

  • Cross-encoder ๊ฐ™์€ ๊ฐ•๋ ฅํ•œ ๊ต์‚ฌ ๋ชจ๋ธ์˜ ์ ์ˆ˜ ํ•™์Šต
  • SPLADE v2 ์ดํ›„ ์ฃผ๋กœ ์‚ฌ์šฉ
  • ์•™์ƒ๋ธ” distillation์œผ๋กœ ๋”์šฑ ๊ฐœ์„  (v3)

์ •๊ทœํ™”

FLOPS Regularizer

Loss = Ranking_Loss + ฮป ร— FLOPS_cost

FLOPS_cost โ‰ˆ โˆ‘(non-zero weights)
  • ฮป ์ฆ๊ฐ€ โ†’ ๋” ํฌ์†Œํ•œ ํ‘œํ˜„ โ†’ ๋น ๋ฅธ ๊ฒ€์ƒ‰, ๋‚ฎ์€ ํšจ๊ณผ์„ฑ
  • ฮป ๊ฐ์†Œ โ†’ ๋” ๋ฐ€์ง‘ํ•œ ํ‘œํ˜„ โ†’ ๋А๋ฆฐ ๊ฒ€์ƒ‰, ๋†’์€ ํšจ๊ณผ์„ฑ

์‹ค์ œ ์‹œ์Šคํ…œ ์ ์šฉ

์ฃผ์š” ๊ตฌํ˜„:

  • Elasticsearch: ELSER (Elastic Learned Sparse Encoder)
  • OpenSearch: Neural Sparse Search (opensearch-neural-sparse-encoding ๋ชจ๋ธ)
  • Pinecone: Sparse vector index ์ง€์›
  • Qdrant: Sparse vector ์ง€์›

ํ™œ์šฉ ์‚ฌ๋ก€:

  • ์—”ํ„ฐํ”„๋ผ์ด์ฆˆ ๊ฒ€์ƒ‰: ๋ฌธ์„œ, ์ด๋ฉ”์ผ, ์ฝ”๋“œ ๊ฒ€์ƒ‰
  • E-commerce: ์ƒํ’ˆ ๊ฒ€์ƒ‰ ๋ฐ ์ถ”์ฒœ
  • ์งˆ์˜์‘๋‹ต ์‹œ์Šคํ…œ: RAG (Retrieval-Augmented Generation)
  • ๋ฒ•๋ฅ /์˜๋ฃŒ: ์ „๋ฌธ ์šฉ์–ด ๊ฒ€์ƒ‰

์ตœ์‹  ์—ฐ๊ตฌ ๋™ํ–ฅ

LLM๊ณผ์˜ ๊ฒฐํ•ฉ (2024)

Large Language Model์„ ํ™œ์šฉํ•œ sparse retrieval ํ–ฅ์ƒ:

  • Query rewriting: LLM์œผ๋กœ ์ฟผ๋ฆฌ ์žฌ์ž‘์„ฑ
  • Query expansion: LLM์ด ๊ด€๋ จ ์šฉ์–ด ์ œ์•ˆ
  • Document augmentation: LLM์œผ๋กœ ๋ฌธ์„œ ๋ณด๊ฐ•

Long Document ์ ์‘ (2023)

Proximal scoring์˜ ์ค‘์š”์„ฑ ๋ฐœ๊ฒฌ7:

  • ExactSDM: ์ •ํ™•ํ•œ ๊ทผ์ ‘์„ฑ ๋งค์นญ
  • SoftSDM: ์†Œํ”„ํŠธ ๊ทผ์ ‘์„ฑ ๋งค์นญ
  • ๊ธด ๋ฌธ์„œ์—์„œ LSR ์„ฑ๋Šฅ ํฌ๊ฒŒ ๊ฐœ์„ 

Multimodal LSR

ํ…์ŠคํŠธ ์ด์™ธ ๋„๋ฉ”์ธ์œผ๋กœ ํ™•์žฅ:

  • ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๊ฒ€์ƒ‰
  • ๋น„๋””์˜ค ๊ฒ€์ƒ‰
  • ์˜ค๋””์˜ค ๊ฒ€์ƒ‰

๊ด€๋ จ ๊ธฐ์ˆ 

์ฐธ๊ณ  ์ž๋ฃŒ

์ฃผ์š” ๋…ผ๋ฌธ

๋ธ”๋กœ๊ทธ ๋ฐ ํŠœํ† ๋ฆฌ์–ผ

์ฝ”๋“œ ์ €์žฅ์†Œ

Footnotes

  1. Learned sparse retrieval - Wikipedia โ†ฉ

  2. A Unified Framework for Learned Sparse Retrieval - Nguyen et al., 2023 โ†ฉ

  3. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques - Lin et al., 2021 โ†ฉ

  4. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking - Formal et al., 2021 โ†ฉ

  5. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval - Formal et al., 2021 โ†ฉ

  6. SPLADE-v3: New baselines for SPLADE - 2024 โ†ฉ

  7. Adapting Learned Sparse Retrieval for Long Documents - 2023 โ†ฉ