๊ฐœ์š”

opensearch-py-ml์€ OpenSearch๋ฅผ ์œ„ํ•œ Python ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ฐ ๋จธ์‹ ๋Ÿฌ๋‹ ํด๋ผ์ด์–ธํŠธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋‹ค. Eland์˜ ์ปค๋ฎค๋‹ˆํ‹ฐ ์ฃผ๋„ ์˜คํ”ˆ์†Œ์Šค ํฌํฌ๋กœ, Apache v2.0 ๋ผ์ด์„ ์Šค ํ•˜์— ๋ฐฐํฌ๋œ๋‹ค.

Eland๊ฐ€ Elasticsearch ์ „์šฉ์ธ ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, opensearch-py-ml์€ OpenSearch๋ฅผ ์œ„ํ•ด ํŠน๋ณ„ํžˆ ๊ฐœ๋ฐœ๋˜์—ˆ์œผ๋ฉฐ OpenSearch์˜ ML Commons ํ”Œ๋Ÿฌ๊ทธ์ธ๊ณผ ๊ธด๋ฐ€ํ•˜๊ฒŒ ํ†ตํ•ฉ๋˜์–ด ์žˆ๋‹ค.

์ฃผ์š” ๊ธฐ๋Šฅ

1. DataFrame API

OpenSearch ์ธ๋ฑ์Šค๋ฅผ Pandas DataFrame์ฒ˜๋Ÿผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” API๋ฅผ ์ œ๊ณตํ•œ๋‹ค. Jupyter Notebook ํ™˜๊ฒฝ์—์„œ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

ํŠน์ง•:

  • Pandas์™€ ์œ ์‚ฌํ•œ ์ธํ„ฐํŽ˜์ด์Šค
  • OpenSearch์—์„œ ์ง์ ‘ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์ˆ˜ํ–‰
  • ๋ณต์žกํ•œ ํ•„ํ„ฐ๋ง ๋ฐ ์ง‘๊ณ„ ์—ฐ์‚ฐ ์ง€์›
  • ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ ์—†์ด ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ

2. ML Commons ํ†ตํ•ฉ

OpenSearch์˜ ML Commons ํ”Œ๋Ÿฌ๊ทธ์ธ๊ณผ ํ†ตํ•ฉํ•˜์—ฌ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๊ด€๋ฆฌํ•˜๊ณ  ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

์ง€์› ๊ธฐ๋Šฅ:

  • ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ ๋“ฑ๋ก
  • ๋ชจ๋ธ ๋ฐฐํฌ ๋ฐ ์–ธ๋กœ๋“œ
  • ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ ๋ฐ ์ถ”๋ก 
  • ๋ชจ๋ธ ๊ทธ๋ฃน ๊ด€๋ฆฌ
  • ๋ชจ๋ธ ์‚ญ์ œ

3. SentenceTransformer ์ง€์›

SentenceTransformer ๋ชจ๋ธ์„ ์—…๋กœ๋“œํ•˜๊ณ  ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ธฐ๋Šฅ:

  • Hugging Face์˜ SentenceTransformer ๋ชจ๋ธ ์—…๋กœ๋“œ
  • ํ•ฉ์„ฑ ์ฟผ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ ํŒŒ์ธํŠœ๋‹
  • ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ

์„ค์น˜

# opensearch-py-ml ์„ค์น˜
pip install opensearch-py-ml
 
# OpenSearch Python ํด๋ผ์ด์–ธํŠธ๋„ ํ•จ๊ป˜ ์„ค์น˜๋จ

์š”๊ตฌ์‚ฌํ•ญ:

  • Python 3.x
  • opensearch-py (์ž๋™์œผ๋กœ ์„ค์น˜๋จ)
  • OpenSearch ํด๋Ÿฌ์Šคํ„ฐ (1.x, 2.x ์ง€์›)

์‚ฌ์šฉ ์˜ˆ์‹œ

DataFrame์œผ๋กœ ๋ฐ์ดํ„ฐ ์กฐํšŒ

from opensearchpy import OpenSearch
import opensearch_py_ml as oml
 
# OpenSearch ์—ฐ๊ฒฐ
client = OpenSearch(
    hosts=[{'host': 'localhost', 'port': 9200}],
    http_auth=('admin', 'admin'),
    use_ssl=True,
    verify_certs=False
)
 
# DataFrame ์ƒ์„ฑ
oml_df = oml.DataFrame(client, 'my-index')
 
# ๋ฐ์ดํ„ฐ ์กฐํšŒ
print(oml_df.head())
 
# ํ•„ํ„ฐ๋ง ๋ฐ ์ง‘๊ณ„
filtered = oml_df[oml_df['age'] > 30]
result = filtered.groupby('city').mean()
print(result)

Pandas์™€ ์ƒํ˜ธ ๋ณ€ํ™˜

import pandas as pd
import opensearch_py_ml as oml
 
# Pandas DataFrame์„ OpenSearch์— ์ €์žฅ
pd_df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['Seoul', 'Busan', 'Incheon']
})
 
oml.pandas_to_opensearch(
    pd_df,
    client,
    'users-index'
)
 
# OpenSearch์—์„œ Pandas DataFrame์œผ๋กœ ๋ณ€ํ™˜
oml_df = oml.DataFrame(client, 'users-index')
pd_df_result = oml.opensearch_to_pandas(oml_df)

์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ ๋“ฑ๋ก

from opensearchpy import OpenSearch
import opensearch_py_ml as oml
 
# OpenSearch ํด๋ผ์ด์–ธํŠธ ์ƒ์„ฑ
client = OpenSearch(
    hosts=[{'host': 'localhost', 'port': 9200}],
    http_auth=('admin', 'admin')
)
 
# ML Commons ํด๋ผ์ด์–ธํŠธ ์ƒ์„ฑ
ml_client = oml.MLCommonClient(client)
 
# ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ ๋“ฑ๋ก
model_id = ml_client.register_pretrained_model(
    model_name="huggingface/sentence-transformers/all-MiniLM-L6-v2",
    model_version="1.0.0",
    model_format="TORCH_SCRIPT",
    deploy_model=True  # ๋“ฑ๋ก ํ›„ ์ž๋™ ๋ฐฐํฌ
)
 
print(f"Model registered with ID: {model_id}")

๋ชจ๋ธ ๋ฐฐํฌ ๋ฐ ์–ธ๋กœ๋“œ

# ๋ชจ๋ธ ๋ฐฐํฌ (๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œ)
task_id = ml_client.deploy_model(model_id)
 
# ๋ฐฐํฌ ์ƒํƒœ ํ™•์ธ
task_info = ml_client.get_task_info(task_id)
print(f"Deployment status: {task_info['state']}")
 
# ๋ชจ๋ธ ์ •๋ณด ์กฐํšŒ
model_info = ml_client.get_model_info(model_id)
print(model_info)
 
# ๋ชจ๋ธ ์–ธ๋กœ๋“œ (๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ œ๊ฑฐ)
ml_client.undeploy_model(model_id)

์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ

# ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
sentences = [
    "OpenSearch๋Š” ๊ฐ•๋ ฅํ•œ ๊ฒ€์ƒ‰ ์—”์ง„์ž…๋‹ˆ๋‹ค",
    "๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์‰ฝ๊ฒŒ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค"
]
 
embeddings = ml_client.generate_embedding(
    model_id=model_id,
    sentences=sentences
)
 
print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding dimension: {len(embeddings[0])}")

๋ชจ๋ธ ๊ทธ๋ฃน ๊ด€๋ฆฌ

# ๋ชจ๋ธ ๊ทธ๋ฃน ์ƒ์„ฑ (OpenSearch 2.8+)
model_group_id = ml_client.register_model_group(
    name="sentence-transformers",
    description="Sentence transformer models for semantic search"
)
 
# ๋ชจ๋ธ ๊ทธ๋ฃน์— ๋ชจ๋ธ ๋“ฑ๋ก
model_id = ml_client.register_pretrained_model(
    model_name="huggingface/sentence-transformers/msmarco-distilbert-base-v4",
    model_version="1.0.0",
    model_format="TORCH_SCRIPT",
    model_group_id=model_group_id
)

๋ชจ๋ธ ์‚ญ์ œ

# ๋ชจ๋ธ ์‚ญ์ œ (์–ธ๋กœ๋“œ ํ›„ ์˜๊ตฌ ์‚ญ์ œ)
ml_client.delete_model(model_id)
print(f"Model {model_id} deleted")

์ฃผ์š” ํ™œ์šฉ ์‚ฌ๋ก€

1. ์‹œ๋งจํ‹ฑ ๊ฒ€์ƒ‰

SentenceTransformer ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜๋ฏธ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

# 1. ๋ชจ๋ธ ๋“ฑ๋ก ๋ฐ ๋ฐฐํฌ
model_id = ml_client.register_pretrained_model(
    model_name="huggingface/sentence-transformers/all-MiniLM-L6-v2",
    model_version="1.0.0",
    model_format="TORCH_SCRIPT",
    deploy_model=True
)
 
# 2. ๋ฌธ์„œ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ ๋ฐ ์ธ๋ฑ์‹ฑ
documents = [
    "Python is a programming language",
    "OpenSearch is a search engine",
    "Machine learning helps computers learn"
]
 
embeddings = ml_client.generate_embedding(model_id, documents)
 
# 3. ๊ฒ€์ƒ‰ ์‹œ ์ฟผ๋ฆฌ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑํ•˜์—ฌ ์œ ์‚ฌ๋„ ๊ฒ€์ƒ‰
query = "What is a search tool?"
query_embedding = ml_client.generate_embedding(model_id, [query])[0]

2. ๋Œ€์šฉ๋Ÿ‰ ๋กœ๊ทธ ๋ถ„์„

OpenSearch์— ์ €์žฅ๋œ ๋Œ€์šฉ๋Ÿ‰ ๋กœ๊ทธ ๋ฐ์ดํ„ฐ๋ฅผ Pandas์™€ ์œ ์‚ฌํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

# ๋กœ๊ทธ ๋ฐ์ดํ„ฐ ์กฐํšŒ
logs_df = oml.DataFrame(client, 'logs-2024-*')
 
# ์—๋Ÿฌ ๋กœ๊ทธ ํ•„ํ„ฐ๋ง
error_logs = logs_df[logs_df['level'] == 'ERROR']
 
# ์‹œ๊ฐ„๋Œ€๋ณ„ ์—๋Ÿฌ ์ง‘๊ณ„
hourly_errors = error_logs.groupby(
    pd.Grouper(key='timestamp', freq='1H')
).size()
 
print(hourly_errors)

3. ์ถ”์ฒœ ์‹œ์Šคํ…œ

์‚ฌ์šฉ์ž์™€ ์•„์ดํ…œ์„ ์ž„๋ฒ ๋”ฉํ•˜์—ฌ ์ถ”์ฒœ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ๋‹ค.

# ์‚ฌ์šฉ์ž ํ”„๋กœํ•„ ์ž„๋ฒ ๋”ฉ
user_profiles = [
    "User likes action movies and sci-fi",
    "User prefers comedy and romance"
]
 
user_embeddings = ml_client.generate_embedding(model_id, user_profiles)
 
# ์•„์ดํ…œ(์˜ํ™”) ์„ค๋ช… ์ž„๋ฒ ๋”ฉ
movie_descriptions = [
    "Action-packed space adventure",
    "Romantic comedy about love"
]
 
movie_embeddings = ml_client.generate_embedding(model_id, movie_descriptions)
 
# ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋กœ ์ถ”์ฒœ

4. ETL ํŒŒ์ดํ”„๋ผ์ธ

๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์†Œ์Šค์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜์—ฌ OpenSearch๋กœ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋‹ค.

import pandas as pd
import opensearch_py_ml as oml
 
# CSV ํŒŒ์ผ ์ฝ๊ธฐ
df = pd.read_csv('data.csv')
 
# ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['category'] = df['category'].str.lower()
 
# OpenSearch์— ๋กœ๋“œ
oml.pandas_to_opensearch(
    df,
    client,
    'processed-data',
    chunk_size=1000
)

Eland์™€์˜ ๋น„๊ต

์ธก๋ฉดElandopensearch-py-ml
๋Œ€์ƒ ํ”Œ๋žซํผElasticsearchOpenSearch
๋ผ์ด์„ ์ŠคElastic License 2.0Apache 2.0
๊ฐœ๋ฐœ ์ฃผ์ฒดElasticOpenSearch Community
ML ํ”Œ๋Ÿฌ๊ทธ์ธElasticsearch MLML Commons
๋ชจ๋ธ ๋ฐฐํฌHugging Face Hub ์ง์ ‘ ํ†ตํ•ฉML Commons API ์‚ฌ์šฉ
์ง€์› ๋ฒ„์ „Elasticsearch 8+OpenSearch 1.x, 2.x, 3.x
DataFrame APIPandas ์Šคํƒ€์ผPandas ์Šคํƒ€์ผ (์œ ์‚ฌ)

๋ฒ„์ „ ํ˜ธํ™˜์„ฑ

OpenSearch ๋ฒ„์ „ ์ง€์›

opensearch-py-ml์€ opensearch-py ํด๋ผ์ด์–ธํŠธ์— ์˜์กดํ•˜๋ฏ€๋กœ, ํ•ด๋‹น ํด๋ผ์ด์–ธํŠธ์˜ ํ˜ธํ™˜์„ฑ์„ ๋”ฐ๋ฅธ๋‹ค.

ํ˜ธํ™˜์„ฑ ๋งคํŠธ๋ฆญ์Šค:

opensearch-py-ml ๋ฒ„์ „OpenSearch ๋ฒ„์ „๋น„๊ณ 
1.0.02.4.0๊ณต์‹ ํ˜ธํ™˜์„ฑ ๋งคํŠธ๋ฆญ์Šค์— ๋ช…์‹œ
1.3.0 (์ตœ์‹ )1.x, 2.x, 3.xopensearch-py 3.x ์˜์กด์„ฑ์„ ํ†ตํ•ด ์ง€์›

๋ฒ„์ „๋ณ„ ์ฃผ์š” ๊ธฐ๋Šฅ:

  • OpenSearch 1.x: DataFrame API, ๊ธฐ๋ณธ ML ๋ชจ๋ธ ์—…๋กœ๋“œ
  • OpenSearch 2.x: ML Commons ๊ณ ๊ธ‰ ๊ธฐ๋Šฅ, ๋ชจ๋ธ ๊ทธ๋ฃน (2.8+)
  • OpenSearch 3.x: opensearch-py 3.x์™€ ํ•จ๊ป˜ ํ˜ธํ™˜ (๋‹จ, 3.0์—์„œ ์ œ๊ฑฐ๋œ ๊ธฐ๋Šฅ์€ ์‚ฌ์šฉ ๋ถˆ๊ฐ€)

์ฐธ๊ณ ์‚ฌํ•ญ:

  • opensearch-py 3.x.x๋Š” OpenSearch 1.0.0-3.x๋ฅผ ์ง€์›
  • OpenSearch ์ฃผ์š” ๋ฒ„์ „ ์—…๊ทธ๋ ˆ์ด๋“œ ์‹œ ์ œ๊ฑฐ๋œ ๊ธฐ๋Šฅ์ด ์žˆ์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ฆด๋ฆฌ์Šค ๋…ธํŠธ ํ™•์ธ ํ•„์š”
  • ์ตœ์‹  ๋ฒ„์ „ ํ˜ธํ™˜์„ฑ์€ opensearch-py COMPATIBILITY.md ์ฐธ๊ณ 

Python ๋ฒ„์ „

  • Python 3.7+
  • ๊ถŒ์žฅ: Python 3.9 ์ด์ƒ

์˜์กด์„ฑ

  • opensearch-py: OpenSearch Python ํด๋ผ์ด์–ธํŠธ
  • pandas: DataFrame ์กฐ์ž‘
  • numpy: ์ˆ˜์น˜ ์—ฐ์‚ฐ

์ œ์•ฝ์‚ฌํ•ญ

  • Pandas API ๋ถ€๋ถ„ ํ˜ธํ™˜: ๋ชจ๋“  Pandas ๊ธฐ๋Šฅ์„ ์ง€์›ํ•˜์ง€ ์•Š์Œ
  • ML Commons ์˜์กด์„ฑ: ML ๊ธฐ๋Šฅ์€ ML Commons ํ”Œ๋Ÿฌ๊ทธ์ธ ์„ค์น˜ ํ•„์š”
  • OpenSearch ์ „์šฉ: Elasticsearch์™€ ํ˜ธํ™˜๋˜์ง€ ์•Š์Œ
  • ๋ฌธ์„œ ๋ถ€์กฑ: Eland์— ๋น„ํ•ด ๋ฌธ์„œ์™€ ์˜ˆ์ œ๊ฐ€ ์ ์Œ

ML Commons ์„ค์ •

ML ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” OpenSearch ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ํŠน์ • ์„ค์ •์„ ํ™œ์„ฑํ™”ํ•ด์•ผ ํ•œ๋‹ค:

# ์™ธ๋ถ€ URL์—์„œ ๋ชจ๋ธ ๋“ฑ๋ก ํ—ˆ์šฉ
PUT _cluster/settings
{
  "persistent": {
    "plugins.ml_commons.allow_registering_model_via_url": true
  }
}
 
# ๋กœ์ปฌ ํŒŒ์ผ์—์„œ ๋ชจ๋ธ ๋“ฑ๋ก ํ—ˆ์šฉ
PUT _cluster/settings
{
  "persistent": {
    "plugins.ml_commons.allow_registering_model_via_local_file": true
  }
}

์ฐธ๊ณ  ์ž๋ฃŒ