内容目录
概述
在这篇文章中,我们来了解下如何使用milvus集成OpenAI来实现高质量语义搜索。
主要方面:
- 通过OpenAI生成文本Embedding
- 使用milvus进行向量搜索
1. 通过OpenAI生成文本Embedding
首先,如果我们需要使用OpenAI API那么就需要获取OpenAI API密钥,然后使用python编码利用OpenAI Embedding API实现Embedding生成,示例代码如下:
# 在执行生成Embbedding之前需要正确设置OpenAI的密钥
def embed(text):
return openai.Embedding.create(
input=text,
engine=OPENAI_ENGINE)["data"][0]["embedding"]
2. 使用milvus进行向量搜索
接下来,我们将OpenAI生成的Embedding写入到milvus向量数据库以支持检索。这个示例是milvus官网提供的,是milvus结合OpenAI实现书名搜索,具体步骤如下:
- 连接milvus向量数据库
- 创建milvus collection
- 创建milvus索引
- 从csv文件中读取数据并调用OpenAI生成Embedding
- Embedding写入milvus
- 将用户搜索文本调用OpenAI生成Embedding
- 根据用户Embedding搜索milvus
2.1 安装milvus python依赖
pip install openai
pip install pymilvus
2.2 实现milvus语义搜索
首先引入依赖模块:
import csv
import json
import random
import openai
import time
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
# Extract the book titles
def csv_load(file):
with open(file, newline='') as f:
reader=csv.reader(f, delimiter=',')
for row in reader:
yield row[1]
FILE = './content/books.csv' # Download it from https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks and save it in the folder that holds your script.
COLLECTION_NAME = 'title_db' # Collection name
DIMENSION = 1536 # Embeddings size
COUNT = 100 # How many titles to embed and insert.
MILVUS_HOST = 'localhost' # Milvus server URI
MILVUS_PORT = '19530'
OPENAI_ENGINE = 'text-embedding-ada-002' # Which engine to use
openai.api_key = 'sk-******' # Use your own Open AI API Key here
# Connect to Milvus
connections.connect(host=MILVUS_HOST, port=MILVUS_PORT)
# Remove collection if it already exists
if utility.has_collection(COLLECTION_NAME):
utility.drop_collection(COLLECTION_NAME)
# Create collection which includes the id, title, and embedding.
fields = [
FieldSchema(name='id', dtype=DataType.INT64, descrition='Ids', is_primary=True, auto_id=False),
FieldSchema(name='title', dtype=DataType.VARCHAR, description='Title texts', max_length=200),
FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='Embedding vectors', dim=DIMENSION)
]
schema = CollectionSchema(fields=fields, description='Title collection')
collection = Collection(name=COLLECTION_NAME, schema=schema)
# Create an index for the collection.
# Create an index for the collection.
index_params = {
'index_type': 'IVF_FLAT',
'metric_type': 'L2',
'params': {'nlist': 1024}
}
collection.create_index(field_name="embedding", index_params=index_params)
# Extract embedding from text using OpenAI
def embed(text):
return openai.Embedding.create(
input=text,
engine=OPENAI_ENGINE)["data"][0]["embedding"]
# Insert each title and its embedding
for idx, text in enumerate(random.sample(sorted(csv_load(FILE)), k=COUNT)): # Load COUNT amount of random values from dataset
ins=[[idx], [(text[:198] + '..') if len(text) > 200 else text], [embed(text)]] # Insert the title id, the title text, and the title embedding vector
collection.insert(ins)
time.sleep(3) # Free OpenAI account limited to 60 RPM
# Load the collection into memory for searching
collection.load()
# Search the database based on input text
def search(text):
# Search parameters for the index
search_params={
"metric_type": "L2"
}
results=collection.search(
data=[embed(text)], # Embeded search value
anns_field="embedding", # Search across embeddings
param=search_params,
limit=5, # Limit to five results per search
output_fields=['title'] # Include title field in result
)
ret=[]
for hit in results[0]:
row=[]
row.extend([hit.id, hit.score, hit.entity.get('title')]) # Get the id, distance, and title for the results
ret.append(row)
return ret
search_terms=['self-improvement', 'landscape']
for x in search_terms:
print('Search term:', x)
for result in search(x):
print(result)
print()
将这段代码保存后,在python3环境中执行:
python filename.py
执行完成后控制台输出内容;
Search term: self-improvement
[46, 0.37948882579803467, 'The Road Less Traveled: A New Psychology of Love Traditional Values and Spiritual Growth']
[24, 0.39301538467407227, 'The Leader In You: How to Win Friends Influence People and Succeed in a Changing World']
[35, 0.4081816077232361, 'Think and Grow Rich: The Landmark Bestseller Now Revised and Updated for the 21st Century']
[93, 0.4174671173095703, 'Great Expectations']
[10, 0.41889268159866333, 'Nicomachean Ethics']
Search term: landscape
[49, 0.3966977894306183, 'Traveller']
[20, 0.41044068336486816, 'A Parchment of Leaves']
[40, 0.4179283380508423, 'The Illustrated Garden Book: A New Anthology']
[97, 0.42227691411972046, 'Monsoon Summer']
[70, 0.42461898922920227, 'Frankenstein']
3. milvus Standlone Docker Compose Yaml文件
version: '3.5'
services:
etcd:
container_name: milvus-etcd
image: quay.io/coreos/etcd:v3.5.5
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_SNAPSHOT_COUNT=50000
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
minio:
container_name: milvus-minio
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
command: minio server /minio_data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
standalone:
container_name: milvus-standalone
image: milvusdb/milvus:v2.2.13
command: ["milvus", "run", "standalone"]
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
ports:
- "19530:19530"
- "9091:9091"
depends_on:
- "etcd"
- "minio"
networks:
default:
name: milvus
4. 总结
通过结合OpenAI生成的Embedding和milvus的高效向量搜索,我们可以搭建强大的语义搜索引擎,并且可以在很多场景中使用(比如:文本检索、问答系统、推荐系统等)。