Experiment tracking with a vector DB#
This notebook illustrates experiment tracking for document-based models, using the LangChain API to integrate directly with vector databases. You can track documents as artifacts, complete with metadata such as loader type, producer information, and collection details.
In this tutorial:
This tutorial uses Milvus on a local host for simplicity. To use Milvus without the local host, see Manage Milvus Connections.
SDK reference#
Prerequisites#
Install Milvus on your cluster.
To install Milvus with name space milvus
on a cluster:
ssh to your data node.
helm repo add milvus https://zilliztech.github.io/milvus-helm/
2helm repo update
kubectl create namespace milvus
helm install test-milvus milvus/milvus -n milvus --set cluster.enabled=false --set standalone.persistence.enabled=false --set etcd.replicaCount=1 --set minio.mode=standalone --set pulsar.enabled=false --set minio.persistence.enabled=false --set etcd.persistence.enabled=false --set attu.enabled=true
The URI to access Milvus from the cluster: http://test-milvus.milvus.svc.cluster.local:19530
Setup and Imports#
#!pip install mlrun langchain langchain-milvus pymilvus langchain_community
import mlrun
import tempfile
from langchain.embeddings import FakeEmbeddings
from langchain_milvus import Milvus
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from mlrun.artifacts import DocumentLoaderSpec, MLRunLoader
from mlrun.datastore.datastore_profile import (
ConfigProfile,
register_temporary_client_datastore_profile,
)
# Initialize project
project = mlrun.get_or_create_project("vectorstore-demo3")
> 2025-04-27 10:55:35,059 [info] Project loaded successfully: {"project_name":"vectorstore-demo3"}
Milvus configuration#
Create and register a profile representing a Milvus DB. This is done in the project level, once per project. Credentials for the DB may be passed here assuming the code is not introduced into any repo, or they may be provided through project secrets. See ConfigProfile.
profile = ConfigProfile(
name="milvus-config", public={"MILVUS_DB": {"uri": "./milvus_demo.db"}}
)
# Register the profile temporarily for the current client session
register_temporary_client_datastore_profile(profile)
Creating an MLRun collection from Milvus#
Create (or use an existing) collection to store the artifact/documents in.
Use the configuration stored earlier in the ConfigProfile
to get the configuration details.
You still need to create the actual VectorDB class, since each VectorDB has a different initialization method.
See get_config_profile_attribute
# Initialize embedding model (using FakeEmbeddings for demonstration)
embedding_model = FakeEmbeddings(size=3)
config = project.get_config_profile_attributes("milvus-config")
config
{'MILVUS_DB': {'uri': './milvus_demo.db'}}
Create the Milvus vector store#
In this step you also create the MLRun collection wrapper. See get_vector_store_collection.
vectorstore = Milvus(
collection_name="my_tutorial_collection",
embedding_function=embedding_model,
connection_args=config["MILVUS_DB"],
auto_id=True,
)
# Create MLRun collection wrapper
collection = project.get_vector_store_collection(vector_store=vectorstore)
Working with LangChain documents and MLRun artifacts#
# Create a sample document
def create_sample_document(content, dir=None):
with tempfile.NamedTemporaryFile(
mode="w", suffix=".txt", delete=False, dir=dir
) as temp_file:
temp_file.write(content)
return temp_file.name
# Create and log an MLRun artifact
file_path = create_sample_document("Sample content for demonstration")
artifact = project.log_document("sample-doc", local_path=file_path)
# Convert MLRun artifact to LangChain documents
langchain_docs = artifact.to_langchain_documents()
print("LangChain document content:", langchain_docs[0].page_content)
print("LangChain document metadata:", langchain_docs[0].metadata)
# Add LangChain documents to collection
milvus_ids = collection.add_documents(langchain_docs)
print("Documents added with IDs:", milvus_ids)
# Search in collection
results = collection.similarity_search("sample", k=1)
print("Search results:", [doc.page_content for doc in results])
LangChain document content: Sample content for demonstration
LangChain document metadata: {'source': 'vectorstore-demo3/sample-doc', 'original_source': '/tmp/tmp69ymvc77.txt', 'mlrun_tag': 'latest', 'mlrun_key': 'sample-doc', 'mlrun_project': 'vectorstore-demo3', 'mlrun_chunk': '0'}
Documents added with IDs: [457638264974082048]
Search results: ['Sample content for demonstration']
Working with MLRun artifacts in a collection#
# Add artifacts directly to collection
artifact1 = project.log_document(
"doc1", local_path=create_sample_document("First document")
)
artifact2 = project.log_document(
"doc2", local_path=create_sample_document("Second document")
)
# Add multiple artifacts at once
milvus_ids = collection.add_artifacts([artifact1, artifact2])
print("Artifacts added with IDs:", milvus_ids)
# Get back as LangChain documents
search_results = collection.similarity_search("first")
print("Retrieved document:", search_results[0].page_content)
Artifacts added with IDs: [457638265059803138, 457638265073958916]
Retrieved document: First document
Using text splitters with Artifacts#
An artifact represents an original document source file (pdf, doc, etc.). Langchain uses a text splitter to convert such sources into chunks of text called document.
# Create a text splitter
splitter = CharacterTextSplitter(separator="\n", chunk_size=100, chunk_overlap=20)
# Create a longer document
long_text = "This is a longer document.\n" * 5
long_doc = project.log_document(
"long-doc", local_path=create_sample_document(long_text)
)
# Add artifact with splitting
collection_split = project.get_vector_store_collection(
vector_store=Milvus(
collection_name="split_collection",
embedding_function=embedding_model,
connection_args=config["MILVUS_DB"],
auto_id=False,
),
)
# Add with custom IDs for chunks
ids = collection_split.add_artifacts([long_doc], splitter=splitter, ids=["doc1"])
print("Generated chunk IDs:", ids)
Generated chunk IDs: ['doc1_1', 'doc1_2']
Using MLRunLoader#
MLRunLoader is a wrapper. It receives langchain loader as a parameter (for example Langchain.PDFloader
, or Langchain.CSVLoader
), and calls this underlying loader for all its purposes. In addition, it adds the source file as an MLRun artifact.
This flow uses MLRunLoader for creating instances of a dynamically defined document loader.
# Create a document loader specification
loader_spec = DocumentLoaderSpec(
loader_class_name="langchain_community.document_loaders.TextLoader",
src_name="file_path",
)
# Create and use MLRunLoader
file_path = create_sample_document("Content for MLRunLoader test")
loader = MLRunLoader(
source_path=file_path,
loader_spec=loader_spec,
artifact_key="loaded-doc",
producer=project,
)
# Load documents
documents = loader.load()
print("Loaded document content:", documents[0].page_content)
# Verify artifact creation
artifact = project.get_artifact("loaded-doc")
print("Created artifact key:", artifact.key)
Loaded document content: Content for MLRunLoader test
Created artifact key: loaded-doc
Using MLRunLoader with DirectoryLoader#
Langchain DirectoryLoader loads all the files in the directory by calling the langchain loader. When you pass MLRunLoader
, all the source files are added as MLRun artifacts.
See DocumentLoaderSpec.
# Create a directory with multiple documents
temp_dir = tempfile.mkdtemp()
create_sample_document("First file content", dir=temp_dir)
create_sample_document("Second file content", dir=temp_dir)
# Configure loader specification
artifact_loader_spec = DocumentLoaderSpec(
loader_class_name="langchain_community.document_loaders.TextLoader",
src_name="file_path",
)
# Create directory loader with MLRunLoader
dir_loader = DirectoryLoader(
temp_dir,
glob="**/*.*",
loader_cls=MLRunLoader,
loader_kwargs={
"loader_spec": artifact_loader_spec,
"artifact_key": "dir_doc%%", # %% will be replaced with unique identifier
"producer": project,
"upload": False,
},
)
# Load all documents
documents = dir_loader.load()
print(f"Loaded {len(documents)} documents")
# List created artifacts
artifacts = project.list_artifacts(kind="document")
matching_artifacts = [
art for art in artifacts if art["metadata"]["key"].startswith("dir_doc")
]
print("Created artifacts:", [art["metadata"]["key"] for art in matching_artifacts])
Loaded 2 documents
Created artifacts: ['dir_doctmp_tmpxonub3vj_tmpuxptzapt.txt', 'dir_doctmp_tmpxonub3vj_tmpglg9pznc.txt']