Experiment tracking with a vector DB

Experiment tracking with a vector DB#

This notebook illustrates experiment tracking for document-based models, using the LangChain API to integrate directly with vector databases. You can track documents as artifacts, complete with metadata such as loader type, producer information, and collection details.

In this tutorial

SDK reference
Setup and Imports
Milvus configuration
Creating an MLRun collection from Milvus
Create the Milvus vector store
Working with LangChain Documents and MLRun Artifacts
Working with MLRun artifacts in a collection
Using Text Splitters with Artifacts
Using MLRunLoader
Using MLRunLoader with DirectoryLoader

This tutorial uses Milvus on a local host for simplicity. To use Milvus without the local host, see Manage Milvus Connections.

SDK reference#

Setup and Imports#

import sys

%pip install mlrun langchain langchain-milvus langchain_community backports.tarfile pymilvus[milvus_lite]
%pip install --force-reinstall "setuptools<80"

import mlrun
import tempfile
from langchain_milvus import Milvus
from langchain_community.document_loaders import DirectoryLoader
from mlrun.artifacts import DocumentLoaderSpec, MLRunLoader
from mlrun.datastore.datastore_profile import (
    ConfigProfile,
    register_temporary_client_datastore_profile,
)

try:
    from langchain.embeddings import FakeEmbeddings
    from langchain.text_splitter import CharacterTextSplitter
except:
    from langchain_core.embeddings import FakeEmbeddings
    from langchain_text_splitters import CharacterTextSplitter

# Initialize project
project = mlrun.get_or_create_project(
    name="genai-tutorial", context="./", user_project=True, allow_cross_project=True
)

> 2026-03-25 06:36:06,812 [info] Project loaded successfully: {"project_name":"genai-tutorial-admin"}

Milvus configuration#

Create and register a profile representing a Milvus DB. This is done in the project level, once per project.
Credentials for the DB may be passed here assuming the code is not introduced into any repo, or they may be provided through project secrets.
See ConfigProfile.

profile = ConfigProfile(
    name="milvus-config", public={"MILVUS_DB": {"uri": "./milvus_demo.db"}}
)
# Register the profile temporarily for the current client session
register_temporary_client_datastore_profile(profile)

Creating an MLRun collection from Milvus#

Create, or use an existing, collection to store the artifact/documents in.
Use the configuration stored earlier in the ConfigProfile to get the configuration details.
You still need to create the actual VectorDB class, since each VectorDB has a different initialization method. See get_config_profile_attribute.

# Initialize embedding model (using FakeEmbeddings for demonstration)
embedding_model = FakeEmbeddings(size=3)

config = project.get_config_profile_attributes("milvus-config")
config

{'MILVUS_DB': {'uri': './milvus_demo.db'}}

Create the Milvus vector store#

In this step you also create the MLRun collection wrapper. See get_vector_store_collection.

vectorstore = Milvus(
    collection_name="my_tutorial_collection",
    embedding_function=embedding_model,
    connection_args=config["MILVUS_DB"],
    auto_id=True,
)

# Create MLRun collection wrapper
collection = project.get_vector_store_collection(vector_store=vectorstore)

Working with LangChain documents and MLRun artifacts#

# Create a sample document
def create_sample_document(content, dir=None):
    with tempfile.NamedTemporaryFile(
        mode="w", suffix=".txt", delete=False, dir=dir
    ) as temp_file:
        temp_file.write(content)
        return temp_file.name


# Create and log an MLRun artifact
file_path = create_sample_document("Sample content for demonstration")
artifact = project.log_document("sample-doc", local_path=file_path)

# Convert MLRun artifact to LangChain documents
langchain_docs = artifact.to_langchain_documents()
print("LangChain document content:", langchain_docs[0].page_content)
print("LangChain document metadata:", langchain_docs[0].metadata)

# Add LangChain documents to collection
milvus_ids = collection.add_documents(langchain_docs)
print("Documents added with IDs:", milvus_ids)

# Search in collection
results = collection.similarity_search("sample", k=1)
print("Search results:", [doc.page_content for doc in results])

Working with MLRun artifacts in a collection#

# Add artifacts directly to collection
artifact1 = project.log_document(
    "doc1", local_path=create_sample_document("First document")
)
artifact2 = project.log_document(
    "doc2", local_path=create_sample_document("Second document")
)

# Add multiple artifacts at once
milvus_ids = collection.add_artifacts([artifact1, artifact2])
print("Artifacts added with IDs:", milvus_ids)

# Get back as LangChain documents
search_results = collection.similarity_search("first")
print("Retrieved document:", search_results[0].page_content)

Using text splitters with Artifacts#

An artifact represents an original document source file (pdf, doc, etc.). Langchain uses a text splitter to convert such sources into chunks of text called document.

# Create a text splitter
splitter = CharacterTextSplitter(separator="\n", chunk_size=100, chunk_overlap=20)

# Create a longer document
long_text = "This is a longer document.\n" * 5
long_doc = project.log_document(
    "long-doc", local_path=create_sample_document(long_text)
)

# Add artifact with splitting
collection_split = project.get_vector_store_collection(
    vector_store=Milvus(
        collection_name="split_collection",
        embedding_function=embedding_model,
        connection_args=config["MILVUS_DB"],
        auto_id=False,
    ),
)

# Add with custom IDs for chunks
ids = collection_split.add_artifacts([long_doc], splitter=splitter, ids=["doc1"])
print("Generated chunk IDs:", ids)

Using MLRunLoader#

MLRunLoader is a wrapper. It receives langchain loader as a parameter (for example Langchain.PDFloader, or Langchain.CSVLoader), and calls this underlying loader for all its purposes. In addition, it adds the source file as an MLRun artifact.

This flow uses MLRunLoader for creating instances of a dynamically defined document loader.

# Create a document loader specification
loader_spec = DocumentLoaderSpec(
    loader_class_name="langchain_community.document_loaders.TextLoader",
    src_name="file_path",
)

# Create and use MLRunLoader
file_path = create_sample_document("Content for MLRunLoader test")
loader = MLRunLoader(
    source_path=file_path,
    loader_spec=loader_spec,
    artifact_key="loaded-doc",
    producer=project,
)

# Load documents
documents = loader.load()
print("Loaded document content:", documents[0].page_content)

# Verify artifact creation
artifact = project.get_artifact("loaded-doc")
print("Created artifact key:", artifact.key)

Using MLRunLoader with DirectoryLoader#

Langchain DirectoryLoader loads all the files in the directory by calling the langchain loader. When you pass MLRunLoader, all the source files are added as MLRun artifacts. See DocumentLoaderSpec.

# Create a directory with multiple documents
temp_dir = tempfile.mkdtemp()
create_sample_document("First file content", dir=temp_dir)
create_sample_document("Second file content", dir=temp_dir)

# Configure loader specification
artifact_loader_spec = DocumentLoaderSpec(
    loader_class_name="langchain_community.document_loaders.TextLoader",
    src_name="file_path",
)

# Create directory loader with MLRunLoader
dir_loader = DirectoryLoader(
    temp_dir,
    glob="**/*.*",
    loader_cls=MLRunLoader,
    loader_kwargs={
        "loader_spec": artifact_loader_spec,
        "artifact_key": "dir_doc%%",  # %% will be replaced with unique identifier
        "producer": project,
        "upload": False,
    },
)

# Load all documents
documents = dir_loader.load()
print(f"Loaded {len(documents)} documents")

# List created artifacts
artifacts = project.list_artifacts(kind="document")
matching_artifacts = [
    art for art in artifacts if art["metadata"]["key"].startswith("dir_doc")
]

print("Created artifacts:", [art["metadata"]["key"] for art in matching_artifacts])