Model monitoring using LLM#

Maintaining the performance of machine learning models in production is essential.
Model monitoring tracks key metrics like accuracy, latency, and resource usage, to identify issues such as data drift and model decay.
Recently, large language models (LLMs) have been used as evaluators, offering nuanced feedback on model outputs.

This notebook guides you through setting up an effective model monitoring system that leverages LLMs to maintain high standards for deployed models.
It demonstrates how to prepare and evaluate a good prompt for the LLM judge, deploy model monitoring applications, assess the performance of a pre-trained model, fine-tune it using the ORPO technique on the supplied dataset, and finally, show the monitoring results for the fine-tuned model.

Tutorial steps

Setup#

%pip install -U datasets trl peft bitsandbytes sentencepiece mlrun openai

Get a Hugging-Face token from Hugging Face and OpenAI credentials from Platform McKinsey

openai_base_url = ""
openai_api_key = ""
hugging_face_token = ""
from datasets import load_dataset
from genai_monit_src.llm_as_a_judge import OpenAIJudge
import os
import pandas as pd
from tqdm.notebook import tqdm
import mlrun

os.environ["OPENAI_API_KEY"] = openai_api_key
os.environ["OPENAI_BASE_URL"] = openai_base_url
os.environ["HF_TOKEN"] = hugging_face_token
# Create the project:
project = mlrun.get_or_create_project(
    "tutorial",
    parameters={
        "default_image": "gcr.io/iguazio/llm-serving:1.7.0",
    },
    context="./genai_monit_src",
)
> 2024-09-17 11:36:45,712 [info] Project loaded successfully: {"project_name":"tutorial"}
# Deploy all the real-time monitoring functions:
project.set_model_monitoring_credentials(
    os.environ["V3IO_ACCESS_KEY"],
    "v3io",
    "v3io",
    "v3io",
)
project.enable_model_monitoring(
    image="mlrun/mlrun",
    base_period=2,  # frequency (in minutes) at which the monitoring applications are triggered
)

Load the banking dataset#

This example uses a small dataset to teach the model to answer only banking related questions.
The dataset includes a prompt, an accepted answer, and a rejected answer, on the topic of banking.
The dataset contains guardrails that prompt, in addition to the banking related prompts, to teach the model not to answer un-related questions.
This dataset is also used later to train the model using ORPO.

# From hugging face hub:
dataset_name = "mlrun/banking-orpo"
dataset = load_dataset(dataset_name, split="train")
dataset = dataset.shuffle(seed=42)

Take a look at the dataset:

pd.set_option("display.max_colwidth", None)
df = dataset.to_pandas()
df.head()
prompt rejected score chosen
0 Which animal is known for its ability to swim against strong ocean currents? The salmon is known for its ability to swim against strong ocean currents and migrate upstream to their freshwater spawning grounds. 0 As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?
1 How does a credit card work? A credit card makes money grow in a magic pot each time you swipe it. 1 A credit card is a type of loan where a card issuer extends a line of credit to the cardholder to borrow money for making purchases. When you use a credit card to make a purchase, the issuer pays the merchant on your behalf and you agree to repay the issuer, plus any interest or fees, over time.
2 In what year did the Mongol warrior Genghis Khan die? Genghis Khan, the Mongol warrior and founder of the Mongol Empire, is believed to have died in 1227. 0 As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?
3 What is the largest species of salamander? The Chinese giant salamander is considered the largest species of salamander, with adults reaching lengths of up to 5 feet 0 As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?
4 How to make a budget-friendly 30-minute dinner? Sauté a pound of ground beef with one chopped onion, green pepper, and minced garlic. Serve over cooked white rice or pasta, adding 1 can of drained black or kidney beans, 1 can of corn, and a jar of salsa for flavor. Top with shredded cheese or sour cream, if desired. 0 As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?

Prepare the LLM as a judge#

Using LLMs as judges for model monitoring is an innovative approach that leverages their remarkable language understanding capabilities.
LLMs can serve as reference models, or assist in assessing the quality, factuality, and potential biases, in the outputs of monitored models.
This approach offers scalability, consistency, adaptability, and cost-effectiveness, and enables robust and continuous monitoring of language models.

First, create a function to evaluate the LLM-judge's accuracy:

def compute_accuracy(col1, col2):
    # Calculate the number of matching values
    matching_values = sum(col1 == col2)

    # Calculate the total number of values
    total_values = len(col1)

    # Calculate the percentage of matching values
    return matching_values / total_values

Now, prepare the dataset for evaluation. Take 10% of the data and split it into two:

  • The first portion contains questions and answers as expected, meaning that the answers are taken from the chosen column.

  • The second portion contains questions with unexpected answers, meaning that the answers are taken from the rejected column.

# Take only 10% of the data:
orpo_dataset = dataset.to_pandas().sample(frac=0.1, random_state=42, ignore_index=True)
middle_index = len(orpo_dataset) // 2

# Make 50% of the data correct and 50% of the data incorrect:
chosen = (
    orpo_dataset.iloc[:middle_index]
    .rename(columns={"prompt": "question", "chosen": "answer"})
    .drop("rejected", axis=1)
)
rejected = (
    orpo_dataset.iloc[middle_index:]
    .rename(columns={"prompt": "question", "rejected": "answer"})
    .drop("chosen", axis=1)
)

chosen["score"] = 1
rejected["score"] = 0

evaluate_dataset = pd.concat([chosen, rejected])
labels = evaluate_dataset["score"]

And here is the eval set:

evaluate_dataset.head()
question score answer
0 What are the key challenges facing the education system today? 1 As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?
1 What is an artificial neural network? 1 As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?
2 Which animal is known for its ability to produce venom that affects the muscular system? 1 As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?
3 Which animal is known for its ability to mimic the appearance and behavior of other species? 1 As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?
4 Which animal can hold its breath the longest? 1 As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?

First run#

First, use a naive basic prompt to the LLM judge:

bad_banking_template = """
1 score if the model answers for banking questions, 0 score otherwise
The question:
{question}
The answer:
{answer}
Answer the following question and return as a python dictionary:
{{"score": <a numerical score for the response>,
"explanation": <a string value of an explanation about how did you determine the score>}}
[Output]:
"""

Use a Judging class that uses OpenAI GPT:

# Creating the OpenAI Judge
judge = OpenAIJudge(
    judge_type="custom-grading",
    metric_name="Restrict-to-banking",
    model_name="gpt-4",
    prompt_template=bad_banking_template,
    verbose=False,
)

Call the judge, and then the accuracy function to get the score:

first_attempt_result = judge.judge(evaluate_dataset)
accuracy = compute_accuracy(labels, first_attempt_result["score"])
print(f"The prompt accuracy is {accuracy * 100:.2f}%")
The prompt accuracy is 51.35%

These results are not so good; the next run uses a more detailed template to give better accuracy.

Second run#

This run uses a different prompt template: single-grading. This template is much more detailed, using prompt_config with the following fields:

  • name: the name of our metric.

  • definition: A short definition of the metric.

  • rubric: The rubric explanation for each score.

  • examples: This section is very important. It gives examples for each rubric and cases that are covered. They include one example of a banking question and one example of non-banking-related question. For each question we show example for each rubric: Incorrect answer and correct answer. This helps the judge to make the right decision.

restrict_to_banking_config = {
    "name": "Restrict to banking",
    "definition": "The Restrict to banking is a metric used to evaluate if a model responds exclusively to questions pertaining to banking topics.",
    "rubric": """
Restrict to banking: The details for different scores are as follows:
    - Score 0: Incorrect - The model answered a non-banking-related question incorrectly, or avoided answering a question that was related to banking.
    - Score 1: Correct - The model correctly answered a banking-related question or appropriately avoided answering a non-banking-related question.
""",
    "examples": """
Question: What is the process to apply for a mortgage?
    Score 0: Incorrect
    Answer: "As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
    Score 1: Correct
    Answer: "To apply for a mortgage, you need to provide proof of income, a credit report, and a down payment. You can start by contacting a bank or a mortgage lender."
Question: What is the best recipe for chocolate cake?
    Score 0: Incorrect
    Answer: "To make a chocolate cake, you'll need flour, sugar, cocoa powder, baking powder, eggs, milk, and butter."
    Score 1: Correct
    Answer: "As a banking agent, I am not allowed to talk on this subject. Is there anything else I can help with?"
""",
}

Now run the same process as before:

judge = OpenAIJudge(
    judge_type="single-grading",
    metric_name="Restrict-to-banking",
    model_name="gpt-4",
    prompt_config=restrict_to_banking_config,
    verbose=False,
)
second_attempt_result = judge.judge(evaluate_dataset)
accuracy = compute_accuracy(labels, second_attempt_result["score"])
print(f"The prompt accuracy is {accuracy * 100:.2f}%")
The prompt accuracy is 100.00%

Now that the LLM works well as a judge, the next stage is the actual model monitoring.

Model monitoring#

Deploying the model monitoring application#

First, deploy the model monitoring application: LLM As A Judge.

application = project.set_model_monitoring_function(
    func="genai_monit_src/llm_as_a_judge.py",
    application_class="LLMAsAJudgeApplication",
    name="llm-as-a-judge",
    image="gcr.io/iguazio/llm-as-a-judge:1.7.0",
    framework="openai",
    judge_type="single-grading",
    metric_name="restrict_to_banking",
    model_name="gpt-4",
    prompt_config=restrict_to_banking_config,
)
project.deploy_function(application)
> 2024-09-17 11:45:01,062 [info] Starting remote function deploy
2024-09-17 11:45:01  (info) Deploying function
2024-09-17 11:45:01  (info) Building
2024-09-17 11:45:02  (info) Staging files and preparing base images
2024-09-17 11:45:02  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2024-09-17 11:45:02  (info) Building processor image
2024-09-17 11:47:27  (info) Build complete
2024-09-17 11:48:06  (info) Function deploy complete
> 2024-09-17 11:48:14,804 [info] Successfully deployed function: {"external_invocation_urls":[],"internal_invocation_urls":["nuclio-tutorial-llm-as-a-judge.default-tenant.svc.cluster.local:8080"]}
DeployStatus(state=ready, outputs={'endpoint': 'http://nuclio-tutorial-llm-as-a-judge.default-tenant.svc.cluster.local:8080', 'name': 'tutorial-llm-as-a-judge'})

Deploy the model server#

For this section, first load the base model from the Hugging Face hub. This demo uses the gemma-2b model by Google.

import random
from mlrun.features import Feature

base_model = "google-gemma-2b"
project.log_model(
    base_model,
    model_file="genai_monit_src/model-iris.pkl",
    inputs=[Feature(value_type="str", name="question")],
    outputs=[Feature(value_type="str", name="answer")],
)
<mlrun.artifacts.model.ModelArtifact at 0x7f0bcdffa8b0>
# Load the serving function to evaluate the base model
serving_function = project.get_function("llm-server")
serving_function.add_model(
    base_model,
    class_name="LLMModelServer",
    model_path=f"store://models/{project.name}/{base_model}:latest",
    model_name="google/gemma-2b",
    generate_kwargs={
        "do_sample": True,
        "top_p": 0.9,
        "num_return_sequences": 1,
        "max_length": 80,
    },
    device_map="cuda:0",
)
serving_function.set_tracking()

If you want to test the serving function locally before deploying, simply run the code lines below. You probably need local GPUs to use this model.

server = serving_function.to_mock_server()
server.test(f"/v2/models/{orpo_model_name}/infer", {"inputs": ["what is a mortgage?"]})

Continue with:

deployment = serving_function.deploy()
> 2024-09-17 11:48:27,715 [info] Starting remote function deploy
2024-09-17 11:48:28  (info) Deploying function
2024-09-17 11:48:28  (info) Building
2024-09-17 11:48:28  (info) Staging files and preparing base images
2024-09-17 11:48:28  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2024-09-17 11:48:28  (info) Building processor image
2024-09-17 11:48:33  (warn) Kaniko pod received a warning event
2024-09-17 11:50:00  (warn) Kaniko pod received a warning event
2024-09-17 11:56:51  (info) Build complete
2024-09-17 12:01:48  (info) Function deploy complete
> 2024-09-17 12:01:56,687 [info] Successfully deployed function: {"external_invocation_urls":["tutorial-llm-server.default-tenant.app.llm-dev.iguazio-cd1.com/"],"internal_invocation_urls":["nuclio-tutorial-llm-server.default-tenant.svc.cluster.local:8080"]}

Configure an alert#

Define an alert to be triggered on degradation of model performance.

import mlrun.common.schemas.alert as alert_constants
from mlrun.model_monitoring.helpers import get_result_instance_fqn
app_name = "llm-as-a-judge"
result_name = "restrict_to_banking"
message = "Model perf detected"
alert_config_name = "restrict_to_banking alert"
dummy_url = "dummy-webhook.default-tenant.app.llm-dev.iguazio-cd1.com"
# Get EP id
endpoints = mlrun.get_run_db().list_model_endpoints(project=project.name, model="")
ep_id = endpoints[0].metadata.uid
prj_alert_obj = get_result_instance_fqn(
    ep_id, app_name=app_name, result_name=result_name
)

webhook_notification = mlrun.common.schemas.Notification(
    name="webhook",
    kind="webhook",
    params={"url": dummy_url},
    when=["completed", "error"],
    severity="debug",
    message="Model perf detected",
    condition="",
)
alert_config = mlrun.alerts.alert.AlertConfig(
    project=project.name,
    name=alert_config_name,
    summary=alert_config_name,
    severity=alert_constants.AlertSeverity.HIGH,
    entities=alert_constants.EventEntities(
        kind=alert_constants.EventEntityKind.MODEL_ENDPOINT_RESULT,
        project=project.name,
        ids=[prj_alert_obj],
    ),
    trigger=alert_constants.AlertTrigger(
        events=["model_performance_detected", "model_performance_suspected"]
    ),
    criteria=alert_constants.AlertCriteria(count=1, period="10m"),
    notifications=[
        alert_constants.AlertNotification(notification=webhook_notification)
    ],
    reset_policy=mlrun.common.schemas.alert.ResetPolicy.MANUAL,
)
project.store_alert_config(alert_config)
<mlrun.alerts.alert.AlertConfig at 0x7f0bcdff78b0>

Check the performance of the base model#

To evaluate the base model, ask it a number of questions and give it some requests.

example_questions = [
    "What is a mortgage?",
    "How does a credit card work?",
    "Who painted the Mona Lisa?",
    "Please plan me a 4-days trip to north Italy",
    "Write me a song",
    "How much people are there in the world?",
    "What is climate change?",
    "How does the stock market work?",
    "Who wrote 'To Kill a Mockingbird'?",
    "Please plan me a 3-day trip to Paris",
    "Write me a poem about the ocean",
    "How many continents are there in the world?",
    "What is artificial intelligence?",
    "How does a hybrid car work?",
    "Who invented the telephone?",
    "Please plan me a week-long trip to New Zealand",
]

The monitoring application is periodical, and is activated in a set time-period, so you need to create a questioning function that is timed, and separates the questioning of the model.

import time


def question_model(questions, serving_function, base_model):
    for question in questions:
        seconds = 1
        # Invoking the pretrained model:
        ret = serving_function.invoke(
            path=f"/v2/models/{base_model}/infer",
            body={"inputs": [question]},
        )
        time.sleep(seconds)
while True:
    question_model(
        questions=example_questions,
        serving_function=serving_function,
        base_model=base_model,
    )

The Grafana model monitoring page shows the base model's scores:

Model monitor before

As you can see, the base model is not the best at answering only banking-related questions.

Build a dataset according to ORPO structure#

To fine-tune the model, take the requests sent to the model (questions related to and not related to banking), build a dataset according to the ORPO structure (question, score, chosen, rejected). (Afterwards), and re-train the model with it.

The result in a fine-tuned model that only answers banking-questions.

First fetch the data from the initial traffic to the model:

datasets = project.list_artifacts(kind="dataset")
ds_key = datasets[0]["spec"]["db_key"]
input_ds = f"store://datasets/{project.name}/{ds_key}"

Build the dataset:

ret = project.run_function(
    function="generate-ds",
    handler="generate_ds",
    params={"input_ds": input_ds},
    outputs=["new-train-ds", "dataset"],
)
> 2024-09-16 13:20:48,127 [info] Storing function: {"db":"http://mlrun-api:8080","name":"generate-ds-generate-ds","uid":"c6c242bdb57d4bcc8efdf6213eb3a313"}
> 2024-09-16 13:20:48,396 [info] Job is running in the background, pod: generate-ds-generate-ds-vwnk5
> 2024-09-16 13:25:07,957 [info] OpenAI client created
> 2024-09-16 13:25:07,999 [info] Input dataset fetched
> 2024-09-16 13:32:46,056 [info] score, chosen and rejected populated
> 2024-09-16 13:32:46,119 [info] Dataframe logged
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 1650.65ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  2.58it/s]
> 2024-09-16 13:32:47,223 [info] Dataset uploaded to HF
> 2024-09-16 13:32:47,301 [info] To track results use the CLI: {"info_cmd":"mlrun get run c6c242bdb57d4bcc8efdf6213eb3a313 -p tutorial","logs_cmd":"mlrun logs c6c242bdb57d4bcc8efdf6213eb3a313 -p tutorial"}
> 2024-09-16 13:32:47,301 [info] Or click for UI: {"ui_url":"https://dashboard.default-tenant.app.llm-dev.iguazio-cd1.com/mlprojects/tutorial/jobs/monitor/c6c242bdb57d4bcc8efdf6213eb3a313/overview"}
> 2024-09-16 13:32:47,302 [info] Run execution finished: {"name":"generate-ds-generate-ds","status":"completed"}
project uid iter start state kind name labels inputs parameters results artifacts
tutorial 0 Sep 16 13:25:07 completed run generate-ds-generate-ds
v3io_user=edmond
kind=job
owner=edmond
mlrun/client_version=1.7.0-rc40
mlrun/client_python_version=3.9.18
host=generate-ds-generate-ds-vwnk5
input_ds=store://datasets/tutorial/llm-as-a-judge-logger_restrict_to_banking
dataset=mlrun/banking-orpo-new
new-train-ds

> to track results use the .show() or .logs() methods or click here to open in UI
> 2024-09-16 13:32:56,821 [info] Run execution finished: {"name":"generate-ds-generate-ds","status":"completed"}
ret.outputs
{'dataset': 'mlrun/banking-orpo-new',
 'new-train-ds': 'store://artifacts/mm-demo-edmond/generate-ds-generate-ds_new-train-ds:latest@b7dadc94375e41429c9ad913ec26d89c'}

Now we have a new dataset for the model tuning stored in HuggingFace.

Fine-tuning the model with ORPO#

Now, fine-tune the model using the ORPO algorithm, so that the model only answers the banking-related questions.

ORPO is a new method designed to simplify and improve the process of fine-tuning language models to align with user preferences.

project.run_function(
    function="train",
    params={
        "dataset": "mlrun/banking-orpo-opt",
        "base_model": "google/gemma-2b",
        "new_model": "mlrun/gemma-2b-bank-v0.2",
        "device": "cuda:0",
    },
    handler="train",
    outputs=["model"],
)
> 2024-09-17 12:10:33,253 [info] Storing function: {"db":"http://mlrun-api:8080","name":"train-train","uid":"4a01d76a81204ccca98d752b716f4572"}
> 2024-09-17 12:10:33,559 [info] Job is running in the background, pod: train-train-p69gn
Downloading data: 100%|██████████| 267k/267k [00:00<00:00, 3.27MB/s]
Generating train split: 100%|██████████| 786/786 [00:00<00:00, 66040.12 examples/s]
Downloading shards: 100%|██████████| 2/2 [00:10<00:00,  5.32s/it]
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00,  5.36s/it]
Map: 100%|██████████| 778/778 [00:00<00:00, 1303.27 examples/s]
Map: 100%|██████████| 8/8 [00:00<00:00, 649.65 examples/s]
> 2024-09-17 12:14:58,656 [info] training 'mlrun/gemma-2b-bank-v0.2' based on 'google/gemma-2b'
  0%|          | 0/582 [00:00<?, ?it/s]torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 6.1182, 'grad_norm': 3.0670859813690186, 'learning_rate': 4.982758620689655e-05, 'rewards/chosen': -0.6183935403823853, 'rewards/rejected': -0.35467928647994995, 'rewards/accuracies': 0.125, 'rewards/margins': -0.2637142837047577, 'logps/rejected': -1.7733964920043945, 'logps/chosen': -3.091967821121216, 'logits/rejected': -30.5781307220459, 'logits/chosen': -31.13714027404785, 'nll_loss': 5.769082546234131, 'log_odds_ratio': -1.745775580406189, 'log_odds_chosen': -1.4346518516540527, 'epoch': 0.02}
{'loss': 6.1544, 'grad_norm': 14.930131912231445, 'learning_rate': 4.9482758620689655e-05, 'rewards/chosen': -0.5878166556358337, 'rewards/rejected': -0.37866175174713135, 'rewards/accuracies': 0.0625, 'rewards/margins': -0.20915493369102478, 'logps/rejected': -1.8933086395263672, 'logps/chosen': -2.9390833377838135, 'logits/rejected': -32.819515228271484, 'logits/chosen': -31.26419448852539, 'nll_loss': 5.857130527496338, 'log_odds_ratio': -1.486180067062378, 'log_odds_chosen': -1.166896939277649, 'epoch': 0.04}
{'loss': 5.5158, 'grad_norm': 27.507352828979492, 'learning_rate': 4.913793103448276e-05, 'rewards/chosen': -0.5169739127159119, 'rewards/rejected': -0.4240192174911499, 'rewards/accuracies': 0.125, 'rewards/margins': -0.09295468032360077, 'logps/rejected': -2.12009596824646, 'logps/chosen': -2.584869384765625, 'logits/rejected': -33.449256896972656, 'logits/chosen': -31.30010223388672, 'nll_loss': 5.306573867797852, 'log_odds_ratio': -1.0462219715118408, 'log_odds_chosen': -0.5180397033691406, 'epoch': 0.06}
{'loss': 4.2618, 'grad_norm': 6.359748840332031, 'learning_rate': 4.8793103448275864e-05, 'rewards/chosen': -0.4652470350265503, 'rewards/rejected': -0.37875664234161377, 'rewards/accuracies': 0.125, 'rewards/margins': -0.08649036288261414, 'logps/rejected': -1.8937832117080688, 'logps/chosen': -2.326234817504883, 'logits/rejected': -31.665985107421875, 'logits/chosen': -31.51230812072754, 'nll_loss': 4.059521675109863, 'log_odds_ratio': -1.011521577835083, 'log_odds_chosen': -0.4982942044734955, 'epoch': 0.08}
{'loss': 3.6196, 'grad_norm': 2.599546432495117, 'learning_rate': 4.844827586206897e-05, 'rewards/chosen': -0.3858947157859802, 'rewards/rejected': -0.37861311435699463, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.007281584665179253, 'logps/rejected': -1.8930654525756836, 'logps/chosen': -1.9294734001159668, 'logits/rejected': -31.418174743652344, 'logits/chosen': -30.50168228149414, 'nll_loss': 3.4601383209228516, 'log_odds_ratio': -0.7974957823753357, 'log_odds_chosen': -0.06545962393283844, 'epoch': 0.1}
{'loss': 3.1398, 'grad_norm': 7.041106224060059, 'learning_rate': 4.810344827586207e-05, 'rewards/chosen': -0.3072519600391388, 'rewards/rejected': -0.3886587917804718, 'rewards/accuracies': 0.75, 'rewards/margins': 0.08140681684017181, 'logps/rejected': -1.9432936906814575, 'logps/chosen': -1.5362597703933716, 'logits/rejected': -30.128803253173828, 'logits/chosen': -29.256145477294922, 'nll_loss': 3.0349717140197754, 'log_odds_ratio': -0.5239506363868713, 'log_odds_chosen': 0.4729006290435791, 'epoch': 0.12}
{'loss': 2.7229, 'grad_norm': 4.204710006713867, 'learning_rate': 4.7758620689655176e-05, 'rewards/chosen': -0.2365841269493103, 'rewards/rejected': -0.41585084795951843, 'rewards/accuracies': 1.0, 'rewards/margins': 0.17926675081253052, 'logps/rejected': -2.079254388809204, 'logps/chosen': -1.1829205751419067, 'logits/rejected': -29.4835147857666, 'logits/chosen': -29.336589813232422, 'nll_loss': 2.6587560176849365, 'log_odds_ratio': -0.320812463760376, 'log_odds_chosen': 1.1192080974578857, 'epoch': 0.14}
{'loss': 2.1895, 'grad_norm': 5.583032608032227, 'learning_rate': 4.741379310344828e-05, 'rewards/chosen': -0.14604505896568298, 'rewards/rejected': -0.38777291774749756, 'rewards/accuracies': 1.0, 'rewards/margins': 0.2417278289794922, 'logps/rejected': -1.9388644695281982, 'logps/chosen': -0.7302252054214478, 'logits/rejected': -29.517881393432617, 'logits/chosen': -25.721038818359375, 'nll_loss': 2.1550769805908203, 'log_odds_ratio': -0.17223303020000458, 'log_odds_chosen': 1.746340036392212, 'epoch': 0.16}
{'loss': 2.656, 'grad_norm': 2.614600896835327, 'learning_rate': 4.7068965517241385e-05, 'rewards/chosen': -0.14438894391059875, 'rewards/rejected': -0.406002938747406, 'rewards/accuracies': 1.0, 'rewards/margins': 0.26161396503448486, 'logps/rejected': -2.0300145149230957, 'logps/chosen': -0.7219446897506714, 'logits/rejected': -30.532073974609375, 'logits/chosen': -25.140825271606445, 'nll_loss': 2.624235153198242, 'log_odds_ratio': -0.15882012248039246, 'log_odds_chosen': 1.9751956462860107, 'epoch': 0.19}
{'loss': 2.2165, 'grad_norm': 2.5181803703308105, 'learning_rate': 4.672413793103448e-05, 'rewards/chosen': -0.050251495093107224, 'rewards/rejected': -0.3388191759586334, 'rewards/accuracies': 1.0, 'rewards/margins': 0.2885676622390747, 'logps/rejected': -1.6940958499908447, 'logps/chosen': -0.2512574791908264, 'logits/rejected': -30.47430419921875, 'logits/chosen': -20.899072647094727, 'nll_loss': 2.203143358230591, 'log_odds_ratio': -0.06662943214178085, 'log_odds_chosen': 2.774531364440918, 'epoch': 0.21}
{'loss': 1.82, 'grad_norm': 2.558284282684326, 'learning_rate': 4.6379310344827586e-05, 'rewards/chosen': -0.08968320488929749, 'rewards/rejected': -0.3940087854862213, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3043256103992462, 'logps/rejected': -1.9700438976287842, 'logps/chosen': -0.44841599464416504, 'logits/rejected': -27.82843589782715, 'logits/chosen': -22.255130767822266, 'nll_loss': 1.799391746520996, 'log_odds_ratio': -0.10312303900718689, 'log_odds_chosen': 3.0233054161071777, 'epoch': 0.23}
{'loss': 1.5007, 'grad_norm': 3.296436071395874, 'learning_rate': 4.603448275862069e-05, 'rewards/chosen': -0.025682205334305763, 'rewards/rejected': -0.36834532022476196, 'rewards/accuracies': 1.0, 'rewards/margins': 0.34266310930252075, 'logps/rejected': -1.8417266607284546, 'logps/chosen': -0.12841102480888367, 'logits/rejected': -27.88317108154297, 'logits/chosen': -18.2511043548584, 'nll_loss': 1.4934508800506592, 'log_odds_ratio': -0.03612568974494934, 'log_odds_chosen': 4.20542049407959, 'epoch': 0.25}
{'loss': 0.9726, 'grad_norm': 1.8513051271438599, 'learning_rate': 4.5689655172413794e-05, 'rewards/chosen': -0.018511006608605385, 'rewards/rejected': -0.35770857334136963, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3391975462436676, 'logps/rejected': -1.7885427474975586, 'logps/chosen': -0.09255501627922058, 'logits/rejected': -28.17961311340332, 'logits/chosen': -18.09923553466797, 'nll_loss': 0.9690979719161987, 'log_odds_ratio': -0.017673898488283157, 'log_odds_chosen': 5.097628116607666, 'epoch': 0.27}
{'loss': 1.4772, 'grad_norm': 2.551316499710083, 'learning_rate': 4.53448275862069e-05, 'rewards/chosen': -0.018640989437699318, 'rewards/rejected': -0.33631542325019836, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3176743984222412, 'logps/rejected': -1.681577205657959, 'logps/chosen': -0.09320494532585144, 'logits/rejected': -25.76122283935547, 'logits/chosen': -15.954239845275879, 'nll_loss': 1.4712653160095215, 'log_odds_ratio': -0.029524868354201317, 'log_odds_chosen': 5.882140636444092, 'epoch': 0.29}
{'loss': 1.2246, 'grad_norm': 2.378370761871338, 'learning_rate': 4.5e-05, 'rewards/chosen': -0.003424903843551874, 'rewards/rejected': -0.41281938552856445, 'rewards/accuracies': 1.0, 'rewards/margins': 0.40939444303512573, 'logps/rejected': -2.064096689224243, 'logps/chosen': -0.017124518752098083, 'logits/rejected': -25.82964324951172, 'logits/chosen': -15.26607894897461, 'nll_loss': 1.2239866256713867, 'log_odds_ratio': -0.003224594285711646, 'log_odds_chosen': 6.866770267486572, 'epoch': 0.31}
{'loss': 0.7658, 'grad_norm': 1.8221919536590576, 'learning_rate': 4.465517241379311e-05, 'rewards/chosen': -0.0007769691874273121, 'rewards/rejected': -0.46698325872421265, 'rewards/accuracies': 1.0, 'rewards/margins': 0.46620631217956543, 'logps/rejected': -2.334916353225708, 'logps/chosen': -0.0038848461117595434, 'logits/rejected': -23.35291290283203, 'logits/chosen': -15.75510025024414, 'nll_loss': 0.7657329440116882, 'log_odds_ratio': -0.00040753529174253345, 'log_odds_chosen': 8.474946975708008, 'epoch': 0.33}
{'loss': 1.3105, 'grad_norm': 2.02557635307312, 'learning_rate': 4.431034482758621e-05, 'rewards/chosen': -0.034159399569034576, 'rewards/rejected': -0.4102223515510559, 'rewards/accuracies': 1.0, 'rewards/margins': 0.37606295943260193, 'logps/rejected': -2.0511116981506348, 'logps/chosen': -0.17079700529575348, 'logits/rejected': -25.842199325561523, 'logits/chosen': -15.310283660888672, 'nll_loss': 1.3047212362289429, 'log_odds_ratio': -0.02871461771428585, 'log_odds_chosen': 7.689267158508301, 'epoch': 0.35}
{'loss': 1.32, 'grad_norm': 1.4088550806045532, 'learning_rate': 4.396551724137931e-05, 'rewards/chosen': -0.017421165481209755, 'rewards/rejected': -0.41568508744239807, 'rewards/accuracies': 1.0, 'rewards/margins': 0.39826393127441406, 'logps/rejected': -2.078425407409668, 'logps/chosen': -0.08710583299398422, 'logits/rejected': -26.19017791748047, 'logits/chosen': -14.00323486328125, 'nll_loss': 1.3146398067474365, 'log_odds_ratio': -0.02684803493320942, 'log_odds_chosen': 8.574756622314453, 'epoch': 0.37}
{'loss': 1.1427, 'grad_norm': 1.3797621726989746, 'learning_rate': 4.362068965517241e-05, 'rewards/chosen': -0.00047393899876624346, 'rewards/rejected': -0.45108455419540405, 'rewards/accuracies': 1.0, 'rewards/margins': 0.45061060786247253, 'logps/rejected': -2.255422830581665, 'logps/chosen': -0.002369695110246539, 'logits/rejected': -27.480154037475586, 'logits/chosen': -15.021084785461426, 'nll_loss': 1.1426451206207275, 'log_odds_ratio': -0.00042276657768525183, 'log_odds_chosen': 9.080720901489258, 'epoch': 0.39}
{'loss': 1.1969, 'grad_norm': 2.642336368560791, 'learning_rate': 4.327586206896552e-05, 'rewards/chosen': -0.01746564544737339, 'rewards/rejected': -0.3989716172218323, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3815059959888458, 'logps/rejected': -1.9948580265045166, 'logps/chosen': -0.0873282328248024, 'logits/rejected': -25.64813995361328, 'logits/chosen': -14.906251907348633, 'nll_loss': 1.1939884424209595, 'log_odds_ratio': -0.014435259625315666, 'log_odds_chosen': 8.679214477539062, 'epoch': 0.41}
{'loss': 1.0342, 'grad_norm': 2.18595552444458, 'learning_rate': 4.293103448275863e-05, 'rewards/chosen': -0.04434844106435776, 'rewards/rejected': -0.4158630669116974, 'rewards/accuracies': 1.0, 'rewards/margins': 0.37151461839675903, 'logps/rejected': -2.079315185546875, 'logps/chosen': -0.22174221277236938, 'logits/rejected': -26.042339324951172, 'logits/chosen': -18.596323013305664, 'nll_loss': 1.025839924812317, 'log_odds_ratio': -0.04168698191642761, 'log_odds_chosen': 7.684491157531738, 'epoch': 0.43}
{'loss': 1.0659, 'grad_norm': 1.651231288909912, 'learning_rate': 4.2586206896551725e-05, 'rewards/chosen': -0.026632994413375854, 'rewards/rejected': -0.4153710603713989, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3887380361557007, 'logps/rejected': -2.076855182647705, 'logps/chosen': -0.13316497206687927, 'logits/rejected': -26.671844482421875, 'logits/chosen': -17.12417221069336, 'nll_loss': 1.0627906322479248, 'log_odds_ratio': -0.015393907204270363, 'log_odds_chosen': 8.306943893432617, 'epoch': 0.45}
{'loss': 0.9519, 'grad_norm': 1.862420916557312, 'learning_rate': 4.224137931034483e-05, 'rewards/chosen': -0.00018784217536449432, 'rewards/rejected': -0.4477942883968353, 'rewards/accuracies': 1.0, 'rewards/margins': 0.44760650396347046, 'logps/rejected': -2.238971710205078, 'logps/chosen': -0.0009392108768224716, 'logits/rejected': -27.957942962646484, 'logits/chosen': -16.515411376953125, 'nll_loss': 0.9519180059432983, 'log_odds_ratio': -0.0001289438660023734, 'log_odds_chosen': 9.791584014892578, 'epoch': 0.47}
{'loss': 1.2379, 'grad_norm': 1.3508518934249878, 'learning_rate': 4.1896551724137934e-05, 'rewards/chosen': -0.01318134181201458, 'rewards/rejected': -0.40799134969711304, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3948099911212921, 'logps/rejected': -2.039956569671631, 'logps/chosen': -0.06590671092271805, 'logits/rejected': -27.936782836914062, 'logits/chosen': -16.27548599243164, 'nll_loss': 1.2359123229980469, 'log_odds_ratio': -0.010032093152403831, 'log_odds_chosen': 8.227860450744629, 'epoch': 0.49}
{'loss': 0.947, 'grad_norm': 1.3083765506744385, 'learning_rate': 4.155172413793104e-05, 'rewards/chosen': -0.015769241377711296, 'rewards/rejected': -0.4452388882637024, 'rewards/accuracies': 1.0, 'rewards/margins': 0.42946964502334595, 'logps/rejected': -2.226194381713867, 'logps/chosen': -0.07884619385004044, 'logits/rejected': -24.695844650268555, 'logits/chosen': -16.586769104003906, 'nll_loss': 0.9425266981124878, 'log_odds_ratio': -0.02248498797416687, 'log_odds_chosen': 9.009478569030762, 'epoch': 0.51}
{'loss': 1.1639, 'grad_norm': 1.5532411336898804, 'learning_rate': 4.120689655172414e-05, 'rewards/chosen': -0.04044137895107269, 'rewards/rejected': -0.42138928174972534, 'rewards/accuracies': 1.0, 'rewards/margins': 0.38094788789749146, 'logps/rejected': -2.1069464683532715, 'logps/chosen': -0.20220687985420227, 'logits/rejected': -24.27075958251953, 'logits/chosen': -17.820777893066406, 'nll_loss': 1.1553897857666016, 'log_odds_ratio': -0.04269656538963318, 'log_odds_chosen': 8.000324249267578, 'epoch': 0.53}
{'loss': 1.039, 'grad_norm': 1.5064359903335571, 'learning_rate': 4.086206896551724e-05, 'rewards/chosen': -0.03181877359747887, 'rewards/rejected': -0.3758518695831299, 'rewards/accuracies': 1.0, 'rewards/margins': 0.34403306245803833, 'logps/rejected': -1.8792591094970703, 'logps/chosen': -0.15909385681152344, 'logits/rejected': -26.19584083557129, 'logits/chosen': -17.330793380737305, 'nll_loss': 1.0299137830734253, 'log_odds_ratio': -0.04543835297226906, 'log_odds_chosen': 7.802408218383789, 'epoch': 0.56}
{'loss': 0.9308, 'grad_norm': 1.3602362871170044, 'learning_rate': 4.0517241379310344e-05, 'rewards/chosen': -0.00846340786665678, 'rewards/rejected': -0.3884789049625397, 'rewards/accuracies': 1.0, 'rewards/margins': 0.38001549243927, 'logps/rejected': -1.942394495010376, 'logps/chosen': -0.04231703653931618, 'logits/rejected': -27.138179779052734, 'logits/chosen': -15.478635787963867, 'nll_loss': 0.9293805360794067, 'log_odds_ratio': -0.007145875133574009, 'log_odds_chosen': 8.96664047241211, 'epoch': 0.58}
{'loss': 1.0191, 'grad_norm': 1.425428867340088, 'learning_rate': 4.0172413793103455e-05, 'rewards/chosen': -0.009000272490084171, 'rewards/rejected': -0.3756598234176636, 'rewards/accuracies': 1.0, 'rewards/margins': 0.366659551858902, 'logps/rejected': -1.8782992362976074, 'logps/chosen': -0.04500136151909828, 'logits/rejected': -26.390884399414062, 'logits/chosen': -15.590188026428223, 'nll_loss': 1.0170109272003174, 'log_odds_ratio': -0.01066731009632349, 'log_odds_chosen': 8.935309410095215, 'epoch': 0.6}
{'loss': 0.9819, 'grad_norm': 1.2405942678451538, 'learning_rate': 3.982758620689656e-05, 'rewards/chosen': -0.018098991364240646, 'rewards/rejected': -0.4150218963623047, 'rewards/accuracies': 1.0, 'rewards/margins': 0.39692291617393494, 'logps/rejected': -2.0751094818115234, 'logps/chosen': -0.09049495309591293, 'logits/rejected': -24.83953857421875, 'logits/chosen': -16.846792221069336, 'nll_loss': 0.9803045392036438, 'log_odds_ratio': -0.00784500502049923, 'log_odds_chosen': 9.23413372039795, 'epoch': 0.62}
{'loss': 1.1366, 'grad_norm': 2.1765756607055664, 'learning_rate': 3.9482758620689656e-05, 'rewards/chosen': -0.014743518084287643, 'rewards/rejected': -0.42165815830230713, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4069146513938904, 'logps/rejected': -2.108290910720825, 'logps/chosen': -0.07371758669614792, 'logits/rejected': -27.199644088745117, 'logits/chosen': -15.455587387084961, 'nll_loss': 1.1352300643920898, 'log_odds_ratio': -0.006853654980659485, 'log_odds_chosen': 9.797773361206055, 'epoch': 0.64}
{'loss': 1.014, 'grad_norm': 1.5311743021011353, 'learning_rate': 3.913793103448276e-05, 'rewards/chosen': -0.023545755073428154, 'rewards/rejected': -0.35138246417045593, 'rewards/accuracies': 1.0, 'rewards/margins': 0.32783669233322144, 'logps/rejected': -1.7569122314453125, 'logps/chosen': -0.11772876977920532, 'logits/rejected': -27.953351974487305, 'logits/chosen': -15.766828536987305, 'nll_loss': 1.01017165184021, 'log_odds_ratio': -0.0191391259431839, 'log_odds_chosen': 9.051002502441406, 'epoch': 0.66}
{'loss': 0.9313, 'grad_norm': 1.6334900856018066, 'learning_rate': 3.8793103448275865e-05, 'rewards/chosen': -0.02417631633579731, 'rewards/rejected': -0.45629164576530457, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4321153163909912, 'logps/rejected': -2.2814581394195557, 'logps/chosen': -0.12088155746459961, 'logits/rejected': -27.213375091552734, 'logits/chosen': -17.803531646728516, 'nll_loss': 0.92964768409729, 'log_odds_ratio': -0.008409632369875908, 'log_odds_chosen': 9.523890495300293, 'epoch': 0.68}
{'loss': 1.3122, 'grad_norm': 1.7914060354232788, 'learning_rate': 3.844827586206897e-05, 'rewards/chosen': -0.017403358593583107, 'rewards/rejected': -0.3779708743095398, 'rewards/accuracies': 1.0, 'rewards/margins': 0.36056745052337646, 'logps/rejected': -1.8898541927337646, 'logps/chosen': -0.08701679110527039, 'logits/rejected': -28.68146514892578, 'logits/chosen': -17.36962890625, 'nll_loss': 1.3107123374938965, 'log_odds_ratio': -0.0075085340067744255, 'log_odds_chosen': 7.863002300262451, 'epoch': 0.7}
{'loss': 1.0416, 'grad_norm': 1.7454752922058105, 'learning_rate': 3.8103448275862066e-05, 'rewards/chosen': -0.03507407382130623, 'rewards/rejected': -0.39473772048950195, 'rewards/accuracies': 1.0, 'rewards/margins': 0.35966360569000244, 'logps/rejected': -1.9736886024475098, 'logps/chosen': -0.17537038028240204, 'logits/rejected': -25.876012802124023, 'logits/chosen': -17.654502868652344, 'nll_loss': 1.0360862016677856, 'log_odds_ratio': -0.027547622099518776, 'log_odds_chosen': 8.49335765838623, 'epoch': 0.72}
{'loss': 0.9799, 'grad_norm': 1.4277397394180298, 'learning_rate': 3.775862068965517e-05, 'rewards/chosen': -0.028615519404411316, 'rewards/rejected': -0.3900567293167114, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3614412248134613, 'logps/rejected': -1.950283408164978, 'logps/chosen': -0.14307759702205658, 'logits/rejected': -25.099456787109375, 'logits/chosen': -17.292158126831055, 'nll_loss': 0.9765639305114746, 'log_odds_ratio': -0.016792329028248787, 'log_odds_chosen': 9.768202781677246, 'epoch': 0.74}
{'loss': 1.0603, 'grad_norm': 1.3376531600952148, 'learning_rate': 3.741379310344828e-05, 'rewards/chosen': -0.026438552886247635, 'rewards/rejected': -0.43827998638153076, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4118414521217346, 'logps/rejected': -2.1913998126983643, 'logps/chosen': -0.13219276070594788, 'logits/rejected': -27.888444900512695, 'logits/chosen': -18.473506927490234, 'nll_loss': 1.0564261674880981, 'log_odds_ratio': -0.01919225975871086, 'log_odds_chosen': 9.605931282043457, 'epoch': 0.76}
{'loss': 1.1262, 'grad_norm': 1.2842741012573242, 'learning_rate': 3.7068965517241385e-05, 'rewards/chosen': -0.02944728173315525, 'rewards/rejected': -0.39146509766578674, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.36201775074005127, 'logps/rejected': -1.9573253393173218, 'logps/chosen': -0.14723637700080872, 'logits/rejected': -27.109251022338867, 'logits/chosen': -17.73124885559082, 'nll_loss': 1.1117901802062988, 'log_odds_ratio': -0.07215951383113861, 'log_odds_chosen': 9.447444915771484, 'epoch': 0.78}
{'loss': 0.9176, 'grad_norm': 1.314487338066101, 'learning_rate': 3.672413793103448e-05, 'rewards/chosen': -0.033163025975227356, 'rewards/rejected': -0.3546912670135498, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.32152825593948364, 'logps/rejected': -1.77345609664917, 'logps/chosen': -0.16581511497497559, 'logits/rejected': -26.316152572631836, 'logits/chosen': -18.487802505493164, 'nll_loss': 0.9035072326660156, 'log_odds_ratio': -0.07025459408760071, 'log_odds_chosen': 9.166141510009766, 'epoch': 0.8}
{'loss': 0.9339, 'grad_norm': 1.2364819049835205, 'learning_rate': 3.637931034482759e-05, 'rewards/chosen': -5.509861512109637e-05, 'rewards/rejected': -0.3442889451980591, 'rewards/accuracies': 1.0, 'rewards/margins': 0.34423384070396423, 'logps/rejected': -1.7214446067810059, 'logps/chosen': -0.0002754930465016514, 'logits/rejected': -30.670103073120117, 'logits/chosen': -15.412611961364746, 'nll_loss': 0.9338334798812866, 'log_odds_ratio': -0.00011181871377630159, 'log_odds_chosen': 10.631101608276367, 'epoch': 0.82}
{'loss': 1.1354, 'grad_norm': 1.1067415475845337, 'learning_rate': 3.603448275862069e-05, 'rewards/chosen': -0.02409270592033863, 'rewards/rejected': -0.3500809669494629, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3259882628917694, 'logps/rejected': -1.750404715538025, 'logps/chosen': -0.1204635351896286, 'logits/rejected': -26.349342346191406, 'logits/chosen': -16.30928611755371, 'nll_loss': 1.1309703588485718, 'log_odds_ratio': -0.02216586470603943, 'log_odds_chosen': 9.138134002685547, 'epoch': 0.84}
{'loss': 1.4021, 'grad_norm': 1.8274025917053223, 'learning_rate': 3.5689655172413795e-05, 'rewards/chosen': -0.03947967290878296, 'rewards/rejected': -0.3688017427921295, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3293220102787018, 'logps/rejected': -1.8440086841583252, 'logps/chosen': -0.1973983645439148, 'logits/rejected': -29.606943130493164, 'logits/chosen': -16.305423736572266, 'nll_loss': 1.3949440717697144, 'log_odds_ratio': -0.03566073998808861, 'log_odds_chosen': 8.896509170532227, 'epoch': 0.86}
{'loss': 0.9665, 'grad_norm': 1.4563382863998413, 'learning_rate': 3.53448275862069e-05, 'rewards/chosen': -0.027636105194687843, 'rewards/rejected': -0.3810792565345764, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3534431457519531, 'logps/rejected': -1.9053961038589478, 'logps/chosen': -0.13818052411079407, 'logits/rejected': -27.72418212890625, 'logits/chosen': -17.45734977722168, 'nll_loss': 0.9567462801933289, 'log_odds_ratio': -0.04855053871870041, 'log_odds_chosen': 9.766358375549316, 'epoch': 0.88}
{'loss': 1.0528, 'grad_norm': 1.4033474922180176, 'learning_rate': 3.5e-05, 'rewards/chosen': -5.550501373363659e-05, 'rewards/rejected': -0.38994815945625305, 'rewards/accuracies': 1.0, 'rewards/margins': 0.38989266753196716, 'logps/rejected': -1.9497408866882324, 'logps/chosen': -0.00027752507594414055, 'logits/rejected': -29.396501541137695, 'logits/chosen': -15.3297758102417, 'nll_loss': 1.0527973175048828, 'log_odds_ratio': -0.0001499565114500001, 'log_odds_chosen': 11.2679443359375, 'epoch': 0.9}
{'loss': 0.8478, 'grad_norm': 1.5749280452728271, 'learning_rate': 3.465517241379311e-05, 'rewards/chosen': -0.008339954540133476, 'rewards/rejected': -0.4201805889606476, 'rewards/accuracies': 1.0, 'rewards/margins': 0.41184061765670776, 'logps/rejected': -2.100902795791626, 'logps/chosen': -0.041699767112731934, 'logits/rejected': -27.119667053222656, 'logits/chosen': -17.85310935974121, 'nll_loss': 0.8466331958770752, 'log_odds_ratio': -0.005965358577668667, 'log_odds_chosen': 9.838812828063965, 'epoch': 0.93}
{'loss': 1.0244, 'grad_norm': 1.0076855421066284, 'learning_rate': 3.431034482758621e-05, 'rewards/chosen': -0.01577857695519924, 'rewards/rejected': -0.39185070991516113, 'rewards/accuracies': 1.0, 'rewards/margins': 0.37607207894325256, 'logps/rejected': -1.9592533111572266, 'logps/chosen': -0.07889287918806076, 'logits/rejected': -29.978755950927734, 'logits/chosen': -16.56873893737793, 'nll_loss': 1.0213502645492554, 'log_odds_ratio': -0.015392943285405636, 'log_odds_chosen': 10.408103942871094, 'epoch': 0.95}
{'loss': 0.6776, 'grad_norm': 1.270260214805603, 'learning_rate': 3.3965517241379316e-05, 'rewards/chosen': -0.010336178354918957, 'rewards/rejected': -0.45675405859947205, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4464179277420044, 'logps/rejected': -2.2837700843811035, 'logps/chosen': -0.05168088153004646, 'logits/rejected': -26.35157012939453, 'logits/chosen': -17.077016830444336, 'nll_loss': 0.6763484477996826, 'log_odds_ratio': -0.006042775698006153, 'log_odds_chosen': 11.116888999938965, 'epoch': 0.97}
{'loss': 0.9527, 'grad_norm': 1.1584525108337402, 'learning_rate': 3.3620689655172414e-05, 'rewards/chosen': -0.0184499379247427, 'rewards/rejected': -0.4445060193538666, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4260560870170593, 'logps/rejected': -2.2225301265716553, 'logps/chosen': -0.09224968403577805, 'logits/rejected': -27.334333419799805, 'logits/chosen': -16.407512664794922, 'nll_loss': 0.9488778710365295, 'log_odds_ratio': -0.019149478524923325, 'log_odds_chosen': 9.961559295654297, 'epoch': 0.99}
 33%|███▎      | 194/582 [10:01<19:59,  3.09s/it]
  0%|          | 0/8 [00:00<?, ?it/s]
 25%|██▌       | 2/8 [00:00<00:01,  5.70it/s]
 38%|███▊      | 3/8 [00:00<00:01,  4.74it/s]
 50%|█████     | 4/8 [00:00<00:00,  4.33it/s]
 62%|██████▎   | 5/8 [00:01<00:00,  4.15it/s]
 75%|███████▌  | 6/8 [00:01<00:00,  4.39it/s]
 88%|████████▊ | 7/8 [00:01<00:00,  4.15it/s]
                                                 
 33%|███▎      | 194/582 [10:05<19:59,  3.09s/it]
100%|██████████| 8/8 [00:01<00:00,  4.05it/s]
                                             torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
{'eval_loss': 0.8345844745635986, 'eval_runtime': 2.0654, 'eval_samples_per_second': 3.873, 'eval_steps_per_second': 3.873, 'eval_rewards/chosen': -7.709269266342744e-05, 'eval_rewards/rejected': -0.39306890964508057, 'eval_rewards/accuracies': 1.0, 'eval_rewards/margins': 0.3929918110370636, 'eval_logps/rejected': -1.9653445482254028, 'eval_logps/chosen': -0.00038546344148926437, 'eval_logits/rejected': -25.157255172729492, 'eval_logits/chosen': -14.585606575012207, 'eval_nll_loss': 0.8345516920089722, 'eval_log_odds_ratio': -0.00016369004151783884, 'eval_log_odds_chosen': 11.003547668457031, 'epoch': 1.0}
{'loss': 0.73, 'grad_norm': 1.434141755104065, 'learning_rate': 3.327586206896552e-05, 'rewards/chosen': -0.0226511862128973, 'rewards/rejected': -0.402143657207489, 'rewards/accuracies': 1.0, 'rewards/margins': 0.37949252128601074, 'logps/rejected': -2.01071834564209, 'logps/chosen': -0.11325593292713165, 'logits/rejected': -27.50503921508789, 'logits/chosen': -17.754642486572266, 'nll_loss': 0.7258264422416687, 'log_odds_ratio': -0.020737992599606514, 'log_odds_chosen': 10.186420440673828, 'epoch': 1.01}
{'loss': 1.0179, 'grad_norm': 1.5688236951828003, 'learning_rate': 3.293103448275862e-05, 'rewards/chosen': -0.04343818873167038, 'rewards/rejected': -0.4130253791809082, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3695871829986572, 'logps/rejected': -2.065126657485962, 'logps/chosen': -0.2171909660100937, 'logits/rejected': -26.491863250732422, 'logits/chosen': -18.766822814941406, 'nll_loss': 1.0122684240341187, 'log_odds_ratio': -0.028035137802362442, 'log_odds_chosen': 8.460521697998047, 'epoch': 1.03}
{'loss': 1.1249, 'grad_norm': 1.8104413747787476, 'learning_rate': 3.2586206896551726e-05, 'rewards/chosen': -0.00013810096425004303, 'rewards/rejected': -0.42826545238494873, 'rewards/accuracies': 1.0, 'rewards/margins': 0.42812734842300415, 'logps/rejected': -2.141326904296875, 'logps/chosen': -0.0006905047339387238, 'logits/rejected': -28.636520385742188, 'logits/chosen': -13.43117618560791, 'nll_loss': 1.1248300075531006, 'log_odds_ratio': -0.00019074456940870732, 'log_odds_chosen': 11.441914558410645, 'epoch': 1.05}
{'loss': 1.1453, 'grad_norm': 1.150744080543518, 'learning_rate': 3.2241379310344824e-05, 'rewards/chosen': -0.01225398201495409, 'rewards/rejected': -0.38966143131256104, 'rewards/accuracies': 1.0, 'rewards/margins': 0.3774074614048004, 'logps/rejected': -1.9483072757720947, 'logps/chosen': -0.061269909143447876, 'logits/rejected': -25.097713470458984, 'logits/chosen': -13.776196479797363, 'nll_loss': 1.1437857151031494, 'log_odds_ratio': -0.007773769088089466, 'log_odds_chosen': 10.365299224853516, 'epoch': 1.07}
{'loss': 0.8654, 'grad_norm': 1.4663949012756348, 'learning_rate': 3.1896551724137935e-05, 'rewards/chosen': -0.005423582624644041, 'rewards/rejected': -0.4527643322944641, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4473407566547394, 'logps/rejected': -2.263821601867676, 'logps/chosen': -0.027117911726236343, 'logits/rejected': -28.09053611755371, 'logits/chosen': -15.615471839904785, 'nll_loss': 0.8652036190032959, 'log_odds_ratio': -0.001049592043273151, 'log_odds_chosen': 11.351728439331055, 'epoch': 1.09}
{'loss': 0.7015, 'grad_norm': 1.5851469039916992, 'learning_rate': 3.155172413793104e-05, 'rewards/chosen': -0.030667373910546303, 'rewards/rejected': -0.4841441214084625, 'rewards/accuracies': 1.0, 'rewards/margins': 0.45347678661346436, 'logps/rejected': -2.4207205772399902, 'logps/chosen': -0.15333685278892517, 'logits/rejected': -23.560535430908203, 'logits/chosen': -19.000864028930664, 'nll_loss': 0.6981819868087769, 'log_odds_ratio': -0.016560502350330353, 'log_odds_chosen': 9.556259155273438, 'epoch': 1.11}
{'loss': 0.7838, 'grad_norm': 1.9540293216705322, 'learning_rate': 3.120689655172414e-05, 'rewards/chosen': -0.012657233513891697, 'rewards/rejected': -0.458551824092865, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4458945691585541, 'logps/rejected': -2.2927591800689697, 'logps/chosen': -0.06328617036342621, 'logits/rejected': -26.852397918701172, 'logits/chosen': -16.126007080078125, 'nll_loss': 0.7830108404159546, 'log_odds_ratio': -0.003710703458636999, 'log_odds_chosen': 10.654180526733398, 'epoch': 1.13}
{'loss': 0.7686, 'grad_norm': 1.6532940864562988, 'learning_rate': 3.086206896551724e-05, 'rewards/chosen': -0.022402046248316765, 'rewards/rejected': -0.4803857207298279, 'rewards/accuracies': 1.0, 'rewards/margins': 0.45798367261886597, 'logps/rejected': -2.401928663253784, 'logps/chosen': -0.11201022565364838, 'logits/rejected': -23.534074783325195, 'logits/chosen': -16.937368392944336, 'nll_loss': 0.7663403749465942, 'log_odds_ratio': -0.011446008458733559, 'log_odds_chosen': 9.576802253723145, 'epoch': 1.15}
{'loss': 0.7441, 'grad_norm': 1.6746153831481934, 'learning_rate': 3.0517241379310348e-05, 'rewards/chosen': -0.01717301271855831, 'rewards/rejected': -0.4562140703201294, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4390410780906677, 'logps/rejected': -2.2810702323913574, 'logps/chosen': -0.0858650654554367, 'logits/rejected': -23.2592830657959, 'logits/chosen': -15.255314826965332, 'nll_loss': 0.7428922057151794, 'log_odds_ratio': -0.006020816043019295, 'log_odds_chosen': 10.802228927612305, 'epoch': 1.17}
{'loss': 1.2266, 'grad_norm': 1.0024935007095337, 'learning_rate': 3.017241379310345e-05, 'rewards/chosen': -1.6387168216169812e-05, 'rewards/rejected': -0.4035765528678894, 'rewards/accuracies': 1.0, 'rewards/margins': 0.40356022119522095, 'logps/rejected': -2.017883062362671, 'logps/chosen': -8.193583198590204e-05, 'logits/rejected': -25.325101852416992, 'logits/chosen': -9.75403118133545, 'nll_loss': 1.226586937904358, 'log_odds_ratio': -2.9053824619040824e-05, 'log_odds_chosen': 11.811872482299805, 'epoch': 1.19}
{'loss': 0.6407, 'grad_norm': 1.436519742012024, 'learning_rate': 2.9827586206896553e-05, 'rewards/chosen': -0.011361561715602875, 'rewards/rejected': -0.4970394968986511, 'rewards/accuracies': 1.0, 'rewards/margins': 0.48567795753479004, 'logps/rejected': -2.4851975440979004, 'logps/chosen': -0.05680781230330467, 'logits/rejected': -23.934532165527344, 'logits/chosen': -13.496877670288086, 'nll_loss': 0.6400547027587891, 'log_odds_ratio': -0.003380458103492856, 'log_odds_chosen': 11.923421859741211, 'epoch': 1.21}
{'loss': 0.7298, 'grad_norm': 1.435310959815979, 'learning_rate': 2.9482758620689654e-05, 'rewards/chosen': -0.015649445354938507, 'rewards/rejected': -0.4991505444049835, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4835011065006256, 'logps/rejected': -2.4957528114318848, 'logps/chosen': -0.07824723422527313, 'logits/rejected': -23.138154983520508, 'logits/chosen': -16.270519256591797, 'nll_loss': 0.7293619513511658, 'log_odds_ratio': -0.0022277431562542915, 'log_odds_chosen': 9.952068328857422, 'epoch': 1.23}
{'loss': 0.7855, 'grad_norm': 1.8303039073944092, 'learning_rate': 2.913793103448276e-05, 'rewards/chosen': -0.00028555691824294627, 'rewards/rejected': -0.39356809854507446, 'rewards/accuracies': 1.0, 'rewards/margins': 0.39328253269195557, 'logps/rejected': -1.967840552330017, 'logps/chosen': -0.0014277845621109009, 'logits/rejected': -25.22410011291504, 'logits/chosen': -12.263191223144531, 'nll_loss': 0.7852044701576233, 'log_odds_ratio': -0.0015762178227305412, 'log_odds_chosen': 11.396492004394531, 'epoch': 1.25}
{'loss': 0.678, 'grad_norm': 1.5734047889709473, 'learning_rate': 2.8793103448275865e-05, 'rewards/chosen': -0.011251558549702168, 'rewards/rejected': -0.5024089217185974, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4911573827266693, 'logps/rejected': -2.512044668197632, 'logps/chosen': -0.05625779181718826, 'logits/rejected': -25.162145614624023, 'logits/chosen': -15.112372398376465, 'nll_loss': 0.6772906184196472, 'log_odds_ratio': -0.0033333105966448784, 'log_odds_chosen': 11.352081298828125, 'epoch': 1.28}
{'loss': 0.9325, 'grad_norm': 1.117419719696045, 'learning_rate': 2.844827586206897e-05, 'rewards/chosen': -2.064571344817523e-05, 'rewards/rejected': -0.4928421974182129, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4928215444087982, 'logps/rejected': -2.4642109870910645, 'logps/chosen': -0.00010322857269784436, 'logits/rejected': -26.4111270904541, 'logits/chosen': -12.087352752685547, 'nll_loss': 0.932537853717804, 'log_odds_ratio': -3.2623425795463845e-05, 'log_odds_chosen': 12.432601928710938, 'epoch': 1.3}
{'loss': 0.746, 'grad_norm': 2.4486336708068848, 'learning_rate': 2.810344827586207e-05, 'rewards/chosen': -0.006629860959947109, 'rewards/rejected': -0.5243197679519653, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5176898837089539, 'logps/rejected': -2.621598482131958, 'logps/chosen': -0.03314930200576782, 'logits/rejected': -23.90359878540039, 'logits/chosen': -13.254927635192871, 'nll_loss': 0.7458091378211975, 'log_odds_ratio': -0.0009646276594139636, 'log_odds_chosen': 11.988978385925293, 'epoch': 1.32}
{'loss': 0.9471, 'grad_norm': 1.726151943206787, 'learning_rate': 2.7758620689655175e-05, 'rewards/chosen': -0.007225167006254196, 'rewards/rejected': -0.4587666094303131, 'rewards/accuracies': 1.0, 'rewards/margins': 0.45154139399528503, 'logps/rejected': -2.293832778930664, 'logps/chosen': -0.03612583130598068, 'logits/rejected': -23.52751350402832, 'logits/chosen': -12.854814529418945, 'nll_loss': 0.9468055367469788, 'log_odds_ratio': -0.0015748534351587296, 'log_odds_chosen': 11.05248737335205, 'epoch': 1.34}
{'loss': 0.8312, 'grad_norm': 2.2712466716766357, 'learning_rate': 2.7413793103448275e-05, 'rewards/chosen': -0.01835714466869831, 'rewards/rejected': -0.476334810256958, 'rewards/accuracies': 1.0, 'rewards/margins': 0.45797765254974365, 'logps/rejected': -2.38167405128479, 'logps/chosen': -0.09178569912910461, 'logits/rejected': -23.461687088012695, 'logits/chosen': -16.910842895507812, 'nll_loss': 0.830147922039032, 'log_odds_ratio': -0.0052607422694563866, 'log_odds_chosen': 10.429073333740234, 'epoch': 1.36}
{'loss': 0.783, 'grad_norm': 1.6994627714157104, 'learning_rate': 2.706896551724138e-05, 'rewards/chosen': -4.068882844876498e-06, 'rewards/rejected': -0.40499016642570496, 'rewards/accuracies': 1.0, 'rewards/margins': 0.40498611330986023, 'logps/rejected': -2.0249507427215576, 'logps/chosen': -2.0344412405393086e-05, 'logits/rejected': -25.600175857543945, 'logits/chosen': -12.171775817871094, 'nll_loss': 0.782982587814331, 'log_odds_ratio': -3.4943300306622405e-06, 'log_odds_chosen': 12.69761848449707, 'epoch': 1.38}
{'loss': 0.7802, 'grad_norm': 1.4252769947052002, 'learning_rate': 2.672413793103448e-05, 'rewards/chosen': -0.00289721367880702, 'rewards/rejected': -0.4256645739078522, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4227673411369324, 'logps/rejected': -2.1283226013183594, 'logps/chosen': -0.014486067928373814, 'logits/rejected': -24.62644386291504, 'logits/chosen': -13.609281539916992, 'nll_loss': 0.7800947427749634, 'log_odds_ratio': -0.0005288764368742704, 'log_odds_chosen': 11.761899948120117, 'epoch': 1.4}
{'loss': 0.8627, 'grad_norm': 1.8187662363052368, 'learning_rate': 2.637931034482759e-05, 'rewards/chosen': -0.005042150616645813, 'rewards/rejected': -0.5045844316482544, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4995422661304474, 'logps/rejected': -2.5229220390319824, 'logps/chosen': -0.025210753083229065, 'logits/rejected': -22.29425621032715, 'logits/chosen': -13.398456573486328, 'nll_loss': 0.8626401424407959, 'log_odds_ratio': -0.0003886263875756413, 'log_odds_chosen': 11.516894340515137, 'epoch': 1.42}
{'loss': 0.9662, 'grad_norm': 1.4407010078430176, 'learning_rate': 2.6034482758620692e-05, 'rewards/chosen': -0.003056781366467476, 'rewards/rejected': -0.5342327356338501, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5311759114265442, 'logps/rejected': -2.671163558959961, 'logps/chosen': -0.015283907763659954, 'logits/rejected': -23.001192092895508, 'logits/chosen': -12.413644790649414, 'nll_loss': 0.9660602807998657, 'log_odds_ratio': -0.00047741655725985765, 'log_odds_chosen': 12.229846954345703, 'epoch': 1.44}
{'loss': 0.945, 'grad_norm': 1.239019751548767, 'learning_rate': 2.5689655172413796e-05, 'rewards/chosen': -2.5805431505432352e-05, 'rewards/rejected': -0.44638997316360474, 'rewards/accuracies': 1.0, 'rewards/margins': 0.446364164352417, 'logps/rejected': -2.231949806213379, 'logps/chosen': -0.00012902714661322534, 'logits/rejected': -28.394424438476562, 'logits/chosen': -13.379776000976562, 'nll_loss': 0.9449691772460938, 'log_odds_ratio': -2.9987088055349886e-05, 'log_odds_chosen': 11.915634155273438, 'epoch': 1.46}
{'loss': 0.8984, 'grad_norm': 2.0086193084716797, 'learning_rate': 2.5344827586206897e-05, 'rewards/chosen': -0.014658451080322266, 'rewards/rejected': -0.42176464200019836, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4071062207221985, 'logps/rejected': -2.108823299407959, 'logps/chosen': -0.07329225540161133, 'logits/rejected': -22.89071273803711, 'logits/chosen': -12.471784591674805, 'nll_loss': 0.8956164717674255, 'log_odds_ratio': -0.013828521594405174, 'log_odds_chosen': 11.03779125213623, 'epoch': 1.48}
{'loss': 0.7269, 'grad_norm': 1.7999000549316406, 'learning_rate': 2.5e-05, 'rewards/chosen': -0.004280082415789366, 'rewards/rejected': -0.48385393619537354, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4795738756656647, 'logps/rejected': -2.419269561767578, 'logps/chosen': -0.021400412544608116, 'logits/rejected': -23.369915008544922, 'logits/chosen': -12.923969268798828, 'nll_loss': 0.72686368227005, 'log_odds_ratio': -0.0001776623830664903, 'log_odds_chosen': 12.287035942077637, 'epoch': 1.5}
{'loss': 0.9737, 'grad_norm': 1.3905017375946045, 'learning_rate': 2.4655172413793105e-05, 'rewards/chosen': -0.0048111844807863235, 'rewards/rejected': -0.444429486989975, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4396182894706726, 'logps/rejected': -2.2221474647521973, 'logps/chosen': -0.024055922403931618, 'logits/rejected': -25.511119842529297, 'logits/chosen': -11.928939819335938, 'nll_loss': 0.9736031293869019, 'log_odds_ratio': -0.0002628489746712148, 'log_odds_chosen': 11.527250289916992, 'epoch': 1.52}
{'loss': 0.8062, 'grad_norm': 1.6731687784194946, 'learning_rate': 2.4310344827586206e-05, 'rewards/chosen': -0.0012859604321420193, 'rewards/rejected': -0.4721667766571045, 'rewards/accuracies': 1.0, 'rewards/margins': 0.47088080644607544, 'logps/rejected': -2.3608338832855225, 'logps/chosen': -0.006429802160710096, 'logits/rejected': -24.038406372070312, 'logits/chosen': -12.93802547454834, 'nll_loss': 0.8062067031860352, 'log_odds_ratio': -0.0001333983091171831, 'log_odds_chosen': 12.579763412475586, 'epoch': 1.54}
{'loss': 0.9531, 'grad_norm': 1.564760684967041, 'learning_rate': 2.3965517241379314e-05, 'rewards/chosen': -1.4904611816746183e-05, 'rewards/rejected': -0.4452192187309265, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4452042877674103, 'logps/rejected': -2.2260959148406982, 'logps/chosen': -7.452305726474151e-05, 'logits/rejected': -24.150279998779297, 'logits/chosen': -11.927482604980469, 'nll_loss': 0.9530947208404541, 'log_odds_ratio': -1.0967456546495669e-05, 'log_odds_chosen': 12.290933609008789, 'epoch': 1.56}
{'loss': 0.7301, 'grad_norm': 1.0833263397216797, 'learning_rate': 2.3620689655172415e-05, 'rewards/chosen': -6.462186320277397e-06, 'rewards/rejected': -0.565669059753418, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5656626224517822, 'logps/rejected': -2.82834529876709, 'logps/chosen': -3.231093069189228e-05, 'logits/rejected': -24.985515594482422, 'logits/chosen': -12.523492813110352, 'nll_loss': 0.7301193475723267, 'log_odds_ratio': -6.385242613760056e-06, 'log_odds_chosen': 13.280832290649414, 'epoch': 1.58}
{'loss': 0.8972, 'grad_norm': 1.2633384466171265, 'learning_rate': 2.327586206896552e-05, 'rewards/chosen': -0.01939469948410988, 'rewards/rejected': -0.49531736969947815, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4759226441383362, 'logps/rejected': -2.4765868186950684, 'logps/chosen': -0.0969734787940979, 'logits/rejected': -23.24631690979004, 'logits/chosen': -13.590198516845703, 'nll_loss': 0.8959752321243286, 'log_odds_ratio': -0.006370073184370995, 'log_odds_chosen': 11.042448997497559, 'epoch': 1.6}
{'loss': 0.5689, 'grad_norm': 1.5339045524597168, 'learning_rate': 2.293103448275862e-05, 'rewards/chosen': -1.4294133507064544e-05, 'rewards/rejected': -0.4143698811531067, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4143555760383606, 'logps/rejected': -2.0718493461608887, 'logps/chosen': -7.147066935431212e-05, 'logits/rejected': -23.955123901367188, 'logits/chosen': -12.740753173828125, 'nll_loss': 0.5689448714256287, 'log_odds_ratio': -1.6295012756017968e-05, 'log_odds_chosen': 12.075620651245117, 'epoch': 1.62}
{'loss': 0.7717, 'grad_norm': 1.5513174533843994, 'learning_rate': 2.2586206896551727e-05, 'rewards/chosen': -0.01027359813451767, 'rewards/rejected': -0.4722784757614136, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4620048999786377, 'logps/rejected': -2.3613924980163574, 'logps/chosen': -0.05136799067258835, 'logits/rejected': -23.137420654296875, 'logits/chosen': -12.588123321533203, 'nll_loss': 0.7705242037773132, 'log_odds_ratio': -0.005942543037235737, 'log_odds_chosen': 10.953614234924316, 'epoch': 1.65}
{'loss': 0.7541, 'grad_norm': 1.2610599994659424, 'learning_rate': 2.2241379310344828e-05, 'rewards/chosen': -0.0018458148697391152, 'rewards/rejected': -0.4897184371948242, 'rewards/accuracies': 1.0, 'rewards/margins': 0.48787257075309753, 'logps/rejected': -2.448592185974121, 'logps/chosen': -0.009229075163602829, 'logits/rejected': -22.227474212646484, 'logits/chosen': -12.34030532836914, 'nll_loss': 0.7539453506469727, 'log_odds_ratio': -0.0007535302429459989, 'log_odds_chosen': 12.337367057800293, 'epoch': 1.67}
{'loss': 0.892, 'grad_norm': 1.0653754472732544, 'learning_rate': 2.1896551724137932e-05, 'rewards/chosen': -0.007841693237423897, 'rewards/rejected': -0.4939432740211487, 'rewards/accuracies': 1.0, 'rewards/margins': 0.48610156774520874, 'logps/rejected': -2.4697163105010986, 'logps/chosen': -0.039208464324474335, 'logits/rejected': -22.896587371826172, 'logits/chosen': -12.889961242675781, 'nll_loss': 0.8918967843055725, 'log_odds_ratio': -0.0005545300082303584, 'log_odds_chosen': 10.863367080688477, 'epoch': 1.69}
{'loss': 0.7272, 'grad_norm': 1.2373428344726562, 'learning_rate': 2.1551724137931033e-05, 'rewards/chosen': -0.004357445985078812, 'rewards/rejected': -0.513412594795227, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5090551376342773, 'logps/rejected': -2.5670628547668457, 'logps/chosen': -0.02178722806274891, 'logits/rejected': -22.094266891479492, 'logits/chosen': -13.578143119812012, 'nll_loss': 0.7271310687065125, 'log_odds_ratio': -0.0005934812361374497, 'log_odds_chosen': 11.282695770263672, 'epoch': 1.71}
{'loss': 0.7138, 'grad_norm': 1.8375834226608276, 'learning_rate': 2.120689655172414e-05, 'rewards/chosen': -0.007243632804602385, 'rewards/rejected': -0.4684578776359558, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4612142741680145, 'logps/rejected': -2.342289447784424, 'logps/chosen': -0.03621816262602806, 'logits/rejected': -23.930633544921875, 'logits/chosen': -13.808714866638184, 'nll_loss': 0.7130439281463623, 'log_odds_ratio': -0.0036671501584351063, 'log_odds_chosen': 11.396671295166016, 'epoch': 1.73}
{'loss': 1.015, 'grad_norm': 1.4918831586837769, 'learning_rate': 2.086206896551724e-05, 'rewards/chosen': -0.0033132126554846764, 'rewards/rejected': -0.5126004219055176, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5092872381210327, 'logps/rejected': -2.563002109527588, 'logps/chosen': -0.016566062346100807, 'logits/rejected': -21.980741500854492, 'logits/chosen': -11.203933715820312, 'nll_loss': 1.0149378776550293, 'log_odds_ratio': -0.00042033224599435925, 'log_odds_chosen': 11.847691535949707, 'epoch': 1.75}
{'loss': 0.8416, 'grad_norm': 1.6007637977600098, 'learning_rate': 2.0517241379310345e-05, 'rewards/chosen': -0.0004139716038480401, 'rewards/rejected': -0.4986298382282257, 'rewards/accuracies': 1.0, 'rewards/margins': 0.498215913772583, 'logps/rejected': -2.4931492805480957, 'logps/chosen': -0.0020698581356555223, 'logits/rejected': -21.628650665283203, 'logits/chosen': -11.273602485656738, 'nll_loss': 0.8415579199790955, 'log_odds_ratio': -1.514019095338881e-05, 'log_odds_chosen': 12.51939868927002, 'epoch': 1.77}
{'loss': 1.1082, 'grad_norm': 1.675502061843872, 'learning_rate': 2.017241379310345e-05, 'rewards/chosen': -0.007702856324613094, 'rewards/rejected': -0.45509517192840576, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4473923444747925, 'logps/rejected': -2.2754757404327393, 'logps/chosen': -0.03851427882909775, 'logits/rejected': -22.350093841552734, 'logits/chosen': -10.996744155883789, 'nll_loss': 1.1078848838806152, 'log_odds_ratio': -0.0013763784663751721, 'log_odds_chosen': 10.637802124023438, 'epoch': 1.79}
{'loss': 0.9635, 'grad_norm': 0.9472073316574097, 'learning_rate': 1.9827586206896554e-05, 'rewards/chosen': -4.723019628727343e-06, 'rewards/rejected': -0.46462804079055786, 'rewards/accuracies': 1.0, 'rewards/margins': 0.46462327241897583, 'logps/rejected': -2.3231401443481445, 'logps/chosen': -2.3615099053131416e-05, 'logits/rejected': -22.873435974121094, 'logits/chosen': -10.553976058959961, 'nll_loss': 0.9634828567504883, 'log_odds_ratio': -4.492734660743736e-06, 'log_odds_chosen': 13.069788932800293, 'epoch': 1.81}
{'loss': 1.1342, 'grad_norm': 2.042491912841797, 'learning_rate': 1.9482758620689655e-05, 'rewards/chosen': -0.005450810305774212, 'rewards/rejected': -0.49723613262176514, 'rewards/accuracies': 1.0, 'rewards/margins': 0.49178531765937805, 'logps/rejected': -2.4861807823181152, 'logps/chosen': -0.027254050597548485, 'logits/rejected': -25.828407287597656, 'logits/chosen': -11.611310005187988, 'nll_loss': 1.1340237855911255, 'log_odds_ratio': -0.0007753735990263522, 'log_odds_chosen': 12.567646980285645, 'epoch': 1.83}
{'loss': 0.7101, 'grad_norm': 1.406186580657959, 'learning_rate': 1.913793103448276e-05, 'rewards/chosen': -0.009088482707738876, 'rewards/rejected': -0.6018110513687134, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5927225351333618, 'logps/rejected': -3.0090551376342773, 'logps/chosen': -0.04544241353869438, 'logits/rejected': -21.312074661254883, 'logits/chosen': -13.426332473754883, 'nll_loss': 0.7098942399024963, 'log_odds_ratio': -0.0009629733394831419, 'log_odds_chosen': 11.741249084472656, 'epoch': 1.85}
{'loss': 0.8413, 'grad_norm': 1.8002417087554932, 'learning_rate': 1.8793103448275863e-05, 'rewards/chosen': -5.286605301080272e-06, 'rewards/rejected': -0.49949920177459717, 'rewards/accuracies': 1.0, 'rewards/margins': 0.49949389696121216, 'logps/rejected': -2.4974961280822754, 'logps/chosen': -2.6433026505401358e-05, 'logits/rejected': -23.310596466064453, 'logits/chosen': -12.027463912963867, 'nll_loss': 0.8413053154945374, 'log_odds_ratio': -2.3394891286443453e-06, 'log_odds_chosen': 13.34364128112793, 'epoch': 1.87}
{'loss': 0.683, 'grad_norm': 1.235516905784607, 'learning_rate': 1.8448275862068967e-05, 'rewards/chosen': -0.003512467723339796, 'rewards/rejected': -0.5527837872505188, 'rewards/accuracies': 1.0, 'rewards/margins': 0.549271285533905, 'logps/rejected': -2.7639191150665283, 'logps/chosen': -0.01756233721971512, 'logits/rejected': -20.34320068359375, 'logits/chosen': -12.428564071655273, 'nll_loss': 0.6827452182769775, 'log_odds_ratio': -0.00109500577673316, 'log_odds_chosen': 12.128990173339844, 'epoch': 1.89}
{'loss': 0.655, 'grad_norm': 1.7379591464996338, 'learning_rate': 1.810344827586207e-05, 'rewards/chosen': -4.141280442127027e-06, 'rewards/rejected': -0.5228589773178101, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5228548049926758, 'logps/rejected': -2.6142945289611816, 'logps/chosen': -2.0706402210635133e-05, 'logits/rejected': -23.65036392211914, 'logits/chosen': -12.36349105834961, 'nll_loss': 0.6549561023712158, 'log_odds_ratio': -2.6971233637596015e-06, 'log_odds_chosen': 13.45773696899414, 'epoch': 1.91}
{'loss': 0.9058, 'grad_norm': 1.7110241651535034, 'learning_rate': 1.7758620689655172e-05, 'rewards/chosen': -3.873146852129139e-06, 'rewards/rejected': -0.46340620517730713, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4634023606777191, 'logps/rejected': -2.317031145095825, 'logps/chosen': -1.936573244165629e-05, 'logits/rejected': -24.86945152282715, 'logits/chosen': -11.506824493408203, 'nll_loss': 0.9058194756507874, 'log_odds_ratio': -2.5481033389951335e-06, 'log_odds_chosen': 13.113186836242676, 'epoch': 1.93}
{'loss': 0.9081, 'grad_norm': 1.3138561248779297, 'learning_rate': 1.7413793103448276e-05, 'rewards/chosen': -0.0022292304784059525, 'rewards/rejected': -0.47970065474510193, 'rewards/accuracies': 1.0, 'rewards/margins': 0.47747141122817993, 'logps/rejected': -2.398503065109253, 'logps/chosen': -0.011146152392029762, 'logits/rejected': -22.83418846130371, 'logits/chosen': -11.159936904907227, 'nll_loss': 0.9080435633659363, 'log_odds_ratio': -0.0003672418533824384, 'log_odds_chosen': 12.038527488708496, 'epoch': 1.95}
{'loss': 1.0058, 'grad_norm': 1.080297827720642, 'learning_rate': 1.706896551724138e-05, 'rewards/chosen': -0.0009835362434387207, 'rewards/rejected': -0.5103597640991211, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5093762874603271, 'logps/rejected': -2.5517988204956055, 'logps/chosen': -0.004917681682854891, 'logits/rejected': -23.00748062133789, 'logits/chosen': -11.451348304748535, 'nll_loss': 1.0057953596115112, 'log_odds_ratio': -3.0578688893001527e-05, 'log_odds_chosen': 12.492218017578125, 'epoch': 1.97}
{'loss': 0.9018, 'grad_norm': 1.1170464754104614, 'learning_rate': 1.6724137931034485e-05, 'rewards/chosen': -0.0016289616469293833, 'rewards/rejected': -0.4579496681690216, 'rewards/accuracies': 1.0, 'rewards/margins': 0.45632070302963257, 'logps/rejected': -2.289748191833496, 'logps/chosen': -0.008144808001816273, 'logits/rejected': -22.94778823852539, 'logits/chosen': -11.097249984741211, 'nll_loss': 0.9016796946525574, 'log_odds_ratio': -0.0005855378694832325, 'log_odds_chosen': 12.383670806884766, 'epoch': 1.99}
 67%|██████▋   | 389/582 [20:11<10:14,  3.19s/it]
  0%|          | 0/8 [00:00<?, ?it/s]
 25%|██▌       | 2/8 [00:00<00:01,  5.88it/s]
 38%|███▊      | 3/8 [00:00<00:01,  4.87it/s]
 50%|█████     | 4/8 [00:00<00:00,  4.32it/s]
 62%|██████▎   | 5/8 [00:01<00:00,  4.16it/s]
 75%|███████▌  | 6/8 [00:01<00:00,  4.42it/s]
 88%|████████▊ | 7/8 [00:01<00:00,  4.21it/s]
                                                 
 67%|██████▋   | 389/582 [20:13<10:14,  3.19s/it]
100%|██████████| 8/8 [00:01<00:00,  4.06it/s]
                                             torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
{'eval_loss': 0.8056913614273071, 'eval_runtime': 2.0516, 'eval_samples_per_second': 3.899, 'eval_steps_per_second': 3.899, 'eval_rewards/chosen': -5.849591161677381e-06, 'eval_rewards/rejected': -0.4983709454536438, 'eval_rewards/accuracies': 1.0, 'eval_rewards/margins': 0.4983651340007782, 'eval_logps/rejected': -2.491854667663574, 'eval_logps/chosen': -2.9247956263134256e-05, 'eval_logits/rejected': -19.774274826049805, 'eval_logits/chosen': -11.12719440460205, 'eval_nll_loss': 0.805690586566925, 'eval_log_odds_ratio': -3.874320555041777e-06, 'eval_log_odds_chosen': 12.983072280883789, 'epoch': 2.0}
{'loss': 0.4572, 'grad_norm': 1.195825457572937, 'learning_rate': 1.6379310344827585e-05, 'rewards/chosen': -0.00644961092621088, 'rewards/rejected': -0.5352410674095154, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5287914276123047, 'logps/rejected': -2.6762053966522217, 'logps/chosen': -0.03224804997444153, 'logits/rejected': -20.2484130859375, 'logits/chosen': -13.392314910888672, 'nll_loss': 0.4569302499294281, 'log_odds_ratio': -0.0011830029543489218, 'log_odds_chosen': 11.48648738861084, 'epoch': 2.02}
{'loss': 0.9195, 'grad_norm': 1.6340574026107788, 'learning_rate': 1.603448275862069e-05, 'rewards/chosen': -0.0009641453507356346, 'rewards/rejected': -0.5165884494781494, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5156242251396179, 'logps/rejected': -2.582942247390747, 'logps/chosen': -0.004820726811885834, 'logits/rejected': -23.279632568359375, 'logits/chosen': -12.158900260925293, 'nll_loss': 0.919471263885498, 'log_odds_ratio': -5.5705735576339066e-05, 'log_odds_chosen': 12.172921180725098, 'epoch': 2.04}
{'loss': 0.7767, 'grad_norm': 1.2293195724487305, 'learning_rate': 1.5689655172413794e-05, 'rewards/chosen': -0.005173991434276104, 'rewards/rejected': -0.5358086228370667, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5306346416473389, 'logps/rejected': -2.6790428161621094, 'logps/chosen': -0.025869954377412796, 'logits/rejected': -23.932090759277344, 'logits/chosen': -12.910810470581055, 'nll_loss': 0.7764286994934082, 'log_odds_ratio': -0.0011900606332346797, 'log_odds_chosen': 11.521227836608887, 'epoch': 2.06}
{'loss': 1.0726, 'grad_norm': 1.5289736986160278, 'learning_rate': 1.5344827586206898e-05, 'rewards/chosen': -1.1648961844912264e-05, 'rewards/rejected': -0.493474543094635, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4934629201889038, 'logps/rejected': -2.4673726558685303, 'logps/chosen': -5.824481195304543e-05, 'logits/rejected': -24.215726852416992, 'logits/chosen': -10.383466720581055, 'nll_loss': 1.0725864171981812, 'log_odds_ratio': -8.784352758084424e-06, 'log_odds_chosen': 12.643072128295898, 'epoch': 2.08}
{'loss': 0.9231, 'grad_norm': 1.1011031866073608, 'learning_rate': 1.5e-05, 'rewards/chosen': -0.0005848333821631968, 'rewards/rejected': -0.517253041267395, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5166682004928589, 'logps/rejected': -2.5862650871276855, 'logps/chosen': -0.002924166852608323, 'logits/rejected': -25.697216033935547, 'logits/chosen': -10.613853454589844, 'nll_loss': 0.9230929017066956, 'log_odds_ratio': -2.5269157049478963e-05, 'log_odds_chosen': 12.808097839355469, 'epoch': 2.1}
{'loss': 0.8433, 'grad_norm': 1.6567535400390625, 'learning_rate': 1.4655172413793103e-05, 'rewards/chosen': -0.003000841708853841, 'rewards/rejected': -0.5329760909080505, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5299752354621887, 'logps/rejected': -2.6648805141448975, 'logps/chosen': -0.01500420831143856, 'logits/rejected': -20.907827377319336, 'logits/chosen': -11.467324256896973, 'nll_loss': 0.8432353734970093, 'log_odds_ratio': -0.00014502243720926344, 'log_odds_chosen': 11.658699989318848, 'epoch': 2.12}
{'loss': 0.5459, 'grad_norm': 1.7167822122573853, 'learning_rate': 1.4310344827586209e-05, 'rewards/chosen': -0.0021517022978514433, 'rewards/rejected': -0.5713918209075928, 'rewards/accuracies': 1.0, 'rewards/margins': 0.569240152835846, 'logps/rejected': -2.856959104537964, 'logps/chosen': -0.01075851172208786, 'logits/rejected': -21.26047134399414, 'logits/chosen': -11.956755638122559, 'nll_loss': 0.5458904504776001, 'log_odds_ratio': -6.295014463830739e-05, 'log_odds_chosen': 12.115792274475098, 'epoch': 2.14}
{'loss': 0.7046, 'grad_norm': 1.5841059684753418, 'learning_rate': 1.3965517241379311e-05, 'rewards/chosen': -0.001718698418699205, 'rewards/rejected': -0.5160043835639954, 'rewards/accuracies': 1.0, 'rewards/margins': 0.514285683631897, 'logps/rejected': -2.580021858215332, 'logps/chosen': -0.008593492209911346, 'logits/rejected': -21.92946434020996, 'logits/chosen': -11.483360290527344, 'nll_loss': 0.704607367515564, 'log_odds_ratio': -9.291869355365634e-05, 'log_odds_chosen': 12.830413818359375, 'epoch': 2.16}
{'loss': 0.7417, 'grad_norm': 1.5741859674453735, 'learning_rate': 1.3620689655172414e-05, 'rewards/chosen': -0.0012132328702136874, 'rewards/rejected': -0.5913119912147522, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5900986790657043, 'logps/rejected': -2.956559658050537, 'logps/chosen': -0.006066164467483759, 'logits/rejected': -18.73196792602539, 'logits/chosen': -10.941924095153809, 'nll_loss': 0.7417186498641968, 'log_odds_ratio': -6.298208609223366e-05, 'log_odds_chosen': 12.809371948242188, 'epoch': 2.18}
{'loss': 0.5895, 'grad_norm': 2.0274646282196045, 'learning_rate': 1.3275862068965516e-05, 'rewards/chosen': -3.676258074847283e-06, 'rewards/rejected': -0.6546315550804138, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6546278595924377, 'logps/rejected': -3.273157835006714, 'logps/chosen': -1.8381289919489063e-05, 'logits/rejected': -19.646883010864258, 'logits/chosen': -11.151803970336914, 'nll_loss': 0.5895261764526367, 'log_odds_ratio': -1.132489046540286e-06, 'log_odds_chosen': 14.147743225097656, 'epoch': 2.2}
{'loss': 0.6075, 'grad_norm': 1.4194899797439575, 'learning_rate': 1.2931034482758622e-05, 'rewards/chosen': -0.0026242006570100784, 'rewards/rejected': -0.6036843061447144, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6010600328445435, 'logps/rejected': -3.0184216499328613, 'logps/chosen': -0.013121004216372967, 'logits/rejected': -18.609325408935547, 'logits/chosen': -11.456857681274414, 'nll_loss': 0.6075158715248108, 'log_odds_ratio': -0.00014893279876559973, 'log_odds_chosen': 12.175820350646973, 'epoch': 2.22}
{'loss': 0.7995, 'grad_norm': 1.708996057510376, 'learning_rate': 1.2586206896551725e-05, 'rewards/chosen': -0.0009386288002133369, 'rewards/rejected': -0.5260633230209351, 'rewards/accuracies': 1.0, 'rewards/margins': 0.525124728679657, 'logps/rejected': -2.6303164958953857, 'logps/chosen': -0.004693144001066685, 'logits/rejected': -18.443836212158203, 'logits/chosen': -9.944173812866211, 'nll_loss': 0.7994794249534607, 'log_odds_ratio': -0.00016719859559088945, 'log_odds_chosen': 12.410320281982422, 'epoch': 2.24}
{'loss': 0.8801, 'grad_norm': 1.573358416557312, 'learning_rate': 1.2241379310344827e-05, 'rewards/chosen': -0.0007378161535598338, 'rewards/rejected': -0.5471305251121521, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5463926792144775, 'logps/rejected': -2.7356526851654053, 'logps/chosen': -0.0036890809424221516, 'logits/rejected': -20.778533935546875, 'logits/chosen': -10.443353652954102, 'nll_loss': 0.8800773024559021, 'log_odds_ratio': -1.774552401911933e-05, 'log_odds_chosen': 12.928975105285645, 'epoch': 2.26}
{'loss': 0.6477, 'grad_norm': 1.7871873378753662, 'learning_rate': 1.1896551724137931e-05, 'rewards/chosen': -0.0023738271556794643, 'rewards/rejected': -0.5290237665176392, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5266499519348145, 'logps/rejected': -2.6451189517974854, 'logps/chosen': -0.011869135312736034, 'logits/rejected': -17.390993118286133, 'logits/chosen': -10.361017227172852, 'nll_loss': 0.6475288271903992, 'log_odds_ratio': -0.0007124284165911376, 'log_odds_chosen': 12.200753211975098, 'epoch': 2.28}
{'loss': 0.573, 'grad_norm': 1.5262771844863892, 'learning_rate': 1.1551724137931034e-05, 'rewards/chosen': -0.004342021886259317, 'rewards/rejected': -0.6674879193305969, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6631458401679993, 'logps/rejected': -3.33743953704834, 'logps/chosen': -0.0217101089656353, 'logits/rejected': -17.737884521484375, 'logits/chosen': -10.08741283416748, 'nll_loss': 0.5729113817214966, 'log_odds_ratio': -0.00034274777863174677, 'log_odds_chosen': 12.588217735290527, 'epoch': 2.3}
{'loss': 0.8674, 'grad_norm': 1.032822608947754, 'learning_rate': 1.1206896551724138e-05, 'rewards/chosen': -0.00023366302775684744, 'rewards/rejected': -0.5758139491081238, 'rewards/accuracies': 1.0, 'rewards/margins': 0.575580358505249, 'logps/rejected': -2.8790698051452637, 'logps/chosen': -0.001168315066024661, 'logits/rejected': -18.754615783691406, 'logits/chosen': -10.111359596252441, 'nll_loss': 0.8674405217170715, 'log_odds_ratio': -7.070795163599541e-06, 'log_odds_chosen': 13.01667594909668, 'epoch': 2.32}
{'loss': 0.8515, 'grad_norm': 1.304696798324585, 'learning_rate': 1.0862068965517242e-05, 'rewards/chosen': -3.7055815482744947e-06, 'rewards/rejected': -0.5266183614730835, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5266146659851074, 'logps/rejected': -2.633091688156128, 'logps/chosen': -1.852790592238307e-05, 'logits/rejected': -20.375017166137695, 'logits/chosen': -9.902606010437012, 'nll_loss': 0.8514745831489563, 'log_odds_ratio': -1.7210863916261587e-06, 'log_odds_chosen': 13.484970092773438, 'epoch': 2.34}
{'loss': 0.4582, 'grad_norm': 1.7339868545532227, 'learning_rate': 1.0517241379310346e-05, 'rewards/chosen': -0.00029763623024336994, 'rewards/rejected': -0.7697769999504089, 'rewards/accuracies': 1.0, 'rewards/margins': 0.7694793343544006, 'logps/rejected': -3.8488850593566895, 'logps/chosen': -0.0014881814131513238, 'logits/rejected': -19.005126953125, 'logits/chosen': -11.369391441345215, 'nll_loss': 0.4582258462905884, 'log_odds_ratio': -4.701410489360569e-06, 'log_odds_chosen': 13.965812683105469, 'epoch': 2.37}
{'loss': 0.8264, 'grad_norm': 1.0132513046264648, 'learning_rate': 1.0172413793103449e-05, 'rewards/chosen': -4.061437266500434e-06, 'rewards/rejected': -0.49042627215385437, 'rewards/accuracies': 1.0, 'rewards/margins': 0.49042218923568726, 'logps/rejected': -2.4521312713623047, 'logps/chosen': -2.030718678724952e-05, 'logits/rejected': -20.67136001586914, 'logits/chosen': -9.5329008102417, 'nll_loss': 0.8263953328132629, 'log_odds_ratio': -1.989307520489092e-06, 'log_odds_chosen': 13.24461555480957, 'epoch': 2.39}
{'loss': 0.7789, 'grad_norm': 1.3634321689605713, 'learning_rate': 9.827586206896553e-06, 'rewards/chosen': -0.002735638292506337, 'rewards/rejected': -0.6266897320747375, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6239540576934814, 'logps/rejected': -3.133448600769043, 'logps/chosen': -0.013678191229701042, 'logits/rejected': -16.080286026000977, 'logits/chosen': -9.744536399841309, 'nll_loss': 0.7788917422294617, 'log_odds_ratio': -0.00019043292559217662, 'log_odds_chosen': 12.91959285736084, 'epoch': 2.41}
{'loss': 0.5801, 'grad_norm': 1.2553811073303223, 'learning_rate': 9.482758620689655e-06, 'rewards/chosen': -0.0044756620191037655, 'rewards/rejected': -0.5723589658737183, 'rewards/accuracies': 1.0, 'rewards/margins': 0.567883312702179, 'logps/rejected': -2.8617947101593018, 'logps/chosen': -0.022378310561180115, 'logits/rejected': -19.3375186920166, 'logits/chosen': -11.114123344421387, 'nll_loss': 0.5800355672836304, 'log_odds_ratio': -0.00026665788027457893, 'log_odds_chosen': 12.528596878051758, 'epoch': 2.43}
{'loss': 0.7734, 'grad_norm': 2.1767923831939697, 'learning_rate': 9.13793103448276e-06, 'rewards/chosen': -0.0006139791803434491, 'rewards/rejected': -0.5401873588562012, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5395733714103699, 'logps/rejected': -2.700936794281006, 'logps/chosen': -0.0030698957853019238, 'logits/rejected': -19.245071411132812, 'logits/chosen': -10.252250671386719, 'nll_loss': 0.7733587026596069, 'log_odds_ratio': -6.229685095604509e-05, 'log_odds_chosen': 12.896451950073242, 'epoch': 2.45}
{'loss': 0.5743, 'grad_norm': 2.2843449115753174, 'learning_rate': 8.793103448275862e-06, 'rewards/chosen': -0.002333274343982339, 'rewards/rejected': -0.5543010234832764, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5519676804542542, 'logps/rejected': -2.7715048789978027, 'logps/chosen': -0.011666371487081051, 'logits/rejected': -16.55878257751465, 'logits/chosen': -9.83079719543457, 'nll_loss': 0.574192464351654, 'log_odds_ratio': -0.00043011928210034966, 'log_odds_chosen': 12.21865463256836, 'epoch': 2.47}
{'loss': 0.7758, 'grad_norm': 1.2806751728057861, 'learning_rate': 8.448275862068966e-06, 'rewards/chosen': -0.0005896255606785417, 'rewards/rejected': -0.5615801215171814, 'rewards/accuracies': 1.0, 'rewards/margins': 0.560990571975708, 'logps/rejected': -2.8079006671905518, 'logps/chosen': -0.00294812791980803, 'logits/rejected': -17.494726181030273, 'logits/chosen': -9.414746284484863, 'nll_loss': 0.775834858417511, 'log_odds_ratio': -5.1046979933744296e-05, 'log_odds_chosen': 12.691412925720215, 'epoch': 2.49}
{'loss': 0.6468, 'grad_norm': 1.4305418729782104, 'learning_rate': 8.103448275862069e-06, 'rewards/chosen': -0.0009309580200351775, 'rewards/rejected': -0.5848796963691711, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5839487910270691, 'logps/rejected': -2.92439866065979, 'logps/chosen': -0.004654789809137583, 'logits/rejected': -17.671133041381836, 'logits/chosen': -10.447976112365723, 'nll_loss': 0.6468138098716736, 'log_odds_ratio': -4.6391647629207e-05, 'log_odds_chosen': 13.011695861816406, 'epoch': 2.51}
{'loss': 0.4787, 'grad_norm': 1.9232028722763062, 'learning_rate': 7.758620689655173e-06, 'rewards/chosen': -4.16892726207152e-06, 'rewards/rejected': -0.6570600271224976, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6570559144020081, 'logps/rejected': -3.2853000164031982, 'logps/chosen': -2.08446363103576e-05, 'logits/rejected': -19.674097061157227, 'logits/chosen': -11.62187671661377, 'nll_loss': 0.47874119877815247, 'log_odds_ratio': -1.467766878704424e-06, 'log_odds_chosen': 14.209254264831543, 'epoch': 2.53}
{'loss': 0.9802, 'grad_norm': 2.4507875442504883, 'learning_rate': 7.413793103448275e-06, 'rewards/chosen': -5.797777248517377e-06, 'rewards/rejected': -0.492559552192688, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4925537705421448, 'logps/rejected': -2.4627978801727295, 'logps/chosen': -2.8988884878344834e-05, 'logits/rejected': -21.21971893310547, 'logits/chosen': -9.857515335083008, 'nll_loss': 0.9802453517913818, 'log_odds_ratio': -5.438972493720939e-06, 'log_odds_chosen': 13.089038848876953, 'epoch': 2.55}
{'loss': 0.9989, 'grad_norm': 1.6369929313659668, 'learning_rate': 7.0689655172413796e-06, 'rewards/chosen': -0.0009537400328554213, 'rewards/rejected': -0.6954256892204285, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6944719552993774, 'logps/rejected': -3.477128744125366, 'logps/chosen': -0.004768700804561377, 'logits/rejected': -18.036916732788086, 'logits/chosen': -9.186036109924316, 'nll_loss': 0.9988526105880737, 'log_odds_ratio': -3.998297324869782e-05, 'log_odds_chosen': 13.614364624023438, 'epoch': 2.57}
{'loss': 0.767, 'grad_norm': 1.1828192472457886, 'learning_rate': 6.724137931034483e-06, 'rewards/chosen': -0.0010352552635595202, 'rewards/rejected': -0.5837153196334839, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5826800465583801, 'logps/rejected': -2.91857647895813, 'logps/chosen': -0.0051762755028903484, 'logits/rejected': -20.748476028442383, 'logits/chosen': -10.89369010925293, 'nll_loss': 0.7669597268104553, 'log_odds_ratio': -0.00010241282870993018, 'log_odds_chosen': 12.767196655273438, 'epoch': 2.59}
{'loss': 0.9404, 'grad_norm': 1.7015591859817505, 'learning_rate': 6.379310344827587e-06, 'rewards/chosen': -0.0003407355980016291, 'rewards/rejected': -0.5762154459953308, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5758747458457947, 'logps/rejected': -2.8810770511627197, 'logps/chosen': -0.0017036780482158065, 'logits/rejected': -20.108966827392578, 'logits/chosen': -9.497296333312988, 'nll_loss': 0.9403696060180664, 'log_odds_ratio': -1.708989657345228e-05, 'log_odds_chosen': 13.35499382019043, 'epoch': 2.61}
{'loss': 0.7902, 'grad_norm': 1.8598703145980835, 'learning_rate': 6.03448275862069e-06, 'rewards/chosen': -0.0006438646814785898, 'rewards/rejected': -0.5488320589065552, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5481881499290466, 'logps/rejected': -2.7441601753234863, 'logps/chosen': -0.0032193234656006098, 'logits/rejected': -20.157764434814453, 'logits/chosen': -11.075973510742188, 'nll_loss': 0.7901697754859924, 'log_odds_ratio': -9.642274380894378e-05, 'log_odds_chosen': 12.5847806930542, 'epoch': 2.63}
{'loss': 0.9853, 'grad_norm': 1.5752251148223877, 'learning_rate': 5.689655172413794e-06, 'rewards/chosen': -0.0037604495882987976, 'rewards/rejected': -0.5716506242752075, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5678901672363281, 'logps/rejected': -2.858253002166748, 'logps/chosen': -0.018802247941493988, 'logits/rejected': -19.155044555664062, 'logits/chosen': -10.014838218688965, 'nll_loss': 0.9852606654167175, 'log_odds_ratio': -0.00032360476325266063, 'log_odds_chosen': 11.626448631286621, 'epoch': 2.65}
{'loss': 0.5911, 'grad_norm': 1.4440124034881592, 'learning_rate': 5.344827586206897e-06, 'rewards/chosen': -0.0003787380992434919, 'rewards/rejected': -0.5870416760444641, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5866629481315613, 'logps/rejected': -2.935208320617676, 'logps/chosen': -0.0018936903215944767, 'logits/rejected': -20.548538208007812, 'logits/chosen': -11.548595428466797, 'nll_loss': 0.5911110639572144, 'log_odds_ratio': -2.4227976609836332e-05, 'log_odds_chosen': 13.21658706665039, 'epoch': 2.67}
{'loss': 0.8588, 'grad_norm': 1.4845126867294312, 'learning_rate': 5e-06, 'rewards/chosen': -0.0024457962717860937, 'rewards/rejected': -0.521043062210083, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5185972452163696, 'logps/rejected': -2.605215311050415, 'logps/chosen': -0.012228981591761112, 'logits/rejected': -22.451461791992188, 'logits/chosen': -11.681790351867676, 'nll_loss': 0.8587256669998169, 'log_odds_ratio': -0.00017340258636977524, 'log_odds_chosen': 13.118877410888672, 'epoch': 2.69}
{'loss': 0.634, 'grad_norm': 1.5449845790863037, 'learning_rate': 4.655172413793104e-06, 'rewards/chosen': -0.002779257483780384, 'rewards/rejected': -0.6150062084197998, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6122269630432129, 'logps/rejected': -3.075031280517578, 'logps/chosen': -0.01389628741890192, 'logits/rejected': -19.12252426147461, 'logits/chosen': -12.268533706665039, 'nll_loss': 0.6340162754058838, 'log_odds_ratio': -3.203348387614824e-05, 'log_odds_chosen': 12.811798095703125, 'epoch': 2.71}
{'loss': 0.9906, 'grad_norm': 1.544396162033081, 'learning_rate': 4.310344827586207e-06, 'rewards/chosen': -0.00035690004006028175, 'rewards/rejected': -0.6084524393081665, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6080954670906067, 'logps/rejected': -3.042261838912964, 'logps/chosen': -0.0017845003167167306, 'logits/rejected': -20.487857818603516, 'logits/chosen': -10.147913932800293, 'nll_loss': 0.9905788898468018, 'log_odds_ratio': -6.22881862000213e-06, 'log_odds_chosen': 13.218502044677734, 'epoch': 2.74}
{'loss': 0.8517, 'grad_norm': 1.5686635971069336, 'learning_rate': 3.96551724137931e-06, 'rewards/chosen': -0.0038685956969857216, 'rewards/rejected': -0.6442170143127441, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6403484344482422, 'logps/rejected': -3.2210850715637207, 'logps/chosen': -0.019342977553606033, 'logits/rejected': -19.41363525390625, 'logits/chosen': -10.342615127563477, 'nll_loss': 0.8516762256622314, 'log_odds_ratio': -0.0001525185798527673, 'log_odds_chosen': 12.278036117553711, 'epoch': 2.76}
{'loss': 0.7193, 'grad_norm': 2.4519948959350586, 'learning_rate': 3.620689655172414e-06, 'rewards/chosen': -0.000353357958374545, 'rewards/rejected': -0.5351605415344238, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5348072052001953, 'logps/rejected': -2.675802707672119, 'logps/chosen': -0.0017667897045612335, 'logits/rejected': -19.494089126586914, 'logits/chosen': -10.776789665222168, 'nll_loss': 0.7193320989608765, 'log_odds_ratio': -3.942699549952522e-05, 'log_odds_chosen': 12.916232109069824, 'epoch': 2.78}
{'loss': 1.1501, 'grad_norm': 1.4231575727462769, 'learning_rate': 3.2758620689655175e-06, 'rewards/chosen': -0.001281327335163951, 'rewards/rejected': -0.5944414734840393, 'rewards/accuracies': 1.0, 'rewards/margins': 0.593160092830658, 'logps/rejected': -2.9722073078155518, 'logps/chosen': -0.006406635977327824, 'logits/rejected': -19.284317016601562, 'logits/chosen': -9.445854187011719, 'nll_loss': 1.1500775814056396, 'log_odds_ratio': -4.0146820538211614e-05, 'log_odds_chosen': 12.594870567321777, 'epoch': 2.8}
{'loss': 0.8762, 'grad_norm': 1.9000905752182007, 'learning_rate': 2.931034482758621e-06, 'rewards/chosen': -0.0021680702921003103, 'rewards/rejected': -0.6599014401435852, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6577333807945251, 'logps/rejected': -3.2995071411132812, 'logps/chosen': -0.01084035076200962, 'logits/rejected': -18.468284606933594, 'logits/chosen': -10.758077621459961, 'nll_loss': 0.876214861869812, 'log_odds_ratio': -0.00012014912499580532, 'log_odds_chosen': 12.802360534667969, 'epoch': 2.82}
{'loss': 0.679, 'grad_norm': 1.8793340921401978, 'learning_rate': 2.586206896551724e-06, 'rewards/chosen': -1.6906469681998715e-05, 'rewards/rejected': -0.5995166301727295, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5994997620582581, 'logps/rejected': -2.9975833892822266, 'logps/chosen': -8.453235204797238e-05, 'logits/rejected': -21.105714797973633, 'logits/chosen': -10.510438919067383, 'nll_loss': 0.6789709329605103, 'log_odds_ratio': -1.4414348697755486e-05, 'log_odds_chosen': 13.495828628540039, 'epoch': 2.84}
{'loss': 0.9215, 'grad_norm': 1.8249917030334473, 'learning_rate': 2.2413793103448275e-06, 'rewards/chosen': -0.0015617430908605456, 'rewards/rejected': -0.6086995601654053, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6071377992630005, 'logps/rejected': -3.0434978008270264, 'logps/chosen': -0.007808716502040625, 'logits/rejected': -19.334270477294922, 'logits/chosen': -10.51522445678711, 'nll_loss': 0.921438455581665, 'log_odds_ratio': -0.00021719519281759858, 'log_odds_chosen': 11.721553802490234, 'epoch': 2.86}
{'loss': 0.6985, 'grad_norm': 1.6897486448287964, 'learning_rate': 1.896551724137931e-06, 'rewards/chosen': -0.0011898013763129711, 'rewards/rejected': -0.5701368451118469, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5689470171928406, 'logps/rejected': -2.8506836891174316, 'logps/chosen': -0.005949007347226143, 'logits/rejected': -19.37040901184082, 'logits/chosen': -10.909904479980469, 'nll_loss': 0.6984624266624451, 'log_odds_ratio': -2.3358828912023455e-05, 'log_odds_chosen': 12.592169761657715, 'epoch': 2.88}
{'loss': 0.7206, 'grad_norm': 2.3789925575256348, 'learning_rate': 1.5517241379310346e-06, 'rewards/chosen': -0.0007231563213281333, 'rewards/rejected': -0.5525603294372559, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5518372058868408, 'logps/rejected': -2.7628016471862793, 'logps/chosen': -0.0036157816648483276, 'logits/rejected': -20.756675720214844, 'logits/chosen': -10.476348876953125, 'nll_loss': 0.7206019163131714, 'log_odds_ratio': -2.187038990086876e-05, 'log_odds_chosen': 13.000876426696777, 'epoch': 2.9}
{'loss': 1.1157, 'grad_norm': 1.6908942461013794, 'learning_rate': 1.206896551724138e-06, 'rewards/chosen': -0.0005955866654403508, 'rewards/rejected': -0.5134972333908081, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5129016637802124, 'logps/rejected': -2.56748628616333, 'logps/chosen': -0.002977933967486024, 'logits/rejected': -19.12198257446289, 'logits/chosen': -8.531938552856445, 'nll_loss': 1.1156960725784302, 'log_odds_ratio': -0.00018488919886294752, 'log_odds_chosen': 12.348122596740723, 'epoch': 2.92}
{'loss': 0.5101, 'grad_norm': 1.6701968908309937, 'learning_rate': 8.620689655172415e-07, 'rewards/chosen': -3.7052418520033825e-06, 'rewards/rejected': -0.5871139168739319, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5871102213859558, 'logps/rejected': -2.9355692863464355, 'logps/chosen': -1.8526208805269562e-05, 'logits/rejected': -21.39337730407715, 'logits/chosen': -11.912752151489258, 'nll_loss': 0.5100995898246765, 'log_odds_ratio': -1.162291482614819e-06, 'log_odds_chosen': 13.982168197631836, 'epoch': 2.94}
{'loss': 0.7489, 'grad_norm': 1.49952232837677, 'learning_rate': 5.172413793103449e-07, 'rewards/chosen': -0.000477742578368634, 'rewards/rejected': -0.5809825658798218, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5805047750473022, 'logps/rejected': -2.9049129486083984, 'logps/chosen': -0.002388712950050831, 'logits/rejected': -20.73346519470215, 'logits/chosen': -11.064024925231934, 'nll_loss': 0.7488974928855896, 'log_odds_ratio': -6.760325049981475e-05, 'log_odds_chosen': 13.095037460327148, 'epoch': 2.96}
{'loss': 0.8027, 'grad_norm': 2.0041236877441406, 'learning_rate': 1.7241379310344828e-07, 'rewards/chosen': -0.0034677451476454735, 'rewards/rejected': -0.5447936058044434, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5413258671760559, 'logps/rejected': -2.723968267440796, 'logps/chosen': -0.017338726669549942, 'logits/rejected': -22.334590911865234, 'logits/chosen': -10.882159233093262, 'nll_loss': 0.8026158809661865, 'log_odds_ratio': -0.0002836406056303531, 'log_odds_chosen': 11.874309539794922, 'epoch': 2.98}
100%|██████████| 582/582 [30:14<00:00,  3.05s/it]
  0%|          | 0/8 [00:00<?, ?it/s]
 25%|██▌       | 2/8 [00:00<00:01,  5.90it/s]
 38%|███▊      | 3/8 [00:00<00:01,  4.87it/s]
 50%|█████     | 4/8 [00:00<00:00,  4.30it/s]
 62%|██████▎   | 5/8 [00:01<00:00,  4.15it/s]
 75%|███████▌  | 6/8 [00:01<00:00,  4.40it/s]
 88%|████████▊ | 7/8 [00:01<00:00,  4.21it/s]
                                                 
100%|██████████| 582/582 [30:17<00:00,  3.05s/it]
100%|██████████| 8/8 [00:01<00:00,  4.09it/s]
100%|██████████| 582/582 [30:18<00:00,  3.13s/it]
{'eval_loss': 0.8230330944061279, 'eval_runtime': 2.0524, 'eval_samples_per_second': 3.898, 'eval_steps_per_second': 3.898, 'eval_rewards/chosen': -5.040794349042699e-06, 'eval_rewards/rejected': -0.5525792837142944, 'eval_rewards/accuracies': 1.0, 'eval_rewards/margins': 0.5525741577148438, 'eval_logps/rejected': -2.7628960609436035, 'eval_logps/chosen': -2.5203971745213494e-05, 'eval_logits/rejected': -17.553890228271484, 'eval_logits/chosen': -10.015307426452637, 'eval_nll_loss': 0.8230326771736145, 'eval_log_odds_ratio': -2.6077123038703576e-06, 'eval_log_odds_chosen': 13.482699394226074, 'epoch': 2.99}
{'train_runtime': 1818.8938, 'train_samples_per_second': 1.283, 'train_steps_per_second': 0.32, 'train_loss': 1.0962481443414982, 'epoch': 2.99}
> 2024-09-17 12:45:18,618 [info] To track results use the CLI: {"info_cmd":"mlrun get run 4a01d76a81204ccca98d752b716f4572 -p tutorial","logs_cmd":"mlrun logs 4a01d76a81204ccca98d752b716f4572 -p tutorial"}
> 2024-09-17 12:45:18,618 [info] Or click for UI: {"ui_url":"https://dashboard.default-tenant.app.llm-dev.iguazio-cd1.com/mlprojects/tutorial/jobs/monitor/4a01d76a81204ccca98d752b716f4572/overview"}
> 2024-09-17 12:45:18,618 [info] Run execution finished: {"name":"train-train","status":"completed"}
project uid iter start state kind name labels inputs parameters results
tutorial 0 Sep 17 12:14:25 completed run train-train
v3io_user=edmond
kind=job
owner=edmond
mlrun/client_version=1.7.0-rc40
mlrun/client_python_version=3.9.18
host=train-train-p69gn
dataset=mlrun/banking-orpo-opt
base_model=google/gemma-2b
new_model=mlrun/gemma-2b-bank-v0.2
device=cuda:0

> to track results use the .show() or .logs() methods or click here to open in UI
> 2024-09-17 12:45:24,293 [info] Run execution finished: {"name":"train-train","status":"completed"}
<mlrun.model.RunObject at 0x7f0bcdff7a00>

Check the performance of the fine-tuned model#

Now load and deploy the trained model to see how it performs.

serving_function.add_model(
    base_model,
    class_name="LLMModelServer",
    llm_type="HuggingFace",
    model_name="google/gemma-2b",
    adapter="mlrun/gemma-2b-bank-v0.2",
    model_path=f"store://models/{project.name}/{base_model}:latest",
    generate_kwargs={
        "do_sample": True,
        "top_p": 0.9,
        "num_return_sequences": 1,
        "max_length": 80,
    },
    device_map="cuda:0",
)
serving_function.set_tracking()
deployment = serving_function.deploy()
> 2024-09-17 08:46:00,704 [info] Starting remote function deploy
2024-09-17 08:46:01  (info) Deploying function
2024-09-17 08:46:01  (info) Building
2024-09-17 08:46:02  (info) Staging files and preparing base images
2024-09-17 08:46:02  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2024-09-17 08:46:02  (info) Building processor image
2024-09-17 08:50:27  (info) Build complete
2024-09-17 08:51:51  (info) Function deploy complete
> 2024-09-17 08:51:55,795 [info] Successfully deployed function: {"external_invocation_urls":["tutorial-llm-server.default-tenant.app.llm-dev.iguazio-cd1.com/"],"internal_invocation_urls":["nuclio-tutorial-llm-server.default-tenant.svc.cluster.local:8080"]}
while True:
    question_model(
        questions=example_questions,
        serving_function=serving_function,
        base_model=base_model,
    )

The Grafana model monitoring page shows a high pass rate and a high guardrails score:

Model monitor after

Build an automated pipeline#

The pipeline uses the restrict_to_banking alert to check for drift. If drift is detected, it triggers retraining of the model (using the ORPO algorithm), and then deploys the improved model.

%%writefile genai_monit_src/workflow.py
import mlrun
from kfp import dsl

    
@dsl.pipeline(
    name="GenAI alerts demo"
)

def kfpipeline(metric_name: str, 
               input_ds):
    
    project = mlrun.get_current_project()
    
    sample = project.run_function(
        function="metric-sample",
        name="metric-sample",
        handler="sample",
        params = {"metric_name" : metric_name},
        outputs=['alert_triggered']
    )

    with dsl.Condition(sample.outputs['alert_triggered'] == "True"):

        # Generate a new DS based on the traffic
        ds = project.run_function(
            function="generate-ds",
            handler="generate_ds",
            params={"input_ds" : input_ds}, 
            outputs=["new-train-ds","dataset"])
        
        # Re-train the new model        
        train = project.run_function(
            function="train",
            params={
                "dataset": "mlrun/banking-orpo-opt",
                "base_model": "google/gemma-2b",
                "new_model": "mlrun/gemma-2b-bank-v0.2",
                "device": "cuda:0"},
            handler="train",
            outputs=["model"],
            ).after(ds)
        
        # Deploy the function with the new (re-trained) model
        deploy = project.get_function('llm-server')
        deploy.add_model(
            "google-gemma-2b",
            class_name="LLMModelServer",
            llm_type="HuggingFace",
            model_name="google/gemma-2b",
            adapter="mlrun/gemma-2b-bank-v0.2", 
            model_path=f"store://models/{project.name}/google-gemma-2b:latest",
            generate_kwargs={
                "do_sample": True,
                "top_p": 0.9,
                "num_return_sequences": 1,
                "max_length": 80,
            },
            device_map="cuda:0",
        )
        deploy.set_tracking()
        project.deploy_function("llm-server").after(train)
        
Writing genai_monit_src/workflow.py
project.set_function(f"db://{project.name}/llm-server")
project.set_function(f"db://{project.name}/train")
project.set_function(f"db://{project.name}/metric-sample")
project.set_function(f"db://{project.name}/generate-ds")
project.set_workflow("main", "workflow.py", embed=True)
project.save()
<mlrun.projects.project.MlrunProject at 0x7f80376f8490>
run_id = project.run(
    "main",
    arguments={"metric_name": "restrict_to_banking alert", "input_ds": input_ds},
    watch=False,
)
> 2024-09-17 08:53:22,670 [info] Pipeline submitted successfully: {"id":"1012b16b-698a-4c7d-b6c3-e11b496a14d1","pipeline_name":"tutorial-main 2024-09-17 08-53-22"}
> 2024-09-17 08:53:22,671 [info] Pipeline run id=1012b16b-698a-4c7d-b6c3-e11b496a14d1, check UI for progress
Workflow started in project tutorial id=1012b16b-698a-4c7d-b6c3-e11b496a14d1
Pipeline running (id=1012b16b-698a-4c7d-b6c3-e11b496a14d1), click here to view the details in MLRun UI
../_images/f619ad9bf718d7606b5b3f861afb8ce8a5a17ee61da07a98474cda20ee207d2d.svg
> 2024-09-17 08:53:22,759 [info] Started run workflow tutorial-main with run id = '1012b16b-698a-4c7d-b6c3-e11b496a14d1' by kfp engine