How to Evaluate Multi-Turn AI Conversations with Chainlit and Label Studio
This notebook demonstrates how to create a Label Studio project for evaluating chatbot conversations using the Chatbot Evaluation template.
The allows you to:
- Review multi-turn conversations
- Rate assistant responses for accuracy, clarity, and helpfulness
- Evaluate grounding in documentation
- Assess tone and style
- Track whether questions were answered
Reference: Chatbot Evaluation Template
Label Studio Requirements
This tutorial showcases one or more features available only in Label Studio paid products. We recommend creating a Starter Cloud trial to follow the tutorial.
Setup and Installation
First, install the Label Studio SDK if you haven’t already.
For more information about the SDK, see the Label Studio Python SDK documentation.
%pip install label-studio-sdk
Configure Credentials
To support loading credentials from Google Colab Secrets with fallback to .env and environment sourced variables the following cell can be used.
%pip install python-dotenv
# Load configuration with Google Colab Secrets support + fallback
IS_GOOGLE_COLAB = False
# Load from .env file if available (for local development)
try:
from dotenv import load_dotenv
load_dotenv()
except:
pass # will use system env vars
def get_credential(key, default=None):
global IS_GOOGLE_COLAB
"""Get credential from Colab Secrets first, then environment variables"""
try:
# Try Google Colab Secrets first (most secure)
from google.colab import userdata
IS_GOOGLE_COLAB = True
return userdata.get(key)
except:
from os import environ
IS_GOOGLE_COLAB = False
# Fallback to environment variables (for local Jupyter)
return environ.get(key, default)
Requirement already satisfied: python-dotenv in /usr/local/lib/python3.12/dist-packages (1.1.1)
Set your environment variables before running:
# URL of your Label Studio instance
export LABEL_STUDIO_URL="https://app.humansignal.com"
# Your API key (find it in Account & Settings > Personal Access Token)
export LABEL_STUDIO_API_KEY="your-api-key-here"
How to get your API key:
- Open Label Studio in your browser
- Click on your profile (top-right)
- Go to “Account & Settings”
- Click “Access Token” (or “Personal Access Token”)
- Copy the existing token or create a new one
import os
from label_studio_sdk import LabelStudio
# Get credentials from environment variables
ls_api_key = os.environ.get('LABEL_STUDIO_API_KEY')
ls_url = os.environ.get('LABEL_STUDIO_URL', 'https://app.humansignal.com')
if not ls_api_key:
raise ValueError('❌ Please set LABEL_STUDIO_API_KEY environment variable.')
# Connect to Label Studio
try:
ls = LabelStudio(base_url=ls_url, api_key=ls_api_key)
print(f'✅ Connected to Label Studio at {ls_url}')
except Exception as e:
raise ConnectionError(f'❌ Failed to connect to Label Studio: {str(e)}')
Define the Chatbot Evaluation Label Config
This is the label config from the Evaluate Production Conversations for RLHF example. It includes:
- A chat interface for viewing conversations
- Overall quality of message rating
- Additinal comments
LABEL_CONFIG = """
<View>
<Style>
.chat {
border: 1px solid #ccc;
padding: 10px;
border-radius: 5px;
}
.evaluation {
border: 2px solid #cc854f;
background-color: #ffe4d0;
color: #664228;
padding: 10px;
border-radius: 5px;
margin-bottom: 20px;
}
<!-- Choice text -->
.evaluation span {
color: #664228;
}
<!-- Star rating -->
.evaluation .ant-rate-star.ant-rate-star-full span {
color: #f4aa2a;
}
<!-- Dark mode comment text and button color -->
[data-color-scheme="dark"] .evaluation .lsf-row p,
[data-color-scheme="dark"] .evaluation button span {
color: #f9f8f6
}
.overall-chat {
border-bottom: 1px solid #cc854f;
margin-bottom: 15px;
}
.instructions {
color: #664228;
background-color: #ffe4d0;
padding-top: 15px;
padding-bottom: 15px;
}
<!-- Allow enlarging the instruction text -->
.lsf-richtext__container.lsf-htx-richtext {
font-size: 16px !important;
line-height: 1.6;
}
<!-- Remove excess height from the chat to allow space for instruction text -->
.htx-chat {
--excess-height: 275px
}
</Style>
<View style="display: flex; gap: 24px;">
<!-- Left: conversation -->
<View className="chat" style="flex: 2;">
<View className="instructions">
<Text name="instructions" value="Review the conversation in detail.
As you read through it, click on individual messages to
provide feedback about accuracy, clarity, and intent." />
</View>
<Chat name="chat" value="$chat"
minMessages="2"
editable="false" />
</View>
<!-- Right: conversation-level evaluation -->
<View style="flex: 1;" className="evaluation">
<View style="position:sticky;top:14px">
<!-- Evaluate the whole conversation -->
<View className="overall-chat" style="margin-top:auto">
<Header size="4">Overall quality of this conversation</Header>
<Rating name="rating" toName="chat" />
<View style="padding-top:15px">
<Text name="add_comment" value="Add a comment (optional)" />
<TextArea name="conversation_comment" toName="chat" />
</View>
</View>
<!-- Only visible when no message is selected -->
<View visibleWhen="no-region-selected">
<View style="padding-top:15px">
</View>
</View>
<!-- Only visible when a user message is selected, and only applies to selected message -->
<View visibleWhen="region-selected" whenRole="user">
<Header value="Classify the user message"/>
<Choices name="request_classification" toName="chat" perRegion="true" >
<Choice value="Question" />
<Choice value="Clarifying Question" />
<Choice value="Command or Request" />
<Choice value="Positive Feedback" />
<Choice value="Negative Feedback" />
<Choice value="Off-topic / Chit-chat" />
</Choices>
</View>
<!-- Only visible when an assistant message is selected, and only applies to selected message -->
<View visibleWhen="region-selected" whenRole="assistant">
<Header value="Rate assistant's clarity"/>
<Rating name="assistant_response_clarity" toName="chat" perRegion="true" />
<Header value="Rate assistant's accuracy"/>
<Rating name="assistant_response_accuracy" toName="chat" perRegion="true" />
<Header value="Classify the message tone"/>
<Choices name="q" toName="chat" perRegion="true" >
<Choice value="Professional" />
<Choice value="Casual" />
</Choices>
<Header value="Add a comment (optional)"/>
<TextArea perRegion="true" name="message_comment" toName="chat" />
</View>
</View>
</View>
</View>
</View>
"""
print("Label config loaded successfully")
Label config loaded successfully
With the label config set, we now use it to create the Chat Evaluation project
## Create Project with Label Config
# Define project parameters
PROJECT_TITLE = "Chatbot Conversation Evaluation"
PROJECT_DESCRIPTION = "Evaluate multi-turn chatbot conversations for accuracy, clarity, and helpfulness"
# Create the project using Label Studio SDK
project = ls.projects.create(
title=PROJECT_TITLE,
description=PROJECT_DESCRIPTION,
label_config=LABEL_CONFIG
)
## Get Project ID and URL
# Store project ID and build direct URL
project_id = project.id
project_url = f"{ls_url}/projects/{project_id}"
# Save project ID to .env file
with open('.env', 'a') as f:
f.write(f"LABEL_STUDIO_PROJECT_ID={project_id}\n")
print(f"📋 Project Details:")
print(f" ID: {project_id}")
print(f" Direct URL: {project_url}")
print(f"\n🔗 Click here to open the project:")
print(f" {project_url}")
Part 2: Set Up Chainlit Integration
Now we’ll set up a Chainlit chatbot that automatically syncs conversations to Label Studio.
What We’ll Build
- A chatbot UI using Chainlit
- Automatic conversation logging to JSON
- Auto-sync to Label Studio when users disconnect
- Support for conversation resumption with versioning
Step 1: Install Additional Dependencies
We need Chainlit for the chat UI and Ollama for a local LLM.
%pip install chainlit ollama openai anthropic
Step 2: Create Helper Files
We’ll create three Python files:
conversation_logger.py- Saves conversations to JSONauto_sync.py- Automatically syncs to Label Studiochatbot_ui_auto_sync.py- Main chatbot application
Note: Run these cells to create the files in your working directory.
%%writefile conversation_logger.py
"""Conversation logger for saving chats to JSON"""
import json
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Optional
class ConversationLogger:
"""Logs conversations to JSON files"""
def __init__(self, output_dir: Path = Path("data/conversations")):
self.output_dir = output_dir
self.output_dir.mkdir(parents=True, exist_ok=True)
def save_conversation(
self,
messages: List[Dict[str, str]],
session_id: str,
model: str,
metadata: Optional[Dict] = None
) -> Path:
"""Save conversation to JSON file"""
# Check if metadata contains auto_save flag
is_auto_save = metadata and metadata.get('auto_save', False)
conversation_data = {
"session_id": session_id,
"timestamp": datetime.utcnow().isoformat() + "Z",
"model": model,
"messages": messages,
"turn_count": len([m for m in messages if m["role"] == "user"]),
"metadata": {k: v for k, v in (metadata or {}).items() if k != 'auto_save'}
}
# For auto-save: use session ID only (continuous updates)
# For manual save: add timestamp (creates snapshot)
if is_auto_save:
filename = f"conversation_{session_id}.json"
else:
filename = f"conversation_{session_id}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
filepath = self.output_dir / filename
with open(filepath, 'w', encoding='utf-8') as f:
json.dump(conversation_data, f, indent=2, ensure_ascii=False)
return filepath
%%writefile auto_sync.py
"""Automatic Label Studio sync helper"""
import os
import json
from pathlib import Path
from typing import Optional
from datetime import datetime
from label_studio_sdk.client import LabelStudio
from dotenv import load_dotenv
load_dotenv()
class LabelStudioSync:
"""Helper class to push conversations to Label Studio"""
def __init__(
self,
url: Optional[str] = None,
api_key: Optional[str] = None,
project_id: Optional[int] = None
):
self.url = url or os.getenv('LABEL_STUDIO_URL', 'https://app.humansignal.com')
self.api_key = api_key or os.getenv('LABEL_STUDIO_API_KEY')
self.project_id = project_id or int(os.getenv('LABEL_STUDIO_PROJECT_ID', None))
if not self.project_id:
print("⚠️ LABEL_STUDIO_PROJECT_ID not set - auto-sync disabled")
self.client = None
if not self.url:
print("⚠️ LABEL_STUDIO_URL not set - auto-sync disabled")
self.client = None
if not self.api_key:
print("⚠️ LABEL_STUDIO_API_KEY not set - auto-sync disabled")
self.client = None
else:
self.client = LabelStudio(base_url=self.url, api_key=self.api_key)
def is_enabled(self) -> bool:
"""Check if auto-sync is enabled"""
return self.client is not None and self.project_id > 0
async def push_conversation(self, conversation_file: Path) -> bool:
"""Push a single conversation to Label Studio"""
if not self.is_enabled():
return False
try:
# Load conversation
with open(conversation_file, 'r') as f:
data = json.load(f)
# Format as Label Studio task
task = {
'data': {
'chat': data['messages'], # Changed from 'messages' to 'chat' to match label config
'text': 'Review the conversation below and evaluate the quality of the chat interaction.',
'session_id': data.get('session_id', 'unknown'),
'thread_id': data.get('metadata', {}).get('thread_id', data.get('session_id')),
'model': data.get('model', 'unknown'),
'turn_count': data.get('turn_count', 0),
'timestamp': data.get('timestamp', ''),
'version': data.get('metadata', {}).get('version', 1),
},
'meta': {
'filename': conversation_file.name,
'imported_at': datetime.utcnow().isoformat() + 'Z',
'auto_synced': True
}
}
# Check if already imported
session_id = data.get('session_id')
existing = self.client.tasks.list(project=self.project_id)
for existing_task in existing:
if hasattr(existing_task, 'data') and \
existing_task.data.get('session_id') == session_id:
print(f"⏭️ Session {session_id} already in Label Studio")
return False
# Import task
self.client.projects.import_tasks(id=self.project_id, request=[task])
print(f"✅ Auto-synced {session_id} to Label Studio")
return True
except Exception as e:
print(f"❌ Failed to sync {conversation_file.name}: {e}")
return False
# Global instance
_sync = None
def get_sync() -> LabelStudioSync:
"""Get the global sync instance"""
global _sync
if _sync is None:
_sync = LabelStudioSync()
return _sync
async def auto_push_conversation(conversation_file: Path):
"""Push a conversation to Label Studio (async wrapper)"""
sync = get_sync()
if sync.is_enabled():
await sync.push_conversation(conversation_file)
%%writefile chatbot_ui_auto_sync.py
"""
Chainlit Chatbot with Automatic Label Studio Sync
Handles resumed conversations with versioned tasks
"""
import os
import uuid
import json
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Optional
from dotenv import load_dotenv
load_dotenv()
import chainlit as cl
try:
import openai
OPENAI_AVAILABLE = True
except ImportError:
OPENAI_AVAILABLE = False
try:
import anthropic
ANTHROPIC_AVAILABLE = True
except ImportError:
ANTHROPIC_AVAILABLE = False
try:
import ollama
OLLAMA_AVAILABLE = True
except ImportError:
OLLAMA_AVAILABLE = False
from conversation_logger import ConversationLogger
from auto_sync import get_sync
# Configuration
MIN_TURNS_FOR_SYNC = 2 # Minimum conversation length to sync
def get_available_models() -> Dict[str, List[str]]:
"""Return available models by provider"""
models = {}
if OPENAI_AVAILABLE and os.getenv("OPENAI_API_KEY"):
models["OpenAI"] = ["gpt-4", "gpt-3.5-turbo"]
if ANTHROPIC_AVAILABLE and os.getenv("ANTHROPIC_API_KEY"):
models["Anthropic"] = ["claude-3-sonnet-20240229"]
if OLLAMA_AVAILABLE:
try:
ollama_models = ollama.list()
if ollama_models and ollama_models.get('models'):
models["Ollama"] = [m['name'] for m in ollama_models['models']]
else:
models["Ollama"] = ["llama3.2:3b"]
except:
models["Ollama"] = ["llama3.2:3b"]
return models
async def generate_response(messages: List[Dict[str, str]], model: str) -> str:
"""Generate response from specified model"""
provider, model_name = model.split("/", 1)
if provider == "OpenAI":
client = openai.OpenAI()
response = client.chat.completions.create(
model=model_name,
messages=messages
)
return response.choices[0].message.content
elif provider == "Anthropic":
client = anthropic.Anthropic()
response = client.messages.create(
model=model_name,
messages=messages,
max_tokens=1024
)
return response.content[0].text
elif provider == "Ollama":
msg = cl.Message(content="")
await msg.send()
full_response = ""
stream = ollama.chat(
model=model_name,
messages=messages,
stream=True
)
for chunk in stream:
content = chunk['message']['content']
full_response += content
await msg.stream_token(content)
await msg.update()
return full_response
return "Error: Unknown provider"
def get_or_create_thread_id() -> str:
"""Get persistent thread ID for this conversation"""
thread_id = cl.user_session.get("thread_id")
if not thread_id:
thread_id = str(uuid.uuid4())[:16]
cl.user_session.set("thread_id", thread_id)
return thread_id
@cl.on_chat_start
async def start():
"""Initialize chat session"""
available_models = get_available_models()
if not available_models:
await cl.Message(
content="⚠️ **No LLM providers configured!**\n\n"
"Set up Ollama: `brew install ollama && ollama pull llama3.2:3b`"
).send()
return
model_list = []
for provider, models in available_models.items():
for model in models:
model_list.append(f"{provider}/{model}")
thread_id = get_or_create_thread_id()
cl.user_session.set("messages", [])
cl.user_session.set("logger", ConversationLogger())
cl.user_session.set("model", model_list[0] if model_list else None)
cl.user_session.set("available_models", model_list)
cl.user_session.set("is_resumed", False)
sync = get_sync()
sync_status = f"✅ Auto-sync enabled (Project {sync.project_id})" if sync.is_enabled() else "💾 Auto-sync disabled"
await cl.Message(
content=f"💬 **Multi-Turn Chat Feedback**\n\n"
f"**Thread:** `{thread_id}`\n"
f"**Model:** `{model_list[0] if model_list else 'None'}`\n\n"
f"{sync_status}\n\n"
f"Ask me anything!"
).send()
@cl.on_chat_resume
async def on_resume(thread: Dict):
"""Handle conversation resumption"""
available_models = get_available_models()
model_list = []
for provider, models in available_models.items():
for model in models:
model_list.append(f"{provider}/{model}")
thread_id = thread.get("id")
steps = thread.get("steps", [])
messages = []
for step in steps:
if step.get("type") in ["user_message", "assistant_message"]:
role = "user" if step["type"] == "user_message" else "assistant"
messages.append({"role": role, "content": step.get("output", "")})
cl.user_session.set("thread_id", thread_id)
cl.user_session.set("messages", messages)
cl.user_session.set("logger", ConversationLogger())
cl.user_session.set("model", model_list[0] if model_list else None)
cl.user_session.set("is_resumed", True)
turn_count = len([m for m in messages if m["role"] == "user"])
await cl.Message(
content=f"🔄 **Resumed** | Thread: `{thread_id}` | Previous turns: {turn_count}"
).send()
@cl.on_message
async def main(message: cl.Message):
"""Handle incoming messages"""
messages = cl.user_session.get("messages")
model = cl.user_session.get("model")
if not model:
await cl.Message(content="⚠️ No model selected").send()
return
messages.append({"role": "user", "content": message.content})
try:
response = await generate_response(messages, model)
messages.append({"role": "assistant", "content": response})
cl.user_session.set("messages", messages)
# Auto-save after each response
logger = cl.user_session.get("logger")
thread_id = get_or_create_thread_id()
logger.save_conversation(
messages=messages,
session_id=thread_id,
model=model,
metadata={"auto_save": True, "last_updated": datetime.utcnow().isoformat()}
)
except Exception as e:
await cl.Message(content=f"❌ Error: {str(e)}").send()
@cl.on_chat_end
async def on_chat_end():
"""Auto-push to Label Studio (with versioning for resumes)"""
messages = cl.user_session.get("messages")
thread_id = get_or_create_thread_id()
model = cl.user_session.get("model")
is_resumed = cl.user_session.get("is_resumed", False)
if not messages:
return
turn_count = len([m for m in messages if m["role"] == "user"])
if turn_count < MIN_TURNS_FOR_SYNC:
print(f"⏭️ Only {turn_count} turns, skipping sync")
return
sync = get_sync()
if not sync.is_enabled():
print(f"ℹ️ Auto-sync disabled")
return
try:
# Find existing versions
existing_tasks = sync.client.tasks.list(project=sync.project_id)
existing_versions = []
for task in existing_tasks:
if hasattr(task, 'data'):
task_thread_id = task.data.get('thread_id') or task.data.get('session_id')
if task_thread_id and task_thread_id.split('_v')[0] == thread_id.split('_v')[0]:
existing_versions.append(task)
version = len(existing_versions) + 1
versioned_session_id = f"{thread_id}_v{version}"
# Save with version
logger = cl.user_session.get("logger")
filepath = logger.save_conversation(
messages=messages,
session_id=versioned_session_id,
model=model,
metadata={
"auto_save": True,
"version": version,
"thread_id": thread_id,
"was_resumed": is_resumed
}
)
# Create task
conversation_data = json.loads(filepath.read_text())
task = {
'data': {
'chat': conversation_data['messages'], # Changed from 'messages' to 'chat' to match label config
'text': 'Review the conversation below and evaluate the quality of the chat interaction.',
'session_id': versioned_session_id,
'thread_id': thread_id,
'model': model,
'turn_count': turn_count,
'version': version,
'timestamp': datetime.utcnow().isoformat() + 'Z',
},
'meta': {
'auto_synced': True,
'is_resume': is_resumed,
'version': version,
}
}
sync.client.projects.import_tasks(id=sync.project_id, request=[task])
if version > 1:
print(f"✅ Created version {version} (RESUME)")
else:
print(f"✅ Created version 1 (NEW)")
except Exception as e:
print(f"❌ Sync failed: {e}")
if __name__ == "__main__":
pass
Step 3: Set Up Ollama (Local LLM)
For this example, we’ll use Ollama which runs a local LLM on your machine.
Install Ollama:
# macOS
brew install ollama
# Or download from https://ollama.ai
Pull a model:
ollama pull llama3.2:3b # 3B model works on 16GB RAM
Verify it’s running:
ollama list
Step 4: Set Environment Variables
Set the project ID from earlier so the chatbot knows where to sync conversations.
# Set environment variables for auto-sync
import os
# Set all required environment variables for auto-sync
os.environ['LABEL_STUDIO_URL'] = ls_url
os.environ['LABEL_STUDIO_API_KEY'] = ls_api_key
os.environ['LABEL_STUDIO_PROJECT_ID'] = str(project_id)
print(f"✅ Environment configured:")
print(f" LABEL_STUDIO_URL: {ls_url}")
print(f" LABEL_STUDIO_PROJECT_ID: {project_id}")
print(f" LABEL_STUDIO_API_KEY: {'*' * 20}... (hidden)")
print(f"\n🔄 Auto-sync: ENABLED")
print(f" Conversations will automatically sync to Label Studio when you close the chat!")
Step 5: Run the Chainlit Chatbot
Now run the chatbot! It will automatically sync conversations to Label Studio.
To run from terminal:
chainlit run chatbot_ui_auto_sync.py --port 8087
Then:
- Open http://localhost:8087 in your browser
- Have a conversation (at least 2 turns)
- Close the browser tab
- Check Label Studio - your conversation will be there!
Features:
- ✅ Automatic conversation capture
- ✅ Auto-sync to Label Studio on disconnect
- ✅ Conversation resumption with versioning
- ✅ Local LLM (free and private!)
What happens:
- Each message auto-saves to
data/conversations/ - When you close the chat, it pushes to Label Studio
- If you resume the chat later, it creates a new version (v2, v3, etc.)
Ready to annotate! Visit your Label Studio project to start evaluating the conversations.
Verify Files Created
Let’s check that all necessary files were created.
from pathlib import Path
required_files = [
'conversation_logger.py',
'auto_sync.py',
'chatbot_ui_auto_sync.py'
]
print("📁 Checking files...")
for file in required_files:
if Path(file).exists():
print(f" ✅ {file}")
else:
print(f" ❌ {file} - MISSING!")
print(f"\n📋 Project URL: {project_url}")
print(f"\n🚀 Ready to run:")
print(f" chainlit run chatbot_ui_auto_sync.py --port 8087")
Summary
You’ve successfully set up:
- ✅ Label Studio Project - Created with chatbot evaluation template
- ✅ Conversation Logger - Saves chats to JSON automatically
- ✅ Auto-Sync - Pushes conversations to Label Studio
- ✅ Chainlit Chatbot - Full UI with local LLM support
Complete workflow:
User chats → Auto-save to JSON → Close browser → Auto-push to Label Studio → Ready to annotate!
Key Features:
- 🔒 Private - Local LLM, no data sent to cloud
- 🔄 Versioning - Resume conversations safely
- ⚡ Automatic - Zero manual export needed
Next Steps:
- Run the chatbot:
chainlit run chatbot_ui_auto_sync.py --port 8087 - Have some conversations
- Annotate them in Label Studio
- Export annotations for model training or system prompt adjustment