Documentation

Bonito Docs

Everything you need to connect your cloud AI providers, deploy models, and route requests through a single gateway.

Getting Started

Bonito is a unified AI gateway that connects your AI providers (AWS Bedrock, Azure OpenAI, Google Cloud Vertex AI, OpenAI Direct, Anthropic Direct, and Groq) and lets you manage all your models from a single dashboard. You get one API endpoint, one place to track costs, and one control plane for your entire AI stack.

Quick start (5 minutes)

1Sign up at getbonito.com/register — one account covers your entire organization.
2Go to Providers → Add Provider and connect at least one cloud provider (AWS, Azure, or GCP).
3Bonito validates your credentials and syncs all available models automatically.
4Enable the models you want — click Enable on any model or use bulk activation for up to 20 at once.
5Go to Gateway → Create Key to generate an API key.
6Point any OpenAI-compatible SDK at gateway.getbonito.com/v1 with your new key.

# Make your first request through Bonito
curl -X POST https://getbonito.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_BONITO_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic.claude-3-sonnet-20240229-v1:0",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256
  }'

Tip

Bonito's gateway is fully compatible with the OpenAI Chat Completions format. Any tool that supports OpenAI (LangChain, LlamaIndex, custom apps) works out of the box — just change the base URL and API key.

Provider Setup

Connect your cloud provider accounts so Bonito can discover your available models, route requests, and track costs. Each provider requires different credentials.

AWS Bedrock

To connect AWS, you need an IAM Access Key ID and Secret Access Key. Bonito validates them using STS and checks Bedrock permissions automatically.

Field	Description
Access Key ID	Your IAM access key
Secret Access Key	Your IAM secret key

Azure OpenAI

Azure requires a service principal with access to your Azure OpenAI resource. The endpoint must be a custom-subdomain URL (e.g., https://your-resource.openai.azure.com/), not a generic regional endpoint.

Field	Description
Tenant ID	Azure AD tenant ID
Client ID	Service principal application ID
Client Secret	Service principal secret
Subscription ID	Your Azure subscription
Resource Group	Resource group with your OpenAI resource
Endpoint URL	Custom subdomain endpoint URL

Warning

A generic regional endpoint like https://eastus.api.cognitive.microsoft.com/ will not work. You must use an Azure OpenAI resource with a custom subdomain.

Google Cloud (Vertex AI)

GCP requires your Project ID and a Service Account JSON key file. Paste the entire JSON contents — Bonito validates the format in the browser before sending.

Field	Description
Project ID	Your GCP project ID
Service Account JSON	Full JSON key file contents

Tip

You can update credentials at any time without re-entering everything. Go to Providers → click a provider → change only the fields you need. Blank fields keep their current values.

OpenAI Direct

For teams using OpenAI directly (not through Azure). Connect with just an API key to access GPT-4o, GPT-4o mini, and other OpenAI models.

Field	Description
API Key	Your API key from platform.openai.com

Anthropic Direct

For teams that want to use Claude models directly through Anthropic without going through AWS Bedrock. Connect with just an API key.

Field	Description
API Key	Your API key from console.anthropic.com

Groq

Ultra-fast inference for open-source models like Llama 3.3 and Mixtral. Groq's LPU hardware delivers the lowest latency in the industry. Connect with just an API key.

Field	Description
API Key	Your API key from console.groq.com

Permissions & IAM

Bonito supports two IAM setup modes for every provider. Choose based on your security requirements.

Quick Start

Attach a single managed role with broad permissions. Fast to set up, ideal for evaluation and testing.

Enterprise (Recommended)

Separate least-privilege policies per capability. Only grant the exact permissions each feature needs.

AWS Bedrock permissions

In Enterprise mode, each capability has its own policy so you only grant what you need:

Policy	Actions	Required?
Core	ListFoundationModels, GetFoundationModel, InvokeModel, InvokeModelWithResponseStream, sts:GetCallerIdentity	Always
Provisioning	Create/Get/Update/Delete/ListProvisionedModelThroughput	If deploying reserved capacity
Model Activation	PutFoundationModelEntitlement	If enabling models from Bonito UI
Cost Tracking	ce:GetCostAndUsage, GetCostForecast, GetDimensionValues, GetTags	If you want spend visibility

Example IAM policy (core only — minimum to get started):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:ListFoundationModels",
        "bedrock:GetFoundationModel",
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["sts:GetCallerIdentity"],
      "Resource": "*"
    }
  ]
}

Azure permissions

Quick Start: Assign Cognitive Services Contributor on the Azure OpenAI resource.

Enterprise: Create a custom role with only the exact permissions Bonito uses — account read, deployments read/write/delete, models read, and inference actions.

az role definition create --role-definition '{
  "Name": "Bonito AI Operator",
  "Actions": [
    "Microsoft.CognitiveServices/accounts/read",
    "Microsoft.CognitiveServices/accounts/deployments/read",
    "Microsoft.CognitiveServices/accounts/deployments/write",
    "Microsoft.CognitiveServices/accounts/deployments/delete",
    "Microsoft.CognitiveServices/accounts/models/read"
  ],
  "DataActions": [
    "Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action",
    "Microsoft.CognitiveServices/accounts/OpenAI/deployments/completions/action",
    "Microsoft.CognitiveServices/accounts/OpenAI/deployments/embeddings/action"
  ],
  "AssignableScopes": ["/subscriptions/YOUR_SUBSCRIPTION_ID"]
}'

Optionally add Cost Management Reader at subscription scope for spend visibility.

GCP permissions

Quick Start: Assign roles/aiplatform.user to the service account.

Enterprise: Create a custom role with discovery (publishers.get, publisherModels.get), invocation (endpoints.predict), endpoint management (create/get/list/update/delete/deploy/undeploy), model metadata (models.list, models.get), and project validation (resourcemanager.projects.get).

Tip

Bonito's IaC templates (Terraform) support both modes for all providers. Set iam_mode = "least_privilege" for enterprise or "managed" for quick start.

Model Management

Once a provider is connected, Bonito automatically syncs all available models. You can view, search, filter, and enable models from a single catalog.

One-click model activation

Models with a 🔒 icon exist in your provider's catalog but aren't yet enabled in your cloud account. Instead of switching to each provider's console, enable them directly from Bonito:

1Go to the Models page and find the model you want to enable.
2Click the Enable button on the model card.
3Bonito handles the provider-specific activation (Bedrock entitlements, Azure deployments, GCP API enablement).
4Some models may require approval from the provider and won't activate instantly.

Tip

Use bulk activation to enable up to 20 models at once — select them and click "Enable Selected".

Playground

Test any enabled chat model directly in the browser. The Playground supports single-model chat and side-by-side comparison mode (up to 4 models). Token usage and cost appear after each response. Only chat-capable, enabled models are shown in the picker.

Deployments

Deploy AI models directly into your cloud from the Bonito UI — no console-hopping required. Bonito creates real deployments in your cloud account.

Provider	Deployment Type	What Bonito Creates
AWS Bedrock	On-demand or Provisioned Throughput	On-demand: validates access. PT: creates reserved capacity with commitment (1 week–6 months)
Azure OpenAI	Model deployment with TPM capacity	Creates a deployment on your Azure OpenAI resource (Standard or GlobalStandard tier)
GCP Vertex AI	Serverless (no provisioning needed)	Verifies access — GCP models are serverless by default

Warning

AWS Provisioned Throughput costs real money ($20+/hr per model unit) and requires a minimum 1-month commitment. Use on-demand for testing. Also requires the bedrock:CreateProvisionedModelThroughput IAM permission.

Note

Azure deployments require TPM quota for the model in your subscription. If you get a quota error, request an increase in Azure Portal → Quotas.

Gateway API

Bonito provides an OpenAI-compatible API endpoint so you can use any connected model with tools that support the OpenAI format. One API key, all your providers.

Endpoint

POST https://getbonito.com/v1/chat/completions

Authentication

Generate API keys from the Gateway page in the dashboard. Include your key in the Authorization header:

Authorization: Bearer YOUR_BONITO_API_KEY

Example: Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://getbonito.com/v1",
    api_key="YOUR_BONITO_API_KEY"
)

response = client.chat.completions.create(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Example: curl

curl https://getbonito.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_BONITO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic.claude-3-sonnet-20240229-v1:0",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256
  }'

Model names

Use the provider-native model IDs shown on the Models page in Bonito. For example: anthropic.claude-3-sonnet-20240229-v1:0 for AWS Bedrock, gpt-4o for Azure, gemini-1.5-pro for GCP.

When using a routing policy, pass the policy name as the model field instead of a specific model ID.

Access Tokens

Bonito has four authentication methods. Choose the right one for your use case:

Gateway Keys (bn-...)

Gateway keys authenticate LLM proxy requests to /v1/* endpoints only (chat completions, embeddings, images, video). They resolve to an org but carry no user identity. Create them in Settings → Gateway API Keys.

Warning

Using a bn- key on /api/* endpoints returns 401. Use a PAT or session token for platform operations.

Personal Access Tokens (bp-...)

PATs carry your full user permissions and work on all endpoints — both /api/* (platform) and /v1/* (gateway). They are ideal for CI/CD pipelines, scripts, and CLI automation.

1Go to Settings → Personal Access Tokens
2Enter a name and click Generate Token
3Copy the token immediately — it is shown only once
4Use it as: Authorization: Bearer bp-...

CLI: Create tokens with bonito auth token create --name my-token, then authenticate with bonito auth token login --token bp-....

Tier limits: Free = 2 PATs, Pro = 10, Enterprise+ = unlimited. Tokens expire after 90 days by default (max 365). Revoke anytime in Settings or via bonito auth token revoke <id>.

Project Tokens (bj-...)

Project tokens are scoped to a single project. They are created by an org admin and restrict access to only the agents, knowledge bases, and resources within that project. Ideal for giving external teams or CI pipelines access to a specific project without exposing the full org.

Requires: Pro plan or above. Create via API: POST /api/projects/{id}/tokens.

Session Tokens (JWT)

Session tokens are used by the dashboard automatically when you log in. You can also obtain them via bonito auth login (CLI) or POST /api/auth/login. They expire after 24 hours and auto-refresh.

Routing Policies

Routing policies let you automatically select the best model for each request based on your priorities. Create policies from Routing → Create Policy in the dashboard.

Cost-Optimized

Automatically selects the cheapest capable model for each request. Route routine traffic to economy models and save 40–70% versus using a single premium model for everything.

Failover Chain

Define a primary model and one or more fallbacks. If the primary fails or is unavailable, Bonito automatically tries the next model in the chain. Great for high-availability use cases.

A/B Testing

Split traffic between models using percentage weights (must sum to 100). Test new models in production with controlled rollout — e.g., send 90% to your current model and 10% to a new one.

Tip

Use the "Test" button on any routing policy to dry-run model selection and verify your configuration before going live.

Cross-Region Inference

Bonito automatically creates cross-region inference profiles for AWS Bedrock models. If us-east-1 is overloaded or experiencing downtime, requests seamlessly route to us-west-2 or eu-west-1 without any action on your part.

How it works

Bonito handles the us. prefix routing at the gateway level, so cross-region inference profiles are created and managed automatically.
No configuration needed — just connect your AWS provider and Bonito takes care of the rest.
If a primary region is throttled or returns errors, requests are retried in an alternate region transparently.

True high-availability

Cross-region inference combines with multi-provider failover for true high-availability. For example, if Claude on Bedrock fails across all regions, Bonito can fall back to Anthropic Direct automatically. No other platform (Portkey, LiteLLM, Helicone) offers automatic cross-region inference profiles with intelligent failover built in.

Tip

Cross-region inference works out of the box for all Bedrock models. Pair it with a failover routing policy that includes a direct provider (OpenAI, Anthropic, Groq) for maximum resilience.

Model Aliases

Bonito supports shorthand model aliases that resolve to provider-specific versioned model IDs. This means you can reference models by simple names and switch providers without changing your code.

How aliases work

When you send a request with an alias like claude-sonnet, Bonito resolves it to the correct model ID for your active provider. On Bedrock, that becomes anthropic.claude-3-sonnet-20240229-v1:0. On Anthropic Direct, it becomes claude-3-sonnet-20240229.

Switch providers without changing a single line of code in your application.
Aliases are updated automatically when new model versions are released.
Works with routing policies — aliases resolve before routing rules are applied.

from openai import OpenAI

client = OpenAI(
    base_url="https://getbonito.com/v1",
    api_key="YOUR_BONITO_API_KEY"
)

# Use an alias instead of a provider-specific model ID
response = client.chat.completions.create(
    model="claude-sonnet",  # Resolves automatically
    messages=[{"role": "user", "content": "Hello!"}]
)

Notifications

Bonito sends in-app notifications for important events across the platform so you never miss a deployment status change or cost alert.

Notification types

Deployment lifecycle — creation, scaling, completion, and failure alerts for deployments across all providers.
Spend alerts — get notified when costs approach or exceed your configured budget thresholds.
Model activation — confirmation when models are enabled or if activation requires provider approval.
Provider health — alerts when a provider connection has issues or needs credential rotation.

The notification bell in the dashboard header shows your unread count. Click to see the full list with read/unread states. You can configure alert rules for budget thresholds with email and in-app delivery preferences.

Cost Management

Monitor AI spending across all connected providers from a single dashboard. Bonito pulls real cost data from your cloud accounts and shows breakdowns by model, provider, and time period.

What you get

Aggregated costs across AWS, Azure, and GCP with daily/weekly/monthly views.
Cost forecast with projected spending and confidence bounds.
Per-model and per-provider breakdowns to identify expensive workloads.
Budget alerts — set thresholds and get notified before you exceed them.
Optimization recommendations — Bonito suggests cheaper model alternatives and cross-provider routing opportunities.

Note

Cost tracking requires the cost-related IAM permissions for each provider (AWS Cost Explorer, Azure Cost Management Reader, GCP Billing Viewer). These are optional — the platform works without them, you just won't see spend data.

CLI Tool

bonito-cli is a Python CLI for managing your Bonito resources from the terminal. It's useful for scripting, CI/CD pipelines, and terminal-first workflows.

Installation

pip install bonito-cli

Authentication

# Login with your Bonito credentials
bonito auth login

# Or set your API key directly
export BONITO_API_KEY=your-key-here

Common commands

# List connected providers
bonito providers list

# List available models
bonito models list

# Create a gateway API key
bonito gateway keys create --name "my-key"

# List routing policies
bonito routing list

# Check costs
bonito costs summary

Run bonito --help for the full list of commands and options.

Integrations

Claude Cowork is Anthropic's agentic desktop application. The Bonito plugin gives Claude deep knowledge of your AI infrastructure, so you can deploy providers, create agents, configure routing, and analyze costs through natural conversation.

Install the Plugin

Install from the Claude plugin marketplace or via Claude Code:

claude plugin install bonito

What the Plugin Adds

6 domain skills that Claude draws on automatically when relevant:

deploy-stack: Deploy infrastructure from a bonito.yaml config
manage-providers: Connect and manage cloud AI providers
create-agent: Create BonBon agents and Bonobot orchestrators
gateway-routing: Configure failover, cost-optimized routing, A/B testing
cost-analysis: Analyze spending and recommend optimizations
debug-issues: Troubleshoot gateway, provider, and agent problems

MCP Server

The plugin connects to the Bonito MCP server, which exposes 18 tools for direct API access. Install via PyPI or Docker:

pip install bonito-mcp

Or run with Docker:

docker run -e BONITO_API_KEY=your-key -p 8080:8080 bonitoai/mcp-server

Claude Desktop Configuration

Add this to your Claude Desktop MCP config:

{
  "mcpServers": {
    "bonito": {
      "command": "bonito-mcp",
      "env": {
        "BONITO_API_KEY": "your-bonito-api-key"
      }
    }
  }
}

Tip

The MCP server works with any MCP-compatible client including Cursor, Windsurf, and other tools that support the Model Context Protocol.

Learn More

Declarative Config (bonito.yaml)

Define your entire AI stack in a single bonito.yaml file. Providers, agents, MCP servers, knowledge bases, routing, and triggers — all in one place. Deploy with a single command.

bonito deploy -f bonito.yaml

Example: AWS Bedrock stack

A complete stack using AWS Bedrock with two agents and a knowledge base:

version: "1"
name: my-ai-stack

gateway:
  providers:
    - name: aws
      priority: 1
      models:
        - anthropic.claude-3-sonnet-20240229-v1:0
        - amazon.nova-pro-v1:0
        - amazon.titan-embed-text-v2:0
      region: us-east-1
      access_key: ${AWS_ACCESS_KEY_ID}
      secret_key: ${AWS_SECRET_ACCESS_KEY}

  routing:
    strategy: cost-optimized
    fallback: true

agents:
  support-bot:
    type: bonbon
    mode: simple
    display_name: Support Agent
    model:
      primary: anthropic.claude-3-sonnet-20240229-v1:0
      fallback: amazon.nova-pro-v1:0
    system_prompt: |
      You are a helpful support agent...
    rag:
      knowledge_base: company-docs

  code-reviewer:
    type: bonbon
    mode: advanced
    display_name: Code Reviewer
    model:
      primary: anthropic.claude-3-sonnet-20240229-v1:0
    mcp_servers:
      - github

knowledge_bases:
  company-docs:
    description: Internal documentation
    sources:
      - type: directory
        path: ./docs/
        glob: "**/*.md"
    embedding:
      model: amazon.titan-embed-text-v2:0
      provider: aws

Example: Multi-provider stack

A stack using Groq, Anthropic, OpenAI, and GCP with multi-agent orchestration:

version: "1"
name: multi-provider-stack

gateway:
  providers:
    - name: groq
      priority: 1
      models: [llama-3.3-70b-versatile, mixtral-8x7b-32768]
      api_key: ${GROQ_API_KEY}

    - name: anthropic
      priority: 1
      models: [claude-sonnet-4-20250514]
      api_key: ${ANTHROPIC_API_KEY}

    - name: openai
      priority: 2
      models: [gpt-4o, gpt-4o-mini]
      api_key: ${OPENAI_API_KEY}

    - name: gcp
      priority: 2
      models: [gemini-1.5-pro, gemini-1.5-flash]
      project_id: ${GCP_PROJECT_ID}
      service_account: ${GCP_SERVICE_ACCOUNT_JSON}

  routing:
    strategy: cost-optimized
    fallback: true
    retry_attempts: 2

agents:
  orchestrator:
    type: bonobot
    display_name: Operations Center
    model:
      primary: claude-sonnet-4-20250514
      fallback: gpt-4o
    delegates:
      - agent: fast-responder
        domains: [triage, alerts, summaries]
      - agent: deep-analyst
        domains: [code-review, analysis, research]

  fast-responder:
    type: bonbon
    mode: simple
    model:
      primary: groq/llama-3.3-70b-versatile
    system_prompt: |
      You are a fast-response agent for triage...

  deep-analyst:
    type: bonbon
    mode: advanced
    model:
      primary: claude-sonnet-4-20250514
      fallback: gpt-4o
    mcp_servers:
      - github
      - jira

Tip

Use environment variable references like ${AWS_ACCESS_KEY_ID} to keep secrets out of your config file. Bonito resolves them at deploy time from your environment or a .env file.

BonBon Agents

BonBon is Bonito's managed agent service. Create AI agents with custom system prompts, connect them to knowledge bases for RAG, and deploy them as embeddable chat widgets or API endpoints — all without managing infrastructure.

Agent tiers

Simple — $49/mo

Pre-built agent with a system prompt, optional knowledge base, and an embeddable widget. Ideal for FAQ bots, customer support, and internal assistants. Deploy in minutes.

Advanced — $99/mo

Agent with MCP tool integration, multiple knowledge bases, webhook triggers, and custom workflows. Built for agents that need to interact with external systems.

Creating an agent

1Go to Agents → Create Agent in the dashboard.
2Choose a tier (Simple or Advanced) and give your agent a name.
3Write a system prompt that defines your agent's personality, constraints, and behavior.
4Optionally attach a knowledge base for RAG-powered responses.
5Select the backing model (or models for Advanced tier).
6Deploy — Bonito gives you a widget embed code and an API endpoint.

# Create an agent via CLI
bonito agents create \
  --name "Support Bot" \
  --tier simple \
  --model "anthropic.claude-3-sonnet-20240229-v1:0" \
  --system-prompt "You are a helpful support agent for Acme Corp..."

# List your agents
bonito agents list

# Get agent details
bonito agents get --id ag_abc123

System prompts

System prompts define how your agent behaves. Write clear instructions about the agent's role, tone, constraints, and what it should or shouldn't do. You can update the system prompt at any time without redeploying.

You are a customer support agent for Acme Corp.

Rules:
- Only answer questions about Acme products and services.
- If you don't know the answer, say so and offer to connect with a human.
- Be friendly, concise, and professional.
- Never make up pricing or feature information — use the knowledge base.
- Respond in the same language as the customer.

RAG integration

Connect a knowledge base to give your agent access to your documents. When a user asks a question, Bonito retrieves relevant chunks from your KB and includes them in the context before the model generates a response.

# Attach a knowledge base to an agent
bonito agents update --id ag_abc123 \
  --knowledge-base kb_xyz789

# Attach multiple KBs (Advanced tier only)
bonito agents update --id ag_abc123 \
  --knowledge-base kb_xyz789 \
  --knowledge-base kb_docs456

Widget embedding

Every BonBon agent gets an embeddable chat widget. Add it to any website with a single script tag:

<!-- Add to your website -->
<script
  src="https://getbonito.com/widget.js"
  data-agent-id="ag_abc123"
  data-theme="dark"
  data-position="bottom-right"
  async
></script>

Tip

You can also use the agent via the API directly. Send messages to POST /v1/agents/ag_abc123/chat with the same OpenAI-compatible format.

Bonobot Orchestrator

Bonobot is Bonito's multi-agent orchestration layer. It acts as a front-door agent that classifies user intent, delegates to specialized sub-agents, and synthesizes their responses into a unified reply. Think of it as a dispatcher that routes conversations to the right expert.

How it works

1A user sends a message to the Bonobot endpoint.
2The orchestrator classifies the user's intent using a fast classification model.
3Based on the intent, it delegates the request to one or more specialized BonBon agents.
4Each sub-agent processes its part using its own system prompt, model, and knowledge base.
5Bonobot synthesizes the responses and returns a single, coherent answer.

Delegation map

The delegation map defines which sub-agents handle which intents. Configure it as a JSON mapping of intent patterns to agent IDs:

{
  "delegation_map": {
    "billing": {
      "agent_id": "ag_billing01",
      "description": "Handles billing, invoices, and payment questions"
    },
    "technical_support": {
      "agent_id": "ag_techsup01",
      "description": "Handles technical issues, bugs, and troubleshooting"
    },
    "sales": {
      "agent_id": "ag_sales01",
      "description": "Handles pricing, demos, and feature inquiries"
    },
    "general": {
      "agent_id": "ag_general01",
      "description": "Fallback for anything that doesn't match a specific intent"
    }
  }
}

Creating a Bonobot

# Create an orchestrator
bonito bonobot create \
  --name "Customer Hub" \
  --classifier-model "anthropic.claude-3-haiku-20240307-v1:0" \
  --delegation-map ./delegation.json

# Update the delegation map
bonito bonobot update --id bot_abc123 \
  --delegation-map ./updated-delegation.json

# Test intent classification
bonito bonobot classify --id bot_abc123 \
  --message "I need a refund for my last invoice"

Note

Bonobot is a separate add-on ($349/mo hosted) for fully custom multi-agent orchestration. Each sub-agent in the delegation map can be a BonBon agent or another Bonobot.

Response synthesis

When a request touches multiple intents (e.g., "I want to upgrade my plan and fix a bug"), Bonobot can delegate to multiple agents in parallel and merge their responses. Enable multi-delegation in the orchestrator settings:

{
  "multi_delegation": true,
  "synthesis_model": "anthropic.claude-3-sonnet-20240229-v1:0",
  "synthesis_prompt": "Combine the following specialist responses into a single, coherent answer."
}

Agent Scaling (HPA)

Agent-level Horizontal Pod Autoscaling (HPA) dynamically adjusts an agent's capacity based on real-time utilization. When traffic spikes, the effective rate limit scales up automatically. When traffic subsides, it scales back down. No manual intervention needed.

Note

Agent HPA is available on Enterprise and Scale plans. Free and Pro plans use fixed rate limits.

How it works

1Each agent has a base rate_limit_rpm (default: 30 requests per minute).
2When utilization crosses the capacity threshold (default: 60%), the effective RPM doubles automatically.
3If traffic continues to surge, it doubles again — up to max_replicas × base RPM.
4A background process checks every 30 seconds for scale-down opportunities.
5When utilization drops below 30% for the cooldown period (default: 5 minutes), the effective RPM halves back toward the base.

Example: Bulletproof MSP

A Tier 1 support triage agent handling 50K tickets/month with a base of 30 RPM. Monday morning surge hits 18 RPM (60% utilization) — the agent auto-scales to 60 RPM. Surge continues to 36 RPM — scales to 120 RPM. With max_replicas=5, the ceiling is 150 RPM. After the surge, it gracefully scales back down.

Configuration via bonito.yaml

agents:
  triage-router:
    display_name: "Triage Router"
    system_prompt: "agents/triage-router/system-prompt.md"
    model: gpt-4o-mini
    rate_limit_rpm: 30
    scaling:
      enabled: true
      capacity_threshold: 0.6      # Scale up at 60% utilization
      scale_down_threshold: 0.3    # Scale down below 30%
      max_replicas: 5              # Max effective RPM = 5 × 30 = 150
      scale_down_cooldown_seconds: 300  # Wait 5 min before scaling down
      mode: virtual                # Phase 1: virtual scaling

Configuration via API

# Enable autoscaling on an agent
curl -X POST https://api.getbonito.com/api/agents/{agent_id}/scaling/configure \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": true,
    "capacity_threshold": 0.6,
    "scale_down_threshold": 0.3,
    "max_replicas": 5,
    "mode": "virtual"
  }'

# Check current scaling status
curl https://api.getbonito.com/api/agents/{agent_id}/scaling \
  -H "Authorization: Bearer $TOKEN"

# View scaling event history
curl https://api.getbonito.com/api/agents/{agent_id}/scaling/events \
  -H "Authorization: Bearer $TOKEN"

# Manual scale up/down (for testing or emergencies)
curl -X POST https://api.getbonito.com/api/agents/{agent_id}/scaling/manual \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"direction": "up"}'

CLI commands

# Check scaling status
bonito agents scaling status <agent-id>

# Configure autoscaling
bonito agents scaling configure <agent-id> \
  --threshold 0.6 \
  --max-replicas 5 \
  --mode virtual \
  --enable

# View recent scaling events
bonito agents scaling events <agent-id> --limit 20

# Manual scale override
bonito agents scaling manual <agent-id> up
bonito agents scaling manual <agent-id> down

Scaling parameters

Parameter	Default	Range	Description
capacity_threshold	0.6	0.1 – 0.95	Utilization % that triggers scale-up
scale_down_threshold	0.3	0.05 – capacity_threshold	Utilization % that triggers scale-down
max_replicas	5	1 – 10	Max multiplier for effective RPM
scale_down_cooldown_seconds	300	60 – 3600	Seconds to wait before scaling down
mode	virtual	virtual	Scaling mode (virtual raises RPM in Redis)

Response metadata

When autoscaling is active, agent execution responses include additional metadata in the security object:

{
  "security": {
    "rate_limit_remaining": 42,
    "effective_rpm": 60,
    "scaling_active": true,
    "tools_used": ["search_knowledge_base", "invoke_agent"],
    "budget_remaining": 89.50
  }
}

Warning

Virtual scaling raises the Bonito-side rate limit. If the upstream AI provider has its own rate limits (e.g., AWS Bedrock throttling), those still apply. Phase 2 (physical replicas across providers) is planned for provider-side bottlenecks.

Overflow queue

When an agent hits its RPM ceiling — even after autoscaling to max_replicas — requests are queued rather than dropped. Callers receive an immediate 202 Accepted response with a ticket they can poll, and the queue drains automatically as capacity frees up. Only agents with autoscale_enabled: true participate in overflow queuing.

Queued response shape

// POST /api/agents/{id}/execute  →  202 Accepted
{
  "queued": true,
  "ticket_id": "q_01HXYZ...",
  "position": 4,
  "estimated_wait_seconds": 8,
  "poll_url": "/api/agents/{id}/queue/q_01HXYZ..."
}

Polling for results

# Poll for result
curl https://api.getbonito.com/api/agents/{agent_id}/queue/{ticket_id} \
  -H "Authorization: Bearer $TOKEN"

# Response while queued
{ "status": "queued", "position": 2 }

# Response when processing
{ "status": "processing" }

# Response when complete
{ "status": "completed", "result": { ...agent response... } }

# Response on failure
{ "status": "failed", "error": "upstream timeout" }

# Check overall queue depth
curl https://api.getbonito.com/api/agents/{agent_id}/queue \
  -H "Authorization: Bearer $TOKEN"
# → { "depth": 12, "max_depth": 500 }

Queue behaviour

Property	Value	Notes
Max depth	500 per agent	Requests beyond 500 are rejected with 429
Drain interval	2 seconds	Background drainer processes a batch of 3 every 2s
Batch size	3 requests	Per drain cycle
Result TTL	1 hour	Completed results stored in Redis for 1 hour
Eligibility	autoscale_enabled: true	Queue inactive for agents without HPA enabled

Note

The overflow queue is a safety valve, not a substitute for capacity planning. If your agent regularly hits the 500-item queue limit, consider raising max_replicas or increasing its base rate_limit_rpm.

Code Review

Bonito's GitHub App provides AI-powered code review on pull requests. Install it from the Bonito dashboard, connect your GitHub repos, and get automatic reviews on every PR.

What it does

Automatically reviews PRs for security vulnerabilities, performance issues, and code quality.
Posts structured findings directly as PR comments with clear explanations and suggestions.
Multiple review personas available — default professional tone, or fun characters for a lighter touch.
Configure which repos and branches trigger automatic reviews from the dashboard.

Getting started

1Go to Code Review in the Bonito dashboard.
2Click Install GitHub App and authorize access to your repositories.
3Select which repos and branches should trigger automatic reviews.
4Open a pull request — Bonito reviews it and posts comments within minutes.

Note

Free tier includes 6 code reviews per month, up to 3 provider connections, 25K gateway calls, automatic failover routing, and 1 BonBon Simple agent. Starter ($99/mo) adds intelligent routing, cost analytics, 100K calls, 2 BonBon agents, and CLI access. Upgrade to Pro for RAG, deployment provisioning, and unlimited reviews.

MCP Integration

MCP (Model Context Protocol) lets your BonBon agents call external tools — databases, APIs, code execution, file systems, and more. Register MCP servers with Bonito and connect them to your agents so they can take actions, not just answer questions.

Supported transports

SSE (Server-Sent Events)

Connect to remote MCP servers over HTTP. Best for hosted tools, third-party integrations, and production deployments.

stdio (Standard I/O)

Run MCP servers as local subprocesses. Best for development, local tools, and self-hosted servers.

Registering an MCP server

{
  "name": "github-tools",
  "transport": "sse",
  "url": "https://mcp.example.com/github/sse",
  "headers": {
    "Authorization": "Bearer ghp_xxxxxxxxxxxx"
  },
  "tools": ["create_issue", "search_repos", "get_pull_request"]
}

# Register via CLI
bonito mcp register \
  --name "github-tools" \
  --transport sse \
  --url "https://mcp.example.com/github/sse"

# List registered MCP servers
bonito mcp list

# Test a tool call
bonito mcp test --server "github-tools" --tool "search_repos" \
  --params '{"query": "bonito"}'

stdio configuration

For stdio-based MCP servers, provide the command and arguments to launch the server process:

{
  "name": "sqlite-tools",
  "transport": "stdio",
  "command": "npx",
  "args": ["-y", "@modelcontextprotocol/server-sqlite", "./data/mydb.sqlite"],
  "env": {
    "NODE_ENV": "production"
  }
}

Connecting MCP to agents

Once registered, attach MCP servers to your BonBon agents. The agent's model will automatically see the available tools and can call them during conversations:

# Connect MCP server to an agent
bonito agents update --id ag_abc123 \
  --mcp-server "github-tools" \
  --mcp-server "sqlite-tools"

# Remove an MCP server from an agent
bonito agents update --id ag_abc123 \
  --remove-mcp-server "sqlite-tools"

Warning

MCP tool execution happens server-side. Make sure your MCP servers are secured — Bonito passes through authentication headers but does not sandbox tool execution.

Knowledge Bases

Knowledge Bases power RAG (Retrieval-Augmented Generation) for your BonBon agents. Upload documents, and Bonito chunks, embeds, and indexes them so your agents can retrieve relevant context at query time.

Creating a knowledge base

1Go to Knowledge Bases → Create in the dashboard, or use the CLI.
2Give it a name and optional description.
3Choose an embedding model (defaults to a high-quality model on your connected providers).
4Configure chunking strategy (size, overlap).
5Upload your documents.

# Create a knowledge base
bonito kb create --name "Product Docs" \
  --embedding-model "amazon.titan-embed-text-v2:0" \
  --chunk-size 512 \
  --chunk-overlap 50

# Upload documents
bonito kb upload --id kb_xyz789 ./docs/*.pdf
bonito kb upload --id kb_xyz789 ./faq.md
bonito kb upload --id kb_xyz789 https://example.com/api-reference.html

# Check indexing status
bonito kb status --id kb_xyz789

Supported file formats

Bonito accepts PDF, Markdown, plain text, HTML, DOCX, and CSV files. Each file is parsed, split into chunks, and embedded using your chosen embedding model.

Chunking strategies

Strategy	Description	Best For
fixed	Split by token count with configurable overlap	General purpose, predictable chunk sizes
semantic	Split at natural boundaries (paragraphs, sections)	Long-form documents, articles
sentence	Split by sentence with grouping	FAQ, short-form content

Embedding models

Bonito uses embedding models from your connected providers. Any enabled embedding model can be used:

AWS Bedrock: amazon.titan-embed-text-v2:0, cohere.embed-english-v3
Azure OpenAI: text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large
GCP Vertex AI: textembedding-gecko, text-embedding-004

Querying

You can query a knowledge base directly to test retrieval before connecting it to an agent:

# Query a knowledge base directly
bonito kb query --id kb_xyz789 \
  --query "What is the refund policy?" \
  --top-k 5

# Connect KB to an agent (see BonBon Agents section)
bonito agents update --id ag_abc123 \
  --knowledge-base kb_xyz789

Tip

Start with a chunk size of 512 tokens and 50-token overlap. Adjust based on your content — shorter chunks work better for precise Q&A, longer chunks for summarization tasks.

Managed Inference

Managed Inference gives you zero-config access to AI models without connecting any cloud provider. No API keys, no cloud accounts, no setup — just start making requests. Bonito handles provider selection, routing, and billing.

How it works

1Sign up for Bonito and create a gateway API key — that's it.
2Use any supported model by name in your API requests.
3Bonito routes your request to the optimal provider automatically.
4You're billed through Bonito based on token usage — no separate cloud bills.

from openai import OpenAI

# No cloud provider setup needed — just your Bonito key
client = OpenAI(
    base_url="https://getbonito.com/v1",
    api_key="YOUR_BONITO_API_KEY"
)

# Use any supported model
response = client.chat.completions.create(
    model="claude-3-sonnet",  # Bonito routes to the best provider
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

print(response.choices[0].message.content)

Supported models

Managed Inference supports a curated set of popular models across providers. Use simplified model names — Bonito resolves them to provider-specific IDs:

Model	Provider	Use Case
claude-3-sonnet	Anthropic	General purpose, balanced performance
claude-3-haiku	Anthropic	Fast, cost-effective tasks
gpt-4o	OpenAI	Multimodal, high performance
gpt-4o-mini	OpenAI	Lightweight, budget-friendly
gemini-1.5-pro	Google	Long context, multimodal

When to use Managed vs BYOC

Managed Inference

• Quick start — no cloud setup
• Don't want to manage API keys
• Prototyping and development
• Teams without cloud accounts

Bring Your Own Cloud

• Data residency requirements
• Existing cloud commitments/discounts
• Full provider control
• Enterprise compliance needs

Note

Managed Inference and BYOC (Bring Your Own Cloud) can be used simultaneously. Use managed models for quick experiments and your own providers for production workloads.

Triggers

Triggers let you invoke BonBon agents automatically based on events — incoming webhooks, cron schedules, slash commands, or custom events. Instead of waiting for a user to open a chat widget, triggers bring your agents into workflows programmatically.

Trigger types

Webhook

HTTP endpoint that invokes your agent when called. Connect to GitHub, Stripe, Slack, or any service that sends webhooks. The request payload is passed as context to the agent.

Scheduled (Cron)

Run your agent on a schedule using cron expressions. Great for daily reports, periodic data processing, health checks, and recurring tasks.

Slash Command

Register slash commands in Slack or Discord that invoke your agent. Users type /ask-support how do I reset my password? and get an agent response inline.

Event

Trigger agents from internal Bonito events — new document indexed in a KB, agent error threshold exceeded, or cost alert fired.

Creating triggers

# Create a webhook trigger
bonito triggers create \
  --agent-id ag_abc123 \
  --type webhook \
  --name "GitHub PR Review"

# Output: Webhook URL → https://getbonito.com/hooks/tr_wh_abc123

# Create a scheduled trigger (daily at 9 AM UTC)
bonito triggers create \
  --agent-id ag_abc123 \
  --type cron \
  --schedule "0 9 * * *" \
  --name "Daily Summary" \
  --input "Generate a summary of yesterday's support tickets"

# Create a slash command trigger
bonito triggers create \
  --agent-id ag_abc123 \
  --type slash-command \
  --platform slack \
  --command "/ask-support" \
  --name "Slack Support"

# List triggers for an agent
bonito triggers list --agent-id ag_abc123

Webhook payload

When a webhook trigger fires, the HTTP request body is passed to the agent as context. You can define a template to extract specific fields:

{
  "trigger_id": "tr_wh_abc123",
  "payload_template": "New {{event}} from {{repository.full_name}}: {{pull_request.title}}",
  "headers_to_forward": ["X-GitHub-Event"],
  "secret": "whsec_xxxxxxxx"
}

Tip

Use the secret field to verify webhook signatures. Bonito validates the HMAC signature on incoming requests and rejects unverified payloads.

Observability

Bonito provides built-in observability for every request flowing through the gateway and every agent interaction. Track tokens, latency, costs, and errors across your entire AI stack without any additional tooling.

Request tracing

Every API request gets a unique trace ID. View the full lifecycle of a request — from gateway receipt through provider routing to response delivery:

# View recent requests
bonito logs list --limit 20

# Get details for a specific trace
bonito logs get --trace-id tr_xxxxxxxxxxxx

# Filter by model, status, or time range
bonito logs list \
  --model "claude-3-sonnet" \
  --status error \
  --since "2024-01-15T00:00:00Z"

What's tracked

Metric	Description
Tokens (in/out)	Input and output token counts per request
Latency	Time to first token (TTFT) and total response time
Cost	Estimated cost per request based on provider pricing
Model	Which model and provider served the request
Status	Success, error, rate-limited, or timed-out
Agent	Which BonBon agent handled the request (if applicable)

Per-agent analytics

Each BonBon agent has its own analytics dashboard showing conversation volume, average response time, token usage, cost breakdown, and error rates over time.

# Get agent analytics
bonito agents analytics --id ag_abc123 \
  --period 7d

# Export analytics as CSV
bonito agents analytics --id ag_abc123 \
  --period 30d \
  --format csv > agent-analytics.csv

Cost monitoring

Observability integrates with Cost Management to show real-time spend. Set budget alerts per agent, per model, or globally:

# Set a per-agent budget alert
bonito alerts create \
  --scope agent \
  --agent-id ag_abc123 \
  --threshold 500 \
  --period monthly \
  --notify email,in-app

# Set a global daily spend alert
bonito alerts create \
  --scope global \
  --threshold 100 \
  --period daily

Note

Observability data is retained for 90 days by default. Contact support for extended retention or data export to your own analytics pipeline.

Troubleshooting

"Connected! Found 0 models"

Your credentials connected successfully, but model listing failed silently. Check:

Azure: Make sure your Endpoint URL is an Azure OpenAI resource endpoint with a custom subdomain, not a generic regional endpoint. Also verify the resource group is correct.
GCP: Ensure the Vertex AI API is enabled in your project.
AWS: Verify your IAM user has Bedrock permissions and you're in a region where Bedrock is available.

Models showing a 🔒 lock icon

The model exists in your provider's catalog but isn't enabled. Click the Enable button in Bonito, or enable it directly in your cloud console (AWS Bedrock → Model Access, Azure → create a deployment, GCP → enable the Vertex AI API).

Playground returns a 500 error

This typically means the model isn't a chat model (embedding or completion-only models can't be used in the Playground), the model isn't enabled in your account, or the model isn't available in your region.

Rate limit or timeout errors

Wait 30–60 seconds and try again. This can happen when making many rapid changes. If 502 errors persist, the backend service may need attention — contact support@getbonito.com.

Need more help?

Can't find what you're looking for? Our team is here to help.

Contact Support support@getbonito.com