Agents based on Large Language Models can behave in non-deterministic ways, and making sure they behave safely and as intended is a significant
challenge. Unlike traditional software where inputs and outputs can be tightly controlled to ensure safe and predictable behavior.
This is especially true when the model is given "uncontrolled" instructions. Users can intentionally or unintentionally, provide inputs that lead
to harmful, biased, or otherwise undesirable outputs. This can result in a range of risks, from generating inappropriate content to taking unintended
actions that could compromise security or lead to financial loss.
The first risk is about the content the AI generates:
- Leakage of sensitive information the model was trained on, such as PII or proprietary business secrets.
- Harmful material such as hate speech, biased language, or explicit content.
- Misinformation due to "hallucinations" where the model confidently states false information.
- Unprofessional or off-brand responses that damage a company’s reputation.
The second risk is specific to agents that can take actions, such as tools, API calls or system interactions:
- Irreversible and destructive actions, such as permanent data deletion or unauthorized purchases.
- Excessive resource consumption leading to unexpected financial costs.
- Attack vectors for security vulnerabilities that could be exploited by malicious actors
Ask Yourself Whether
- Malicious LLM behavior can impact your organization.
- The LLM was trained with private or sensitive data.
There is a risk if you answered yes to any of those questions.
Recommended Secure Coding Practices
- Insert guardrails to prevent the LLM from generating harmful or unsafe content.
- Use existing guardrail frameworks, do not reinvent the wheel unless necessary.
- Restrict the different LLM capabilities (MCP), and knowledge (RAG).
Sensitive Code Example
In the following code example, a primary "triager" agent simply dispatches instructions to secondary agents. No guardrail ensures that unintended
user instructions are caught and handled properly.
from agents import (
Agent,
)
secondary_agent_1 = Agent(...)
secondary_agent_2 = Agent(...)
triage_agent = Agent(
name="Triage Agent",
instructions="You determine which agent to use based on the user's question",
handoffs=[secondary_agent_1, secondary_agent_2],
)
Compliant Solution
This compliant solution adds a guardrail to the triager agent. It ensures that user instructions are properly vetted before being passed to
secondary agents, and that only appropriate instructions are passed along.
Depending on your use case, you may want to have the guardrail function be implemented as an output guardrail instead of an input guardrail.
from agents import (
Agent,
InputGuardrail,
GuardrailFunctionOutput,
Runner,
)
from pydantic import BaseModel
secondary_agent_1 = Agent(...)
secondary_agent_2 = Agent(...)
class ExampleOutput(BaseModel):
is_example: bool
reasoning: str
guardrail_agent = Agent(
name="Guardrail check",
instructions="Check if the user is asking about the Example Topic.",
output_type=ExampleOutput,
)
async def example_guardrail(ctx, agent, input_data):
result = await Runner.run(guardrail_agent, input_data, context=ctx.context)
final_output = result.final_output_as(ExampleOutput)
return GuardrailFunctionOutput(
output_info=final_output,
tripwire_triggered=not final_output.is_example,
)
triage_agent = Agent(
name="Triage Agent",
instructions="You determine which agent to use based on the user's question",
handoffs=[secondary_agent_1, secondary_agent_2],
input_guardrails=[
InputGuardrail(guardrail_function=example_guardrail),
],
)
See