A lightweight service for detecting prompt injection attacks and jailbreak attempts in AI agent interactions. PromptGuard uses Llama Prompt Guard 2 model to classify user input as "safe" or "unsafe" before processing.
PromptGuard is the first layer of defense in a defense-in-depth security strategy for AI agents. It specifically targets:
- Prompt injection attacks: Attempts to manipulate the AI system through crafted inputs
- Jailbreak attempts: Techniques to bypass safety restrictions
- Malicious user input: Content designed to exploit the AI system
The service provides an OpenAI-compatible API endpoint (/v1/chat/completions) that:
- Extracts the user message from the conversation
- Runs inference using the Llama Prompt Guard 2 model
- Returns a classification:
"safe"or"unsafe" - Provides confidence scores for monitoring
The service is designed to be lightweight (86M model) and can run on CPU, making it suitable for production deployments.
# Build the PromptGuard service image
make build-promptguard-image
# Push to registry
make push-promptguard-imagePromptGuard requires the following environment variables:
export PROMPTGUARD_ENABLED=true # Required Makefile variable to enable PromptGuard
export HF_TOKEN=your-huggingface-token # Required for gated modelsOptional environment variables:
PROMPTGUARD_MODEL_ID: Model identifier (defaults tometa-llama/Llama-Prompt-Guard-2-86M)
Configure PromptGuard in helm/values.yaml:
promptGuard:
enabled: true
replicas: 1
modelId: "meta-llama/Llama-Prompt-Guard-2-86M"
huggingfaceToken: "" # Required for gated models
resources:
limits:
cpu: "2"
memory: 2Gi
requests:
cpu: "1"
memory: 1GiPromptGuard is automatically integrated when configured in agent YAML files:
# agent-service/config/agents/your-agent.yaml
input_shields:
- "meta-llama/Llama-Prompt-Guard-2-86M" # First layer: attack detection
- "meta-llama/Llama-Guard-3-8B" # Second layer: content safetyThe agent service will route shield requests to the PromptGuard service at:
http://self-service-agent-promptguard.<namespace>.svc.cluster.local:8000/v1
GET /health: Health check endpointGET /v1/models: List available models (OpenAI-compatible)POST /v1/chat/completions: Classify user input as safe/unsafe