-
Notifications
You must be signed in to change notification settings - Fork 2k
Bedrock invoke_agent streaming starts only after full response generation (delayed first chunk) #4744
Description
Describe the bug
Description
When invoking a Bedrock agent using streaming, the API call blocks until the agent finishes generating the full response. Only after completion does the first chunk event appear, after which remaining chunks are delivered quickly.
This defeats the purpose of streaming because users do not receive any partial output during generation.
Environment
SDK: boto3
Service: Amazon Bedrock Agent Runtime
Python: 3.12
boto3:1.42.59
Region: us-east-1
Minimal Reproducible Code
import boto3
import time
client = boto3.client(
"bedrock-agent-runtime",
region_name="us-east-1",
aws_access_key_id="...",
aws_secret_access_key="..."
)
start = time.time()
print("Invoking agent...")
response = client.invoke_agent(
agentId="HBULA1EYN8",
agentAliasId="92E64FKIFG",
sessionId="test-session",
enableTrace=True,
sessionState={'files': []},
inputText="Explain quantum computing in simple terms",
streamingConfigurations={
"streamFinalResponse": True
}
)
print(f"Time taken to receive first response: {time.time() - start:.2f}s")
for event in response["completion"]:
print(f"{time.time() - start:.2f}s -> {event.keys()}")
Note: I have added the permission - bedrock:InvokeModelWithResponseStream as per this documentation (https://docs.aws.amazon.com/boto3/latest/reference/services/bedrock-agent-runtime/client/invoke_agent.html)
Observed Output
Invoking agent...
Time taken to receive first response: 1.13s
1.33s -> dict_keys(['trace'])
7.18s -> dict_keys(['chunk'])
7.19s -> dict_keys(['chunk'])
7.19s -> dict_keys(['chunk'])
...
Key observation:
trace events arrive early
chunk events begin only after ~7 seconds
Expected Behavior
Streaming should begin as soon as the model starts generating tokens, for example:
0.8s -> dict_keys(['chunk'])
0.9s -> dict_keys(['chunk'])
1.0s -> dict_keys(['chunk'])
Actual Behavior
Agent executes for several seconds
↓
Full response generated internally
↓
Only then streaming of chunks begins
This results in a perceived delay and defeats the purpose of using a streaming API for interactive applications.
Regression Issue
- Select this option if this issue appears to be a regression.
Expected Behavior
Streaming should begin as soon as the model starts generating tokens, for example:
0.8s -> dict_keys(['chunk'])
0.9s -> dict_keys(['chunk'])
1.0s -> dict_keys(['chunk'])
Current Behavior
Actual Behavior
Agent executes for several seconds
↓
Full response generated internally
↓
Only then streaming of chunks begins
Reproduction Steps
import boto3
import time
client = boto3.client(
"bedrock-agent-runtime",
region_name="us-east-1",
aws_access_key_id="...",
aws_secret_access_key="..."
)
start = time.time()
print("Invoking agent...")
response = client.invoke_agent(
agentId="HBULA1EYN8",
agentAliasId="92E64FKIFG",
sessionId="test-session",
enableTrace=True,
sessionState={'files': []},
inputText="Explain quantum computing in simple terms",
streamingConfigurations={
"streamFinalResponse": True
}
)
print(f"Time taken to receive first response: {time.time() - start:.2f}s")
for event in response["completion"]:
print(f"{time.time() - start:.2f}s -> {event.keys()}")
Possible Solution
No response
Additional Information/Context
No response
SDK version used
1.42.59
Environment details (OS name and version, etc.)
Mac