Skip to content

Bedrock invoke_agent streaming starts only after full response generation (delayed first chunk) #4744

@pradeepdev-1995

Description

@pradeepdev-1995

Describe the bug

Description

When invoking a Bedrock agent using streaming, the API call blocks until the agent finishes generating the full response. Only after completion does the first chunk event appear, after which remaining chunks are delivered quickly.

This defeats the purpose of streaming because users do not receive any partial output during generation.

Environment
SDK: boto3
Service: Amazon Bedrock Agent Runtime
Python: 3.12
boto3:1.42.59
Region: us-east-1

Minimal Reproducible Code

import boto3
import time

client = boto3.client(
    "bedrock-agent-runtime",
    region_name="us-east-1",
    aws_access_key_id="...",
    aws_secret_access_key="..."
)

start = time.time()
print("Invoking agent...")

response = client.invoke_agent(
    agentId="HBULA1EYN8",
    agentAliasId="92E64FKIFG",
    sessionId="test-session",
    enableTrace=True,
    sessionState={'files': []},
    inputText="Explain quantum computing in simple terms",
    streamingConfigurations={
        "streamFinalResponse": True
    }
)

print(f"Time taken to receive first response: {time.time() - start:.2f}s")

for event in response["completion"]:
    print(f"{time.time() - start:.2f}s -> {event.keys()}")

Note: I have added the permission - bedrock:InvokeModelWithResponseStream as per this documentation (https://docs.aws.amazon.com/boto3/latest/reference/services/bedrock-agent-runtime/client/invoke_agent.html)

Observed Output

Invoking agent...
Time taken to receive first response: 1.13s
1.33s -> dict_keys(['trace'])
7.18s -> dict_keys(['chunk'])
7.19s -> dict_keys(['chunk'])
7.19s -> dict_keys(['chunk'])

...

Key observation:

trace events arrive early
chunk events begin only after ~7 seconds
Expected Behavior

Streaming should begin as soon as the model starts generating tokens, for example:

0.8s -> dict_keys(['chunk'])
0.9s -> dict_keys(['chunk'])
1.0s -> dict_keys(['chunk'])

Actual Behavior

Agent executes for several seconds
↓
Full response generated internally
↓
Only then streaming of chunks begins

This results in a perceived delay and defeats the purpose of using a streaming API for interactive applications.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

Streaming should begin as soon as the model starts generating tokens, for example:

0.8s -> dict_keys(['chunk'])
0.9s -> dict_keys(['chunk'])
1.0s -> dict_keys(['chunk'])

Current Behavior

Actual Behavior

Agent executes for several seconds
↓
Full response generated internally
↓
Only then streaming of chunks begins

Reproduction Steps

import boto3
import time

client = boto3.client(
    "bedrock-agent-runtime",
    region_name="us-east-1",
    aws_access_key_id="...",
    aws_secret_access_key="..."
)

start = time.time()
print("Invoking agent...")

response = client.invoke_agent(
    agentId="HBULA1EYN8",
    agentAliasId="92E64FKIFG",
    sessionId="test-session",
    enableTrace=True,
    sessionState={'files': []},
    inputText="Explain quantum computing in simple terms",
    streamingConfigurations={
        "streamFinalResponse": True
    }
)

print(f"Time taken to receive first response: {time.time() - start:.2f}s")

for event in response["completion"]:
    print(f"{time.time() - start:.2f}s -> {event.keys()}")

Possible Solution

No response

Additional Information/Context

No response

SDK version used

1.42.59

Environment details (OS name and version, etc.)

Mac

Metadata

Metadata

Assignees

Labels

bedrockbugThis issue is a confirmed bug.p3This is a minor priority issueservice-apiThis issue is caused by the service API, not the SDK implementation.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions