Skip to content

Top 3 Critical Gaps: Concurrency Safety, Unbounded Memory Growth, Resource Lifecycle #1365

@MervinPraison

Description

@MervinPraison

Deep Architecture Analysis — Top 3 Critical Gaps

A comprehensive analysis of the PraisonAI codebase (core SDK, wrapper, LLM integration, process orchestration, memory subsystem, session management) identified three systemic gaps that impact production reliability. These are not cosmetic — each directly undermines the project's stated principles of being multi-agent safe, async-safe, and production-ready.


Gap 1: Concurrency & Async Safety Gaps in Multi-Agent Workflows

Severity: CRITICAL | Principle violated: "Multi-agent + async safe by default"

PraisonAI's core value proposition is multi-agent orchestration, but several synchronization primitives have race conditions that can cause data corruption, deadlocks, and silent failures under concurrent load.

1a. DualLock creates different Lock instances per event loop — defeats synchronization

File: src/praisonai-agents/praisonaiagents/agent/async_safety.py (lines 42-57)

def _get_async_lock(self) -> asyncio.Lock:
    current_loop = asyncio.get_running_loop()
    current_loop_id = id(current_loop)
    if self._loop_id != current_loop_id:
        self._async_lock = asyncio.Lock()  # RACE: two coroutines create different locks
        self._loop_id = current_loop_id

Problem: The comparison and assignment are not atomic. Two concurrent coroutines entering this block create two different asyncio.Lock() instances — one gets discarded, and the two coroutines synchronize on different locks. This makes AsyncSafeState (used for chat history, agent state) not actually safe.

Impact: Chat history corruption, lost state updates in concurrent agent execution.

1b. Lazy async lock initialization race in Process orchestration

File: src/praisonai-agents/praisonaiagents/process/process.py (lines 599-602)

if self._state_lock is None:
    self._state_lock = asyncio.Lock()  # NOT atomic — two coroutines create separate locks
async with self._state_lock:

Problem: Same pattern — self._state_lock initialized to None at line 48, then lazily created without protection. Two concurrent workflow steps can each create their own lock.

Impact: Task status updates race against each other. Only lines 599-623 use the lock; other status mutations at lines 519-547 are completely unprotected.

1c. Nested event loop creation causes deadlocks

File: src/praisonai-agents/praisonaiagents/agent/execution_mixin.py (lines 332-352)

# When called from async context, creates NEW event loop in thread
loop = asyncio.new_event_loop()
loop.run_until_complete(coro)  # Can deadlock

And at line 441:

results = asyncio.run(collect_all())  # Fails if already in async context

Problem: _execute_backend_sync() and _delegate_streaming_to_backend() use asyncio.run() or new_event_loop().run_until_complete() which fail or deadlock when called from within an existing async context.

1d. Session state read-modify-write race conditions

File: src/praisonai-agents/praisonaiagents/session.py (lines 385-399)

def increment_state(self, key, increment=1, default=0):
    current_value = self.get_state(key, default)  # Thread A reads 5
    self.set_state(key, current_value + increment)  # Thread A writes 6
    # Thread B also read 5, writes 6 — increment lost

Problem: Classic TOCTOU (time-of-check-time-of-use). No locking around read-modify-write. Same issue exists for set_state() and the _agents dict (lines 103, 201, 325).

1e. Process task retry counter non-atomic increments

File: src/praisonai-agents/praisonaiagents/process/process.py (lines 46, 270-297)

self.task_retry_counter[task.name] += 1  # Non-atomic read-modify-write

Impact: In concurrent task execution, retry counts silently lost — some retries are skipped entirely.

1f. Tool execution injection context lost in executor

File: src/praisonai-agents/praisonaiagents/agent/tool_execution.py (lines 193-205)

with with_injection_context(state):
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(self._execute_tool_impl, ...)
        # injection context exits BEFORE executor thread runs

Problem: The with_injection_context context manager exits when the with block ends, but the executor thread hasn't started yet. Tools that depend on injection context get None.

Recommended fix direction

  • Initialize asyncio.Lock() in __init__, not lazily
  • Use contextvars for async-safe state instead of DualLock
  • Add threading.RLock around all session state mutations
  • Use copy_context().run() for executor-submitted callables to preserve context

Gap 2: Unbounded Memory & Context Growth — No Automatic Lifecycle Management

Severity: HIGH | Principle violated: "Performance-first", "Production-ready"

The memory, session, and knowledge subsystems have no automatic pruning, expiry, or lifecycle management. For any agent running beyond a demo, resources grow without bound — increasing LLM costs, degrading latency, and eventually causing OOM.

2a. Chat history grows unbounded with no default context management

File: src/praisonai-agents/praisonaiagents/agent/agent.py (line 1581)

self.__chat_history_state = AsyncSafeState([])  # Grows forever

File: src/praisonai-agents/praisonaiagents/agent/memory_mixin.py (lines 141-210)

The prune_history() method exists but is manual-only — never called automatically. There is a ContextManager with strategies (SLIDING_WINDOW, SUMMARIZE, TRUNCATE) at context/manager.py, but it's disabled by default (context=None in Agent init).

Impact: Every LLM call includes the entire conversation history. For a 100-turn conversation, that's 100+ messages sent with every API call. Costs scale linearly; eventually hits provider context limits and triggers emergency truncation (which loses important context).

Recommendation: Default to context="sliding_window" or context="auto_compact" rather than None.

2b. Long-term memory and vector stores grow indefinitely

Files:

  • src/praisonai-agents/praisonaiagents/memory/storage.py (lines 47-104)
  • src/praisonai-agents/praisonaiagents/memory/core.py (lines 77-82)
# Auto-promotion from STM → LTM when quality score ≥ 7.5
if score >= 7.5:
    self.memory_adapter.store_long_term(content, ...)  # Never deleted
  • SQLite long_term_memory.db grows without limit
  • ChromaDB collections keep all historical embeddings indefinitely
  • No TTL, no archival, no expiry mechanism exists anywhere in the codebase
  • MongoDB connection pool fixed at 50 with maxIdleTimeMS=30000 may exhaust under load

2c. Memory adapter dual-persistence doubles storage

File: src/praisonai-agents/praisonaiagents/memory/core.py (lines 58-74)

# Primary adapter
memory_id = self.memory_adapter.store_short_term(content, ...)

# Backward-compat fallback — stores SAME content AGAIN in SQLite
if self._sqlite_adapter != self.memory_adapter:
    fallback_id = self._sqlite_adapter.store_short_term(content, ...)

Impact: Every memory entry stored twice — 2x disk usage, 2x write latency. This backward-compatibility shim should be removed.

2d. Checkpoint list grows unbounded — config exists but is never enforced

File: src/praisonai-agents/praisonaiagents/checkpoints/service.py (lines 67-100)

self.config = CheckpointConfig(max_checkpoints=max_checkpoints)  # Set...
self._checkpoints: List[Checkpoint] = []  # ...but never enforced

The max_checkpoints config parameter is accepted but never checked. Checkpoints accumulate in memory and on disk forever.

2e. No pre-call context length validation

File: src/praisonai-agents/praisonaiagents/llm/llm.py (lines 4380-4441)

_build_completion_params() assembles the request payload without checking whether the total token count exceeds the model's context window before sending the API call. Failure only happens when the provider rejects the request — wasting a round-trip and providing a poor error message.

2f. Knowledge retrieval results unbounded

File: src/praisonai-agents/praisonaiagents/knowledge/retrieval.py (lines 144+)

reciprocal_rank_fusion() merges results from multiple retrievers without deduplication or limit checks. Combined results can exceed the agent's context window. The reranker (which would filter by quality) is disabled by default (enabled: False at knowledge.py line 115).

Recommended fix direction

  • Default context to "sliding_window" instead of None
  • Add TTL-based expiry to long-term memory stores
  • Remove dual-persistence backward-compatibility shim
  • Enforce max_checkpoints config
  • Add pre-call token count estimation and truncation
  • Enable reranker by default or cap retrieval result count

Gap 3: Incomplete Resource Lifecycle — Timeouts Don't Cancel, Close Doesn't Cleanup

Severity: HIGH | Principle violated: "Production-ready", "Safe by default"

Multiple subsystems have timeout or shutdown mechanisms that don't actually stop running work, leading to resource leaks, orphaned threads, and zombie processes in production.

3a. Workflow timeout doesn't cancel running tasks

File: src/praisonai-agents/praisonaiagents/process/process.py (lines 429-433)

if self.workflow_timeout is not None:
    elapsed = time.monotonic() - workflow_start
    if elapsed > self.workflow_timeout:
        logging.warning("Workflow timeout...")
        break  # Just breaks the loop — doesn't cancel running tasks

Problem: break exits the orchestration loop but doesn't:

  • Cancel the currently executing task/agent
  • Clean up resources (file handles, connections)
  • Wait for the current task to reach a safe state

The timed-out task continues running in the background, consuming LLM API quota and possibly producing side effects.

3b. Tool execution timeout doesn't stop the thread

File: src/praisonai-agents/praisonaiagents/agent/tool_execution.py (lines 196-205)

with ThreadPoolExecutor(max_workers=1) as executor:
    future = executor.submit(self._execute_tool_impl, function_name, arguments)
    try:
        result = future.result(timeout=tool_timeout)
    except concurrent.futures.TimeoutError:
        result = {"error": f"Tool timed out..."}
# ThreadPoolExecutor.__exit__ returns immediately — thread still runs!

Problem: Python's ThreadPoolExecutor has no mechanism to interrupt a running thread. After timeout, the thread continues executing (possibly holding connections, file handles, or making API calls). The executor context manager exits, but the thread is not cancelled — it runs to completion silently.

Impact: For I/O-heavy tools (web scraping, database queries), a "timed out" tool continues consuming resources indefinitely.

3c. Agent close() doesn't fully clean up

File: src/praisonai-agents/praisonaiagents/agent/agent.py (lines 4605-4623)

def close(self):
    self._cleanup_server_registrations()
    self._closed = True
    # Doesn't: wait for pending RPC calls, unsubscribe from streams,
    # close LLM connections, flush pending memory writes

async def aclose(self):
    for task in self._background_tasks:
        task.cancel()  # Cancel but don't await — no guarantee of cleanup

Problem:

  • Background tasks cancelled but not awaited (no await task after cancel)
  • StreamEventEmitter subscriptions not unsubscribed
  • Pending memory writes not flushed
  • LLM client connections not closed

3d. Global OpenAI client replaced without closing old instance

File: src/praisonai-agents/praisonaiagents/llm/openai_client.py (lines 2198-2227)

with _global_client_lock:
    if _global_client is None or _global_client_params != current_params:
        _global_client = OpenAIClient(...)  # Old client just garbage collected
        _global_client_params = current_params

When model parameters change mid-session, the old OpenAIClient is discarded without closing its HTTP connections. Under frequent model switches, this leaks HTTP connections.

3e. Streaming fallback loses accumulated state

File: src/praisonai-agents/praisonaiagents/llm/llm.py (lines 2141-2248)

When streaming encounters a JSON parse error mid-stream, the code falls back to non-streaming:

response_text = ""  # Reset — all accumulated content LOST
tool_calls = []     # Reset
# Re-sends the entire request non-streaming

Any content already streamed to the user is lost and re-generated, causing duplicate output or missing segments.

3f. Daemon server threads killed without cleanup

File: src/praisonai-agents/praisonaiagents/agent/execution_mixin.py (lines 1163, 1266)

server_thread = threading.Thread(target=run_server, daemon=True)
server_thread.start()
# No stop mechanism, no graceful shutdown, no connection draining

Daemon threads are forcibly terminated on process exit without closing connections or saving state.

Recommended fix direction

  • Implement cooperative cancellation for workflow tasks (cancellation tokens / asyncio.Task.cancel())
  • For tool timeouts, run tools in subprocesses (not threads) so they can be killed
  • await cancelled tasks in aclose() with a short grace period
  • Close old LLM clients explicitly before replacing
  • Buffer streamed content so fallback can resume from last good state
  • Add graceful shutdown hooks for server threads

Summary

Gap Severity Core Principle Violated Key Files
1. Concurrency/Async Safety CRITICAL "Multi-agent + async safe by default" async_safety.py, process.py, session.py, execution_mixin.py
2. Unbounded Memory Growth HIGH "Performance-first", "Production-ready" memory/core.py, agent.py, checkpoints/service.py, knowledge/retrieval.py
3. Resource Lifecycle Gaps HIGH "Production-ready", "Safe by default" tool_execution.py, process.py, agent.py, openai_client.py

These three gaps are interconnected: concurrency bugs cause data corruption in the memory system, unbounded memory growth makes the lifecycle gaps worse (more resources to leak), and incomplete cleanup amplifies the concurrency issues (orphaned threads modifying shared state).

Addressing Gap 1 first is recommended — it has the highest blast radius and affects the correctness of every multi-agent workflow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingclaudeAuto-trigger Claude analysisenhancementNew feature or requestperformance

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions