Skip to content

Feature/cog 4463 graph aware embeddings implement minimal graphdb in postgres#2584

Merged
lxobr merged 23 commits intodevfrom
feature/cog-4463-graph-aware-embeddings-implement-minimal-graphdb-in-postgres
Apr 10, 2026
Merged

Feature/cog 4463 graph aware embeddings implement minimal graphdb in postgres#2584
lxobr merged 23 commits intodevfrom
feature/cog-4463-graph-aware-embeddings-implement-minimal-graphdb-in-postgres

Conversation

@DanielNScott
Copy link
Copy Markdown
Contributor

@DanielNScott DanielNScott commented Apr 9, 2026

Description

Implements a PostgreSQL-based graph database adapter for Cognee.

The adapter lives in cognee/infrastructure/databases/graph/postgres/ and implements GraphDBInterface using two tables (graph_node, graph_edge). All graph operations are implemented using SQL or SQLAlchemy.

Key design decisions:

  • Composite primary key (source_id, target_id, relationship_name) on triplets
    • enforces one-edge-per-triple semantics
    • matches the MERGE behavior of Neo4j and Kuzu
  • Covering indexes on edges
    • specifically, this information is pre-packaged, so doesn't demand addnl lookup
    • as a result, neighbor lookups are fast
  • Adapter creates its own SQLAlchemy engine from connection string
    • matches the convention of all other graph adapters
    • no cross-layer dependency on the relational engine
  • Factory requires GRAPH_DATABASE_URL when GRAPH_DATABASE_PROVIDER=postgres
    • consistent with Neo4j and Kuzu configuration
  • Schema definitions in tables.py, adapter logic in adapter.py
  • query() raises NotImplementedError by design
    • raw Cypher not supported on postgres backend
    • arbitrary Cypher is not practically implementable via SQL
    • documented in CLAUDE.md
  • get_neighborhood uses a recursive CTE with UNION ALL
    • single query, matching get_nodeset_subgraph pattern
    • benefits: fewer round trips than the three-query pattern other adapters use
    • liabilities: harder to debug and can't independently optimize node vs edge fetch

Also includes:

  • Config save/restore in test_graphdb_shared.py so E2E tests don't leak mutated global state
  • GRAPH_DATABASE_URL added to CI postgres E2E step in graph_db_tests.yml

Acceptance Criteria

  • PostgresAdapter implements all required GraphDBInterface methods except query()
  • 26 unit tests pass against a live Postgres instance (test_postgres_adapter.py)
  • E2E test (test_graphdb_postgres.py) passes through full add/cognify/search pipeline
  • CI configuration updated in graph_db_tests.yml for both unit and E2E test steps
  • CI passes on upstream

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Code refactoring
  • Other (please specify):

Screenshots

pytest: 1080 passed, 0 failed, 15 skipped

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR (See CONTRIBUTING.md)
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

  • New Features

    • Added PostgreSQL as a supported graph database backend with CRUD, traversal, metrics, filtering, and pagination features.
  • Documentation

    • Added PostgreSQL configuration example and clarified limitations (no raw Cypher, no natural-language graph search, no Graphiti).
  • Tests

    • Added adapter and e2e tests for PostgreSQL and CI support to run Postgres-specific test jobs with configurable Python version.
  • Behavior Changes

    • Cypher-based retrievers and Graphiti/temporal indexing now explicitly mark Postgres as unsupported.

DanielNScott and others added 18 commits March 26, 2026 08:38
Problems:
- Need to migrate basic graph functionality to relational DB

Solutions:
- New adapter implementing GraphDBInterface
- Selection branch included in get_graph_engine.py

Decisions:
- re-use SQLAlchemy setup and configuration
- Remove .query() (raw Cypher) functionality
Problems:
- Cypher search retriever exposes raw cypher interface via .query()
- Natural language retriever writes, uses Cypher via .query()
- Temporal awareness (Graphiti) subsystem directly uses cypher, neo4j

Solutions:
- Raise incompatibility errors proactively across these consumers

Decisions:
- None of these are core functionality, cost should be low
- Graphiti already unguarded hard-coding of multiple configs...
- Complex cypher search cannot be replaced in postgres at present
Problems:
- asyncpg misparses ::jsonb cast as named parameter
- prune_system drops graph tables via delete_database()
- cached adapter instance has no tables after prune
- no CI coverage for postgres graph backend
- no unit or e2e tests for postgres adapter

Solutions:
- replace ::jsonb with CAST(:param AS jsonb) throughout
- add _ensure_initialized() calling idempotent CREATE IF NOT EXISTS
- guard is_empty and delete_graph with _ensure_initialized
- add run-postgres-tests job to graph_db_tests.yml CI workflow
- add 26 unit tests against real Postgres
- add shared e2e test and postgres wrapper

Decisions:
- _ensure_initialized over defensive checks on every method
- real Postgres in tests, not SQLite (dialect differences block it)
- shared test function for future backend reuse (test_graphdb_shared)
- CI uses Postgres 16 service container, matching Kuzu/Neo4j pattern
Problems:
- some white-space and wrapping non-compliance

Solutions:
- fix via ruff
Problems:
- Inadequate batching
   - add_nodes loops N awaits, one INSERT round-trip each
   - add_edges same per-row loop pattern
   - has_edges loops N SELECT EXISTS queries
- raw text() SQL construction, not SQLAlchemy

Solutions:
- multi-row insert instead of looping
   - add_nodes uses pg_insert with multi-row VALUES batch
   - add_edges uses pg_insert with multi-row VALUES batch
   - has_edges uses SQLAlchemy values() with EXISTS join

- added SQLAlchemy Core table definitions for DML
   - will remove next commit to maintain single source of truth

Decisions:
- maximum batching optimization
   - multi-row INSERT over executemany for single-statement execution
   - executemany pipelines but still N server-side executions
   - looped awaits worst, executemany middle, VALUES best batching

- chose SQLAlchemy Core over raw text() for proactive safety
   - text() with bind params was safe but not project convention
   - SQLAlchemy eliminates all raw SQL string construction

- keep problematic reproduction of DDL logic for commit specificity
   - will fix in follow up
Problems:
- raw DDL strings and Core table objects duplicated schema

Solutions:
- replaced _SCHEMA_DDL list with Core Table and Index definitions
- initialize() now uses _meta.create_all() instead of raw DDL
- removed asyncio import (unused)
- replaced postgresql TIMESTAMP with generic DateTime(timezone=True)
Note:
- All problems below were cosmetic or stylistic, not substantive.

Problems:
- SQL injection via unvalidated attribute names in get_filtered_graph_data
- empty attribute_filters can produce invalid SQL syntax
- local "values" var shadows sqlalchemy values import in add_edges
- JSON serialize then deserialize indirection in add_nodes
- get_nodeset_subgraph issues four sequential DB queries
- get_filtered_graph_data issued two sequential DB queries
- public methods missing docstrings, unlike Neo4j and Kuzu adapters
- section banners inconsistent with other adapters
- narrating inline comments restate what code already says
- python-version input missing from graph_db_tests workflow
- SearchTypeNotSupported imported inline instead of at module level

Solutions:
- add attribute whitelist, raise ValueError for invalid filter keys
- fall back to get_graph_data when attribute_filters is empty
- rename local values to rows in add_edges
- inline property dict construction, remove _split_core_and_extra
- combine get_nodeset_subgraph into single CTE query
- combine get_filtered_graph_data into single CTE query
- add Parameters/Returns docstrings to all public methods
- remove section banners throughout adapter
- trim narrating comments, keep complex clarifying ones
- add python-version input with default 3.11 to workflow_call
- move SearchTypeNotSupported to module-level import

Reasoning and Related Notes:
- prune_system metadata=True drops relational DB including graph tables
   - _ensure_initialized kept defensive as a result
- JSON round-trip in add_nodes retained for UUID and datetime serialization
   - specifically, enforces JSONB compatibility (postgres wants python dict)
   - alternative is in-place modification
   - adding rules for doing this would replicate JSONEncoder
   - JSONEncoder outputs strings, so we need to convert back if used
Problems:
- tuple merge order lets properties dict override authoritative node ID
- __init__ and _session missing type annotations
- magic number 6 in test_graphdb_shared unexplained
- raised but incorrect: add weighted edges test

Solutions:
- swap merge order so tuple ID always wins
- add type annotations to __init__ and _session
- add comment explaining 3 searches x 2 entries = 6
- no duplication of weighted edges test necessary or appropriate

Reasoning and Related Notes:
- weighted edges test is backend-agnostic, exercises get_graph_from_model only
   - running it under Postgres job would not test adapter round-trip
   - no backend actually has a round-trip weighted edges test
   - therefore kuzu and neo4j cases are incorrect and duplicates

- recursive CTE performance in get_graph_metrics is not part of pipeline
   - Kuzu returns -1 for components, Neo4j uses GDS plugin
Major problems:
- adapter assumed relational layer is postgres and shared
- users could not run graph-DB on a separate postgres instance
- factory fell back to relational config instead of requiring graph URL
- sqlite relational + postgres graph would pass sqlite engine to adapter

Minor problems:
- table definitions mixed with adapter logic in one file
- _ensure_initialized wrapper added no behavior over initialize()
- pg_insert and String imports were missing
- tests created new event loop per test, staling the engine pool

Solutions:
- adapter creates its own engine from connection string
- factory requires GRAPH_DATABASE_URL, matching other providers
- extract table and index definitions into tables.py
- inline initialize() calls, remove _ensure_initialized
- add missing pg_insert and String imports
- add session-scoped event_loop fixture for tests
- add GRAPH_DATABASE_URL to CI postgres E2E step

Architectural Trade-Off Decisions:
- separate pool over shared pool, matching adapter convention
- hybrid branch can inject shared engine via factory as needed

Reasoning and Related Notes:
- addresses items from PR #2506 review (dexters1)
- all other adapters create their own clients in __init__
- pytest-asyncio 0.21 creates a new event loop per test by default
…al-graphdb-in-postgres' of https://github.com/DanielNScott/cognee into feature/cog-4463-graph-aware-embeddings-implement-minimal-graphdb-in-postgres
…dev merge.

Problems:
- get_neighborhood abstract method missing after dev merge
- get_nodeset_subgraph rejected node_name_filter_operator kwarg
- E2E test could leak config to subsequent tests if explicitly pytested

Solutions:
- implement get_neighborhood with recursive CTE in single query
- add node_name_filter_operator param to get_nodeset_subgraph
- capture and restore graph provider and root dirs in finally block

Reasoning and Related Notes:
- recursive CTE + UNION ALL matches get_nodeset_subgraph pattern
- other adapters use three queries; single query avoids round trips
- config restore addresses CodeRabbit review item 5 on PR #2506
Problems:
- postgres tests in unit/ fail without live database in CI
- unit tests should mock DBs per project conventions

Solutions:
- moved 3 test files to cognee/tests/e2e/postgres/
- created __init__.py for e2e/ and e2e/postgres/
- updated import path in test_graphdb_postgres.py
- updated 2 test paths in graph_db_tests.yml

Architectural Trade-Off Decisions:
- chose e2e/ over integration/ to avoid automatic CI sweep

Reasoning and Related Notes:
- requested by Igor (dexters1) during PR #2506 review
- mirrors identical move on sister hybrid-adapter branch
…mbeddings-implement-minimal-graphdb-in-postgres' into feature/cog-4463-graph-aware-embeddings-implement-minimal-graphdb-in-postgres
@pull-checklist
Copy link
Copy Markdown

pull-checklist bot commented Apr 9, 2026

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Hello @DanielNScott, thank you for submitting a PR! We will respond as soon as possible.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

Walkthrough

Adds PostgreSQL as a new graph backend: introduces an async PostgresAdapter and schema, integrates provider into the graph engine factory, updates docs and CI, adds Postgres E2E and adapter tests, and adds runtime guards preventing Cypher/Graphiti workflows on Postgres. (48 words)

Changes

Cohort / File(s) Summary
Adapter & Schema
cognee/infrastructure/databases/graph/postgres/adapter.py, cognee/infrastructure/databases/graph/postgres/tables.py
New PostgresAdapter implementing GraphDBInterface using async SQLAlchemy/JSONB; defines graph_node/graph_edge tables, indexes, batch upserts, recursive CTEs for traversal/metrics, many CRUD/query methods, and query() raises NotImplementedError.
Engine Factory
cognee/infrastructure/databases/graph/get_graph_engine.py
Factory branch to instantiate PostgresAdapter when GRAPH_DATABASE_PROVIDER=postgres; validates GRAPH_DATABASE_URL and updates supported-provider error text.
Retrieval & Temporal Guards
cognee/modules/retrieval/cypher_search_retriever.py, cognee/modules/retrieval/natural_language_retriever.py, cognee/tasks/temporal_awareness/index_graphiti_objects.py
Added guards that detect PostgresAdapter and raise SearchTypeNotSupported or RuntimeError to prevent Cypher/Graphiti-based flows on Postgres; minor log-text correction.
E2E & Adapter Tests
cognee/tests/e2e/postgres/test_graphdb_postgres.py, cognee/tests/e2e/postgres/test_graphdb_shared.py, cognee/tests/e2e/postgres/test_postgres_adapter.py
New shared E2E harness and Postgres-specific test entrypoint; comprehensive adapter tests covering CRUD, neighbors, metrics, triplet pagination, cleanup, and expected error cases.
CI Workflow
.github/workflows/graph_db_tests.yml
Added python-version workflow input and a conditional run-postgres-tests job that starts Postgres 16 service and runs adapter/E2E tests when requested.
Documentation
CLAUDE.md
Documented PostgreSQL as a supported graph provider (GRAPH_DATABASE_PROVIDER=postgres) with example GRAPH_DATABASE_URL and noted limitations (no raw Cypher, NL search, or Graphiti).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

core-team

Suggested reviewers

  • hajdul88
  • borisarzentar
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: implementing a PostgreSQL-based graph database adapter for Cognee as a new feature.
Description check ✅ Passed The PR description comprehensively addresses all template sections: clear human-written description of changes, explicit acceptance criteria with checkmarks, feature classification, test results, and all pre-submission checklist items checked.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/cog-4463-graph-aware-embeddings-implement-minimal-graphdb-in-postgres

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gitguardian
Copy link
Copy Markdown

gitguardian bot commented Apr 9, 2026

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password da6a001 .env.template View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
cognee/infrastructure/databases/graph/postgres/adapter.py (2)

613-685: Recursive CTE for connected components may be expensive on large graphs.

The connected components computation uses a recursive CTE that can be costly for graphs with many nodes/edges. This is acceptable for moderate sizes but may need optimization (e.g., materialized views, incremental updates) for production workloads.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 613 -
685, get_graph_metrics currently computes connected components with an expensive
recursive CTE (the "component" CTE inside get_graph_metrics), which can be
unacceptable on large graphs; modify get_graph_metrics to avoid running the
recursive CTE by default and instead either (A) skip component computation
unless a new flag (e.g., compute_components=True) is passed, (B) only run the
CTE when num_nodes is below a safe threshold (use num_nodes from the earlier
count), or (C) query a precomputed/materialized table (e.g., graph_component or
a cache) if available—implement one of these approaches, update the metrics keys
accordingly (sizes_of_connected_components and num_connected_components), and
document the new behavior so callers can opt in when they know the dataset is
small or choose to use a precomputed source instead of the in-query recursive
CTE.

542-611: Consider implementing node_name_filter_operator or documenting current behavior.

The parameter node_name_filter_operator (default "OR") is accepted but not used in the query, which always uses AND name = ANY(:names) (effectively OR semantics). This matches the default, but if AND logic is needed later, the parameter is misleading.

💡 Option: document the current behavior
     async def get_nodeset_subgraph(
         self, node_type: Type[Any], node_name: List[str], node_name_filter_operator: str = "OR"
     ) -> Tuple[List[Tuple[str, dict]], List[Tuple[str, str, str, dict]]]:
         """Retrieve a subgraph containing matching nodes, their neighbors, and interconnecting edges.

         Parameters:
         -----------
             node_type: The DataPoint subclass whose __name__ is the type label.
             node_name: List of node names to match.
+            node_name_filter_operator: Currently unused; OR semantics are always applied.

         Returns:
         --------
             A tuple of (nodes, edges) for the subgraph.
         """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 542 -
611, The parameter node_name_filter_operator on get_nodeset_subgraph is accepted
but unused; either remove it or implement proper behavior—update
get_nodeset_subgraph to build the WHERE clause for primary_nodes dynamically: if
node_name_filter_operator == "OR" keep WHERE type = :label AND name =
ANY(:names); if "AND" build a conjunctive predicate (e.g., WHERE type = :label
AND name = :name_0 AND name = :name_1 ... with individual params for each name)
or validate/raise for unsupported values; ensure you adjust the session.execute
parameter binding to match the chosen SQL and update callers or docstring
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CLAUDE.md`:
- Around line 286-290: The Postgres example snippet sets
GRAPH_DATABASE_PROVIDER=postgres but omits the required GRAPH_DATABASE_URL;
update the snippet in CLAUDE.md to include GRAPH_DATABASE_URL with an example
connection string (e.g.
GRAPH_DATABASE_URL=postgresql://user:password@host:port/dbname) alongside
GRAPH_DATABASE_PROVIDER=postgres so the Postgres graph factory accepts the
configuration; ensure the example mentions that this URL is required when using
GRAPH_DATABASE_PROVIDER=postgres.

In `@cognee/infrastructure/databases/graph/postgres/tables.py`:
- Around line 15-16: The updated_at column currently uses
server_default=func.now() but lacks onupdate, and the upsert statements in the
Postgres adapter assign stmt.excluded.updated_at (the stale insert timestamp)
instead of the server time; change the Column("updated_at",
DateTime(timezone=True), ...) in tables.py to include onupdate=func.now(), and
in the adapter's node upsert and edge upsert logic replace any use of
stmt.excluded.updated_at with func.now() so that upserts set updated_at to the
server current time.

In `@cognee/tasks/temporal_awareness/index_graphiti_objects.py`:
- Around line 23-31: Replace the negative Postgres-only guard with a positive
Neo4j check: instead of "if isinstance(graph_engine, PostgresAdapter): raise
RuntimeError(...)" ensure you verify that graph_engine is a Neo4j backend (e.g.,
isinstance(graph_engine, Neo4jAdapter) or graph_engine.provider == "neo4j") and
otherwise raise the same unsupported-feature exception class used by the
retrievers (import that exception and use it in place of RuntimeError) so
non-Neo4j backends (Kuzu, Neptune, etc.) get a clear, consistent
unsupported-feature error; keep references to the existing symbols graph_engine
and PostgresAdapter while adding the Neo4j adapter check and the retriever
exception import.

---

Nitpick comments:
In `@cognee/infrastructure/databases/graph/postgres/adapter.py`:
- Around line 613-685: get_graph_metrics currently computes connected components
with an expensive recursive CTE (the "component" CTE inside get_graph_metrics),
which can be unacceptable on large graphs; modify get_graph_metrics to avoid
running the recursive CTE by default and instead either (A) skip component
computation unless a new flag (e.g., compute_components=True) is passed, (B)
only run the CTE when num_nodes is below a safe threshold (use num_nodes from
the earlier count), or (C) query a precomputed/materialized table (e.g.,
graph_component or a cache) if available—implement one of these approaches,
update the metrics keys accordingly (sizes_of_connected_components and
num_connected_components), and document the new behavior so callers can opt in
when they know the dataset is small or choose to use a precomputed source
instead of the in-query recursive CTE.
- Around line 542-611: The parameter node_name_filter_operator on
get_nodeset_subgraph is accepted but unused; either remove it or implement
proper behavior—update get_nodeset_subgraph to build the WHERE clause for
primary_nodes dynamically: if node_name_filter_operator == "OR" keep WHERE type
= :label AND name = ANY(:names); if "AND" build a conjunctive predicate (e.g.,
WHERE type = :label AND name = :name_0 AND name = :name_1 ... with individual
params for each name) or validate/raise for unsupported values; ensure you
adjust the session.execute parameter binding to match the chosen SQL and update
callers or docstring accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e1cf24ef-0c83-4ae5-9e9e-a2cc92985494

📥 Commits

Reviewing files that changed from the base of the PR and between 66b8c0c and da6a001.

📒 Files selected for processing (14)
  • .github/workflows/graph_db_tests.yml
  • CLAUDE.md
  • cognee/infrastructure/databases/graph/get_graph_engine.py
  • cognee/infrastructure/databases/graph/postgres/__init__.py
  • cognee/infrastructure/databases/graph/postgres/adapter.py
  • cognee/infrastructure/databases/graph/postgres/tables.py
  • cognee/modules/retrieval/cypher_search_retriever.py
  • cognee/modules/retrieval/natural_language_retriever.py
  • cognee/tasks/temporal_awareness/index_graphiti_objects.py
  • cognee/tests/e2e/__init__.py
  • cognee/tests/e2e/postgres/__init__.py
  • cognee/tests/e2e/postgres/test_graphdb_postgres.py
  • cognee/tests/e2e/postgres/test_graphdb_shared.py
  • cognee/tests/e2e/postgres/test_postgres_adapter.py

…et_subgraph.

Problems:
- node_name_filter_operator param accepted but only OR was implemented
- AND callers would silently get OR behavior

Solutions:
- branch neighbor_ids CTE on filter operator, matching Kuzu and Neo4j
- AND uses GROUP BY / HAVING COUNT(DISTINCT) = primary_count
- OR unchanged, uses DISTINCT on any neighbor
Problems:
- CLAUDE.md postgres snippet missing required GRAPH_DATABASE_URL
- upserts used stale insert timestamp instead of server time for updated_at
- Graphiti gate excluded only Postgres instead of requiring Neo4j

Solutions:
- add GRAPH_DATABASE_URL to CLAUDE.md postgres config example
- use func.now() in node and edge upsert conflict handlers
- check graph_database_provider != "neo4j" instead of isinstance(PostgresAdapter)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cognee/infrastructure/databases/graph/postgres/adapter.py`:
- Around line 625-632: The node tuples currently include the id twice (as the
first element and again inside the data dict) causing inconsistent shapes; in
the node-building loops inside get_graph_data/get_filtered_graph_data (and the
similar block referenced at lines ~779-781 used by get_neighborhood) remove the
'id' key from the payload before appending the tuple so each node is (id,
props_without_id). Concretely, when creating data = {"id": row[1], ...} or when
merging row[4], ensure you do NOT set or you pop the 'id' key from data so
nodes.append((row[1], data)) yields a payload without an 'id' field.
- Around line 829-830: The pagination ORDER BY in the SQL fragment currently
uses ORDER BY e.source_id, e.target_id which is not unique; update the ordering
to include the edge primary-key column relationship_name so the sort is total
and stable for OFFSET/LIMIT pagination (e.g., ORDER BY e.source_id, e.target_id,
e.relationship_name). Locate the query building code that emits the ORDER BY
clause in the Postgres graph adapter (the SQL near ORDER BY e.source_id,
e.target_id) and modify it to include e.relationship_name (or all edge PK
columns) to guarantee deterministic paginated triplet results.
- Around line 35-44: The adapter must perform one-time schema creation before
any DB operation: add an async lazy-init guard (e.g. private attributes
_initialized: bool and _init_lock: asyncio.Lock) and call initialize() from
inside the asynccontextmanager _session() if _initialized is False, acquiring
_init_lock to ensure only one coroutine runs initialize(), set _initialized =
True after success, and then yield the session; update _session to await
initialize() under the lock so callers of add_nodes/get_nodes/add_edges (via
GraphDBInterface) never hit missing relation errors.
- Around line 95-98: The add_node/add_nodes upsert currently replaces core
fields and properties when a partial payload is provided; change the upsert
logic in methods add_node and add_nodes so omitted core fields (e.g., "name",
"type") preserve existing values (use COALESCE or equivalent to keep existing
column values) and merge properties instead of replacing them (use jsonb
concatenation/merge like existing_properties || new_properties or
jsonb_strip_nulls(jsonb_concat) semantics) so partial calls (e.g.,
add_node("n1") or add_node("n1", {"foo":1})) patch the node rather than blanking
fields; apply the same fix to the other upsert block referenced around lines
124-145.
- Around line 136-145: The upsert fails when the VALUES batch contains duplicate
conflict keys; before building the pg_insert statement in add_nodes (where you
build stmt = pg_insert(_node_table).values(rows) ...) collapse the incoming
nodes list into a dictionary keyed by node["id"] (or the node id field used) and
keep the last occurrence, then build rows from dict.values(); likewise in
add_edges (the block around where edges are inserted, lines ~257-264)
deduplicate by the tuple (source_id, target_id, relationship_name) and keep the
last occurrence before creating the insert rows; this ensures each conflict key
appears only once in the VALUES batch and prevents the "ON CONFLICT DO UPDATE
command cannot affect row a second time" error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5630ab52-ecda-4842-8138-5a88170a98dd

📥 Commits

Reviewing files that changed from the base of the PR and between da6a001 and f784a7a.

📒 Files selected for processing (1)
  • cognee/infrastructure/databases/graph/postgres/adapter.py

Comment on lines +95 to +98
if isinstance(node, str):
props = properties or {}
props.setdefault("id", node)
await self.add_nodes([(node, props)])
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Partial add_node() upserts currently wipe existing node fields.

Passing a string id or a tuple without name/type writes name="", type="" and replaces properties on conflict. That means calls like add_node("n1") or add_node("n1", {"foo": 1}) can silently blank out previously stored metadata instead of doing a MERGE-style update. Preserve existing core fields when they are omitted, and patch properties instead of replacing them with the partial payload.

Based on learnings: In cognee's graph database adapters (Neo4j, Kuzu, Memgraph, Neptune), the add_node() method performs an upsert operation using MERGE semantics: it creates the node if it doesn't exist, or updates the existing node's properties if it does exist.

Also applies to: 124-145

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 95 -
98, The add_node/add_nodes upsert currently replaces core fields and properties
when a partial payload is provided; change the upsert logic in methods add_node
and add_nodes so omitted core fields (e.g., "name", "type") preserve existing
values (use COALESCE or equivalent to keep existing column values) and merge
properties instead of replacing them (use jsonb concatenation/merge like
existing_properties || new_properties or jsonb_strip_nulls(jsonb_concat)
semantics) so partial calls (e.g., add_node("n1") or add_node("n1", {"foo":1}))
patch the node rather than blanking fields; apply the same fix to the other
upsert block referenced around lines 124-145.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non issue.

The pipeline always passes full DataPoint objects via add_nodes(). The string/tuple form of add_node() is only used in tests. No production code path produces a partial node that would trigger the wipe.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

Let me verify this claim by checking the actual callers in the production code path.


🏁 Script executed:

#!/bin/bash
# Find all callers of add_node and add_nodes outside of test files
rg -n 'add_node\b|add_nodes\b' --type=py -g '!**/test*' -g '!**/tests*' -A2 -B2

Length of output: 15801


@DanielNScott, the search results surface a production-code caller that contradicts that claim. cognee/modules/data/methods/add_model_class_to_graph.py calls add_node with a raw string and a partial property dict (no name) in three places:

# Line 19
await graph.add_node(model_name, dict(type="model"))

# Line 51
await graph.add_node(str(item_type), dict(type="value"))

# Line 63
await graph.add_node(str(field_type), dict(type="value"))

These are not test-only calls — they are production module paths. Under the current upsert logic, any subsequent call to one of these nodes would write name="", overwriting whatever name was stored previously. For the other adapters (Neo4j, Kuzu, Neptune) this is safe because their add_node only accepts DataPoint; PostgresAdapter is the one that extends the signature to also accept str.

If add_model_class_to_graph is never invoked when GRAPH_DATABASE_PROVIDER=postgres, you're right that it's not a runtime risk today — but the interface allows it and the code path exists. At minimum, the add_nodes upsert could use COALESCE to preserve existing name/type on conflict, which would make partial callers safe at no additional cost:

stmt = pg_insert(_node_table).values(rows)
stmt = stmt.on_conflict_do_update(
    index_elements=["id"],
    set_={
        "name": func.coalesce(stmt.excluded.name, _node_table.c.name),
        "type": func.coalesce(stmt.excluded.type, _node_table.c.type),
        "properties": stmt.excluded.properties,
        "updated_at": func.now(),
    },
)

Happy to defer to your judgment if you're confident add_model_class_to_graph won't be routed to the Postgres adapter, but wanted to flag the evidence so the decision is made with full visibility.


🧠 Learnings used
Learnt from: Geoff-Robin
Repo: topoteretes/cognee PR: 1501
File: cognee/tasks/web_scraper/web_scraper_task.py:157-181
Timestamp: 2025-10-08T09:47:12.913Z
Learning: In cognee's graph database adapters (Neo4j, Kuzu, Memgraph, Neptune), the `add_node()` method performs an upsert operation using MERGE semantics: it creates the node if it doesn't exist, or updates the existing node's properties if it does exist. This allows `add_node()` to be used for both creation and updates.

Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2584
File: cognee/infrastructure/databases/graph/postgres/adapter.py:35-44
Timestamp: 2026-04-10T03:13:45.843Z
Learning: In `topoteretes/cognee`, the factory function `get_graph_engine()` in `cognee/infrastructure/databases/graph/get_graph_engine.py` always calls `await graph_client.initialize()` (via a `hasattr` guard) before returning the adapter. This means all graph adapters, including `PostgresAdapter`, have their schema/tables created before any caller can invoke graph operations. Concerns about missing schema at runtime (e.g., "relation does not exist") do not apply to code paths that go through the factory.

Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-27T06:30:27.190Z
Learning: Implement database adapters by extending GraphDBInterface or VectorDBInterface

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1990
File: cognee/modules/search/methods/search.py:124-127
Timestamp: 2026-01-13T12:41:17.082Z
Learning: In Python 3.6+ (PEP 530), you can use await inside asynchronous comprehensions (e.g., [await func(x) for x in items]) within async def functions. This is valid and executes sequentially. If parallel execution is desired, collect coroutines and use asyncio.gather, since awaiting inside a comprehension runs each iteration in sequence unless you explicitly parallelize.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (6)
cognee/infrastructure/databases/graph/postgres/adapter.py (6)

829-830: ⚠️ Potential issue | 🟡 Minor

ORDER BY is not unique; paginated results may skip or duplicate rows.

The edge primary key is (source_id, target_id, relationship_name), but the ORDER BY omits relationship_name. If two edges share the same endpoints, OFFSET/LIMIT pagination is non-deterministic. Add relationship_name to the ordering.

Suggested fix
-                    ORDER BY e.source_id, e.target_id
+                    ORDER BY e.source_id, e.target_id, e.relationship_name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 829 -
830, The SQL ORDER BY used when fetching edges is not deterministic because it
orders only by e.source_id, e.target_id but the edge primary key includes
relationship_name; update the ORDER BY in the query that ends with "OFFSET :off
LIMIT :lim" (in postgres adapter code handling edge pagination) to include
e.relationship_name after e.target_id so pagination is stable and rows are
neither skipped nor duplicated.

625-632: ⚠️ Potential issue | 🟡 Minor

Node tuple shape inconsistent with get_graph_data().

Here the payload includes "id": row[1] (line 629), but get_graph_data() returns (id, data) where data excludes the id key (line 447). Since get_nodeset_subgraph documentation says it returns the same format, remove id from data to match.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 625 -
632, The node dict created in get_nodeset_subgraph (the code building nodes in
adapter.py that currently sets data = {"id": row[1], "name": row[2], "type":
row[3]}) must not include the "id" key to match get_graph_data()'s (id, data)
shape; change the data construction in that loop (the block that builds data for
nodes before nodes.append((row[1], data))) to omit "id" and only include "name"
and "type" (and the merged metadata), ensuring nodes remain tuples of (id, data)
where data does not contain id.

777-781: ⚠️ Potential issue | 🟡 Minor

Node payload includes id via _parse_node_row, inconsistent with get_graph_data().

_parse_node_row() includes id in the returned dict (line 52), so nodes here have shape (id, {"id": ..., "name": ..., ...}). This differs from get_graph_data() which returns (id, {"name": ..., "type": ...}). Either adjust _parse_node_row() for this context or build the payload inline without id.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 777 -
781, The nodes list currently stores tuples like (id, parsed_dict) but
parsed_dict from _parse_node_row(...) already contains an "id" key, making the
payload shape inconsistent with get_graph_data(); update the loop in adapter.py
where nodes are appended so that for each row you call parsed =
self._parse_node_row(row) and then remove or exclude parsed["id"] before
appending (e.g., del parsed["id"] or build a new dict with only the fields
expected by get_graph_data()), ensuring nodes become (id, {"name":...,
"type":...}) and leaving _parse_node_row unchanged.

35-44: ⚠️ Potential issue | 🟠 Major

Schema must be created before any operation, not just is_empty()/delete_graph().

initialize() is only called from is_empty() and delete_graph(). If a caller invokes add_nodes(), get_nodes(), or add_edges() on a fresh database before either of those methods, the query will fail with relation "graph_node" does not exist. Add a lazy-init guard in _session() to ensure the schema exists before any database operation.

Suggested lazy-init pattern
+import asyncio
+
 class PostgresAdapter(GraphDBInterface):
     """Graph-as-tables adapter backed by Postgres, accessed via SQLAlchemy async sessions."""

     _ALLOWED_FILTER_ATTRS = {"id", "name", "type"}

     def __init__(self, connection_string: str) -> None:
         """Create engine and sessionmaker from a Postgres connection string."""
         self.db_uri = connection_string
         self.engine = create_async_engine(self.db_uri)
         self.sessionmaker = async_sessionmaker(bind=self.engine, expire_on_commit=False)
+        self._initialized = False
+        self._init_lock = asyncio.Lock()

     # ...

     `@asynccontextmanager`
     async def _session(self) -> AsyncIterator[Any]:
         """Yield an async session from the underlying engine."""
+        if not self._initialized:
+            async with self._init_lock:
+                if not self._initialized:
+                    await self.initialize()
+                    self._initialized = True
         async with self.sessionmaker() as session:
             yield session
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 35 -
44, The schema creation must be enforced before any DB operation: add a
lazy-init guard in the AsyncContextManager _session() so it ensures initialize()
has run before yielding a session; implement a boolean flag (e.g.,
self._initialized) plus an asyncio.Lock (e.g., self._init_lock) on the adapter
and in _session() check if not self._initialized then acquire self._init_lock
and inside the lock re-check the flag and call await self.initialize(); set
self._initialized = True after initialize() completes, then proceed to open the
session as now; this ensures calls to add_nodes(), get_nodes(), add_edges() (and
any other DB ops) on a fresh DB will create tables via _meta.create_all before
queries run and avoids race conditions.

257-268: ⚠️ Potential issue | 🟠 Major

Same batch upsert issues apply to edges.

If edges contains duplicate (source_id, target_id, relationship_name) tuples, the INSERT will fail. Deduplicate by the composite key before building the VALUES list.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 257 -
268, The upsert can fail if the input edges list contains duplicate (source_id,
target_id, relationship_name) tuples; before building the VALUES for
pg_insert(_edge_table) (the rows variable used to create stmt), deduplicate
edges by the composite key (source_id, target_id, relationship_name) and
collapse duplicates into a single row (e.g., keep the last/merged properties and
updated_at) so that rows passed to stmt contain unique composite keys; modify
the code that builds rows from edges to use a dict or set keyed by (source_id,
target_id, relationship_name) and then use its values when creating stmt and
executing the session.commit.

136-149: ⚠️ Potential issue | 🟠 Major

Batch upsert issues: (1) duplicate keys in VALUES fail, (2) partial payloads overwrite existing fields.

Two unresolved concerns from prior review:

  1. Duplicate keys: If nodes contains the same id twice, PostgreSQL raises ON CONFLICT DO UPDATE command cannot affect row a second time. Deduplicate by id before building the VALUES list.

  2. Partial overwrites: When a node tuple omits name or type, the upsert writes empty strings and replaces properties. This differs from MERGE semantics used by Neo4j/Kuzu adapters. Consider using COALESCE to preserve existing values when the new value is empty, and merge properties with || instead of replacing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 136 -
149, The upsert fails on duplicate ids and overwrites existing fields with empty
values; before building rows deduplicate input nodes by id (e.g., in the
codepath that constructs rows passed to pg_insert(_node_table)) so each id
appears only once, and change the ON CONFLICT update (the stmt created with
pg_insert and stmt.on_conflict_do_update) to preserve existing non-empty values
and merge properties: use COALESCE/NULLIF semantics for "name" and "type" (so
existing values are kept when the incoming value is empty/null) and merge JSONB
properties rather than replace them (use properties || excluded.properties or
equivalent JSONB merge) and still set updated_at = func.now(); these changes
should be made where stmt is defined and executed (the same block using async
with self._session()).
🧹 Nitpick comments (1)
cognee/tasks/temporal_awareness/index_graphiti_objects.py (1)

23-32: Core guard logic is correct; consider using SearchTypeNotSupported for consistency.

The positive Neo4j check addresses the main concern from prior feedback. However, similar guards in cypher_search_retriever.py and natural_language_retriever.py raise SearchTypeNotSupported instead of RuntimeError. Using the same exception class would provide a consistent interface for callers handling unsupported-feature errors.

Suggested alignment with other retrieval guards
     from cognee.infrastructure.databases.graph.config import get_graph_context_config
+    from cognee.modules.retrieval.exceptions import SearchTypeNotSupported

     graph_provider = get_graph_context_config()["graph_database_provider"]
     if graph_provider != "neo4j":
-        raise RuntimeError(
+        raise SearchTypeNotSupported(
             "Graphiti temporal awareness requires the Neo4j graph backend. "
             "Set GRAPH_DATABASE_PROVIDER=neo4j to use this feature."
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/tasks/temporal_awareness/index_graphiti_objects.py` around lines 23 -
32, The guard currently checks graph_provider from get_graph_context_config()
and raises a RuntimeError when not "neo4j"; replace this with raising
SearchTypeNotSupported for consistency with cypher_search_retriever.py and
natural_language_retriever.py: import SearchTypeNotSupported from its module
(where other retrievers use it) and change the RuntimeError raise in
index_graphiti_objects.py (the graph_provider check block) to raise
SearchTypeNotSupported with the same explanatory message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cognee/infrastructure/databases/graph/postgres/adapter.py`:
- Around line 829-830: The SQL ORDER BY used when fetching edges is not
deterministic because it orders only by e.source_id, e.target_id but the edge
primary key includes relationship_name; update the ORDER BY in the query that
ends with "OFFSET :off LIMIT :lim" (in postgres adapter code handling edge
pagination) to include e.relationship_name after e.target_id so pagination is
stable and rows are neither skipped nor duplicated.
- Around line 625-632: The node dict created in get_nodeset_subgraph (the code
building nodes in adapter.py that currently sets data = {"id": row[1], "name":
row[2], "type": row[3]}) must not include the "id" key to match
get_graph_data()'s (id, data) shape; change the data construction in that loop
(the block that builds data for nodes before nodes.append((row[1], data))) to
omit "id" and only include "name" and "type" (and the merged metadata), ensuring
nodes remain tuples of (id, data) where data does not contain id.
- Around line 777-781: The nodes list currently stores tuples like (id,
parsed_dict) but parsed_dict from _parse_node_row(...) already contains an "id"
key, making the payload shape inconsistent with get_graph_data(); update the
loop in adapter.py where nodes are appended so that for each row you call parsed
= self._parse_node_row(row) and then remove or exclude parsed["id"] before
appending (e.g., del parsed["id"] or build a new dict with only the fields
expected by get_graph_data()), ensuring nodes become (id, {"name":...,
"type":...}) and leaving _parse_node_row unchanged.
- Around line 35-44: The schema creation must be enforced before any DB
operation: add a lazy-init guard in the AsyncContextManager _session() so it
ensures initialize() has run before yielding a session; implement a boolean flag
(e.g., self._initialized) plus an asyncio.Lock (e.g., self._init_lock) on the
adapter and in _session() check if not self._initialized then acquire
self._init_lock and inside the lock re-check the flag and call await
self.initialize(); set self._initialized = True after initialize() completes,
then proceed to open the session as now; this ensures calls to add_nodes(),
get_nodes(), add_edges() (and any other DB ops) on a fresh DB will create tables
via _meta.create_all before queries run and avoids race conditions.
- Around line 257-268: The upsert can fail if the input edges list contains
duplicate (source_id, target_id, relationship_name) tuples; before building the
VALUES for pg_insert(_edge_table) (the rows variable used to create stmt),
deduplicate edges by the composite key (source_id, target_id, relationship_name)
and collapse duplicates into a single row (e.g., keep the last/merged properties
and updated_at) so that rows passed to stmt contain unique composite keys;
modify the code that builds rows from edges to use a dict or set keyed by
(source_id, target_id, relationship_name) and then use its values when creating
stmt and executing the session.commit.
- Around line 136-149: The upsert fails on duplicate ids and overwrites existing
fields with empty values; before building rows deduplicate input nodes by id
(e.g., in the codepath that constructs rows passed to pg_insert(_node_table)) so
each id appears only once, and change the ON CONFLICT update (the stmt created
with pg_insert and stmt.on_conflict_do_update) to preserve existing non-empty
values and merge properties: use COALESCE/NULLIF semantics for "name" and "type"
(so existing values are kept when the incoming value is empty/null) and merge
JSONB properties rather than replace them (use properties || excluded.properties
or equivalent JSONB merge) and still set updated_at = func.now(); these changes
should be made where stmt is defined and executed (the same block using async
with self._session()).

---

Nitpick comments:
In `@cognee/tasks/temporal_awareness/index_graphiti_objects.py`:
- Around line 23-32: The guard currently checks graph_provider from
get_graph_context_config() and raises a RuntimeError when not "neo4j"; replace
this with raising SearchTypeNotSupported for consistency with
cypher_search_retriever.py and natural_language_retriever.py: import
SearchTypeNotSupported from its module (where other retrievers use it) and
change the RuntimeError raise in index_graphiti_objects.py (the graph_provider
check block) to raise SearchTypeNotSupported with the same explanatory message.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 49784333-06cb-4409-be35-6ad36cc66829

📥 Commits

Reviewing files that changed from the base of the PR and between f784a7a and 6b2af3f.

📒 Files selected for processing (3)
  • CLAUDE.md
  • cognee/infrastructure/databases/graph/postgres/adapter.py
  • cognee/tasks/temporal_awareness/index_graphiti_objects.py
✅ Files skipped from review due to trivial changes (1)
  • CLAUDE.md

Problems:
- duplicate keys in one INSERT batch cause Postgres ON CONFLICT error
- node tuple props included id in some methods but not others
- triplet pagination ORDER BY not unique, could skip or duplicate rows

Solutions:
- dedup node rows by id and edge rows by composite key before INSERT
- remove id from props dict in get_nodeset_subgraph and get_neighborhood
- add relationship_name to triplet ORDER BY for deterministic pagination
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
cognee/infrastructure/databases/graph/postgres/adapter.py (2)

40-44: ⚠️ Potential issue | 🟠 Major

Add one-time lazy schema initialization inside _session().

Calls like add_nodes()/add_edges() can run before initialize() and fail on a fresh DB (relation does not exist). Guard _session() with a one-time async init lock.

Suggested fix
+import asyncio
 ...
 class PostgresAdapter(GraphDBInterface):
 ...
     def __init__(self, connection_string: str) -> None:
         """Create engine and sessionmaker from a Postgres connection string."""
         self.db_uri = connection_string
         self.engine = create_async_engine(self.db_uri)
         self.sessionmaker = async_sessionmaker(bind=self.engine, expire_on_commit=False)
+        self._initialized = False
+        self._init_lock = asyncio.Lock()
 ...
     `@asynccontextmanager`
     async def _session(self) -> AsyncIterator[Any]:
         """Yield an async session from the underlying engine."""
+        if not self._initialized:
+            async with self._init_lock:
+                if not self._initialized:
+                    await self.initialize()
+                    self._initialized = True
         async with self.sessionmaker() as session:
             yield session
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 40 -
44, The _session() async context manager must ensure the DB schema is
initialized once before any user calls like add_nodes()/add_edges() run; add an
instance-level async one-time init guard: create an _init_lock (asyncio.Lock)
and an _initialized boolean on the adapter, and at the start of _session()
acquire the lock, check _initialized, if False call initialize() (await it), set
_initialized=True, then release the lock (or use async with); finally proceed to
yield the session as before—this guarantees initialize() is called exactly once
and prevents race conditions when multiple coroutines open sessions
concurrently.

95-100: ⚠️ Potential issue | 🔴 Critical

Partial add_node() upserts can still erase existing node fields.

add_node("id") currently pushes empty name/type and replaces properties on conflict, which can wipe existing metadata instead of patching it.

Suggested fix
         rows = []
         for node in nodes:
             if isinstance(node, tuple):
                 props = {**(node[1] or {}), "id": node[0]}
 ...
             rows.append(
                 {
                     "id": str(props.get("id", "")),
-                    "name": str(props.get("name", "")),
-                    "type": str(props.get("type", "")),
+                    "name": (
+                        str(props["name"])
+                        if props.get("name") is not None and "name" in props
+                        else None
+                    ),
+                    "type": (
+                        str(props["type"])
+                        if props.get("type") is not None and "type" in props
+                        else None
+                    ),
                     "properties": json.loads(json.dumps(extra, cls=JSONEncoder)),
                     "created_at": now,
                     "updated_at": now,
                 }
             )
 ...
         stmt = stmt.on_conflict_do_update(
             index_elements=["id"],
             set_={
-                "name": stmt.excluded.name,
-                "type": stmt.excluded.type,
-                "properties": stmt.excluded.properties,
+                "name": func.coalesce(stmt.excluded.name, _node_table.c.name),
+                "type": func.coalesce(stmt.excluded.type, _node_table.c.type),
+                "properties": func.coalesce(_node_table.c.properties, text("'{}'::jsonb")).op("||")(
+                    func.coalesce(stmt.excluded.properties, text("'{}'::jsonb"))
+                ),
                 "updated_at": func.now(),
             },
         )

Based on learnings: In cognee's graph database adapters (Neo4j, Kuzu, Memgraph, Neptune), add_node() uses MERGE semantics (create-or-update without blanking existing fields).

Also applies to: 127-146

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 95 -
100, The add_node logic in adapter.py (functions/methods add_node and add_nodes)
currently inserts a node with empty/default name/type or replaces the entire
properties map on conflict, which can overwrite existing metadata; change the
upsert semantics to perform a merge/patch instead of full replace: when a node
is provided as a string or with partial properties, ensure you preserve existing
node fields by merging incoming properties into existing properties (e.g., fetch
and update only provided keys or use DB upsert semantics that merge maps) rather
than replacing the properties column; apply the same merge behavior to the other
add_node/add_nodes block mentioned (the similar code around the other section
lines 127-146).
🧹 Nitpick comments (1)
cognee/infrastructure/databases/graph/postgres/adapter.py (1)

568-580: Validate node_name_filter_operator explicitly before branching.

Any value other than "OR" currently falls through to AND behavior. Raise ValueError for unsupported operators to keep API behavior explicit.

Suggested fix
+        if node_name_filter_operator not in {"OR", "AND"}:
+            raise ValueError(
+                f"Unsupported node_name_filter_operator: {node_name_filter_operator!r}. "
+                "Expected 'OR' or 'AND'."
+            )
         if node_name_filter_operator == "OR":
             neighbor_cte = """

As per coding guidelines: "Prefer explicit, structured error handling in Python code."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 568 -
580, Validate the node_name_filter_operator explicitly before the branching that
builds neighbor_cte: check if node_name_filter_operator is either "OR" or "AND"
(or other supported tokens) and if not raise a ValueError with a clear message;
then proceed with the existing conditional that assigns neighbor_cte for the
"OR" branch and the other branch for "AND". Ensure this validation occurs near
the existing code that references node_name_filter_operator so the error is
raised before any SQL string is constructed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cognee/infrastructure/databases/graph/postgres/adapter.py`:
- Around line 40-44: The _session() async context manager must ensure the DB
schema is initialized once before any user calls like add_nodes()/add_edges()
run; add an instance-level async one-time init guard: create an _init_lock
(asyncio.Lock) and an _initialized boolean on the adapter, and at the start of
_session() acquire the lock, check _initialized, if False call initialize()
(await it), set _initialized=True, then release the lock (or use async with);
finally proceed to yield the session as before—this guarantees initialize() is
called exactly once and prevents race conditions when multiple coroutines open
sessions concurrently.
- Around line 95-100: The add_node logic in adapter.py (functions/methods
add_node and add_nodes) currently inserts a node with empty/default name/type or
replaces the entire properties map on conflict, which can overwrite existing
metadata; change the upsert semantics to perform a merge/patch instead of full
replace: when a node is provided as a string or with partial properties, ensure
you preserve existing node fields by merging incoming properties into existing
properties (e.g., fetch and update only provided keys or use DB upsert semantics
that merge maps) rather than replacing the properties column; apply the same
merge behavior to the other add_node/add_nodes block mentioned (the similar code
around the other section lines 127-146).

---

Nitpick comments:
In `@cognee/infrastructure/databases/graph/postgres/adapter.py`:
- Around line 568-580: Validate the node_name_filter_operator explicitly before
the branching that builds neighbor_cte: check if node_name_filter_operator is
either "OR" or "AND" (or other supported tokens) and if not raise a ValueError
with a clear message; then proceed with the existing conditional that assigns
neighbor_cte for the "OR" branch and the other branch for "AND". Ensure this
validation occurs near the existing code that references
node_name_filter_operator so the error is raised before any SQL string is
constructed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eadb0e74-6eae-4a5a-a3d8-297b1e2c1e55

📥 Commits

Reviewing files that changed from the base of the PR and between 6b2af3f and 1941db7.

📒 Files selected for processing (1)
  • cognee/infrastructure/databases/graph/postgres/adapter.py

@DanielNScott
Copy link
Copy Markdown
Contributor Author

@lxobr @dexters1

New branch which won't auto-fail checks by being PR'd from outside Cognee. Sorry Igor for the hassle.

I think everything looks good. The failing git-guardian is a fake / dummy credential, the not-all-tests-passed failure is... exactly that, and the two actual failures appear to be totally unrelated to the changes here.

@dexters1
Copy link
Copy Markdown
Collaborator

@lxobr @dexters1

New branch which won't auto-fail checks by being PR'd from outside Cognee. Sorry Igor for the hassle.

I think everything looks good. The failing git-guardian is a fake / dummy credential, the not-all-tests-passed failure is... exactly that, and the two actual failures appear to be totally unrelated to the changes here.

Can I update this branch with changes from dev? Would be great if we can fix the unrelated failures (@siillee and me) so the full CI/CD can run normally before merge (Some tests don't start until previous tests pass)

@DanielNScott
Copy link
Copy Markdown
Contributor Author

@lxobr @dexters1
New branch which won't auto-fail checks by being PR'd from outside Cognee. Sorry Igor for the hassle.
I think everything looks good. The failing git-guardian is a fake / dummy credential, the not-all-tests-passed failure is... exactly that, and the two actual failures appear to be totally unrelated to the changes here.

Can I update this branch with changes from dev? Would be great if we can fix the unrelated failures (@siillee and me) so the full CI/CD can run normally before merge (Some tests don't start until previous tests pass)

Yeah, certainly. Looks like Lazar has tried to do so. If these fail despite the dev pulls, and you want me to just fix whatever the issue is, I can do that, too. The only reason I didn't do it yesterday was to avoid potential conflicts with other branches doing the fix.

Copy link
Copy Markdown
Collaborator

@lxobr lxobr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for bringing this one over the line

@lxobr lxobr merged commit da9db6c into dev Apr 10, 2026
108 of 111 checks passed
@lxobr lxobr deleted the feature/cog-4463-graph-aware-embeddings-implement-minimal-graphdb-in-postgres branch April 10, 2026 14:06
Vasilije1990 added a commit that referenced this pull request Apr 11, 2026
…h db and (#2587)

## Description

Adds a PostgreSQL-based hybrid adapter that performs graph and vector
operations against a single Postgres database, enabling combined queries
(JOINs across graph_node/graph_edge and vector collection tables) and
single-transaction writes.

This PR builds on PR #2584 (postgres graph adapter) and integrates it
with the existing PGVector adapter via a new `PostgresHybridAdapter` in
`cognee/infrastructure/databases/hybrid/postgres/`.

Key design decisions:
- Composition of PGVector and PostgresAdapters in Hybrid
   - holds a PostgresAdapter and PGVectorAdapter
   - delegates standard operations to each
   - includes hybrid logic for joint operations
- Combined write methods for single-transaction graph+vector inserts
   - `add_nodes_with_vectors` and `add_edges_with_vectors`
- Combined search method
- `search_graph_with_distances` JOINs graph and vector tables via
pgvector cosine distance
- Unified engine integration
   - `USE_UNIFIED_PROVIDER=pghybrid` env var activates hybrid mode
- `get_unified_engine()` assembles the hybrid adapter from graph and
vector sub-adapters
- `add_data_points` routes to hybrid writes when capability is available
- Graph adapter creates its own engine from the relational connection
string
   - matches the decoupled constructor convention from PR #2584
   - hybrid adapter uses `self._graph._session()` for combined writes
- Cypher, natural language, and Graphiti retrievers rejected by both
PostgresAdapter and PostgresHybridAdapter
- CI updated with hybrid adapter tests and hybrid E2E test steps
   - uses `pgvector/pgvector:pg17` image for pgvector extension support

Also includes from PR #2584 merge:
- All adapter fixes verified present after merge
- `get_neighborhood` with recursive CTE
- `node_name_filter_operator` OR/AND support in `get_nodeset_subgraph`
- Batch dedup, node tuple shape consistency, triplet ordering
- `func.now()` on upsert timestamps
- Config save/restore in E2E tests

That is, this should include all content from previous merge tweaks,
along with the independent tweaks that were on our previous version of
this, #2506.

## Acceptance Criteria

- [x] `PostgresHybridAdapter` implements both `GraphDBInterface` and
`VectorDBInterface`
- [x] Combined write methods perform graph+vector inserts in single
transaction
- [x] Combined search JOINs graph and vector tables
- [x] `USE_UNIFIED_PROVIDER=pghybrid` activates hybrid mode through
unified engine
- [x] `add_data_points` routes to hybrid writes when capability is
present
- [x] 26 postgres adapter unit tests pass
- [x] Retriever gates cover both PostgresAdapter and
PostgresHybridAdapter
- [x] All 123 fixes from PR #2584 verified present after merge
- [x] CI passes on upstream

## Type of Change
- [ ] Bug fix (non-breaking change that fixes an issue)
- [x] New feature (non-breaking change that adds functionality)
- [ ] Code refactoring
- [ ] Other (please specify):

## Screenshots
26 passed, 8 warnings in 7.26s (will update with full passage shortly,
while CI/CD is running).

## Pre-submission Checklist
- [x] **I have tested my changes thoroughly before submitting this PR**
(See `CONTRIBUTING.md`)
- [x] **This PR contains minimal changes necessary to address the
issue/feature**
- [x] My code follows the project's coding standards and style
guidelines
- [x] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have added necessary documentation (if applicable)
- [x] All new and existing tests pass
- [x] I have searched existing PRs to ensure this change hasn't been
submitted already
- [x] I have linked any relevant issues in the description
- [x] My commits have clear and descriptive messages

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* PostgreSQL hybrid adapter: combined graph+vector read/write, unified
"pghybrid" provider, and graph search with distances.
  * Graph interface supports batch triplet retrieval.

* **Tests**
* New E2E and unit tests validating hybrid adapter and hybrid write
flows.

* **Bug Fixes**
* More robust edge formatting and stricter backend guards for Cypher
retrievers.
  * Improved vector metadata caching and reflection.

* **Chores**
* CI updated to run pgvector/Postgres 17 tests, DCO workflow adjusted
for trusted authors, version bumped to 0.5.8.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants