Conversation
Problems: - Need to migrate basic graph functionality to relational DB Solutions: - New adapter implementing GraphDBInterface - Selection branch included in get_graph_engine.py Decisions: - re-use SQLAlchemy setup and configuration - Remove .query() (raw Cypher) functionality
Problems: - Cypher search retriever exposes raw cypher interface via .query() - Natural language retriever writes, uses Cypher via .query() - Temporal awareness (Graphiti) subsystem directly uses cypher, neo4j Solutions: - Raise incompatibility errors proactively across these consumers Decisions: - None of these are core functionality, cost should be low - Graphiti already unguarded hard-coding of multiple configs... - Complex cypher search cannot be replaced in postgres at present
Problems: - asyncpg misparses ::jsonb cast as named parameter - prune_system drops graph tables via delete_database() - cached adapter instance has no tables after prune - no CI coverage for postgres graph backend - no unit or e2e tests for postgres adapter Solutions: - replace ::jsonb with CAST(:param AS jsonb) throughout - add _ensure_initialized() calling idempotent CREATE IF NOT EXISTS - guard is_empty and delete_graph with _ensure_initialized - add run-postgres-tests job to graph_db_tests.yml CI workflow - add 26 unit tests against real Postgres - add shared e2e test and postgres wrapper Decisions: - _ensure_initialized over defensive checks on every method - real Postgres in tests, not SQLite (dialect differences block it) - shared test function for future backend reuse (test_graphdb_shared) - CI uses Postgres 16 service container, matching Kuzu/Neo4j pattern
Problems: - some white-space and wrapping non-compliance Solutions: - fix via ruff
Problems: - Inadequate batching - add_nodes loops N awaits, one INSERT round-trip each - add_edges same per-row loop pattern - has_edges loops N SELECT EXISTS queries - raw text() SQL construction, not SQLAlchemy Solutions: - multi-row insert instead of looping - add_nodes uses pg_insert with multi-row VALUES batch - add_edges uses pg_insert with multi-row VALUES batch - has_edges uses SQLAlchemy values() with EXISTS join - added SQLAlchemy Core table definitions for DML - will remove next commit to maintain single source of truth Decisions: - maximum batching optimization - multi-row INSERT over executemany for single-statement execution - executemany pipelines but still N server-side executions - looped awaits worst, executemany middle, VALUES best batching - chose SQLAlchemy Core over raw text() for proactive safety - text() with bind params was safe but not project convention - SQLAlchemy eliminates all raw SQL string construction - keep problematic reproduction of DDL logic for commit specificity - will fix in follow up
Problems: - raw DDL strings and Core table objects duplicated schema Solutions: - replaced _SCHEMA_DDL list with Core Table and Index definitions - initialize() now uses _meta.create_all() instead of raw DDL - removed asyncio import (unused) - replaced postgresql TIMESTAMP with generic DateTime(timezone=True)
Note: - All problems below were cosmetic or stylistic, not substantive. Problems: - SQL injection via unvalidated attribute names in get_filtered_graph_data - empty attribute_filters can produce invalid SQL syntax - local "values" var shadows sqlalchemy values import in add_edges - JSON serialize then deserialize indirection in add_nodes - get_nodeset_subgraph issues four sequential DB queries - get_filtered_graph_data issued two sequential DB queries - public methods missing docstrings, unlike Neo4j and Kuzu adapters - section banners inconsistent with other adapters - narrating inline comments restate what code already says - python-version input missing from graph_db_tests workflow - SearchTypeNotSupported imported inline instead of at module level Solutions: - add attribute whitelist, raise ValueError for invalid filter keys - fall back to get_graph_data when attribute_filters is empty - rename local values to rows in add_edges - inline property dict construction, remove _split_core_and_extra - combine get_nodeset_subgraph into single CTE query - combine get_filtered_graph_data into single CTE query - add Parameters/Returns docstrings to all public methods - remove section banners throughout adapter - trim narrating comments, keep complex clarifying ones - add python-version input with default 3.11 to workflow_call - move SearchTypeNotSupported to module-level import Reasoning and Related Notes: - prune_system metadata=True drops relational DB including graph tables - _ensure_initialized kept defensive as a result - JSON round-trip in add_nodes retained for UUID and datetime serialization - specifically, enforces JSONB compatibility (postgres wants python dict) - alternative is in-place modification - adding rules for doing this would replicate JSONEncoder - JSONEncoder outputs strings, so we need to convert back if used
Problems: - tuple merge order lets properties dict override authoritative node ID - __init__ and _session missing type annotations - magic number 6 in test_graphdb_shared unexplained - raised but incorrect: add weighted edges test Solutions: - swap merge order so tuple ID always wins - add type annotations to __init__ and _session - add comment explaining 3 searches x 2 entries = 6 - no duplication of weighted edges test necessary or appropriate Reasoning and Related Notes: - weighted edges test is backend-agnostic, exercises get_graph_from_model only - running it under Postgres job would not test adapter round-trip - no backend actually has a round-trip weighted edges test - therefore kuzu and neo4j cases are incorrect and duplicates - recursive CTE performance in get_graph_metrics is not part of pipeline - Kuzu returns -1 for components, Neo4j uses GDS plugin
…ment-minimal-graphdb-in-postgres
Major problems: - adapter assumed relational layer is postgres and shared - users could not run graph-DB on a separate postgres instance - factory fell back to relational config instead of requiring graph URL - sqlite relational + postgres graph would pass sqlite engine to adapter Minor problems: - table definitions mixed with adapter logic in one file - _ensure_initialized wrapper added no behavior over initialize() - pg_insert and String imports were missing - tests created new event loop per test, staling the engine pool Solutions: - adapter creates its own engine from connection string - factory requires GRAPH_DATABASE_URL, matching other providers - extract table and index definitions into tables.py - inline initialize() calls, remove _ensure_initialized - add missing pg_insert and String imports - add session-scoped event_loop fixture for tests - add GRAPH_DATABASE_URL to CI postgres E2E step Architectural Trade-Off Decisions: - separate pool over shared pool, matching adapter convention - hybrid branch can inject shared engine via factory as needed Reasoning and Related Notes: - addresses items from PR #2506 review (dexters1) - all other adapters create their own clients in __init__ - pytest-asyncio 0.21 creates a new event loop per test by default
…al-graphdb-in-postgres' of https://github.com/DanielNScott/cognee into feature/cog-4463-graph-aware-embeddings-implement-minimal-graphdb-in-postgres
…dev merge. Problems: - get_neighborhood abstract method missing after dev merge - get_nodeset_subgraph rejected node_name_filter_operator kwarg - E2E test could leak config to subsequent tests if explicitly pytested Solutions: - implement get_neighborhood with recursive CTE in single query - add node_name_filter_operator param to get_nodeset_subgraph - capture and restore graph provider and root dirs in finally block Reasoning and Related Notes: - recursive CTE + UNION ALL matches get_nodeset_subgraph pattern - other adapters use three queries; single query avoids round trips - config restore addresses CodeRabbit review item 5 on PR #2506
…ment-minimal-graphdb-in-postgres
…ment-minimal-graphdb-in-postgres
Problems: - postgres tests in unit/ fail without live database in CI - unit tests should mock DBs per project conventions Solutions: - moved 3 test files to cognee/tests/e2e/postgres/ - created __init__.py for e2e/ and e2e/postgres/ - updated import path in test_graphdb_postgres.py - updated 2 test paths in graph_db_tests.yml Architectural Trade-Off Decisions: - chose e2e/ over integration/ to avoid automatic CI sweep Reasoning and Related Notes: - requested by Igor (dexters1) during PR #2506 review - mirrors identical move on sister hybrid-adapter branch
…mbeddings-implement-minimal-graphdb-in-postgres' into feature/cog-4463-graph-aware-embeddings-implement-minimal-graphdb-in-postgres
Please make sure all the checkboxes are checked:
|
|
Hello @DanielNScott, thank you for submitting a PR! We will respond as soon as possible. |
WalkthroughAdds PostgreSQL as a new graph backend: introduces an async PostgresAdapter and schema, integrates provider into the graph engine factory, updates docs and CI, adds Postgres E2E and adapter tests, and adds runtime guards preventing Cypher/Graphiti workflows on Postgres. (48 words) Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 9573981 | Triggered | Generic Password | da6a001 | .env.template | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
cognee/infrastructure/databases/graph/postgres/adapter.py (2)
613-685: Recursive CTE for connected components may be expensive on large graphs.The connected components computation uses a recursive CTE that can be costly for graphs with many nodes/edges. This is acceptable for moderate sizes but may need optimization (e.g., materialized views, incremental updates) for production workloads.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 613 - 685, get_graph_metrics currently computes connected components with an expensive recursive CTE (the "component" CTE inside get_graph_metrics), which can be unacceptable on large graphs; modify get_graph_metrics to avoid running the recursive CTE by default and instead either (A) skip component computation unless a new flag (e.g., compute_components=True) is passed, (B) only run the CTE when num_nodes is below a safe threshold (use num_nodes from the earlier count), or (C) query a precomputed/materialized table (e.g., graph_component or a cache) if available—implement one of these approaches, update the metrics keys accordingly (sizes_of_connected_components and num_connected_components), and document the new behavior so callers can opt in when they know the dataset is small or choose to use a precomputed source instead of the in-query recursive CTE.
542-611: Consider implementingnode_name_filter_operatoror documenting current behavior.The parameter
node_name_filter_operator(default "OR") is accepted but not used in the query, which always usesAND name = ANY(:names)(effectively OR semantics). This matches the default, but if AND logic is needed later, the parameter is misleading.💡 Option: document the current behavior
async def get_nodeset_subgraph( self, node_type: Type[Any], node_name: List[str], node_name_filter_operator: str = "OR" ) -> Tuple[List[Tuple[str, dict]], List[Tuple[str, str, str, dict]]]: """Retrieve a subgraph containing matching nodes, their neighbors, and interconnecting edges. Parameters: ----------- node_type: The DataPoint subclass whose __name__ is the type label. node_name: List of node names to match. + node_name_filter_operator: Currently unused; OR semantics are always applied. Returns: -------- A tuple of (nodes, edges) for the subgraph. """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 542 - 611, The parameter node_name_filter_operator on get_nodeset_subgraph is accepted but unused; either remove it or implement proper behavior—update get_nodeset_subgraph to build the WHERE clause for primary_nodes dynamically: if node_name_filter_operator == "OR" keep WHERE type = :label AND name = ANY(:names); if "AND" build a conjunctive predicate (e.g., WHERE type = :label AND name = :name_0 AND name = :name_1 ... with individual params for each name) or validate/raise for unsupported values; ensure you adjust the session.execute parameter binding to match the chosen SQL and update callers or docstring accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CLAUDE.md`:
- Around line 286-290: The Postgres example snippet sets
GRAPH_DATABASE_PROVIDER=postgres but omits the required GRAPH_DATABASE_URL;
update the snippet in CLAUDE.md to include GRAPH_DATABASE_URL with an example
connection string (e.g.
GRAPH_DATABASE_URL=postgresql://user:password@host:port/dbname) alongside
GRAPH_DATABASE_PROVIDER=postgres so the Postgres graph factory accepts the
configuration; ensure the example mentions that this URL is required when using
GRAPH_DATABASE_PROVIDER=postgres.
In `@cognee/infrastructure/databases/graph/postgres/tables.py`:
- Around line 15-16: The updated_at column currently uses
server_default=func.now() but lacks onupdate, and the upsert statements in the
Postgres adapter assign stmt.excluded.updated_at (the stale insert timestamp)
instead of the server time; change the Column("updated_at",
DateTime(timezone=True), ...) in tables.py to include onupdate=func.now(), and
in the adapter's node upsert and edge upsert logic replace any use of
stmt.excluded.updated_at with func.now() so that upserts set updated_at to the
server current time.
In `@cognee/tasks/temporal_awareness/index_graphiti_objects.py`:
- Around line 23-31: Replace the negative Postgres-only guard with a positive
Neo4j check: instead of "if isinstance(graph_engine, PostgresAdapter): raise
RuntimeError(...)" ensure you verify that graph_engine is a Neo4j backend (e.g.,
isinstance(graph_engine, Neo4jAdapter) or graph_engine.provider == "neo4j") and
otherwise raise the same unsupported-feature exception class used by the
retrievers (import that exception and use it in place of RuntimeError) so
non-Neo4j backends (Kuzu, Neptune, etc.) get a clear, consistent
unsupported-feature error; keep references to the existing symbols graph_engine
and PostgresAdapter while adding the Neo4j adapter check and the retriever
exception import.
---
Nitpick comments:
In `@cognee/infrastructure/databases/graph/postgres/adapter.py`:
- Around line 613-685: get_graph_metrics currently computes connected components
with an expensive recursive CTE (the "component" CTE inside get_graph_metrics),
which can be unacceptable on large graphs; modify get_graph_metrics to avoid
running the recursive CTE by default and instead either (A) skip component
computation unless a new flag (e.g., compute_components=True) is passed, (B)
only run the CTE when num_nodes is below a safe threshold (use num_nodes from
the earlier count), or (C) query a precomputed/materialized table (e.g.,
graph_component or a cache) if available—implement one of these approaches,
update the metrics keys accordingly (sizes_of_connected_components and
num_connected_components), and document the new behavior so callers can opt in
when they know the dataset is small or choose to use a precomputed source
instead of the in-query recursive CTE.
- Around line 542-611: The parameter node_name_filter_operator on
get_nodeset_subgraph is accepted but unused; either remove it or implement
proper behavior—update get_nodeset_subgraph to build the WHERE clause for
primary_nodes dynamically: if node_name_filter_operator == "OR" keep WHERE type
= :label AND name = ANY(:names); if "AND" build a conjunctive predicate (e.g.,
WHERE type = :label AND name = :name_0 AND name = :name_1 ... with individual
params for each name) or validate/raise for unsupported values; ensure you
adjust the session.execute parameter binding to match the chosen SQL and update
callers or docstring accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e1cf24ef-0c83-4ae5-9e9e-a2cc92985494
📒 Files selected for processing (14)
.github/workflows/graph_db_tests.ymlCLAUDE.mdcognee/infrastructure/databases/graph/get_graph_engine.pycognee/infrastructure/databases/graph/postgres/__init__.pycognee/infrastructure/databases/graph/postgres/adapter.pycognee/infrastructure/databases/graph/postgres/tables.pycognee/modules/retrieval/cypher_search_retriever.pycognee/modules/retrieval/natural_language_retriever.pycognee/tasks/temporal_awareness/index_graphiti_objects.pycognee/tests/e2e/__init__.pycognee/tests/e2e/postgres/__init__.pycognee/tests/e2e/postgres/test_graphdb_postgres.pycognee/tests/e2e/postgres/test_graphdb_shared.pycognee/tests/e2e/postgres/test_postgres_adapter.py
…et_subgraph. Problems: - node_name_filter_operator param accepted but only OR was implemented - AND callers would silently get OR behavior Solutions: - branch neighbor_ids CTE on filter operator, matching Kuzu and Neo4j - AND uses GROUP BY / HAVING COUNT(DISTINCT) = primary_count - OR unchanged, uses DISTINCT on any neighbor
Problems: - CLAUDE.md postgres snippet missing required GRAPH_DATABASE_URL - upserts used stale insert timestamp instead of server time for updated_at - Graphiti gate excluded only Postgres instead of requiring Neo4j Solutions: - add GRAPH_DATABASE_URL to CLAUDE.md postgres config example - use func.now() in node and edge upsert conflict handlers - check graph_database_provider != "neo4j" instead of isinstance(PostgresAdapter)
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cognee/infrastructure/databases/graph/postgres/adapter.py`:
- Around line 625-632: The node tuples currently include the id twice (as the
first element and again inside the data dict) causing inconsistent shapes; in
the node-building loops inside get_graph_data/get_filtered_graph_data (and the
similar block referenced at lines ~779-781 used by get_neighborhood) remove the
'id' key from the payload before appending the tuple so each node is (id,
props_without_id). Concretely, when creating data = {"id": row[1], ...} or when
merging row[4], ensure you do NOT set or you pop the 'id' key from data so
nodes.append((row[1], data)) yields a payload without an 'id' field.
- Around line 829-830: The pagination ORDER BY in the SQL fragment currently
uses ORDER BY e.source_id, e.target_id which is not unique; update the ordering
to include the edge primary-key column relationship_name so the sort is total
and stable for OFFSET/LIMIT pagination (e.g., ORDER BY e.source_id, e.target_id,
e.relationship_name). Locate the query building code that emits the ORDER BY
clause in the Postgres graph adapter (the SQL near ORDER BY e.source_id,
e.target_id) and modify it to include e.relationship_name (or all edge PK
columns) to guarantee deterministic paginated triplet results.
- Around line 35-44: The adapter must perform one-time schema creation before
any DB operation: add an async lazy-init guard (e.g. private attributes
_initialized: bool and _init_lock: asyncio.Lock) and call initialize() from
inside the asynccontextmanager _session() if _initialized is False, acquiring
_init_lock to ensure only one coroutine runs initialize(), set _initialized =
True after success, and then yield the session; update _session to await
initialize() under the lock so callers of add_nodes/get_nodes/add_edges (via
GraphDBInterface) never hit missing relation errors.
- Around line 95-98: The add_node/add_nodes upsert currently replaces core
fields and properties when a partial payload is provided; change the upsert
logic in methods add_node and add_nodes so omitted core fields (e.g., "name",
"type") preserve existing values (use COALESCE or equivalent to keep existing
column values) and merge properties instead of replacing them (use jsonb
concatenation/merge like existing_properties || new_properties or
jsonb_strip_nulls(jsonb_concat) semantics) so partial calls (e.g.,
add_node("n1") or add_node("n1", {"foo":1})) patch the node rather than blanking
fields; apply the same fix to the other upsert block referenced around lines
124-145.
- Around line 136-145: The upsert fails when the VALUES batch contains duplicate
conflict keys; before building the pg_insert statement in add_nodes (where you
build stmt = pg_insert(_node_table).values(rows) ...) collapse the incoming
nodes list into a dictionary keyed by node["id"] (or the node id field used) and
keep the last occurrence, then build rows from dict.values(); likewise in
add_edges (the block around where edges are inserted, lines ~257-264)
deduplicate by the tuple (source_id, target_id, relationship_name) and keep the
last occurrence before creating the insert rows; this ensures each conflict key
appears only once in the VALUES batch and prevents the "ON CONFLICT DO UPDATE
command cannot affect row a second time" error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5630ab52-ecda-4842-8138-5a88170a98dd
📒 Files selected for processing (1)
cognee/infrastructure/databases/graph/postgres/adapter.py
| if isinstance(node, str): | ||
| props = properties or {} | ||
| props.setdefault("id", node) | ||
| await self.add_nodes([(node, props)]) |
There was a problem hiding this comment.
Partial add_node() upserts currently wipe existing node fields.
Passing a string id or a tuple without name/type writes name="", type="" and replaces properties on conflict. That means calls like add_node("n1") or add_node("n1", {"foo": 1}) can silently blank out previously stored metadata instead of doing a MERGE-style update. Preserve existing core fields when they are omitted, and patch properties instead of replacing them with the partial payload.
Based on learnings: In cognee's graph database adapters (Neo4j, Kuzu, Memgraph, Neptune), the add_node() method performs an upsert operation using MERGE semantics: it creates the node if it doesn't exist, or updates the existing node's properties if it does exist.
Also applies to: 124-145
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 95 -
98, The add_node/add_nodes upsert currently replaces core fields and properties
when a partial payload is provided; change the upsert logic in methods add_node
and add_nodes so omitted core fields (e.g., "name", "type") preserve existing
values (use COALESCE or equivalent to keep existing column values) and merge
properties instead of replacing them (use jsonb concatenation/merge like
existing_properties || new_properties or jsonb_strip_nulls(jsonb_concat)
semantics) so partial calls (e.g., add_node("n1") or add_node("n1", {"foo":1}))
patch the node rather than blanking fields; apply the same fix to the other
upsert block referenced around lines 124-145.
There was a problem hiding this comment.
Non issue.
The pipeline always passes full DataPoint objects via add_nodes(). The string/tuple form of add_node() is only used in tests. No production code path produces a partial node that would trigger the wipe.
There was a problem hiding this comment.
🧩 Analysis chain
Let me verify this claim by checking the actual callers in the production code path.
🏁 Script executed:
#!/bin/bash
# Find all callers of add_node and add_nodes outside of test files
rg -n 'add_node\b|add_nodes\b' --type=py -g '!**/test*' -g '!**/tests*' -A2 -B2
Length of output: 15801
@DanielNScott, the search results surface a production-code caller that contradicts that claim. cognee/modules/data/methods/add_model_class_to_graph.py calls add_node with a raw string and a partial property dict (no name) in three places:
# Line 19
await graph.add_node(model_name, dict(type="model"))
# Line 51
await graph.add_node(str(item_type), dict(type="value"))
# Line 63
await graph.add_node(str(field_type), dict(type="value"))These are not test-only calls — they are production module paths. Under the current upsert logic, any subsequent call to one of these nodes would write name="", overwriting whatever name was stored previously. For the other adapters (Neo4j, Kuzu, Neptune) this is safe because their add_node only accepts DataPoint; PostgresAdapter is the one that extends the signature to also accept str.
If add_model_class_to_graph is never invoked when GRAPH_DATABASE_PROVIDER=postgres, you're right that it's not a runtime risk today — but the interface allows it and the code path exists. At minimum, the add_nodes upsert could use COALESCE to preserve existing name/type on conflict, which would make partial callers safe at no additional cost:
stmt = pg_insert(_node_table).values(rows)
stmt = stmt.on_conflict_do_update(
index_elements=["id"],
set_={
"name": func.coalesce(stmt.excluded.name, _node_table.c.name),
"type": func.coalesce(stmt.excluded.type, _node_table.c.type),
"properties": stmt.excluded.properties,
"updated_at": func.now(),
},
)Happy to defer to your judgment if you're confident add_model_class_to_graph won't be routed to the Postgres adapter, but wanted to flag the evidence so the decision is made with full visibility.
🧠 Learnings used
Learnt from: Geoff-Robin
Repo: topoteretes/cognee PR: 1501
File: cognee/tasks/web_scraper/web_scraper_task.py:157-181
Timestamp: 2025-10-08T09:47:12.913Z
Learning: In cognee's graph database adapters (Neo4j, Kuzu, Memgraph, Neptune), the `add_node()` method performs an upsert operation using MERGE semantics: it creates the node if it doesn't exist, or updates the existing node's properties if it does exist. This allows `add_node()` to be used for both creation and updates.
Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2584
File: cognee/infrastructure/databases/graph/postgres/adapter.py:35-44
Timestamp: 2026-04-10T03:13:45.843Z
Learning: In `topoteretes/cognee`, the factory function `get_graph_engine()` in `cognee/infrastructure/databases/graph/get_graph_engine.py` always calls `await graph_client.initialize()` (via a `hasattr` guard) before returning the adapter. This means all graph adapters, including `PostgresAdapter`, have their schema/tables created before any caller can invoke graph operations. Concerns about missing schema at runtime (e.g., "relation does not exist") do not apply to code paths that go through the factory.
Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-27T06:30:27.190Z
Learning: Implement database adapters by extending GraphDBInterface or VectorDBInterface
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1990
File: cognee/modules/search/methods/search.py:124-127
Timestamp: 2026-01-13T12:41:17.082Z
Learning: In Python 3.6+ (PEP 530), you can use await inside asynchronous comprehensions (e.g., [await func(x) for x in items]) within async def functions. This is valid and executes sequentially. If parallel execution is desired, collect coroutines and use asyncio.gather, since awaiting inside a comprehension runs each iteration in sequence unless you explicitly parallelize.
There was a problem hiding this comment.
♻️ Duplicate comments (6)
cognee/infrastructure/databases/graph/postgres/adapter.py (6)
829-830:⚠️ Potential issue | 🟡 MinorORDER BY is not unique; paginated results may skip or duplicate rows.
The edge primary key is
(source_id, target_id, relationship_name), but the ORDER BY omitsrelationship_name. If two edges share the same endpoints, OFFSET/LIMIT pagination is non-deterministic. Addrelationship_nameto the ordering.Suggested fix
- ORDER BY e.source_id, e.target_id + ORDER BY e.source_id, e.target_id, e.relationship_name🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 829 - 830, The SQL ORDER BY used when fetching edges is not deterministic because it orders only by e.source_id, e.target_id but the edge primary key includes relationship_name; update the ORDER BY in the query that ends with "OFFSET :off LIMIT :lim" (in postgres adapter code handling edge pagination) to include e.relationship_name after e.target_id so pagination is stable and rows are neither skipped nor duplicated.
625-632:⚠️ Potential issue | 🟡 MinorNode tuple shape inconsistent with
get_graph_data().Here the payload includes
"id": row[1](line 629), butget_graph_data()returns(id, data)wheredataexcludes theidkey (line 447). Sinceget_nodeset_subgraphdocumentation says it returns the same format, removeidfromdatato match.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 625 - 632, The node dict created in get_nodeset_subgraph (the code building nodes in adapter.py that currently sets data = {"id": row[1], "name": row[2], "type": row[3]}) must not include the "id" key to match get_graph_data()'s (id, data) shape; change the data construction in that loop (the block that builds data for nodes before nodes.append((row[1], data))) to omit "id" and only include "name" and "type" (and the merged metadata), ensuring nodes remain tuples of (id, data) where data does not contain id.
777-781:⚠️ Potential issue | 🟡 MinorNode payload includes
idvia_parse_node_row, inconsistent withget_graph_data().
_parse_node_row()includesidin the returned dict (line 52), so nodes here have shape(id, {"id": ..., "name": ..., ...}). This differs fromget_graph_data()which returns(id, {"name": ..., "type": ...}). Either adjust_parse_node_row()for this context or build the payload inline withoutid.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 777 - 781, The nodes list currently stores tuples like (id, parsed_dict) but parsed_dict from _parse_node_row(...) already contains an "id" key, making the payload shape inconsistent with get_graph_data(); update the loop in adapter.py where nodes are appended so that for each row you call parsed = self._parse_node_row(row) and then remove or exclude parsed["id"] before appending (e.g., del parsed["id"] or build a new dict with only the fields expected by get_graph_data()), ensuring nodes become (id, {"name":..., "type":...}) and leaving _parse_node_row unchanged.
35-44:⚠️ Potential issue | 🟠 MajorSchema must be created before any operation, not just
is_empty()/delete_graph().
initialize()is only called fromis_empty()anddelete_graph(). If a caller invokesadd_nodes(),get_nodes(), oradd_edges()on a fresh database before either of those methods, the query will fail withrelation "graph_node" does not exist. Add a lazy-init guard in_session()to ensure the schema exists before any database operation.Suggested lazy-init pattern
+import asyncio + class PostgresAdapter(GraphDBInterface): """Graph-as-tables adapter backed by Postgres, accessed via SQLAlchemy async sessions.""" _ALLOWED_FILTER_ATTRS = {"id", "name", "type"} def __init__(self, connection_string: str) -> None: """Create engine and sessionmaker from a Postgres connection string.""" self.db_uri = connection_string self.engine = create_async_engine(self.db_uri) self.sessionmaker = async_sessionmaker(bind=self.engine, expire_on_commit=False) + self._initialized = False + self._init_lock = asyncio.Lock() # ... `@asynccontextmanager` async def _session(self) -> AsyncIterator[Any]: """Yield an async session from the underlying engine.""" + if not self._initialized: + async with self._init_lock: + if not self._initialized: + await self.initialize() + self._initialized = True async with self.sessionmaker() as session: yield session🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 35 - 44, The schema creation must be enforced before any DB operation: add a lazy-init guard in the AsyncContextManager _session() so it ensures initialize() has run before yielding a session; implement a boolean flag (e.g., self._initialized) plus an asyncio.Lock (e.g., self._init_lock) on the adapter and in _session() check if not self._initialized then acquire self._init_lock and inside the lock re-check the flag and call await self.initialize(); set self._initialized = True after initialize() completes, then proceed to open the session as now; this ensures calls to add_nodes(), get_nodes(), add_edges() (and any other DB ops) on a fresh DB will create tables via _meta.create_all before queries run and avoids race conditions.
257-268:⚠️ Potential issue | 🟠 MajorSame batch upsert issues apply to edges.
If
edgescontains duplicate(source_id, target_id, relationship_name)tuples, the INSERT will fail. Deduplicate by the composite key before building the VALUES list.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 257 - 268, The upsert can fail if the input edges list contains duplicate (source_id, target_id, relationship_name) tuples; before building the VALUES for pg_insert(_edge_table) (the rows variable used to create stmt), deduplicate edges by the composite key (source_id, target_id, relationship_name) and collapse duplicates into a single row (e.g., keep the last/merged properties and updated_at) so that rows passed to stmt contain unique composite keys; modify the code that builds rows from edges to use a dict or set keyed by (source_id, target_id, relationship_name) and then use its values when creating stmt and executing the session.commit.
136-149:⚠️ Potential issue | 🟠 MajorBatch upsert issues: (1) duplicate keys in VALUES fail, (2) partial payloads overwrite existing fields.
Two unresolved concerns from prior review:
Duplicate keys: If
nodescontains the sameidtwice, PostgreSQL raisesON CONFLICT DO UPDATE command cannot affect row a second time. Deduplicate byidbefore building the VALUES list.Partial overwrites: When a node tuple omits
nameortype, the upsert writes empty strings and replacesproperties. This differs from MERGE semantics used by Neo4j/Kuzu adapters. Consider usingCOALESCEto preserve existing values when the new value is empty, and merge properties with||instead of replacing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 136 - 149, The upsert fails on duplicate ids and overwrites existing fields with empty values; before building rows deduplicate input nodes by id (e.g., in the codepath that constructs rows passed to pg_insert(_node_table)) so each id appears only once, and change the ON CONFLICT update (the stmt created with pg_insert and stmt.on_conflict_do_update) to preserve existing non-empty values and merge properties: use COALESCE/NULLIF semantics for "name" and "type" (so existing values are kept when the incoming value is empty/null) and merge JSONB properties rather than replace them (use properties || excluded.properties or equivalent JSONB merge) and still set updated_at = func.now(); these changes should be made where stmt is defined and executed (the same block using async with self._session()).
🧹 Nitpick comments (1)
cognee/tasks/temporal_awareness/index_graphiti_objects.py (1)
23-32: Core guard logic is correct; consider usingSearchTypeNotSupportedfor consistency.The positive Neo4j check addresses the main concern from prior feedback. However, similar guards in
cypher_search_retriever.pyandnatural_language_retriever.pyraiseSearchTypeNotSupportedinstead ofRuntimeError. Using the same exception class would provide a consistent interface for callers handling unsupported-feature errors.Suggested alignment with other retrieval guards
from cognee.infrastructure.databases.graph.config import get_graph_context_config + from cognee.modules.retrieval.exceptions import SearchTypeNotSupported graph_provider = get_graph_context_config()["graph_database_provider"] if graph_provider != "neo4j": - raise RuntimeError( + raise SearchTypeNotSupported( "Graphiti temporal awareness requires the Neo4j graph backend. " "Set GRAPH_DATABASE_PROVIDER=neo4j to use this feature." )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/tasks/temporal_awareness/index_graphiti_objects.py` around lines 23 - 32, The guard currently checks graph_provider from get_graph_context_config() and raises a RuntimeError when not "neo4j"; replace this with raising SearchTypeNotSupported for consistency with cypher_search_retriever.py and natural_language_retriever.py: import SearchTypeNotSupported from its module (where other retrievers use it) and change the RuntimeError raise in index_graphiti_objects.py (the graph_provider check block) to raise SearchTypeNotSupported with the same explanatory message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cognee/infrastructure/databases/graph/postgres/adapter.py`:
- Around line 829-830: The SQL ORDER BY used when fetching edges is not
deterministic because it orders only by e.source_id, e.target_id but the edge
primary key includes relationship_name; update the ORDER BY in the query that
ends with "OFFSET :off LIMIT :lim" (in postgres adapter code handling edge
pagination) to include e.relationship_name after e.target_id so pagination is
stable and rows are neither skipped nor duplicated.
- Around line 625-632: The node dict created in get_nodeset_subgraph (the code
building nodes in adapter.py that currently sets data = {"id": row[1], "name":
row[2], "type": row[3]}) must not include the "id" key to match
get_graph_data()'s (id, data) shape; change the data construction in that loop
(the block that builds data for nodes before nodes.append((row[1], data))) to
omit "id" and only include "name" and "type" (and the merged metadata), ensuring
nodes remain tuples of (id, data) where data does not contain id.
- Around line 777-781: The nodes list currently stores tuples like (id,
parsed_dict) but parsed_dict from _parse_node_row(...) already contains an "id"
key, making the payload shape inconsistent with get_graph_data(); update the
loop in adapter.py where nodes are appended so that for each row you call parsed
= self._parse_node_row(row) and then remove or exclude parsed["id"] before
appending (e.g., del parsed["id"] or build a new dict with only the fields
expected by get_graph_data()), ensuring nodes become (id, {"name":...,
"type":...}) and leaving _parse_node_row unchanged.
- Around line 35-44: The schema creation must be enforced before any DB
operation: add a lazy-init guard in the AsyncContextManager _session() so it
ensures initialize() has run before yielding a session; implement a boolean flag
(e.g., self._initialized) plus an asyncio.Lock (e.g., self._init_lock) on the
adapter and in _session() check if not self._initialized then acquire
self._init_lock and inside the lock re-check the flag and call await
self.initialize(); set self._initialized = True after initialize() completes,
then proceed to open the session as now; this ensures calls to add_nodes(),
get_nodes(), add_edges() (and any other DB ops) on a fresh DB will create tables
via _meta.create_all before queries run and avoids race conditions.
- Around line 257-268: The upsert can fail if the input edges list contains
duplicate (source_id, target_id, relationship_name) tuples; before building the
VALUES for pg_insert(_edge_table) (the rows variable used to create stmt),
deduplicate edges by the composite key (source_id, target_id, relationship_name)
and collapse duplicates into a single row (e.g., keep the last/merged properties
and updated_at) so that rows passed to stmt contain unique composite keys;
modify the code that builds rows from edges to use a dict or set keyed by
(source_id, target_id, relationship_name) and then use its values when creating
stmt and executing the session.commit.
- Around line 136-149: The upsert fails on duplicate ids and overwrites existing
fields with empty values; before building rows deduplicate input nodes by id
(e.g., in the codepath that constructs rows passed to pg_insert(_node_table)) so
each id appears only once, and change the ON CONFLICT update (the stmt created
with pg_insert and stmt.on_conflict_do_update) to preserve existing non-empty
values and merge properties: use COALESCE/NULLIF semantics for "name" and "type"
(so existing values are kept when the incoming value is empty/null) and merge
JSONB properties rather than replace them (use properties || excluded.properties
or equivalent JSONB merge) and still set updated_at = func.now(); these changes
should be made where stmt is defined and executed (the same block using async
with self._session()).
---
Nitpick comments:
In `@cognee/tasks/temporal_awareness/index_graphiti_objects.py`:
- Around line 23-32: The guard currently checks graph_provider from
get_graph_context_config() and raises a RuntimeError when not "neo4j"; replace
this with raising SearchTypeNotSupported for consistency with
cypher_search_retriever.py and natural_language_retriever.py: import
SearchTypeNotSupported from its module (where other retrievers use it) and
change the RuntimeError raise in index_graphiti_objects.py (the graph_provider
check block) to raise SearchTypeNotSupported with the same explanatory message.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 49784333-06cb-4409-be35-6ad36cc66829
📒 Files selected for processing (3)
CLAUDE.mdcognee/infrastructure/databases/graph/postgres/adapter.pycognee/tasks/temporal_awareness/index_graphiti_objects.py
✅ Files skipped from review due to trivial changes (1)
- CLAUDE.md
Problems: - duplicate keys in one INSERT batch cause Postgres ON CONFLICT error - node tuple props included id in some methods but not others - triplet pagination ORDER BY not unique, could skip or duplicate rows Solutions: - dedup node rows by id and edge rows by composite key before INSERT - remove id from props dict in get_nodeset_subgraph and get_neighborhood - add relationship_name to triplet ORDER BY for deterministic pagination
There was a problem hiding this comment.
♻️ Duplicate comments (2)
cognee/infrastructure/databases/graph/postgres/adapter.py (2)
40-44:⚠️ Potential issue | 🟠 MajorAdd one-time lazy schema initialization inside
_session().Calls like
add_nodes()/add_edges()can run beforeinitialize()and fail on a fresh DB (relation does not exist). Guard_session()with a one-time async init lock.Suggested fix
+import asyncio ... class PostgresAdapter(GraphDBInterface): ... def __init__(self, connection_string: str) -> None: """Create engine and sessionmaker from a Postgres connection string.""" self.db_uri = connection_string self.engine = create_async_engine(self.db_uri) self.sessionmaker = async_sessionmaker(bind=self.engine, expire_on_commit=False) + self._initialized = False + self._init_lock = asyncio.Lock() ... `@asynccontextmanager` async def _session(self) -> AsyncIterator[Any]: """Yield an async session from the underlying engine.""" + if not self._initialized: + async with self._init_lock: + if not self._initialized: + await self.initialize() + self._initialized = True async with self.sessionmaker() as session: yield session🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 40 - 44, The _session() async context manager must ensure the DB schema is initialized once before any user calls like add_nodes()/add_edges() run; add an instance-level async one-time init guard: create an _init_lock (asyncio.Lock) and an _initialized boolean on the adapter, and at the start of _session() acquire the lock, check _initialized, if False call initialize() (await it), set _initialized=True, then release the lock (or use async with); finally proceed to yield the session as before—this guarantees initialize() is called exactly once and prevents race conditions when multiple coroutines open sessions concurrently.
95-100:⚠️ Potential issue | 🔴 CriticalPartial
add_node()upserts can still erase existing node fields.
add_node("id")currently pushes emptyname/typeand replacespropertieson conflict, which can wipe existing metadata instead of patching it.Suggested fix
rows = [] for node in nodes: if isinstance(node, tuple): props = {**(node[1] or {}), "id": node[0]} ... rows.append( { "id": str(props.get("id", "")), - "name": str(props.get("name", "")), - "type": str(props.get("type", "")), + "name": ( + str(props["name"]) + if props.get("name") is not None and "name" in props + else None + ), + "type": ( + str(props["type"]) + if props.get("type") is not None and "type" in props + else None + ), "properties": json.loads(json.dumps(extra, cls=JSONEncoder)), "created_at": now, "updated_at": now, } ) ... stmt = stmt.on_conflict_do_update( index_elements=["id"], set_={ - "name": stmt.excluded.name, - "type": stmt.excluded.type, - "properties": stmt.excluded.properties, + "name": func.coalesce(stmt.excluded.name, _node_table.c.name), + "type": func.coalesce(stmt.excluded.type, _node_table.c.type), + "properties": func.coalesce(_node_table.c.properties, text("'{}'::jsonb")).op("||")( + func.coalesce(stmt.excluded.properties, text("'{}'::jsonb")) + ), "updated_at": func.now(), }, )Based on learnings: In cognee's graph database adapters (Neo4j, Kuzu, Memgraph, Neptune),
add_node()uses MERGE semantics (create-or-update without blanking existing fields).Also applies to: 127-146
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 95 - 100, The add_node logic in adapter.py (functions/methods add_node and add_nodes) currently inserts a node with empty/default name/type or replaces the entire properties map on conflict, which can overwrite existing metadata; change the upsert semantics to perform a merge/patch instead of full replace: when a node is provided as a string or with partial properties, ensure you preserve existing node fields by merging incoming properties into existing properties (e.g., fetch and update only provided keys or use DB upsert semantics that merge maps) rather than replacing the properties column; apply the same merge behavior to the other add_node/add_nodes block mentioned (the similar code around the other section lines 127-146).
🧹 Nitpick comments (1)
cognee/infrastructure/databases/graph/postgres/adapter.py (1)
568-580: Validatenode_name_filter_operatorexplicitly before branching.Any value other than
"OR"currently falls through to AND behavior. RaiseValueErrorfor unsupported operators to keep API behavior explicit.Suggested fix
+ if node_name_filter_operator not in {"OR", "AND"}: + raise ValueError( + f"Unsupported node_name_filter_operator: {node_name_filter_operator!r}. " + "Expected 'OR' or 'AND'." + ) if node_name_filter_operator == "OR": neighbor_cte = """As per coding guidelines: "Prefer explicit, structured error handling in Python code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/graph/postgres/adapter.py` around lines 568 - 580, Validate the node_name_filter_operator explicitly before the branching that builds neighbor_cte: check if node_name_filter_operator is either "OR" or "AND" (or other supported tokens) and if not raise a ValueError with a clear message; then proceed with the existing conditional that assigns neighbor_cte for the "OR" branch and the other branch for "AND". Ensure this validation occurs near the existing code that references node_name_filter_operator so the error is raised before any SQL string is constructed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cognee/infrastructure/databases/graph/postgres/adapter.py`:
- Around line 40-44: The _session() async context manager must ensure the DB
schema is initialized once before any user calls like add_nodes()/add_edges()
run; add an instance-level async one-time init guard: create an _init_lock
(asyncio.Lock) and an _initialized boolean on the adapter, and at the start of
_session() acquire the lock, check _initialized, if False call initialize()
(await it), set _initialized=True, then release the lock (or use async with);
finally proceed to yield the session as before—this guarantees initialize() is
called exactly once and prevents race conditions when multiple coroutines open
sessions concurrently.
- Around line 95-100: The add_node logic in adapter.py (functions/methods
add_node and add_nodes) currently inserts a node with empty/default name/type or
replaces the entire properties map on conflict, which can overwrite existing
metadata; change the upsert semantics to perform a merge/patch instead of full
replace: when a node is provided as a string or with partial properties, ensure
you preserve existing node fields by merging incoming properties into existing
properties (e.g., fetch and update only provided keys or use DB upsert semantics
that merge maps) rather than replacing the properties column; apply the same
merge behavior to the other add_node/add_nodes block mentioned (the similar code
around the other section lines 127-146).
---
Nitpick comments:
In `@cognee/infrastructure/databases/graph/postgres/adapter.py`:
- Around line 568-580: Validate the node_name_filter_operator explicitly before
the branching that builds neighbor_cte: check if node_name_filter_operator is
either "OR" or "AND" (or other supported tokens) and if not raise a ValueError
with a clear message; then proceed with the existing conditional that assigns
neighbor_cte for the "OR" branch and the other branch for "AND". Ensure this
validation occurs near the existing code that references
node_name_filter_operator so the error is raised before any SQL string is
constructed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: eadb0e74-6eae-4a5a-a3d8-297b1e2c1e55
📒 Files selected for processing (1)
cognee/infrastructure/databases/graph/postgres/adapter.py
|
New branch which won't auto-fail checks by being PR'd from outside Cognee. Sorry Igor for the hassle. I think everything looks good. The failing git-guardian is a fake / dummy credential, the not-all-tests-passed failure is... exactly that, and the two actual failures appear to be totally unrelated to the changes here. |
Can I update this branch with changes from dev? Would be great if we can fix the unrelated failures (@siillee and me) so the full CI/CD can run normally before merge (Some tests don't start until previous tests pass) |
…ment-minimal-graphdb-in-postgres
…ment-minimal-graphdb-in-postgres
Yeah, certainly. Looks like Lazar has tried to do so. If these fail despite the dev pulls, and you want me to just fix whatever the issue is, I can do that, too. The only reason I didn't do it yesterday was to avoid potential conflicts with other branches doing the fix. |
lxobr
left a comment
There was a problem hiding this comment.
Looks good, thanks for bringing this one over the line
…h db and (#2587) ## Description Adds a PostgreSQL-based hybrid adapter that performs graph and vector operations against a single Postgres database, enabling combined queries (JOINs across graph_node/graph_edge and vector collection tables) and single-transaction writes. This PR builds on PR #2584 (postgres graph adapter) and integrates it with the existing PGVector adapter via a new `PostgresHybridAdapter` in `cognee/infrastructure/databases/hybrid/postgres/`. Key design decisions: - Composition of PGVector and PostgresAdapters in Hybrid - holds a PostgresAdapter and PGVectorAdapter - delegates standard operations to each - includes hybrid logic for joint operations - Combined write methods for single-transaction graph+vector inserts - `add_nodes_with_vectors` and `add_edges_with_vectors` - Combined search method - `search_graph_with_distances` JOINs graph and vector tables via pgvector cosine distance - Unified engine integration - `USE_UNIFIED_PROVIDER=pghybrid` env var activates hybrid mode - `get_unified_engine()` assembles the hybrid adapter from graph and vector sub-adapters - `add_data_points` routes to hybrid writes when capability is available - Graph adapter creates its own engine from the relational connection string - matches the decoupled constructor convention from PR #2584 - hybrid adapter uses `self._graph._session()` for combined writes - Cypher, natural language, and Graphiti retrievers rejected by both PostgresAdapter and PostgresHybridAdapter - CI updated with hybrid adapter tests and hybrid E2E test steps - uses `pgvector/pgvector:pg17` image for pgvector extension support Also includes from PR #2584 merge: - All adapter fixes verified present after merge - `get_neighborhood` with recursive CTE - `node_name_filter_operator` OR/AND support in `get_nodeset_subgraph` - Batch dedup, node tuple shape consistency, triplet ordering - `func.now()` on upsert timestamps - Config save/restore in E2E tests That is, this should include all content from previous merge tweaks, along with the independent tweaks that were on our previous version of this, #2506. ## Acceptance Criteria - [x] `PostgresHybridAdapter` implements both `GraphDBInterface` and `VectorDBInterface` - [x] Combined write methods perform graph+vector inserts in single transaction - [x] Combined search JOINs graph and vector tables - [x] `USE_UNIFIED_PROVIDER=pghybrid` activates hybrid mode through unified engine - [x] `add_data_points` routes to hybrid writes when capability is present - [x] 26 postgres adapter unit tests pass - [x] Retriever gates cover both PostgresAdapter and PostgresHybridAdapter - [x] All 123 fixes from PR #2584 verified present after merge - [x] CI passes on upstream ## Type of Change - [ ] Bug fix (non-breaking change that fixes an issue) - [x] New feature (non-breaking change that adds functionality) - [ ] Code refactoring - [ ] Other (please specify): ## Screenshots 26 passed, 8 warnings in 7.26s (will update with full passage shortly, while CI/CD is running). ## Pre-submission Checklist - [x] **I have tested my changes thoroughly before submitting this PR** (See `CONTRIBUTING.md`) - [x] **This PR contains minimal changes necessary to address the issue/feature** - [x] My code follows the project's coding standards and style guidelines - [x] I have added tests that prove my fix is effective or that my feature works - [ ] I have added necessary documentation (if applicable) - [x] All new and existing tests pass - [x] I have searched existing PRs to ensure this change hasn't been submitted already - [x] I have linked any relevant issues in the description - [x] My commits have clear and descriptive messages ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * PostgreSQL hybrid adapter: combined graph+vector read/write, unified "pghybrid" provider, and graph search with distances. * Graph interface supports batch triplet retrieval. * **Tests** * New E2E and unit tests validating hybrid adapter and hybrid write flows. * **Bug Fixes** * More robust edge formatting and stricter backend guards for Cypher retrievers. * Improved vector metadata caching and reflection. * **Chores** * CI updated to run pgvector/Postgres 17 tests, DCO workflow adjusted for trusted authors, version bumped to 0.5.8. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Description
Implements a PostgreSQL-based graph database adapter for Cognee.
The adapter lives in
cognee/infrastructure/databases/graph/postgres/and implementsGraphDBInterfaceusing two tables (graph_node,graph_edge). All graph operations are implemented using SQL or SQLAlchemy.Key design decisions:
(source_id, target_id, relationship_name)on tripletsGRAPH_DATABASE_URLwhenGRAPH_DATABASE_PROVIDER=postgrestables.py, adapter logic inadapter.pyquery()raisesNotImplementedErrorby designget_neighborhooduses a recursive CTE with UNION ALLget_nodeset_subgraphpatternAlso includes:
test_graphdb_shared.pyso E2E tests don't leak mutated global stateGRAPH_DATABASE_URLadded to CI postgres E2E step ingraph_db_tests.ymlAcceptance Criteria
PostgresAdapterimplements all requiredGraphDBInterfacemethods exceptquery()test_postgres_adapter.py)test_graphdb_postgres.py) passes through full add/cognify/search pipelinegraph_db_tests.ymlfor both unit and E2E test stepsType of Change
Screenshots
pytest: 1080 passed, 0 failed, 15 skipped
Pre-submission Checklist
CONTRIBUTING.md)DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
Summary by CodeRabbit
New Features
Documentation
Tests
Behavior Changes