Feature/cog 4469 graph aware embeddings integrate joint postgres graph db and#2587
Conversation
Problems: - Need to migrate basic graph functionality to relational DB Solutions: - New adapter implementing GraphDBInterface - Selection branch included in get_graph_engine.py Decisions: - re-use SQLAlchemy setup and configuration - Remove .query() (raw Cypher) functionality
Problems: - Cypher search retriever exposes raw cypher interface via .query() - Natural language retriever writes, uses Cypher via .query() - Temporal awareness (Graphiti) subsystem directly uses cypher, neo4j Solutions: - Raise incompatibility errors proactively across these consumers Decisions: - None of these are core functionality, cost should be low - Graphiti already unguarded hard-coding of multiple configs... - Complex cypher search cannot be replaced in postgres at present
Problems: - asyncpg misparses ::jsonb cast as named parameter - prune_system drops graph tables via delete_database() - cached adapter instance has no tables after prune - no CI coverage for postgres graph backend - no unit or e2e tests for postgres adapter Solutions: - replace ::jsonb with CAST(:param AS jsonb) throughout - add _ensure_initialized() calling idempotent CREATE IF NOT EXISTS - guard is_empty and delete_graph with _ensure_initialized - add run-postgres-tests job to graph_db_tests.yml CI workflow - add 26 unit tests against real Postgres - add shared e2e test and postgres wrapper Decisions: - _ensure_initialized over defensive checks on every method - real Postgres in tests, not SQLite (dialect differences block it) - shared test function for future backend reuse (test_graphdb_shared) - CI uses Postgres 16 service container, matching Kuzu/Neo4j pattern
Problems: - some white-space and wrapping non-compliance Solutions: - fix via ruff
Problems: - want joint graph and vector interaction in SQL to extent possible - currently, graph and vector writes are separate transactions - search requires two round-trips, vector search then graph search Solutions: - add PostgresHybridAdapter composing PostgresAdapter and PGVectorAdapter - implement combined write methods with single-transaction atomicity - implement batching to avoid loops on session.execute calls later - implement combined search by joining graph and vector tables - exceptions in get_graph_engine and create_vector_engine if given pghybrid - add pghybrid construction in get_unified_engine Decisions: - composition rather than inheritance on subclasses of SQLAlchemyAdapter - otherwise subtle configuration bugs can occur (diamond problem) - hybrid only allowed via get_unified_engine, not individual factories - per-row inserts for now, batch optimization deferred - table name validation via regex to prevent SQL injection
Problems: - pghybrid delegates query() to PostgresAdapter which raises generic error - isinstance checks only catch PostgresAdapter, not PostgresHybridAdapter - users would see NotImplementedError instead of descriptive message Solutions: - add PostgresHybridAdapter to isinstance checks in three files - cypher search, natural language, and graphiti now guard both adapters
Problems: - no test coverage for hybrid adapter Solutions: - add unit tests covering graph, vector, combined, and content paths - content integrity test verifies IDs and payloads across both tables - tests use local SQLAlchemyAdapter to avoid stale event loop issues - fixture teardown cleans up vector collection tables between tests Decisions: - use uuid5 for test IDs because pgvector tables require UUID columns - bypass PGVectorAdapter constructor to share connection pool with graph - e2e and CI deferred because cognify pipeline calls get_graph_engine directly
Problems: - Need to migrate basic graph functionality to relational DB Solutions: - New adapter implementing GraphDBInterface - Selection branch included in get_graph_engine.py Decisions: - re-use SQLAlchemy setup and configuration - Remove .query() (raw Cypher) functionality
Problems: - Cypher search retriever exposes raw cypher interface via .query() - Natural language retriever writes, uses Cypher via .query() - Temporal awareness (Graphiti) subsystem directly uses cypher, neo4j Solutions: - Raise incompatibility errors proactively across these consumers Decisions: - None of these are core functionality, cost should be low - Graphiti already unguarded hard-coding of multiple configs... - Complex cypher search cannot be replaced in postgres at present
Problems: - asyncpg misparses ::jsonb cast as named parameter - prune_system drops graph tables via delete_database() - cached adapter instance has no tables after prune - no CI coverage for postgres graph backend - no unit or e2e tests for postgres adapter Solutions: - replace ::jsonb with CAST(:param AS jsonb) throughout - add _ensure_initialized() calling idempotent CREATE IF NOT EXISTS - guard is_empty and delete_graph with _ensure_initialized - add run-postgres-tests job to graph_db_tests.yml CI workflow - add 26 unit tests against real Postgres - add shared e2e test and postgres wrapper Decisions: - _ensure_initialized over defensive checks on every method - real Postgres in tests, not SQLite (dialect differences block it) - shared test function for future backend reuse (test_graphdb_shared) - CI uses Postgres 16 service container, matching Kuzu/Neo4j pattern
Problems: - some white-space and wrapping non-compliance Solutions: - fix via ruff
Problems: - 36+ graph and 60+ vector call sites block pghybrid error guard - hybrid adapter missing embedding_engine attribute for indexing - dotenv override=True in cognee init overwrites test env vars - cached configs survive env var changes across test boundaries Solutions: - add USE_UNIFIED_PROVIDER env var read at top of both factories - get_graph_engine returns PostgresAdapter when flag is pghybrid - create_vector_engine returns PGVectorAdapter when flag is pghybrid - get_unified_engine reads flag for hybrid detection and construction - expose embedding_engine on hybrid adapter from vector adapter - e2e test sets env vars after import then clears all five caches Decisions: - explicit env var flag over auto-detection from two provider configs - each factory returns independent component, not shared hybrid instance - env vars set after cognee import to survive dotenv override=True - hybrid combined queries only available through get_unified_engine
Problems: - Inadequate batching - add_nodes loops N awaits, one INSERT round-trip each - add_edges same per-row loop pattern - has_edges loops N SELECT EXISTS queries - raw text() SQL construction, not SQLAlchemy Solutions: - multi-row insert instead of looping - add_nodes uses pg_insert with multi-row VALUES batch - add_edges uses pg_insert with multi-row VALUES batch - has_edges uses SQLAlchemy values() with EXISTS join - added SQLAlchemy Core table definitions for DML - will remove next commit to maintain single source of truth Decisions: - maximum batching optimization - multi-row INSERT over executemany for single-statement execution - executemany pipelines but still N server-side executions - looped awaits worst, executemany middle, VALUES best batching - chose SQLAlchemy Core over raw text() for proactive safety - text() with bind params was safe but not project convention - SQLAlchemy eliminates all raw SQL string construction - keep problematic reproduction of DDL logic for commit specificity - will fix in follow up
Problems: - raw DDL strings and Core table objects duplicated schema Solutions: - replaced _SCHEMA_DDL list with Core Table and Index definitions - initialize() now uses _meta.create_all() instead of raw DDL - removed asyncio import (unused) - replaced postgresql TIMESTAMP with generic DateTime(timezone=True)
Note: - All problems below were cosmetic or stylistic, not substantive. Problems: - SQL injection via unvalidated attribute names in get_filtered_graph_data - empty attribute_filters can produce invalid SQL syntax - local "values" var shadows sqlalchemy values import in add_edges - JSON serialize then deserialize indirection in add_nodes - get_nodeset_subgraph issues four sequential DB queries - get_filtered_graph_data issued two sequential DB queries - public methods missing docstrings, unlike Neo4j and Kuzu adapters - section banners inconsistent with other adapters - narrating inline comments restate what code already says - python-version input missing from graph_db_tests workflow - SearchTypeNotSupported imported inline instead of at module level Solutions: - add attribute whitelist, raise ValueError for invalid filter keys - fall back to get_graph_data when attribute_filters is empty - rename local values to rows in add_edges - inline property dict construction, remove _split_core_and_extra - combine get_nodeset_subgraph into single CTE query - combine get_filtered_graph_data into single CTE query - add Parameters/Returns docstrings to all public methods - remove section banners throughout adapter - trim narrating comments, keep complex clarifying ones - add python-version input with default 3.11 to workflow_call - move SearchTypeNotSupported to module-level import Reasoning and Related Notes: - prune_system metadata=True drops relational DB including graph tables - _ensure_initialized kept defensive as a result - JSON round-trip in add_nodes retained for UUID and datetime serialization - specifically, enforces JSONB compatibility (postgres wants python dict) - alternative is in-place modification - adding rules for doing this would replicate JSONEncoder - JSONEncoder outputs strings, so we need to convert back if used
Problems: - tuple merge order lets properties dict override authoritative node ID - __init__ and _session missing type annotations - magic number 6 in test_graphdb_shared unexplained - raised but incorrect: add weighted edges test Solutions: - swap merge order so tuple ID always wins - add type annotations to __init__ and _session - add comment explaining 3 searches x 2 entries = 6 - no duplication of weighted edges test necessary or appropriate Reasoning and Related Notes: - weighted edges test is backend-agnostic, exercises get_graph_from_model only - running it under Postgres job would not test adapter round-trip - no backend actually has a round-trip weighted edges test - therefore kuzu and neo4j cases are incorrect and duplicates - recursive CTE performance in get_graph_metrics is not part of pipeline - Kuzu returns -1 for components, Neo4j uses GDS plugin
…integrate-joint-postgres-graph-db-and
…al-graphdb-in-postgres' into feature/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and
Problems: - postgres tests in unit/ fail without live database in CI - integration/ is auto-swept by CI which lacks postgres service - Igor requested separate postgres test directory and github action Solutions: - created cognee/tests/e2e/postgres/ for service-dependent tests - moved 5 test files from unit/ and tests/ root to e2e/postgres - updated imports to reflect new module paths - updated graph_db_tests.yml to reference new file locations Architectural Trade-Off Decisions: - chose e2e/ over integration/ to avoid automatic CI sweep - graph_db_tests.yml invokes explicitly so tests still run in CI Reasoning and Related Notes: - dexters1 (Igor) requested own github action for postgres tests - all 39 tests pass locally against live postgres
Problems: - get_triplets_batch ordering unstable with duplicate source/target pairs - None embeddable data passed to embedding engine can cause failures - bare except in Triplet_text deletion can hide real database errors - pghybrid DSN built from None config fields gives unhelpful errors - graphiti guard only blocks postgres, not kuzu or other non-neo4j backends Solutions: - added relationship_name to ORDER BY for deterministic pagination - filtered None values from get_embeddable_data before calling embed_data - narrowed except to log and distinguish missing-table from real errors - added config validation before DSN construction in get_unified_engine - replaced postgres blocklist with positive isinstance Neo4jAdapter check
Problems: - edge property test assertion used or, passing on partial data - hybrid search results built from fragile positional row indexes - hybrid adapter __init__ lacked type hints on parameters - get_graph_engine error message exceeded 100-char line limit Solutions: - split or assertion into two separate asserts for weight and edge_text - switched to result.mappings() for named column access in hybrid search - added TYPE_CHECKING imports and type annotations to __init__ - extracted provider list to variable before f-string interpolation
…face. Problems: - get_triplets_batch not declared on interface, callers use hasattr checks - Neptune adapter does not implement it and cannot be required to Solutions: - added get_triplets_batch to GraphDBInterface as non-abstract optional method - default raises NotImplementedError following existing feedback_weights pattern - docstring documents which adapters implement it
Problems: - CI had no coverage for pghybrid hybrid adapter or e2e path - postgres:16 image lacks pgvector extension needed by hybrid tests - TEST_DATA_DIR resolved to wrong path after file move to e2e/postgres - hybrid adapter search wrapper missing node_name_filter_operator param - combined_write stored belongs_to_set as null instead of empty list - nodeset filtering test failed because entities lack belongs_to_set Solutions: - added hybrid adapter and pghybrid e2e steps to graph_db_tests.yml - upgraded CI service image to pgvector/pgvector:pg17 - fixed TEST_DATA_DIR to resolve relative to tests/ root via parents[2] - added node_name_filter_operator passthrough in hybrid search wrapper - normalized belongs_to_set to empty list in combined_write payload - disabled nodeset test with TODO noting cross-backend pipeline gap Reasoning and Related Notes: - nodeset gap affects all backends, not a postgres-specific issue - entities from extract_graph_from_data never inherit belongs_to_set
Problems: - add_data_points split unified engine into separate graph and vector calls - hybrid adapter's combined_write methods weren't called from the pipeline - vector payload format did not match PGVectorAdapter's IndexSchema format Solutions: - added HYBRID_WRITE capability check to route nodes and edges atomically - add_nodes_with_vectors and add_edges_with_vectors now called when hybrid - non-hybrid backends fall back to existing separate graph and vector calls - payload now uses IndexSchema and serialize_data identical to PGVectorAdapter - updated test assertions to match IndexSchema payload format Reasoning and Related Notes: - Import and re-use IndexSchema to keep linkage with PGVectorAdatper
Problems: - hybrid joint-add edge type payloads differed from PGVectorAdapter - no test verified that hybrid and separate paths produce identical payloads Solutions: - hybrid edge type payloads now use IndexSchema and serialize_data - added test inserting a 3-node 3-edge graph via both paths and comparing - test asserts identical vector payload keys and values for nodes and edges Reasoning and Related Notes: - hybrid and separate paths do not share serialization code yet - test will catch drift if either path changes independently - didn't want to change PGVectorAdapter in this PR
…t routing. Problems: - hybrid adapter CI step missing embedding API secrets - get_nodeset_subgraph override drops node_name_filter_operator parameter - vector ON CONFLICT DO NOTHING leaves stale embeddings on reprocess - delete_nodes_with_vectors docstring promised all-collection deletion - hybrid path inserted triplets as graph nodes instead of vector-only Solutions: - added EMBEDDING_* env vars to hybrid adapter CI step - added node_name_filter_operator to get_nodeset_subgraph matching interface - changed both node and edge vector inserts to ON CONFLICT DO UPDATE - corrected docstring to reflect graph-only behavior when collections is None - triplets now always use vector-only index_data_points regardless of backend
…ment-minimal-graphdb-in-postgres
Problems: - hybrid per-row inserts deadlock under concurrent cognify writes - PGVectorAdapter reflects full schema on every search and has_collection call - get_unified_engine creates a new PGVectorAdapter per call for pghybrid Solutions: - batch all hybrid inserts into one statement per table - cache reflected table metadata on PGVectorAdapter instance - hybrid adapter reuses cached PGVectorAdapter via get_vector_engine() - targeted reflect(only=[name]) on cache miss instead of full reflect Architectural Trade-Off Decisions: - reuse get_vector_engine requires VECTOR_DB_PROVIDER=pgvector for pghybrid - accepted over adapter-level cache dict for consistency with other backends Reasoning and Related Notes: - deadlocks occurred in 2 of 5 benchmark iterations before batching fix - pgv_vector_get_table dropped from 33s to 0s with metadata cache - hybrid_write dropped from 30s to 14s with batched inserts
…mbeddings-integrate-joint-postgres-graph-db-and' into feature/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and Take upstream version on everything
…ph-aware-embeddings-integrate-joint-postgres-graph-db-and Same set of conflicts as previously... issue was dev merge order across branches.
Please make sure all the checkboxes are checked:
|
|
Hello @DanielNScott, thank you for submitting a PR! We will respond as soon as possible. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Postgres hybrid adapter and wiring so graph and vector functionality can share a Postgres/pgvector backend; updates engine factories, adapters, retrievers, write paths, tests, CI workflows, and a DCO check to support the new Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 9573981 | Triggered | Generic Password | 4347a95 | .github/workflows/graph_db_tests.yml | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Problems: - unified engine mock returned truthy AsyncMock for has_capability - add_data_points routed to hybrid writes, skipping mocked add_nodes Solutions: - set has_capability to return False in unified engine mock
Problems: - unified engine mock returned truthy AsyncMock for has_capability - add_data_points routed to hybrid writes, skipping mocked add_nodes - no test coverage for hybrid write path Solutions: - set has_capability to return False in non-hybrid mock - add test_add_data_points_hybrid_write_path verifying hybrid routing
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)
173-180:⚠️ Potential issue | 🔴 CriticalPopulate the metadata cache only after the DDL transaction commits.
has_collection()andget_table()now trustself._metadata.tableswithout takingVECTOR_DB_LOCK. Because Line 178 reflects the new table into that cache before the surroundingengine.begin()block commits, a concurrent reader can observe the table as “present” and then fail withrelation does not existon its own session. Move the cache fill until after theCREATE TABLEtransaction exits, or make readers participate in the same lock.🔧 Minimal fix
+ created = False async with self.engine.begin() as connection: if len(Base.metadata.tables.keys()) > 0: await connection.run_sync( Base.metadata.create_all, tables=[PGVectorDataPoint.__table__] ) - await connection.run_sync( - self._metadata.reflect, only=[collection_name] - ) + created = True + + if created: + async with self.engine.begin() as connection: + await connection.run_sync( + self._metadata.reflect, only=[collection_name] + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py` around lines 173 - 180, The metadata cache is being populated inside the CREATE TABLE transaction (via self._metadata.reflect) which can expose the new table to readers (has_collection()/get_table()) before the DDL commit and cause "relation does not exist" errors; fix by moving the self._metadata.reflect call out of the async with self.engine.begin() block (or alternatively acquire VECTOR_DB_LOCK around the reflect) so the cache is updated only after the transaction commits; update PGVectorAdapter to call Base.metadata.create_all (or keep it) inside the transaction but perform self._metadata.reflect (or any cache fill) after the context exits so readers observing self._metadata.tables see only committed tables.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/approve_dco.yaml:
- Around line 23-27: The trustedRoles array currently includes 'CONTRIBUTOR',
which incorrectly treats GitHub author_association "CONTRIBUTOR" as a trusted
role; remove 'CONTRIBUTOR' from the trustedRoles declaration so only OWNER,
MEMBER, and COLLABORATOR are considered trusted, and ensure any logic using
isTrusted (computed from trustedRoles.includes(authorAssociation)) continues to
enforce DCO checks for contributors; also update the nearby comment if needed to
match the corrected trustedRoles list.
In `@cognee/infrastructure/databases/graph/get_graph_engine.py`:
- Around line 98-108: The pghybrid provider check (USE_UNIFIED_PROVIDER ==
"pghybrid") must be moved out of the `@lru_cache`'d _create_graph_engine() so the
choice isn't frozen by the cache; instead, read and evaluate
os.environ["USE_UNIFIED_PROVIDER"] before calling _create_graph_engine(), and
when pghybrid is requested resolve get_relational_engine() immediately and
validate the relational backend is PostgreSQL (failing fast with a clear error
if not) and then construct
PostgresAdapter(connection_string=get_relational_engine().db_uri) outside the
cached function; keep _create_graph_engine() purely responsible for creating
graph engines for a given resolved provider.
In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py`:
- Around line 595-610: The current SQL in the node_id is None branch (the SQL
using gn, vc and LEFT JOIN graph_edge ge) applies ORDER BY distance and LIMIT
after the graph_edge fan-out, causing duplicate rows per node; fix by first
selecting the top N nodes ranked by distance from the {table} vector column
(i.e., produce a subquery of node ids and their distance by joining vc to
graph_node and applying ORDER BY distance LIMIT :limit), then join that limited
set to graph_node and LEFT JOIN graph_edge to attach edge context (or
alternatively use DISTINCT ON (gn.id) ordered by distance) so you return one
representative edge-context row per ranked node instead of limiting post-fanout.
- Around line 518-540: The current transaction opens via async with
self._graph._session() and then attempts DELETE on Triplet_text which, if the
table is missing, marks the whole transaction as failed even though the
exception is caught; fix by isolating the Triplet_text delete in a SAVEPOINT
inside that session (execute "SAVEPOINT sp_triplet", run the DELETE, on
relation-missing error ROLLBACK TO SAVEPOINT sp_triplet and suppress only that
case, otherwise re-raise, then RELEASE SAVEPOINT sp_triplet) or alternatively
probe for the table existence before entering the transaction (use a prior query
against pg_catalog to check Triplet_text) and skip the DELETE entirely if
absent; apply this change around the Triplet_text block that uses
session.execute and await session.commit() so earlier deletes are not rolled
back by the missing-table error.
In `@cognee/infrastructure/databases/unified/get_unified_engine.py`:
- Around line 12-17: The USE_UNIFIED_PROVIDER override in _is_hybrid_provider
currently treats any non-empty value as hybrid; change it to only enable hybrid
when the env value matches a supported hybrid provider. Inside
_is_hybrid_provider, read os.environ.get("USE_UNIFIED_PROVIDER", "") ->
normalise (strip and lower) and check membership against a defined set like
supported_hybrid_providers = {"<providerA>", "<providerB>"} (replace with your
actual supported hybrid identifiers); return True only if the normalised env
value is in that set, otherwise fall through to the existing
graph_config/vector_config logic. Ensure you reference the same function name
_is_hybrid_provider and keep the os import/location as in the diff.
In `@cognee/infrastructure/databases/vector/create_vector_engine.py`:
- Around line 78-98: The pghybrid override is evaluated inside the cached
_create_vector_engine() call so changing USE_UNIFIED_PROVIDER or Postgres
settings can return a stale adapter and bypass pgvector validation; fix by
resolving the unified_provider (read USE_UNIFIED_PROVIDER and, if pghybrid, call
get_relational_config and construct the connection_string) before invoking the
cached factory and then pass the resolved mode/value into
_create_vector_engine's cache key (e.g., add a unified_provider or provider_mode
parameter to _create_vector_engine) so the LRU cache differentiates pghybrid vs
default and retains the normal pgvector validation path (use PGVectorAdapter
only when the resolved mode indicates pghybrid).
In `@cognee/memify_pipelines/consolidate_entity_descriptions.py`:
- Around line 69-83: The code indexes edge[1]/edge[2] in the consolidation logic
without validating the shape, which can raise IndexError on short/malformed
edges; update the branch in consolidate_entity_descriptions.py to first check
the arity and types (e.g., ensure isinstance(edge, (list, tuple)) and len(edge)
>= 3) before any positional indexing, and on failure either skip the record
(continue) or handle it via a clear fallback path that avoids direct indexing;
preserve assignment to target_id and rel_name (the existing symbols) only after
the guard so result[target_id] = rel_name is safe.
In `@cognee/tasks/temporal_awareness/index_graphiti_objects.py`:
- Line 25: Wrap the top-level import of Neo4jAdapter in a try-except block so an
ImportError from the optional neo4j dependency is caught and re-raised as a
RuntimeError with actionable guidance; specifically, locate the line importing
Neo4jAdapter from cognee.infrastructure.databases.graph.neo4j_driver.adapter and
replace it with a guarded import that catches ImportError and raises
RuntimeError suggesting installing the optional neo4j dependency and referencing
the project docs or extras (e.g., "pip install ...") so the module fails with a
clear, actionable message instead of leaking raw ImportError.
In `@cognee/tests/e2e/postgres/test_pghybrid.py`:
- Around line 19-42: Move the module-level environment mutations and
cache-clears into a controlled setup function (e.g., main() or a pytest fixture)
so they run only when the script is intentionally invoked; specifically,
relocate the os.environ assignments for USE_UNIFIED_PROVIDER,
ENABLE_BACKEND_ACCESS_CONTROL, DB_PROVIDER, DB_HOST, DB_PORT, DB_USERNAME,
DB_PASSWORD, DB_NAME and the subsequent cache_clear() calls on
get_relational_config, get_graph_config, create_relational_engine,
_create_graph_engine, and _create_vector_engine into that setup function/fixture
and call it from main() (or register as a fixture) instead of running at import
time.
In `@cognee/tests/e2e/postgres/test_postgres_hybrid_adapter.py`:
- Around line 425-432: The test test_add_edges_with_vectors is flaky because it
relies on leftover PGVectorAdapter connection-pool state; make it deterministic
by isolating collection state per run: modify the test (or the adapter fixture)
to use a unique collection/schema name (e.g., append a UUID or timestamp) or
ensure a deterministic teardown that drops the vector table/collection and
resets the PGVectorAdapter connection pool before and after the test (or
recreate the adapter instance per test). Update the test_add_edges_with_vectors
(and/or the adapter fixture) to create and cleanup a dedicated collection/schema
or call the adapter reset method so no state is shared across tests.
---
Outside diff comments:
In `@cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py`:
- Around line 173-180: The metadata cache is being populated inside the CREATE
TABLE transaction (via self._metadata.reflect) which can expose the new table to
readers (has_collection()/get_table()) before the DDL commit and cause "relation
does not exist" errors; fix by moving the self._metadata.reflect call out of the
async with self.engine.begin() block (or alternatively acquire VECTOR_DB_LOCK
around the reflect) so the cache is updated only after the transaction commits;
update PGVectorAdapter to call Base.metadata.create_all (or keep it) inside the
transaction but perform self._metadata.reflect (or any cache fill) after the
context exits so readers observing self._metadata.tables see only committed
tables.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aa9642a5-a4bd-4528-81d8-c6fb681a12cc
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (17)
.github/workflows/approve_dco.yaml.github/workflows/graph_db_tests.ymlcognee/infrastructure/databases/graph/get_graph_engine.pycognee/infrastructure/databases/graph/graph_db_interface.pycognee/infrastructure/databases/hybrid/postgres/__init__.pycognee/infrastructure/databases/hybrid/postgres/adapter.pycognee/infrastructure/databases/unified/get_unified_engine.pycognee/infrastructure/databases/vector/create_vector_engine.pycognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.pycognee/memify_pipelines/consolidate_entity_descriptions.pycognee/modules/retrieval/cypher_search_retriever.pycognee/modules/retrieval/natural_language_retriever.pycognee/tasks/storage/add_data_points.pycognee/tasks/temporal_awareness/index_graphiti_objects.pycognee/tests/e2e/postgres/test_pghybrid.pycognee/tests/e2e/postgres/test_postgres_hybrid_adapter.pypyproject.toml
| // 1. Check if user is an org member via author_association | ||
| // (available on the PR payload without extra API calls or permissions) | ||
| // OWNER, MEMBER, and COLLABORATOR are trusted roles. | ||
| const trustedRoles = ['OWNER', 'MEMBER', 'COLLABORATOR', 'CONTRIBUTOR']; | ||
| const isTrusted = trustedRoles.includes(authorAssociation); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In GitHub pull_request payloads, what does author_association = "CONTRIBUTOR" mean, and is it considered equivalent to OWNER/MEMBER/COLLABORATOR for trust decisions?
💡 Result:
Web search failed: Server error: no LLM provider could handle the message
🏁 Script executed:
# First, locate and read CONTRIBUTING.md to understand DCO requirements
find . -name "CONTRIBUTING.md" -type f | head -5Repository: topoteretes/cognee
Length of output: 79
🌐 Web query:
GitHub pull request author_association field values meaning OWNER MEMBER COLLABORATOR CONTRIBUTOR documentation
💡 Result:
In GitHub’s author_association field (an enum named CommentAuthorAssociation), these values mean:
- OWNER: Author is the owner of the repository. [1]
- MEMBER: Author is a member of the organization that owns the repository. [1]
- COLLABORATOR: Author has been invited to collaborate on the repository. [1]
- CONTRIBUTOR: Author has previously committed to the repository. [1]
Source: GitHub GraphQL enum docs for CommentAuthorAssociation. [1]
References
[1] https://docs.github.com/en/enterprise-cloud@latest/graphql/reference/enums#commentauthorassociation
🏁 Script executed:
# Read the actual file under review to confirm current state
cat -n .github/workflows/approve_dco.yaml | head -40Repository: topoteretes/cognee
Length of output: 2117
🏁 Script executed:
# Read CONTRIBUTING.md to verify DCO policy requirements
cat -n CONTRIBUTING.mdRepository: topoteretes/cognee
Length of output: 7165
Remove CONTRIBUTOR from trusted roles to enforce DCO policy
Line 26 includes CONTRIBUTOR in trustedRoles, allowing external contributors who have previously committed to bypass DCO enforcement (lines 29–35). However, CONTRIBUTOR in GitHub's author_association field means "has previously committed to the repository," not organizational membership or trusted collaborator status. This contradicts the comment on line 25 ("OWNER, MEMBER, and COLLABORATOR are trusted roles") and violates the DCO requirement stated in CONTRIBUTING.md section 5, which mandates DCO affirmation for all contributions.
Suggested fix
- // OWNER, MEMBER, and COLLABORATOR are trusted roles.
- const trustedRoles = ['OWNER', 'MEMBER', 'COLLABORATOR', 'CONTRIBUTOR'];
+ // OWNER, MEMBER, and COLLABORATOR are trusted roles.
+ const trustedRoles = ['OWNER', 'MEMBER', 'COLLABORATOR'];📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // 1. Check if user is an org member via author_association | |
| // (available on the PR payload without extra API calls or permissions) | |
| // OWNER, MEMBER, and COLLABORATOR are trusted roles. | |
| const trustedRoles = ['OWNER', 'MEMBER', 'COLLABORATOR', 'CONTRIBUTOR']; | |
| const isTrusted = trustedRoles.includes(authorAssociation); | |
| // 1. Check if user is an org member via author_association | |
| // (available on the PR payload without extra API calls or permissions) | |
| // OWNER, MEMBER, and COLLABORATOR are trusted roles. | |
| const trustedRoles = ['OWNER', 'MEMBER', 'COLLABORATOR']; | |
| const isTrusted = trustedRoles.includes(authorAssociation); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/approve_dco.yaml around lines 23 - 27, The trustedRoles
array currently includes 'CONTRIBUTOR', which incorrectly treats GitHub
author_association "CONTRIBUTOR" as a trusted role; remove 'CONTRIBUTOR' from
the trustedRoles declaration so only OWNER, MEMBER, and COLLABORATOR are
considered trusted, and ensure any logic using isTrusted (computed from
trustedRoles.includes(authorAssociation)) continues to enforce DCO checks for
contributors; also update the nearby comment if needed to match the corrected
trustedRoles list.
There was a problem hiding this comment.
Not our contribution, outside scope.
There was a problem hiding this comment.
@DanielNScott, understood — I'll drop this for the current PR. Would you like me to open a follow-up GitHub issue to track removing CONTRIBUTOR from the trustedRoles array in .github/workflows/approve_dco.yaml, so it doesn't get lost?
🧠 Learnings used
Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-05T15:51:58.120Z
Learning: Sign commits and affirm the DCO as described in CONTRIBUTING.md
Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2584
File: cognee/infrastructure/databases/graph/postgres/adapter.py:35-44
Timestamp: 2026-04-10T03:13:45.843Z
Learning: In `topoteretes/cognee`, the factory function `get_graph_engine()` in `cognee/infrastructure/databases/graph/get_graph_engine.py` always calls `await graph_client.initialize()` (via a `hasattr` guard) before returning the adapter. This means all graph adapters, including `PostgresAdapter`, have their schema/tables created before any caller can invoke graph operations. Concerns about missing schema at runtime (e.g., "relation does not exist") do not apply to code paths that go through the factory.
| async with self._graph._session() as session: | ||
| if edge_type_ids: | ||
| table = _validate_table_name("EdgeType_relationship_name") | ||
| await session.execute( | ||
| text(f"DELETE FROM {table} WHERE CAST(id AS text) = ANY(:ids)"), | ||
| {"ids": edge_type_ids}, | ||
| ) | ||
|
|
||
| if triplet_ids: | ||
| try: | ||
| table = _validate_table_name("Triplet_text") | ||
| await session.execute( | ||
| text(f"DELETE FROM {table} WHERE CAST(id AS text) = ANY(:ids)"), | ||
| {"ids": triplet_ids}, | ||
| ) | ||
| except Exception as e: | ||
| error_msg = str(e).lower() | ||
| if "does not exist" in error_msg or "relation" in error_msg: | ||
| logger.debug("Triplet_text table not found, skipping: %s", e) | ||
| else: | ||
| logger.warning("Unexpected error deleting from Triplet_text: %s", e) | ||
|
|
||
| await session.commit() |
There was a problem hiding this comment.
The swallowed Triplet_text error still aborts the transaction.
If DELETE FROM Triplet_text fails because the table is missing, PostgreSQL marks the whole transaction failed. Catching that exception and continuing to commit() can roll back the earlier deletes or raise a second error. Probe for the collection before opening the transaction, or use a savepoint and only suppress the expected "relation does not exist" case there.
💡 One safe approach
async def delete_edges_with_vectors(
self,
edge_type_ids: Optional[List[str]] = None,
triplet_ids: Optional[List[str]] = None,
) -> None:
+ triplet_table_exists = (
+ await self._vector.has_collection("Triplet_text") if triplet_ids else False
+ )
+
async with self._graph._session() as session:
if edge_type_ids:
table = _validate_table_name("EdgeType_relationship_name")
await session.execute(
text(f"DELETE FROM {table} WHERE CAST(id AS text) = ANY(:ids)"),
{"ids": edge_type_ids},
)
- if triplet_ids:
- try:
- table = _validate_table_name("Triplet_text")
- await session.execute(
- text(f"DELETE FROM {table} WHERE CAST(id AS text) = ANY(:ids)"),
- {"ids": triplet_ids},
- )
- except Exception as e:
- error_msg = str(e).lower()
- if "does not exist" in error_msg or "relation" in error_msg:
- logger.debug("Triplet_text table not found, skipping: %s", e)
- else:
- logger.warning("Unexpected error deleting from Triplet_text: %s", e)
+ if triplet_ids and triplet_table_exists:
+ table = _validate_table_name("Triplet_text")
+ await session.execute(
+ text(f"DELETE FROM {table} WHERE CAST(id AS text) = ANY(:ids)"),
+ {"ids": triplet_ids},
+ )
await session.commit()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py` around lines 518
- 540, The current transaction opens via async with self._graph._session() and
then attempts DELETE on Triplet_text which, if the table is missing, marks the
whole transaction as failed even though the exception is caught; fix by
isolating the Triplet_text delete in a SAVEPOINT inside that session (execute
"SAVEPOINT sp_triplet", run the DELETE, on relation-missing error ROLLBACK TO
SAVEPOINT sp_triplet and suppress only that case, otherwise re-raise, then
RELEASE SAVEPOINT sp_triplet) or alternatively probe for the table existence
before entering the transaction (use a prior query against pg_catalog to check
Triplet_text) and skip the DELETE entirely if absent; apply this change around
the Triplet_text block that uses session.execute and await session.commit() so
earlier deletes are not rolled back by the missing-table error.
There was a problem hiding this comment.
Non issue, not touched.
There was a problem hiding this comment.
@DanielNScott, understood — I'll defer to your judgment on this one.
(╯°□°)╯
🧠 Learnings used
Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2584
File: cognee/infrastructure/databases/graph/postgres/adapter.py:35-44
Timestamp: 2026-04-10T03:13:45.843Z
Learning: In `topoteretes/cognee`, the factory function `get_graph_engine()` in `cognee/infrastructure/databases/graph/get_graph_engine.py` always calls `await graph_client.initialize()` (via a `hasattr` guard) before returning the adapter. This means all graph adapters, including `PostgresAdapter`, have their schema/tables created before any caller can invoke graph operations. Concerns about missing schema at runtime (e.g., "relation does not exist") do not apply to code paths that go through the factory.
Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-27T06:30:27.190Z
Learning: Implement database adapters by extending GraphDBInterface or VectorDBInterface
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 2259
File: cognee/modules/retrieval/utils/used_graph_elements.py:77-83
Timestamp: 2026-03-03T14:44:13.269Z
Learning: In temporal retriever dictionaries, `triplets` and `vector_search_results` are mutually exclusive—a temporal dict contains either triplets (edge-based results) or vector_search_results (RAG-based results), but never both. The `extract_from_temporal_dict` function in `cognee/modules/retrieval/utils/used_graph_elements.py` correctly returns immediately after processing triplets without falling back to vector_search_results.
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1990
File: cognee/modules/search/methods/search.py:124-127
Timestamp: 2026-01-13T12:41:17.082Z
Learning: In Python 3.6+ (PEP 530), you can use await inside asynchronous comprehensions (e.g., [await func(x) for x in items]) within async def functions. This is valid and executes sequentially. If parallel execution is desired, collect coroutines and use asyncio.gather, since awaiting inside a comprehension runs each iteration in sequence unless you explicitly parallelize.
| @pytest.mark.asyncio | ||
| async def test_add_edges_with_vectors(adapter): | ||
| """Edges should be inserted into graph and edge types into vector table. | ||
|
|
||
| Note: this test passes in isolation but may fail when run in suite | ||
| due to PGVectorAdapter connection pool state from prior tests. | ||
| This is a pre-existing PGVectorAdapter issue, not a hybrid adapter bug. | ||
| """ |
There was a problem hiding this comment.
This test is documented as flaky.
The note says it can fail in-suite because it depends on leftover PGVector connection-pool state from earlier tests. That makes CI order-dependent instead of behavior-dependent. Please isolate the collection state per test (unique collection names, dedicated schema, or deterministic teardown) before keeping this in the normal suite.
As per coding guidelines, "Avoid external state in Python tests; rely on test fixtures and CI-provided environment variables".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/tests/e2e/postgres/test_postgres_hybrid_adapter.py` around lines 425 -
432, The test test_add_edges_with_vectors is flaky because it relies on leftover
PGVectorAdapter connection-pool state; make it deterministic by isolating
collection state per run: modify the test (or the adapter fixture) to use a
unique collection/schema name (e.g., append a UUID or timestamp) or ensure a
deterministic teardown that drops the vector table/collection and resets the
PGVectorAdapter connection pool before and after the test (or recreate the
adapter instance per test). Update the test_add_edges_with_vectors (and/or the
adapter fixture) to create and cleanup a dedicated collection/schema or call the
adapter reset method so no state is shared across tests.
There was a problem hiding this comment.
Non issue in practice, good coding hygiene, fixed.
There was a problem hiding this comment.
@DanielNScott, understood — acknowledged and moving on.
🧠 Learnings used
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 251
File: cognee/tests/infrastructure/databases/test_index_graph_edges.py:0-0
Timestamp: 2024-12-04T18:37:55.092Z
Learning: In the `index_graph_edges` function, both graph engine and vector engine initialization failures are handled within the same try-except block, so a single test covers both cases.
Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2584
File: cognee/infrastructure/databases/graph/postgres/adapter.py:35-44
Timestamp: 2026-04-10T03:13:45.843Z
Learning: In `topoteretes/cognee`, the factory function `get_graph_engine()` in `cognee/infrastructure/databases/graph/get_graph_engine.py` always calls `await graph_client.initialize()` (via a `hasattr` guard) before returning the adapter. This means all graph adapters, including `PostgresAdapter`, have their schema/tables created before any caller can invoke graph operations. Concerns about missing schema at runtime (e.g., "relation does not exist") do not apply to code paths that go through the factory.
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 2259
File: cognee/tests/test_edge_centered_payload.py:146-147
Timestamp: 2026-03-03T14:48:18.555Z
Learning: In the cognee project, every graph edge must produce exactly one triplet embedding. The test `cognee/tests/test_edge_centered_payload.py` enforces strict equality: `len(triplets) == len(edges)`. This is the expected behavior, not a "too strict" test.
Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2506
File: cognee/tests/test_graphdb_shared.py:31-49
Timestamp: 2026-04-08T01:51:17.734Z
Learning: In cognee/tests/, the files `test_graphdb_shared.py` and `test_graphdb_postgres.py` are standalone script helpers (not pytest test cases), even though they are prefixed with "test_" and are Python files. The test directory contains a pre-existing mixture of pytest and non-pytest (standalone script) test code. Standard pytest isolation rules (e.g., restoring global config state) are less critical for these standalone scripts.
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1990
File: cognee/modules/search/methods/search.py:124-127
Timestamp: 2026-01-13T12:41:17.082Z
Learning: In Python 3.6+ (PEP 530), you can use await inside asynchronous comprehensions (e.g., [await func(x) for x in items]) within async def functions. This is valid and executes sequentially. If parallel execution is desired, collect coroutines and use asyncio.gather, since awaiting inside a comprehension runs each iteration in sequence unless you explicitly parallelize.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cognee/tests/unit/tasks/storage/test_add_data_points.py (1)
346-353: Nice hybrid-path coverage; consider asserting payloads too.The path assertions are solid. Optional improvement: verify
add_edges_with_vectorsargs, not just call count, so malformed edge payloads don’t slip through.Suggested assertion hardening
# Hybrid path: add_edges_with_vectors called, not add_edges assert graph_engine.add_edges_with_vectors.await_count == 2 graph_engine.add_edges.assert_not_awaited() + expected_main_edges = ensure_default_edge_properties([edge1]) + expected_custom_edges = ensure_default_edge_properties(custom_edges) + first_call_edges = graph_engine.add_edges_with_vectors.await_args_list[0].args[0] + assert expected_main_edges[0] in first_call_edges + assert expected_custom_edges[0] in first_call_edges + assert graph_engine.add_edges_with_vectors.await_args_list[1].args[0] == expected_custom_edges🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/tests/unit/tasks/storage/test_add_data_points.py` around lines 346 - 353, The test currently only checks that graph_engine.add_edges_with_vectors was called twice; update it to also assert the actual payloads passed to add_edges_with_vectors to prevent malformed edge data—inspect graph_engine.add_edges_with_vectors.call_args_list (or use the async mock's await_args_list / assert_awaited_with equivalents) and compare each call's arguments to the expected edge objects/vectors for the hybrid path, validating keys, source/target IDs and vector shapes/lengths so both calls' payloads are explicitly verified.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cognee/tests/unit/tasks/storage/test_add_data_points.py`:
- Around line 346-353: The test currently only checks that
graph_engine.add_edges_with_vectors was called twice; update it to also assert
the actual payloads passed to add_edges_with_vectors to prevent malformed edge
data—inspect graph_engine.add_edges_with_vectors.call_args_list (or use the
async mock's await_args_list / assert_awaited_with equivalents) and compare each
call's arguments to the expected edge objects/vectors for the hybrid path,
validating keys, source/target IDs and vector shapes/lengths so both calls'
payloads are explicitly verified.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 354cb22a-814f-453a-9b81-24a11e0995a1
📒 Files selected for processing (1)
cognee/tests/unit/tasks/storage/test_add_data_points.py
…ation. Mostly tweaks, lockfile issue was real error. Problems: - uv.lock out of sync with pyproject.toml, failing CI - any non-empty USE_UNIFIED_PROVIDER triggered hybrid mode - edge indexing crashed on malformed edge tuples - test_pghybrid set env vars at module level, polluting pytest collection - add_data_points mock missing has_capability, no hybrid path test Solutions: - regenerate uv.lock - gate USE_UNIFIED_PROVIDER on known UNIFIED_PROVIDERS set - skip malformed edges in format_edges with arity guard - move env setup and cache clears into main() in test_pghybrid - fix unified engine mock and add hybrid write path test
…actories. Problems: - pghybrid env check frozen inside lru_cache after first call - subsequent env changes ignored for remainder of process Solutions: - move pghybrid check into uncached create_graph_engine wrapper - move pghybrid check into uncached create_vector_engine wrapper - remove duplicate checks from cached _create_* functions
Problems: - LEFT JOIN graph_edge produced multiple rows per node - LIMIT counted joined rows, not distinct nodes Solutions: - select top-k nodes by distance in CTE, then join edges outside LIMIT
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cognee/infrastructure/databases/hybrid/postgres/adapter.py (1)
518-540:⚠️ Potential issue | 🟠 MajorDon't swallow a failed
Triplet_textdelete inside this transaction.If Lines 529-532 raise
relation does not exist, PostgreSQL aborts the whole transaction. Catching that exception and then continuing tocommit()can undo the earlierEdgeType_relationship_namedelete, and theelsebranch currently logs unexpected errors instead of re-raising them. Preflight the table before opening the transaction or isolate just this delete in a SAVEPOINT and re-raise non-missing-table failures.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py` around lines 518 - 540, The Triplet_text DELETE is currently caught and suppressed inside the transaction which can abort and then cause session.commit() to commit nothing; change it to use a SAVEPOINT around the Triplet_text delete so only that operation is isolated: inside the async with self._graph._session() as session create a SAVEPOINT before executing the Triplet_text DELETE (the same block that uses _validate_table_name("Triplet_text") and text(...)), attempt the delete, on exception inspect the error message and if it indicates "does not exist" rollback to the SAVEPOINT and continue, otherwise rollback to the SAVEPOINT and re-raise the exception so the outer transaction fails; leave the EdgeType_relationship_name delete and final await session.commit() unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py`:
- Around line 263-287: vector grouping and embedding use only
DataPoint.get_embeddable_data() and string names derived via rsplit, causing
incorrect/misaligned embeddings and crashes when metadata is missing; change the
grouping key to a tuple (type(dp).__name__, field_name) and when building
valid_items use getattr(dp, field_name, None) (or a small helper that pulls the
actual field value) as the text to embed instead of
DataPoint.get_embeddable_data(), skip entries where getattr returns None, call
self._vector.embed_data with those actual texts, and when creating vector
indexes pass the explicit type_name and field_name from the tuple (not
collection.rsplit("_", 1)), and ensure you guard access to dp.metadata when
iterating index_fields.
In `@cognee/infrastructure/databases/vector/create_vector_engine.py`:
- Around line 29-46: The pghybrid branch is hand-building a PGVector DSN and
bypasses the existing pgvector credential validation, which can produce
malformed connection strings; update the unified_provider == "pghybrid" branch
to either call the shared pgvector helper (reuse the existing
_create_vector_engine or the helper that builds/validates the pgvector DSN) or
mirror its validation by verifying relational_config.db_host, db_port, db_name,
db_username, and db_password are present and raising a clear exception if any
are missing, then construct the connection_string and pass it to PGVectorAdapter
exactly as the shared helper does so the same guardrails (and error messages)
apply.
---
Duplicate comments:
In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py`:
- Around line 518-540: The Triplet_text DELETE is currently caught and
suppressed inside the transaction which can abort and then cause
session.commit() to commit nothing; change it to use a SAVEPOINT around the
Triplet_text delete so only that operation is isolated: inside the async with
self._graph._session() as session create a SAVEPOINT before executing the
Triplet_text DELETE (the same block that uses
_validate_table_name("Triplet_text") and text(...)), attempt the delete, on
exception inspect the error message and if it indicates "does not exist"
rollback to the SAVEPOINT and continue, otherwise rollback to the SAVEPOINT and
re-raise the exception so the outer transaction fails; leave the
EdgeType_relationship_name delete and final await session.commit() unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 50c62b72-55c6-453c-8942-7725cfc753df
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
cognee/infrastructure/databases/graph/get_graph_engine.pycognee/infrastructure/databases/hybrid/postgres/adapter.pycognee/infrastructure/databases/unified/get_unified_engine.pycognee/infrastructure/databases/vector/create_vector_engine.pycognee/memify_pipelines/consolidate_entity_descriptions.pycognee/tests/e2e/postgres/test_pghybrid.py
🚧 Files skipped from review as they are similar to previous changes (4)
- cognee/infrastructure/databases/graph/get_graph_engine.py
- cognee/infrastructure/databases/unified/get_unified_engine.py
- cognee/memify_pipelines/consolidate_entity_descriptions.py
- cognee/tests/e2e/postgres/test_pghybrid.py
| if unified_provider == "pghybrid": | ||
| from cognee.infrastructure.databases.relational import get_relational_config | ||
|
|
||
| embedding_engine = get_embedding_engine() | ||
| relational_config = get_relational_config() | ||
| connection_string = ( | ||
| f"postgresql+asyncpg://{relational_config.db_username}:{relational_config.db_password}" | ||
| f"@{relational_config.db_host}:{relational_config.db_port}" | ||
| f"/{relational_config.db_name}" | ||
| ) | ||
|
|
||
| from .pgvector.PGVectorAdapter import PGVectorAdapter | ||
|
|
||
| return PGVectorAdapter( | ||
| connection_string, | ||
| vector_db_key, | ||
| embedding_engine, | ||
| ) |
There was a problem hiding this comment.
Please keep the pghybrid override on the same validated pgvector path.
This branch now hand-builds the PGVector connection string but skips the normal pgvector credential guard in _create_vector_engine(). With incomplete relational settings it can return a malformed DSN instead of failing fast. Reusing the shared pgvector helper here—or at least mirroring the db_host/db_port/db_name/db_username/db_password validation—would keep the hybrid path predictable.
As per coding guidelines, "Prefer explicit, structured error handling in Python code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/infrastructure/databases/vector/create_vector_engine.py` around lines
29 - 46, The pghybrid branch is hand-building a PGVector DSN and bypasses the
existing pgvector credential validation, which can produce malformed connection
strings; update the unified_provider == "pghybrid" branch to either call the
shared pgvector helper (reuse the existing _create_vector_engine or the helper
that builds/validates the pgvector DSN) or mirror its validation by verifying
relational_config.db_host, db_port, db_name, db_username, and db_password are
present and raising a clear exception if any are missing, then construct the
connection_string and pass it to PGVectorAdapter exactly as the shared helper
does so the same guardrails (and error messages) apply.
There was a problem hiding this comment.
Non issue, relational database would have already failed if mis-specified.
There was a problem hiding this comment.
@DanielNScott, that makes sense — since the relational engine is already initialised (and its config validated) before the pghybrid vector-engine path is reached, a mis-specified DSN would have surfaced earlier in the startup sequence. Fair enough, I'll defer to your judgement here.
✏️ Learnings added
Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2587
File: cognee/infrastructure/databases/vector/create_vector_engine.py:29-46
Timestamp: 2026-04-10T17:49:13.498Z
Learning: In `topoteretes/cognee`, the `pghybrid` branch in `create_vector_engine` (`cognee/infrastructure/databases/vector/create_vector_engine.py`) intentionally skips explicit credential validation when building the PGVector DSN. The relational database engine is initialised (and its config validated) earlier in the startup sequence, so any mis-specified relational config (db_host, db_port, db_name, db_username, db_password) would already have caused a failure before this code path is reached. A redundant guard here is therefore unnecessary by design.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2584
File: cognee/infrastructure/databases/graph/postgres/adapter.py:35-44
Timestamp: 2026-04-10T03:13:45.843Z
Learning: In `topoteretes/cognee`, the factory function `get_graph_engine()` in `cognee/infrastructure/databases/graph/get_graph_engine.py` always calls `await graph_client.initialize()` (via a `hasattr` guard) before returning the adapter. This means all graph adapters, including `PostgresAdapter`, have their schema/tables created before any caller can invoke graph operations. Concerns about missing schema at runtime (e.g., "relation does not exist") do not apply to code paths that go through the factory.
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1990
File: cognee/modules/search/methods/search.py:124-127
Timestamp: 2026-01-13T12:41:17.082Z
Learning: In Python 3.6+ (PEP 530), you can use await inside asynchronous comprehensions (e.g., [await func(x) for x in items]) within async def functions. This is valid and executes sequentially. If parallel execution is desired, collect coroutines and use asyncio.gather, since awaiting inside a comprehension runs each iteration in sequence unless you explicitly parallelize.
|
Consolidated information on PR checks and resolution below. Mostly tweaks related to unrealistic edge cases, but easy to fix. A few actual issues found, most notably lock file conflict mistake. CodeRabbit issues:
Assessments / resolutions:
Test failures:
Test fixes: |
Problems: - PostgresHybridAdapter missing get_neighborhood, failing instantiation - hybrid test fixture used removed relational_engine constructor - hybrid test fixture used __new__ hack to bypass PGVectorAdapter cache - flaky test due to shared connection pool state across tests Solutions: - add get_neighborhood delegation to hybrid adapter - use connection_string constructor for both graph and vector adapters - add session-scoped event_loop fixture - clean up via graph_adapter.sessionmaker instead of relational_engine - remove flakiness note from test_add_edges_with_vectors Reasoning and Related Notes: - search_graph_with_distances LIMIT fix uses CTE to rank nodes before joining edges - CodeRabbit item on get_embeddable_data returning only first field is a pre-existing DataPoint behavior, not introduced by hybrid adapter - CodeRabbit item on pghybrid skipping pgvector credential validation is low risk since relational config is already validated
Problems: - get_embeddable_data always returned first indexed field - multi-field DataPoints embedded wrong text for non-first collections Solutions: - use getattr(dp, field_name) to embed the correct field per collection
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
cognee/infrastructure/databases/hybrid/postgres/adapter.py (1)
270-276:⚠️ Potential issue | 🟡 MinorGuard against
Nonemetadata.
dp.metadatais typed asOptional[MetaData]in the DataPoint model. If a DataPoint hasmetadata=Noneexplicitly set, line 271 will raiseAttributeErrorwhen calling.get().🛡️ Suggested defensive fix
for dp in data_points: - for field_name in dp.metadata.get("index_fields", []): + index_fields = (dp.metadata or {}).get("index_fields", []) + for field_name in index_fields: if getattr(dp, field_name, None) is None: continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py` around lines 270 - 276, The loop grouping data points assumes dp.metadata is non-None and calls dp.metadata.get(...), which will raise AttributeError if metadata is None; update the loop in adapter.py (the block that iterates over data_points and builds vector_groups) to first guard dp.metadata (e.g., skip when dp.metadata is falsy) before accessing .get("index_fields", []), so DataPoint instances with metadata=None are safely skipped and vector_groups construction remains correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py`:
- Around line 73-76: The initialize method's docstring claims it initializes
"both underlying adapters" but only calls self._graph.initialize(); either
update the docstring to reflect that only the graph adapter is initialized
(e.g., "Initialize the graph adapter") or add the missing call to initialize the
vector adapter (e.g., call self._vector.initialize() or the actual vector
adapter field) so the implementation matches the docstring; edit the initialize
method's docstring and/or add the vector adapter initialization to the body
(refer to initialize, self._graph and the vector adapter attribute name in this
class).
---
Duplicate comments:
In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py`:
- Around line 270-276: The loop grouping data points assumes dp.metadata is
non-None and calls dp.metadata.get(...), which will raise AttributeError if
metadata is None; update the loop in adapter.py (the block that iterates over
data_points and builds vector_groups) to first guard dp.metadata (e.g., skip
when dp.metadata is falsy) before accessing .get("index_fields", []), so
DataPoint instances with metadata=None are safely skipped and vector_groups
construction remains correct.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 33de770b-6e67-4d41-9539-1f4ec8d9c9f3
📒 Files selected for processing (2)
cognee/infrastructure/databases/hybrid/postgres/adapter.pycognee/tests/e2e/postgres/test_postgres_hybrid_adapter.py
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tests/e2e/postgres/test_postgres_hybrid_adapter.py
Problems: - PGVectorAdapter.py line wrapping failed ruff-format in CI pre-commit hook Solutions: - apply ruff format to PGVectorAdapter.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py`:
- Line 62: The _metadata cache stored on the adapter (self._metadata) is not
invalidated after prune() calls delete_database(), causing has_collection() to
return stale results; update prune() so that after calling delete_database() you
clear or reset the cache (e.g., reassign self._metadata = MetaData() or None) so
subsequent has_collection() reflections use fresh metadata; ensure you reference
the adapter's self._metadata, the prune() method, and the delete_database() call
when making this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fa0c7046-6953-44bd-85e5-7ee6e5bb0418
📒 Files selected for processing (1)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
…ure/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and
…ing. Problems: - _metadata cache not invalidated after prune, has_collection returns stale True - hybrid adapter docstring says "both" but only initializes graph adapter Solutions: - call self._metadata.clear() before delete_database in prune - update docstring to reflect that vector adapter initializes lazily
…ure/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and
Problems: - all postgres adapter tests fail in CI with "No module named asyncpg" - cognee_setup action does not install postgres extra by default - not sure how this happened, worked on previous branch? merge issue somewhere? Solutions: - pass extra-dependencies: postgres to cognee_setup in graph_db_tests.yml
Description
Adds a PostgreSQL-based hybrid adapter that performs graph and vector operations against a single Postgres database, enabling combined queries (JOINs across graph_node/graph_edge and vector collection tables) and single-transaction writes.
This PR builds on PR #2584 (postgres graph adapter) and integrates it with the existing PGVector adapter via a new
PostgresHybridAdapterincognee/infrastructure/databases/hybrid/postgres/.Key design decisions:
add_nodes_with_vectorsandadd_edges_with_vectorssearch_graph_with_distancesJOINs graph and vector tables via pgvector cosine distanceUSE_UNIFIED_PROVIDER=pghybridenv var activates hybrid modeget_unified_engine()assembles the hybrid adapter from graph and vector sub-adaptersadd_data_pointsroutes to hybrid writes when capability is availableself._graph._session()for combined writespgvector/pgvector:pg17image for pgvector extension supportAlso includes from PR #2584 merge:
get_neighborhoodwith recursive CTEnode_name_filter_operatorOR/AND support inget_nodeset_subgraphfunc.now()on upsert timestampsThat is, this should include all content from previous merge tweaks, along with the independent tweaks that were on our previous version of this, #2506.
Acceptance Criteria
PostgresHybridAdapterimplements bothGraphDBInterfaceandVectorDBInterfaceUSE_UNIFIED_PROVIDER=pghybridactivates hybrid mode through unified engineadd_data_pointsroutes to hybrid writes when capability is presentType of Change
Screenshots
26 passed, 8 warnings in 7.26s (will update with full passage shortly, while CI/CD is running).
Pre-submission Checklist
CONTRIBUTING.md)DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
Summary by CodeRabbit
New Features
Tests
Bug Fixes
Chores