Skip to content

Feature/cog 4469 graph aware embeddings integrate joint postgres graph db and#2587

Merged
Vasilije1990 merged 57 commits intodevfrom
feature/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and
Apr 11, 2026
Merged

Feature/cog 4469 graph aware embeddings integrate joint postgres graph db and#2587
Vasilije1990 merged 57 commits intodevfrom
feature/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and

Conversation

@DanielNScott
Copy link
Copy Markdown
Contributor

@DanielNScott DanielNScott commented Apr 10, 2026

Description

Adds a PostgreSQL-based hybrid adapter that performs graph and vector operations against a single Postgres database, enabling combined queries (JOINs across graph_node/graph_edge and vector collection tables) and single-transaction writes.

This PR builds on PR #2584 (postgres graph adapter) and integrates it with the existing PGVector adapter via a new PostgresHybridAdapter in cognee/infrastructure/databases/hybrid/postgres/.

Key design decisions:

  • Composition of PGVector and PostgresAdapters in Hybrid
    • holds a PostgresAdapter and PGVectorAdapter
    • delegates standard operations to each
    • includes hybrid logic for joint operations
  • Combined write methods for single-transaction graph+vector inserts
    • add_nodes_with_vectors and add_edges_with_vectors
  • Combined search method
    • search_graph_with_distances JOINs graph and vector tables via pgvector cosine distance
  • Unified engine integration
    • USE_UNIFIED_PROVIDER=pghybrid env var activates hybrid mode
    • get_unified_engine() assembles the hybrid adapter from graph and vector sub-adapters
    • add_data_points routes to hybrid writes when capability is available
  • Graph adapter creates its own engine from the relational connection string
  • Cypher, natural language, and Graphiti retrievers rejected by both PostgresAdapter and PostgresHybridAdapter
  • CI updated with hybrid adapter tests and hybrid E2E test steps
    • uses pgvector/pgvector:pg17 image for pgvector extension support

Also includes from PR #2584 merge:

  • All adapter fixes verified present after merge
  • get_neighborhood with recursive CTE
  • node_name_filter_operator OR/AND support in get_nodeset_subgraph
  • Batch dedup, node tuple shape consistency, triplet ordering
  • func.now() on upsert timestamps
  • Config save/restore in E2E tests

That is, this should include all content from previous merge tweaks, along with the independent tweaks that were on our previous version of this, #2506.

Acceptance Criteria

  • PostgresHybridAdapter implements both GraphDBInterface and VectorDBInterface
  • Combined write methods perform graph+vector inserts in single transaction
  • Combined search JOINs graph and vector tables
  • USE_UNIFIED_PROVIDER=pghybrid activates hybrid mode through unified engine
  • add_data_points routes to hybrid writes when capability is present
  • 26 postgres adapter unit tests pass
  • Retriever gates cover both PostgresAdapter and PostgresHybridAdapter
  • All 123 fixes from PR Feature/cog 4463 graph aware embeddings implement minimal graphdb in postgres #2584 verified present after merge
  • CI passes on upstream

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Code refactoring
  • Other (please specify):

Screenshots

26 passed, 8 warnings in 7.26s (will update with full passage shortly, while CI/CD is running).

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR (See CONTRIBUTING.md)
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

  • New Features

    • PostgreSQL hybrid adapter: combined graph+vector read/write, unified "pghybrid" provider, and graph search with distances.
    • Graph interface supports batch triplet retrieval.
  • Tests

    • New E2E and unit tests validating hybrid adapter and hybrid write flows.
  • Bug Fixes

    • More robust edge formatting and stricter backend guards for Cypher retrievers.
    • Improved vector metadata caching and reflection.
  • Chores

    • CI updated to run pgvector/Postgres 17 tests, DCO workflow adjusted for trusted authors, version bumped to 0.5.8.

DanielNScott and others added 30 commits March 24, 2026 12:31
Problems:
- Need to migrate basic graph functionality to relational DB

Solutions:
- New adapter implementing GraphDBInterface
- Selection branch included in get_graph_engine.py

Decisions:
- re-use SQLAlchemy setup and configuration
- Remove .query() (raw Cypher) functionality
Problems:
- Cypher search retriever exposes raw cypher interface via .query()
- Natural language retriever writes, uses Cypher via .query()
- Temporal awareness (Graphiti) subsystem directly uses cypher, neo4j

Solutions:
- Raise incompatibility errors proactively across these consumers

Decisions:
- None of these are core functionality, cost should be low
- Graphiti already unguarded hard-coding of multiple configs...
- Complex cypher search cannot be replaced in postgres at present
Problems:
- asyncpg misparses ::jsonb cast as named parameter
- prune_system drops graph tables via delete_database()
- cached adapter instance has no tables after prune
- no CI coverage for postgres graph backend
- no unit or e2e tests for postgres adapter

Solutions:
- replace ::jsonb with CAST(:param AS jsonb) throughout
- add _ensure_initialized() calling idempotent CREATE IF NOT EXISTS
- guard is_empty and delete_graph with _ensure_initialized
- add run-postgres-tests job to graph_db_tests.yml CI workflow
- add 26 unit tests against real Postgres
- add shared e2e test and postgres wrapper

Decisions:
- _ensure_initialized over defensive checks on every method
- real Postgres in tests, not SQLite (dialect differences block it)
- shared test function for future backend reuse (test_graphdb_shared)
- CI uses Postgres 16 service container, matching Kuzu/Neo4j pattern
Problems:
- some white-space and wrapping non-compliance

Solutions:
- fix via ruff
Problems:
- want joint graph and vector interaction in SQL to extent possible
- currently, graph and vector writes are separate transactions
- search requires two round-trips, vector search then graph search

Solutions:
- add PostgresHybridAdapter composing PostgresAdapter and PGVectorAdapter
- implement combined write methods with single-transaction atomicity
   - implement batching to avoid loops on session.execute calls later
- implement combined search by joining graph and vector tables
- exceptions in get_graph_engine and create_vector_engine if given pghybrid
- add pghybrid construction in get_unified_engine

Decisions:
- composition rather than inheritance on subclasses of SQLAlchemyAdapter
   - otherwise subtle configuration bugs can occur (diamond problem)
- hybrid only allowed via get_unified_engine, not individual factories
- per-row inserts for now, batch optimization deferred
- table name validation via regex to prevent SQL injection
Problems:
- pghybrid delegates query() to PostgresAdapter which raises generic error
- isinstance checks only catch PostgresAdapter, not PostgresHybridAdapter
- users would see NotImplementedError instead of descriptive message

Solutions:
- add PostgresHybridAdapter to isinstance checks in three files
- cypher search, natural language, and graphiti now guard both adapters
Problems:
- no test coverage for hybrid adapter

Solutions:
- add unit tests covering graph, vector, combined, and content paths
- content integrity test verifies IDs and payloads across both tables
- tests use local SQLAlchemyAdapter to avoid stale event loop issues
- fixture teardown cleans up vector collection tables between tests

Decisions:
- use uuid5 for test IDs because pgvector tables require UUID columns
- bypass PGVectorAdapter constructor to share connection pool with graph
- e2e and CI deferred because cognify pipeline calls get_graph_engine directly
Problems:
- Need to migrate basic graph functionality to relational DB

Solutions:
- New adapter implementing GraphDBInterface
- Selection branch included in get_graph_engine.py

Decisions:
- re-use SQLAlchemy setup and configuration
- Remove .query() (raw Cypher) functionality
Problems:
- Cypher search retriever exposes raw cypher interface via .query()
- Natural language retriever writes, uses Cypher via .query()
- Temporal awareness (Graphiti) subsystem directly uses cypher, neo4j

Solutions:
- Raise incompatibility errors proactively across these consumers

Decisions:
- None of these are core functionality, cost should be low
- Graphiti already unguarded hard-coding of multiple configs...
- Complex cypher search cannot be replaced in postgres at present
Problems:
- asyncpg misparses ::jsonb cast as named parameter
- prune_system drops graph tables via delete_database()
- cached adapter instance has no tables after prune
- no CI coverage for postgres graph backend
- no unit or e2e tests for postgres adapter

Solutions:
- replace ::jsonb with CAST(:param AS jsonb) throughout
- add _ensure_initialized() calling idempotent CREATE IF NOT EXISTS
- guard is_empty and delete_graph with _ensure_initialized
- add run-postgres-tests job to graph_db_tests.yml CI workflow
- add 26 unit tests against real Postgres
- add shared e2e test and postgres wrapper

Decisions:
- _ensure_initialized over defensive checks on every method
- real Postgres in tests, not SQLite (dialect differences block it)
- shared test function for future backend reuse (test_graphdb_shared)
- CI uses Postgres 16 service container, matching Kuzu/Neo4j pattern
Problems:
- some white-space and wrapping non-compliance

Solutions:
- fix via ruff
Problems:
- 36+ graph and 60+ vector call sites block pghybrid error guard
- hybrid adapter missing embedding_engine attribute for indexing
- dotenv override=True in cognee init overwrites test env vars
- cached configs survive env var changes across test boundaries

Solutions:
- add USE_UNIFIED_PROVIDER env var read at top of both factories
- get_graph_engine returns PostgresAdapter when flag is pghybrid
- create_vector_engine returns PGVectorAdapter when flag is pghybrid
- get_unified_engine reads flag for hybrid detection and construction
- expose embedding_engine on hybrid adapter from vector adapter
- e2e test sets env vars after import then clears all five caches

Decisions:
- explicit env var flag over auto-detection from two provider configs
- each factory returns independent component, not shared hybrid instance
- env vars set after cognee import to survive dotenv override=True
- hybrid combined queries only available through get_unified_engine
Problems:
- Inadequate batching
   - add_nodes loops N awaits, one INSERT round-trip each
   - add_edges same per-row loop pattern
   - has_edges loops N SELECT EXISTS queries
- raw text() SQL construction, not SQLAlchemy

Solutions:
- multi-row insert instead of looping
   - add_nodes uses pg_insert with multi-row VALUES batch
   - add_edges uses pg_insert with multi-row VALUES batch
   - has_edges uses SQLAlchemy values() with EXISTS join

- added SQLAlchemy Core table definitions for DML
   - will remove next commit to maintain single source of truth

Decisions:
- maximum batching optimization
   - multi-row INSERT over executemany for single-statement execution
   - executemany pipelines but still N server-side executions
   - looped awaits worst, executemany middle, VALUES best batching

- chose SQLAlchemy Core over raw text() for proactive safety
   - text() with bind params was safe but not project convention
   - SQLAlchemy eliminates all raw SQL string construction

- keep problematic reproduction of DDL logic for commit specificity
   - will fix in follow up
Problems:
- raw DDL strings and Core table objects duplicated schema

Solutions:
- replaced _SCHEMA_DDL list with Core Table and Index definitions
- initialize() now uses _meta.create_all() instead of raw DDL
- removed asyncio import (unused)
- replaced postgresql TIMESTAMP with generic DateTime(timezone=True)
Note:
- All problems below were cosmetic or stylistic, not substantive.

Problems:
- SQL injection via unvalidated attribute names in get_filtered_graph_data
- empty attribute_filters can produce invalid SQL syntax
- local "values" var shadows sqlalchemy values import in add_edges
- JSON serialize then deserialize indirection in add_nodes
- get_nodeset_subgraph issues four sequential DB queries
- get_filtered_graph_data issued two sequential DB queries
- public methods missing docstrings, unlike Neo4j and Kuzu adapters
- section banners inconsistent with other adapters
- narrating inline comments restate what code already says
- python-version input missing from graph_db_tests workflow
- SearchTypeNotSupported imported inline instead of at module level

Solutions:
- add attribute whitelist, raise ValueError for invalid filter keys
- fall back to get_graph_data when attribute_filters is empty
- rename local values to rows in add_edges
- inline property dict construction, remove _split_core_and_extra
- combine get_nodeset_subgraph into single CTE query
- combine get_filtered_graph_data into single CTE query
- add Parameters/Returns docstrings to all public methods
- remove section banners throughout adapter
- trim narrating comments, keep complex clarifying ones
- add python-version input with default 3.11 to workflow_call
- move SearchTypeNotSupported to module-level import

Reasoning and Related Notes:
- prune_system metadata=True drops relational DB including graph tables
   - _ensure_initialized kept defensive as a result
- JSON round-trip in add_nodes retained for UUID and datetime serialization
   - specifically, enforces JSONB compatibility (postgres wants python dict)
   - alternative is in-place modification
   - adding rules for doing this would replicate JSONEncoder
   - JSONEncoder outputs strings, so we need to convert back if used
Problems:
- tuple merge order lets properties dict override authoritative node ID
- __init__ and _session missing type annotations
- magic number 6 in test_graphdb_shared unexplained
- raised but incorrect: add weighted edges test

Solutions:
- swap merge order so tuple ID always wins
- add type annotations to __init__ and _session
- add comment explaining 3 searches x 2 entries = 6
- no duplication of weighted edges test necessary or appropriate

Reasoning and Related Notes:
- weighted edges test is backend-agnostic, exercises get_graph_from_model only
   - running it under Postgres job would not test adapter round-trip
   - no backend actually has a round-trip weighted edges test
   - therefore kuzu and neo4j cases are incorrect and duplicates

- recursive CTE performance in get_graph_metrics is not part of pipeline
   - Kuzu returns -1 for components, Neo4j uses GDS plugin
…al-graphdb-in-postgres' into feature/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and
Problems:
- postgres tests in unit/ fail without live database in CI
- integration/ is auto-swept by CI which lacks postgres service
- Igor requested separate postgres test directory and github action

Solutions:
- created cognee/tests/e2e/postgres/ for service-dependent tests
- moved 5 test files from unit/ and tests/ root to e2e/postgres
- updated imports to reflect new module paths
- updated graph_db_tests.yml to reference new file locations

Architectural Trade-Off Decisions:
- chose e2e/ over integration/ to avoid automatic CI sweep
- graph_db_tests.yml invokes explicitly so tests still run in CI

Reasoning and Related Notes:
- dexters1 (Igor) requested own github action for postgres tests
- all 39 tests pass locally against live postgres
Problems:
- get_triplets_batch ordering unstable with duplicate source/target pairs
- None embeddable data passed to embedding engine can cause failures
- bare except in Triplet_text deletion can hide real database errors
- pghybrid DSN built from None config fields gives unhelpful errors
- graphiti guard only blocks postgres, not kuzu or other non-neo4j backends

Solutions:
- added relationship_name to ORDER BY for deterministic pagination
- filtered None values from get_embeddable_data before calling embed_data
- narrowed except to log and distinguish missing-table from real errors
- added config validation before DSN construction in get_unified_engine
- replaced postgres blocklist with positive isinstance Neo4jAdapter check
Problems:
- edge property test assertion used or, passing on partial data
- hybrid search results built from fragile positional row indexes
- hybrid adapter __init__ lacked type hints on parameters
- get_graph_engine error message exceeded 100-char line limit

Solutions:
- split or assertion into two separate asserts for weight and edge_text
- switched to result.mappings() for named column access in hybrid search
- added TYPE_CHECKING imports and type annotations to __init__
- extracted provider list to variable before f-string interpolation
…face.

Problems:
- get_triplets_batch not declared on interface, callers use hasattr checks
- Neptune adapter does not implement it and cannot be required to

Solutions:
- added get_triplets_batch to GraphDBInterface as non-abstract optional method
- default raises NotImplementedError following existing feedback_weights pattern
- docstring documents which adapters implement it
Problems:
- CI had no coverage for pghybrid hybrid adapter or e2e path
- postgres:16 image lacks pgvector extension needed by hybrid tests
- TEST_DATA_DIR resolved to wrong path after file move to e2e/postgres
- hybrid adapter search wrapper missing node_name_filter_operator param
- combined_write stored belongs_to_set as null instead of empty list
- nodeset filtering test failed because entities lack belongs_to_set

Solutions:
- added hybrid adapter and pghybrid e2e steps to graph_db_tests.yml
- upgraded CI service image to pgvector/pgvector:pg17
- fixed TEST_DATA_DIR to resolve relative to tests/ root via parents[2]
- added node_name_filter_operator passthrough in hybrid search wrapper
- normalized belongs_to_set to empty list in combined_write payload
- disabled nodeset test with TODO noting cross-backend pipeline gap

Reasoning and Related Notes:
- nodeset gap affects all backends, not a postgres-specific issue
- entities from extract_graph_from_data never inherit belongs_to_set
Problems:
- add_data_points split unified engine into separate graph and vector calls
- hybrid adapter's combined_write methods weren't called from the pipeline
- vector payload format did not match PGVectorAdapter's IndexSchema format

Solutions:
- added HYBRID_WRITE capability check to route nodes and edges atomically
- add_nodes_with_vectors and add_edges_with_vectors now called when hybrid
- non-hybrid backends fall back to existing separate graph and vector calls
- payload now uses IndexSchema and serialize_data identical to PGVectorAdapter
- updated test assertions to match IndexSchema payload format

Reasoning and Related Notes:
- Import and re-use IndexSchema to keep linkage with PGVectorAdatper
Problems:
- hybrid joint-add edge type payloads differed from PGVectorAdapter
- no test verified that hybrid and separate paths produce identical payloads

Solutions:
- hybrid edge type payloads now use IndexSchema and serialize_data
- added test inserting a 3-node 3-edge graph via both paths and comparing
- test asserts identical vector payload keys and values for nodes and edges

Reasoning and Related Notes:
- hybrid and separate paths do not share serialization code yet
- test will catch drift if either path changes independently
- didn't want to change PGVectorAdapter in this PR
…t routing.

Problems:
- hybrid adapter CI step missing embedding API secrets
- get_nodeset_subgraph override drops node_name_filter_operator parameter
- vector ON CONFLICT DO NOTHING leaves stale embeddings on reprocess
- delete_nodes_with_vectors docstring promised all-collection deletion
- hybrid path inserted triplets as graph nodes instead of vector-only

Solutions:
- added EMBEDDING_* env vars to hybrid adapter CI step
- added node_name_filter_operator to get_nodeset_subgraph matching interface
- changed both node and edge vector inserts to ON CONFLICT DO UPDATE
- corrected docstring to reflect graph-only behavior when collections is None
- triplets now always use vector-only index_data_points regardless of backend
Problems:
- hybrid per-row inserts deadlock under concurrent cognify writes
- PGVectorAdapter reflects full schema on every search and has_collection call
- get_unified_engine creates a new PGVectorAdapter per call for pghybrid

Solutions:
- batch all hybrid inserts into one statement per table
- cache reflected table metadata on PGVectorAdapter instance
- hybrid adapter reuses cached PGVectorAdapter via get_vector_engine()
- targeted reflect(only=[name]) on cache miss instead of full reflect

Architectural Trade-Off Decisions:
- reuse get_vector_engine requires VECTOR_DB_PROVIDER=pgvector for pghybrid
- accepted over adapter-level cache dict for consistency with other backends

Reasoning and Related Notes:
- deadlocks occurred in 2 of 5 benchmark iterations before batching fix
- pgv_vector_get_table dropped from 33s to 0s with metadata cache
- hybrid_write dropped from 30s to 14s with batched inserts
…mbeddings-integrate-joint-postgres-graph-db-and' into feature/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and

Take upstream version on everything
…ph-aware-embeddings-integrate-joint-postgres-graph-db-and

Same set of conflicts as previously... issue was dev merge order across
branches.
@pull-checklist
Copy link
Copy Markdown

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@github-actions
Copy link
Copy Markdown
Contributor

Hello @DanielNScott, thank you for submitting a PR! We will respond as soon as possible.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a Postgres hybrid adapter and wiring so graph and vector functionality can share a Postgres/pgvector backend; updates engine factories, adapters, retrievers, write paths, tests, CI workflows, and a DCO check to support the new pghybrid unified provider and hybrid write capability.

Changes

Cohort / File(s) Summary
Workflows & metadata
​.github/workflows/approve_dco.yaml, ​.github/workflows/graph_db_tests.yml, pyproject.toml
DCO check now uses PR author_association; CI Postgres job image switched to pgvector/pgvector:pg17 and new pghybrid e2e steps added; project version bumped to 0.5.8.
Hybrid adapter implementation
cognee/infrastructure/databases/hybrid/postgres/adapter.py
New PostgresHybridAdapter composing PostgresAdapter + PGVectorAdapter: delegated graph/vector APIs, hybrid transactional write/delete operations, combined graph+vector search with distances, and SQL identifier validation.
Engine factories / provider overrides
cognee/infrastructure/databases/graph/get_graph_engine.py, cognee/infrastructure/databases/vector/create_vector_engine.py, cognee/infrastructure/databases/unified/get_unified_engine.py
Added USE_UNIFIED_PROVIDER=pghybrid early-return/override paths; factories can directly construct Postgres-backed adapters; unified factory now async-creates/awaits PostgresHybridAdapter.
Adapters & interfaces
cognee/infrastructure/databases/graph/graph_db_interface.py, cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
Graph interface: added async get_triplets_batch declaration (raises NotImplementedError by default). PGVectorAdapter: persistent MetaData cache, per-table reflection, improved reflection error handling, and post-create reflection caching.
Task & pipeline changes
cognee/tasks/storage/add_data_points.py, cognee/memify_pipelines/consolidate_entity_descriptions.py, cognee/tasks/temporal_awareness/index_graphiti_objects.py
Storage writes now branch to hybrid methods when EngineCapability.HYBRID_WRITE present; edge formatting broadened to handle multiple tuple shapes; Neo4j-only checks replaced by runtime Neo4jAdapter type validation.
Retrievers
cognee/modules/retrieval/cypher_search_retriever.py, cognee/modules/retrieval/natural_language_retriever.py
Broadened Postgres backend guards to treat PostgresHybridAdapter like PostgresAdapter and raise SearchTypeNotSupported for raw Cypher paths.
E2E & unit tests
cognee/tests/e2e/postgres/test_pghybrid.py, cognee/tests/e2e/postgres/test_postgres_hybrid_adapter.py, cognee/tests/unit/tasks/storage/test_add_data_points.py
Added pghybrid e2e entrypoint and comprehensive PostgresHybridAdapter e2e tests; unit tests updated to assert HYBRID_WRITE routing and hybrid-method usage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • topoteretes/cognee#2223: earlier unified engine/hybrid wiring changes that overlap provider/engine selection and capability plumbing.
  • topoteretes/cognee#2584: touches Postgres graph adapter and engine-factory logic related to pghybrid wiring.
  • topoteretes/cognee#2543: modifies format_edges, overlapping edge-tuple parsing changes.

Suggested labels

core-team

Suggested reviewers

  • hajdul88
  • lxobr
  • borisarzentar
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and incomplete; it uses a truncated feature name that doesn't clearly convey the primary change of adding a PostgreSQL hybrid adapter. Revise the title to clearly describe the main feature, e.g., 'Add PostgreSQL hybrid adapter for combined graph and vector operations' or 'Implement PostgresHybridAdapter for joint graph-vector queries'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description is comprehensive and follows the template structure with all required sections completed, including clear explanation of the feature, acceptance criteria, type of change, and pre-submission checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gitguardian
Copy link
Copy Markdown

gitguardian bot commented Apr 10, 2026

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password 4347a95 .github/workflows/graph_db_tests.yml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Problems:
- unified engine mock returned truthy AsyncMock for has_capability
- add_data_points routed to hybrid writes, skipping mocked add_nodes

Solutions:
- set has_capability to return False in unified engine mock
Problems:
- unified engine mock returned truthy AsyncMock for has_capability
- add_data_points routed to hybrid writes, skipping mocked add_nodes
- no test coverage for hybrid write path

Solutions:
- set has_capability to return False in non-hybrid mock
- add test_add_data_points_hybrid_write_path verifying hybrid routing
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

173-180: ⚠️ Potential issue | 🔴 Critical

Populate the metadata cache only after the DDL transaction commits.

has_collection() and get_table() now trust self._metadata.tables without taking VECTOR_DB_LOCK. Because Line 178 reflects the new table into that cache before the surrounding engine.begin() block commits, a concurrent reader can observe the table as “present” and then fail with relation does not exist on its own session. Move the cache fill until after the CREATE TABLE transaction exits, or make readers participate in the same lock.

🔧 Minimal fix
+                    created = False
                     async with self.engine.begin() as connection:
                         if len(Base.metadata.tables.keys()) > 0:
                             await connection.run_sync(
                                 Base.metadata.create_all, tables=[PGVectorDataPoint.__table__]
                             )
-                            await connection.run_sync(
-                                self._metadata.reflect, only=[collection_name]
-                            )
+                            created = True
+
+                    if created:
+                        async with self.engine.begin() as connection:
+                            await connection.run_sync(
+                                self._metadata.reflect, only=[collection_name]
+                            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py` around
lines 173 - 180, The metadata cache is being populated inside the CREATE TABLE
transaction (via self._metadata.reflect) which can expose the new table to
readers (has_collection()/get_table()) before the DDL commit and cause "relation
does not exist" errors; fix by moving the self._metadata.reflect call out of the
async with self.engine.begin() block (or alternatively acquire VECTOR_DB_LOCK
around the reflect) so the cache is updated only after the transaction commits;
update PGVectorAdapter to call Base.metadata.create_all (or keep it) inside the
transaction but perform self._metadata.reflect (or any cache fill) after the
context exits so readers observing self._metadata.tables see only committed
tables.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/approve_dco.yaml:
- Around line 23-27: The trustedRoles array currently includes 'CONTRIBUTOR',
which incorrectly treats GitHub author_association "CONTRIBUTOR" as a trusted
role; remove 'CONTRIBUTOR' from the trustedRoles declaration so only OWNER,
MEMBER, and COLLABORATOR are considered trusted, and ensure any logic using
isTrusted (computed from trustedRoles.includes(authorAssociation)) continues to
enforce DCO checks for contributors; also update the nearby comment if needed to
match the corrected trustedRoles list.

In `@cognee/infrastructure/databases/graph/get_graph_engine.py`:
- Around line 98-108: The pghybrid provider check (USE_UNIFIED_PROVIDER ==
"pghybrid") must be moved out of the `@lru_cache`'d _create_graph_engine() so the
choice isn't frozen by the cache; instead, read and evaluate
os.environ["USE_UNIFIED_PROVIDER"] before calling _create_graph_engine(), and
when pghybrid is requested resolve get_relational_engine() immediately and
validate the relational backend is PostgreSQL (failing fast with a clear error
if not) and then construct
PostgresAdapter(connection_string=get_relational_engine().db_uri) outside the
cached function; keep _create_graph_engine() purely responsible for creating
graph engines for a given resolved provider.

In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py`:
- Around line 595-610: The current SQL in the node_id is None branch (the SQL
using gn, vc and LEFT JOIN graph_edge ge) applies ORDER BY distance and LIMIT
after the graph_edge fan-out, causing duplicate rows per node; fix by first
selecting the top N nodes ranked by distance from the {table} vector column
(i.e., produce a subquery of node ids and their distance by joining vc to
graph_node and applying ORDER BY distance LIMIT :limit), then join that limited
set to graph_node and LEFT JOIN graph_edge to attach edge context (or
alternatively use DISTINCT ON (gn.id) ordered by distance) so you return one
representative edge-context row per ranked node instead of limiting post-fanout.
- Around line 518-540: The current transaction opens via async with
self._graph._session() and then attempts DELETE on Triplet_text which, if the
table is missing, marks the whole transaction as failed even though the
exception is caught; fix by isolating the Triplet_text delete in a SAVEPOINT
inside that session (execute "SAVEPOINT sp_triplet", run the DELETE, on
relation-missing error ROLLBACK TO SAVEPOINT sp_triplet and suppress only that
case, otherwise re-raise, then RELEASE SAVEPOINT sp_triplet) or alternatively
probe for the table existence before entering the transaction (use a prior query
against pg_catalog to check Triplet_text) and skip the DELETE entirely if
absent; apply this change around the Triplet_text block that uses
session.execute and await session.commit() so earlier deletes are not rolled
back by the missing-table error.

In `@cognee/infrastructure/databases/unified/get_unified_engine.py`:
- Around line 12-17: The USE_UNIFIED_PROVIDER override in _is_hybrid_provider
currently treats any non-empty value as hybrid; change it to only enable hybrid
when the env value matches a supported hybrid provider. Inside
_is_hybrid_provider, read os.environ.get("USE_UNIFIED_PROVIDER", "") ->
normalise (strip and lower) and check membership against a defined set like
supported_hybrid_providers = {"<providerA>", "<providerB>"} (replace with your
actual supported hybrid identifiers); return True only if the normalised env
value is in that set, otherwise fall through to the existing
graph_config/vector_config logic. Ensure you reference the same function name
_is_hybrid_provider and keep the os import/location as in the diff.

In `@cognee/infrastructure/databases/vector/create_vector_engine.py`:
- Around line 78-98: The pghybrid override is evaluated inside the cached
_create_vector_engine() call so changing USE_UNIFIED_PROVIDER or Postgres
settings can return a stale adapter and bypass pgvector validation; fix by
resolving the unified_provider (read USE_UNIFIED_PROVIDER and, if pghybrid, call
get_relational_config and construct the connection_string) before invoking the
cached factory and then pass the resolved mode/value into
_create_vector_engine's cache key (e.g., add a unified_provider or provider_mode
parameter to _create_vector_engine) so the LRU cache differentiates pghybrid vs
default and retains the normal pgvector validation path (use PGVectorAdapter
only when the resolved mode indicates pghybrid).

In `@cognee/memify_pipelines/consolidate_entity_descriptions.py`:
- Around line 69-83: The code indexes edge[1]/edge[2] in the consolidation logic
without validating the shape, which can raise IndexError on short/malformed
edges; update the branch in consolidate_entity_descriptions.py to first check
the arity and types (e.g., ensure isinstance(edge, (list, tuple)) and len(edge)
>= 3) before any positional indexing, and on failure either skip the record
(continue) or handle it via a clear fallback path that avoids direct indexing;
preserve assignment to target_id and rel_name (the existing symbols) only after
the guard so result[target_id] = rel_name is safe.

In `@cognee/tasks/temporal_awareness/index_graphiti_objects.py`:
- Line 25: Wrap the top-level import of Neo4jAdapter in a try-except block so an
ImportError from the optional neo4j dependency is caught and re-raised as a
RuntimeError with actionable guidance; specifically, locate the line importing
Neo4jAdapter from cognee.infrastructure.databases.graph.neo4j_driver.adapter and
replace it with a guarded import that catches ImportError and raises
RuntimeError suggesting installing the optional neo4j dependency and referencing
the project docs or extras (e.g., "pip install ...") so the module fails with a
clear, actionable message instead of leaking raw ImportError.

In `@cognee/tests/e2e/postgres/test_pghybrid.py`:
- Around line 19-42: Move the module-level environment mutations and
cache-clears into a controlled setup function (e.g., main() or a pytest fixture)
so they run only when the script is intentionally invoked; specifically,
relocate the os.environ assignments for USE_UNIFIED_PROVIDER,
ENABLE_BACKEND_ACCESS_CONTROL, DB_PROVIDER, DB_HOST, DB_PORT, DB_USERNAME,
DB_PASSWORD, DB_NAME and the subsequent cache_clear() calls on
get_relational_config, get_graph_config, create_relational_engine,
_create_graph_engine, and _create_vector_engine into that setup function/fixture
and call it from main() (or register as a fixture) instead of running at import
time.

In `@cognee/tests/e2e/postgres/test_postgres_hybrid_adapter.py`:
- Around line 425-432: The test test_add_edges_with_vectors is flaky because it
relies on leftover PGVectorAdapter connection-pool state; make it deterministic
by isolating collection state per run: modify the test (or the adapter fixture)
to use a unique collection/schema name (e.g., append a UUID or timestamp) or
ensure a deterministic teardown that drops the vector table/collection and
resets the PGVectorAdapter connection pool before and after the test (or
recreate the adapter instance per test). Update the test_add_edges_with_vectors
(and/or the adapter fixture) to create and cleanup a dedicated collection/schema
or call the adapter reset method so no state is shared across tests.

---

Outside diff comments:
In `@cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py`:
- Around line 173-180: The metadata cache is being populated inside the CREATE
TABLE transaction (via self._metadata.reflect) which can expose the new table to
readers (has_collection()/get_table()) before the DDL commit and cause "relation
does not exist" errors; fix by moving the self._metadata.reflect call out of the
async with self.engine.begin() block (or alternatively acquire VECTOR_DB_LOCK
around the reflect) so the cache is updated only after the transaction commits;
update PGVectorAdapter to call Base.metadata.create_all (or keep it) inside the
transaction but perform self._metadata.reflect (or any cache fill) after the
context exits so readers observing self._metadata.tables see only committed
tables.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa9642a5-a4bd-4528-81d8-c6fb681a12cc

📥 Commits

Reviewing files that changed from the base of the PR and between da9db6c and 71d9b47.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (17)
  • .github/workflows/approve_dco.yaml
  • .github/workflows/graph_db_tests.yml
  • cognee/infrastructure/databases/graph/get_graph_engine.py
  • cognee/infrastructure/databases/graph/graph_db_interface.py
  • cognee/infrastructure/databases/hybrid/postgres/__init__.py
  • cognee/infrastructure/databases/hybrid/postgres/adapter.py
  • cognee/infrastructure/databases/unified/get_unified_engine.py
  • cognee/infrastructure/databases/vector/create_vector_engine.py
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
  • cognee/memify_pipelines/consolidate_entity_descriptions.py
  • cognee/modules/retrieval/cypher_search_retriever.py
  • cognee/modules/retrieval/natural_language_retriever.py
  • cognee/tasks/storage/add_data_points.py
  • cognee/tasks/temporal_awareness/index_graphiti_objects.py
  • cognee/tests/e2e/postgres/test_pghybrid.py
  • cognee/tests/e2e/postgres/test_postgres_hybrid_adapter.py
  • pyproject.toml

Comment on lines +23 to +27
// 1. Check if user is an org member via author_association
// (available on the PR payload without extra API calls or permissions)
// OWNER, MEMBER, and COLLABORATOR are trusted roles.
const trustedRoles = ['OWNER', 'MEMBER', 'COLLABORATOR', 'CONTRIBUTOR'];
const isTrusted = trustedRoles.includes(authorAssociation);
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In GitHub pull_request payloads, what does author_association = "CONTRIBUTOR" mean, and is it considered equivalent to OWNER/MEMBER/COLLABORATOR for trust decisions?

💡 Result:

Web search failed: Server error: no LLM provider could handle the message


🏁 Script executed:

# First, locate and read CONTRIBUTING.md to understand DCO requirements
find . -name "CONTRIBUTING.md" -type f | head -5

Repository: topoteretes/cognee

Length of output: 79


🌐 Web query:

GitHub pull request author_association field values meaning OWNER MEMBER COLLABORATOR CONTRIBUTOR documentation

💡 Result:

In GitHub’s author_association field (an enum named CommentAuthorAssociation), these values mean:

  • OWNER: Author is the owner of the repository. [1]
  • MEMBER: Author is a member of the organization that owns the repository. [1]
  • COLLABORATOR: Author has been invited to collaborate on the repository. [1]
  • CONTRIBUTOR: Author has previously committed to the repository. [1]

Source: GitHub GraphQL enum docs for CommentAuthorAssociation. [1]

References
[1] https://docs.github.com/en/enterprise-cloud@latest/graphql/reference/enums#commentauthorassociation


🏁 Script executed:

# Read the actual file under review to confirm current state
cat -n .github/workflows/approve_dco.yaml | head -40

Repository: topoteretes/cognee

Length of output: 2117


🏁 Script executed:

# Read CONTRIBUTING.md to verify DCO policy requirements
cat -n CONTRIBUTING.md

Repository: topoteretes/cognee

Length of output: 7165


Remove CONTRIBUTOR from trusted roles to enforce DCO policy

Line 26 includes CONTRIBUTOR in trustedRoles, allowing external contributors who have previously committed to bypass DCO enforcement (lines 29–35). However, CONTRIBUTOR in GitHub's author_association field means "has previously committed to the repository," not organizational membership or trusted collaborator status. This contradicts the comment on line 25 ("OWNER, MEMBER, and COLLABORATOR are trusted roles") and violates the DCO requirement stated in CONTRIBUTING.md section 5, which mandates DCO affirmation for all contributions.

Suggested fix
-            //    OWNER, MEMBER, and COLLABORATOR are trusted roles.
-            const trustedRoles = ['OWNER', 'MEMBER', 'COLLABORATOR', 'CONTRIBUTOR'];
+            //    OWNER, MEMBER, and COLLABORATOR are trusted roles.
+            const trustedRoles = ['OWNER', 'MEMBER', 'COLLABORATOR'];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// 1. Check if user is an org member via author_association
// (available on the PR payload without extra API calls or permissions)
// OWNER, MEMBER, and COLLABORATOR are trusted roles.
const trustedRoles = ['OWNER', 'MEMBER', 'COLLABORATOR', 'CONTRIBUTOR'];
const isTrusted = trustedRoles.includes(authorAssociation);
// 1. Check if user is an org member via author_association
// (available on the PR payload without extra API calls or permissions)
// OWNER, MEMBER, and COLLABORATOR are trusted roles.
const trustedRoles = ['OWNER', 'MEMBER', 'COLLABORATOR'];
const isTrusted = trustedRoles.includes(authorAssociation);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/approve_dco.yaml around lines 23 - 27, The trustedRoles
array currently includes 'CONTRIBUTOR', which incorrectly treats GitHub
author_association "CONTRIBUTOR" as a trusted role; remove 'CONTRIBUTOR' from
the trustedRoles declaration so only OWNER, MEMBER, and COLLABORATOR are
considered trusted, and ensure any logic using isTrusted (computed from
trustedRoles.includes(authorAssociation)) continues to enforce DCO checks for
contributors; also update the nearby comment if needed to match the corrected
trustedRoles list.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not our contribution, outside scope.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanielNScott, understood — I'll drop this for the current PR. Would you like me to open a follow-up GitHub issue to track removing CONTRIBUTOR from the trustedRoles array in .github/workflows/approve_dco.yaml, so it doesn't get lost?


🧠 Learnings used
Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-05T15:51:58.120Z
Learning: Sign commits and affirm the DCO as described in CONTRIBUTING.md

Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2584
File: cognee/infrastructure/databases/graph/postgres/adapter.py:35-44
Timestamp: 2026-04-10T03:13:45.843Z
Learning: In `topoteretes/cognee`, the factory function `get_graph_engine()` in `cognee/infrastructure/databases/graph/get_graph_engine.py` always calls `await graph_client.initialize()` (via a `hasattr` guard) before returning the adapter. This means all graph adapters, including `PostgresAdapter`, have their schema/tables created before any caller can invoke graph operations. Concerns about missing schema at runtime (e.g., "relation does not exist") do not apply to code paths that go through the factory.

Comment on lines +518 to +540
async with self._graph._session() as session:
if edge_type_ids:
table = _validate_table_name("EdgeType_relationship_name")
await session.execute(
text(f"DELETE FROM {table} WHERE CAST(id AS text) = ANY(:ids)"),
{"ids": edge_type_ids},
)

if triplet_ids:
try:
table = _validate_table_name("Triplet_text")
await session.execute(
text(f"DELETE FROM {table} WHERE CAST(id AS text) = ANY(:ids)"),
{"ids": triplet_ids},
)
except Exception as e:
error_msg = str(e).lower()
if "does not exist" in error_msg or "relation" in error_msg:
logger.debug("Triplet_text table not found, skipping: %s", e)
else:
logger.warning("Unexpected error deleting from Triplet_text: %s", e)

await session.commit()
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The swallowed Triplet_text error still aborts the transaction.

If DELETE FROM Triplet_text fails because the table is missing, PostgreSQL marks the whole transaction failed. Catching that exception and continuing to commit() can roll back the earlier deletes or raise a second error. Probe for the collection before opening the transaction, or use a savepoint and only suppress the expected "relation does not exist" case there.

💡 One safe approach
     async def delete_edges_with_vectors(
         self,
         edge_type_ids: Optional[List[str]] = None,
         triplet_ids: Optional[List[str]] = None,
     ) -> None:
+        triplet_table_exists = (
+            await self._vector.has_collection("Triplet_text") if triplet_ids else False
+        )
+
         async with self._graph._session() as session:
             if edge_type_ids:
                 table = _validate_table_name("EdgeType_relationship_name")
                 await session.execute(
                     text(f"DELETE FROM {table} WHERE CAST(id AS text) = ANY(:ids)"),
                     {"ids": edge_type_ids},
                 )

-            if triplet_ids:
-                try:
-                    table = _validate_table_name("Triplet_text")
-                    await session.execute(
-                        text(f"DELETE FROM {table} WHERE CAST(id AS text) = ANY(:ids)"),
-                        {"ids": triplet_ids},
-                    )
-                except Exception as e:
-                    error_msg = str(e).lower()
-                    if "does not exist" in error_msg or "relation" in error_msg:
-                        logger.debug("Triplet_text table not found, skipping: %s", e)
-                    else:
-                        logger.warning("Unexpected error deleting from Triplet_text: %s", e)
+            if triplet_ids and triplet_table_exists:
+                table = _validate_table_name("Triplet_text")
+                await session.execute(
+                    text(f"DELETE FROM {table} WHERE CAST(id AS text) = ANY(:ids)"),
+                    {"ids": triplet_ids},
+                )

             await session.commit()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py` around lines 518
- 540, The current transaction opens via async with self._graph._session() and
then attempts DELETE on Triplet_text which, if the table is missing, marks the
whole transaction as failed even though the exception is caught; fix by
isolating the Triplet_text delete in a SAVEPOINT inside that session (execute
"SAVEPOINT sp_triplet", run the DELETE, on relation-missing error ROLLBACK TO
SAVEPOINT sp_triplet and suppress only that case, otherwise re-raise, then
RELEASE SAVEPOINT sp_triplet) or alternatively probe for the table existence
before entering the transaction (use a prior query against pg_catalog to check
Triplet_text) and skip the DELETE entirely if absent; apply this change around
the Triplet_text block that uses session.execute and await session.commit() so
earlier deletes are not rolled back by the missing-table error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non issue, not touched.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanielNScott, understood — I'll defer to your judgment on this one.

(╯°□°)╯


🧠 Learnings used
Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2584
File: cognee/infrastructure/databases/graph/postgres/adapter.py:35-44
Timestamp: 2026-04-10T03:13:45.843Z
Learning: In `topoteretes/cognee`, the factory function `get_graph_engine()` in `cognee/infrastructure/databases/graph/get_graph_engine.py` always calls `await graph_client.initialize()` (via a `hasattr` guard) before returning the adapter. This means all graph adapters, including `PostgresAdapter`, have their schema/tables created before any caller can invoke graph operations. Concerns about missing schema at runtime (e.g., "relation does not exist") do not apply to code paths that go through the factory.

Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-27T06:30:27.190Z
Learning: Implement database adapters by extending GraphDBInterface or VectorDBInterface

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 2259
File: cognee/modules/retrieval/utils/used_graph_elements.py:77-83
Timestamp: 2026-03-03T14:44:13.269Z
Learning: In temporal retriever dictionaries, `triplets` and `vector_search_results` are mutually exclusive—a temporal dict contains either triplets (edge-based results) or vector_search_results (RAG-based results), but never both. The `extract_from_temporal_dict` function in `cognee/modules/retrieval/utils/used_graph_elements.py` correctly returns immediately after processing triplets without falling back to vector_search_results.

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1990
File: cognee/modules/search/methods/search.py:124-127
Timestamp: 2026-01-13T12:41:17.082Z
Learning: In Python 3.6+ (PEP 530), you can use await inside asynchronous comprehensions (e.g., [await func(x) for x in items]) within async def functions. This is valid and executes sequentially. If parallel execution is desired, collect coroutines and use asyncio.gather, since awaiting inside a comprehension runs each iteration in sequence unless you explicitly parallelize.

Comment on lines +425 to +432
@pytest.mark.asyncio
async def test_add_edges_with_vectors(adapter):
"""Edges should be inserted into graph and edge types into vector table.

Note: this test passes in isolation but may fail when run in suite
due to PGVectorAdapter connection pool state from prior tests.
This is a pre-existing PGVectorAdapter issue, not a hybrid adapter bug.
"""
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This test is documented as flaky.

The note says it can fail in-suite because it depends on leftover PGVector connection-pool state from earlier tests. That makes CI order-dependent instead of behavior-dependent. Please isolate the collection state per test (unique collection names, dedicated schema, or deterministic teardown) before keeping this in the normal suite.

As per coding guidelines, "Avoid external state in Python tests; rely on test fixtures and CI-provided environment variables".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/tests/e2e/postgres/test_postgres_hybrid_adapter.py` around lines 425 -
432, The test test_add_edges_with_vectors is flaky because it relies on leftover
PGVectorAdapter connection-pool state; make it deterministic by isolating
collection state per run: modify the test (or the adapter fixture) to use a
unique collection/schema name (e.g., append a UUID or timestamp) or ensure a
deterministic teardown that drops the vector table/collection and resets the
PGVectorAdapter connection pool before and after the test (or recreate the
adapter instance per test). Update the test_add_edges_with_vectors (and/or the
adapter fixture) to create and cleanup a dedicated collection/schema or call the
adapter reset method so no state is shared across tests.

Copy link
Copy Markdown
Contributor Author

@DanielNScott DanielNScott Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non issue in practice, good coding hygiene, fixed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanielNScott, understood — acknowledged and moving on.


🧠 Learnings used
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 251
File: cognee/tests/infrastructure/databases/test_index_graph_edges.py:0-0
Timestamp: 2024-12-04T18:37:55.092Z
Learning: In the `index_graph_edges` function, both graph engine and vector engine initialization failures are handled within the same try-except block, so a single test covers both cases.

Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2584
File: cognee/infrastructure/databases/graph/postgres/adapter.py:35-44
Timestamp: 2026-04-10T03:13:45.843Z
Learning: In `topoteretes/cognee`, the factory function `get_graph_engine()` in `cognee/infrastructure/databases/graph/get_graph_engine.py` always calls `await graph_client.initialize()` (via a `hasattr` guard) before returning the adapter. This means all graph adapters, including `PostgresAdapter`, have their schema/tables created before any caller can invoke graph operations. Concerns about missing schema at runtime (e.g., "relation does not exist") do not apply to code paths that go through the factory.

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 2259
File: cognee/tests/test_edge_centered_payload.py:146-147
Timestamp: 2026-03-03T14:48:18.555Z
Learning: In the cognee project, every graph edge must produce exactly one triplet embedding. The test `cognee/tests/test_edge_centered_payload.py` enforces strict equality: `len(triplets) == len(edges)`. This is the expected behavior, not a "too strict" test.

Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2506
File: cognee/tests/test_graphdb_shared.py:31-49
Timestamp: 2026-04-08T01:51:17.734Z
Learning: In cognee/tests/, the files `test_graphdb_shared.py` and `test_graphdb_postgres.py` are standalone script helpers (not pytest test cases), even though they are prefixed with "test_" and are Python files. The test directory contains a pre-existing mixture of pytest and non-pytest (standalone script) test code. Standard pytest isolation rules (e.g., restoring global config state) are less critical for these standalone scripts.

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1990
File: cognee/modules/search/methods/search.py:124-127
Timestamp: 2026-01-13T12:41:17.082Z
Learning: In Python 3.6+ (PEP 530), you can use await inside asynchronous comprehensions (e.g., [await func(x) for x in items]) within async def functions. This is valid and executes sequentially. If parallel execution is desired, collect coroutines and use asyncio.gather, since awaiting inside a comprehension runs each iteration in sequence unless you explicitly parallelize.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cognee/tests/unit/tasks/storage/test_add_data_points.py (1)

346-353: Nice hybrid-path coverage; consider asserting payloads too.

The path assertions are solid. Optional improvement: verify add_edges_with_vectors args, not just call count, so malformed edge payloads don’t slip through.

Suggested assertion hardening
     # Hybrid path: add_edges_with_vectors called, not add_edges
     assert graph_engine.add_edges_with_vectors.await_count == 2
     graph_engine.add_edges.assert_not_awaited()
+    expected_main_edges = ensure_default_edge_properties([edge1])
+    expected_custom_edges = ensure_default_edge_properties(custom_edges)
+    first_call_edges = graph_engine.add_edges_with_vectors.await_args_list[0].args[0]
+    assert expected_main_edges[0] in first_call_edges
+    assert expected_custom_edges[0] in first_call_edges
+    assert graph_engine.add_edges_with_vectors.await_args_list[1].args[0] == expected_custom_edges
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/tests/unit/tasks/storage/test_add_data_points.py` around lines 346 -
353, The test currently only checks that graph_engine.add_edges_with_vectors was
called twice; update it to also assert the actual payloads passed to
add_edges_with_vectors to prevent malformed edge data—inspect
graph_engine.add_edges_with_vectors.call_args_list (or use the async mock's
await_args_list / assert_awaited_with equivalents) and compare each call's
arguments to the expected edge objects/vectors for the hybrid path, validating
keys, source/target IDs and vector shapes/lengths so both calls' payloads are
explicitly verified.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cognee/tests/unit/tasks/storage/test_add_data_points.py`:
- Around line 346-353: The test currently only checks that
graph_engine.add_edges_with_vectors was called twice; update it to also assert
the actual payloads passed to add_edges_with_vectors to prevent malformed edge
data—inspect graph_engine.add_edges_with_vectors.call_args_list (or use the
async mock's await_args_list / assert_awaited_with equivalents) and compare each
call's arguments to the expected edge objects/vectors for the hybrid path,
validating keys, source/target IDs and vector shapes/lengths so both calls'
payloads are explicitly verified.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 354cb22a-814f-453a-9b81-24a11e0995a1

📥 Commits

Reviewing files that changed from the base of the PR and between 71d9b47 and 5e23c76.

📒 Files selected for processing (1)
  • cognee/tests/unit/tasks/storage/test_add_data_points.py

…ation.

Mostly tweaks, lockfile issue was real error.

Problems:
- uv.lock out of sync with pyproject.toml, failing CI
- any non-empty USE_UNIFIED_PROVIDER triggered hybrid mode
- edge indexing crashed on malformed edge tuples
- test_pghybrid set env vars at module level, polluting pytest collection
- add_data_points mock missing has_capability, no hybrid path test

Solutions:
- regenerate uv.lock
- gate USE_UNIFIED_PROVIDER on known UNIFIED_PROVIDERS set
- skip malformed edges in format_edges with arity guard
- move env setup and cache clears into main() in test_pghybrid
- fix unified engine mock and add hybrid write path test
…actories.

Problems:
- pghybrid env check frozen inside lru_cache after first call
- subsequent env changes ignored for remainder of process

Solutions:
- move pghybrid check into uncached create_graph_engine wrapper
- move pghybrid check into uncached create_vector_engine wrapper
- remove duplicate checks from cached _create_* functions
Problems:
- LEFT JOIN graph_edge produced multiple rows per node
- LIMIT counted joined rows, not distinct nodes

Solutions:
- select top-k nodes by distance in CTE, then join edges outside LIMIT
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
cognee/infrastructure/databases/hybrid/postgres/adapter.py (1)

518-540: ⚠️ Potential issue | 🟠 Major

Don't swallow a failed Triplet_text delete inside this transaction.

If Lines 529-532 raise relation does not exist, PostgreSQL aborts the whole transaction. Catching that exception and then continuing to commit() can undo the earlier EdgeType_relationship_name delete, and the else branch currently logs unexpected errors instead of re-raising them. Preflight the table before opening the transaction or isolate just this delete in a SAVEPOINT and re-raise non-missing-table failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py` around lines 518
- 540, The Triplet_text DELETE is currently caught and suppressed inside the
transaction which can abort and then cause session.commit() to commit nothing;
change it to use a SAVEPOINT around the Triplet_text delete so only that
operation is isolated: inside the async with self._graph._session() as session
create a SAVEPOINT before executing the Triplet_text DELETE (the same block that
uses _validate_table_name("Triplet_text") and text(...)), attempt the delete, on
exception inspect the error message and if it indicates "does not exist"
rollback to the SAVEPOINT and continue, otherwise rollback to the SAVEPOINT and
re-raise the exception so the outer transaction fails; leave the
EdgeType_relationship_name delete and final await session.commit() unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py`:
- Around line 263-287: vector grouping and embedding use only
DataPoint.get_embeddable_data() and string names derived via rsplit, causing
incorrect/misaligned embeddings and crashes when metadata is missing; change the
grouping key to a tuple (type(dp).__name__, field_name) and when building
valid_items use getattr(dp, field_name, None) (or a small helper that pulls the
actual field value) as the text to embed instead of
DataPoint.get_embeddable_data(), skip entries where getattr returns None, call
self._vector.embed_data with those actual texts, and when creating vector
indexes pass the explicit type_name and field_name from the tuple (not
collection.rsplit("_", 1)), and ensure you guard access to dp.metadata when
iterating index_fields.

In `@cognee/infrastructure/databases/vector/create_vector_engine.py`:
- Around line 29-46: The pghybrid branch is hand-building a PGVector DSN and
bypasses the existing pgvector credential validation, which can produce
malformed connection strings; update the unified_provider == "pghybrid" branch
to either call the shared pgvector helper (reuse the existing
_create_vector_engine or the helper that builds/validates the pgvector DSN) or
mirror its validation by verifying relational_config.db_host, db_port, db_name,
db_username, and db_password are present and raising a clear exception if any
are missing, then construct the connection_string and pass it to PGVectorAdapter
exactly as the shared helper does so the same guardrails (and error messages)
apply.

---

Duplicate comments:
In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py`:
- Around line 518-540: The Triplet_text DELETE is currently caught and
suppressed inside the transaction which can abort and then cause
session.commit() to commit nothing; change it to use a SAVEPOINT around the
Triplet_text delete so only that operation is isolated: inside the async with
self._graph._session() as session create a SAVEPOINT before executing the
Triplet_text DELETE (the same block that uses
_validate_table_name("Triplet_text") and text(...)), attempt the delete, on
exception inspect the error message and if it indicates "does not exist"
rollback to the SAVEPOINT and continue, otherwise rollback to the SAVEPOINT and
re-raise the exception so the outer transaction fails; leave the
EdgeType_relationship_name delete and final await session.commit() unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 50c62b72-55c6-453c-8942-7725cfc753df

📥 Commits

Reviewing files that changed from the base of the PR and between 5e23c76 and e1ed4d4.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • cognee/infrastructure/databases/graph/get_graph_engine.py
  • cognee/infrastructure/databases/hybrid/postgres/adapter.py
  • cognee/infrastructure/databases/unified/get_unified_engine.py
  • cognee/infrastructure/databases/vector/create_vector_engine.py
  • cognee/memify_pipelines/consolidate_entity_descriptions.py
  • cognee/tests/e2e/postgres/test_pghybrid.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • cognee/infrastructure/databases/graph/get_graph_engine.py
  • cognee/infrastructure/databases/unified/get_unified_engine.py
  • cognee/memify_pipelines/consolidate_entity_descriptions.py
  • cognee/tests/e2e/postgres/test_pghybrid.py

Comment on lines +29 to +46
if unified_provider == "pghybrid":
from cognee.infrastructure.databases.relational import get_relational_config

embedding_engine = get_embedding_engine()
relational_config = get_relational_config()
connection_string = (
f"postgresql+asyncpg://{relational_config.db_username}:{relational_config.db_password}"
f"@{relational_config.db_host}:{relational_config.db_port}"
f"/{relational_config.db_name}"
)

from .pgvector.PGVectorAdapter import PGVectorAdapter

return PGVectorAdapter(
connection_string,
vector_db_key,
embedding_engine,
)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Please keep the pghybrid override on the same validated pgvector path.

This branch now hand-builds the PGVector connection string but skips the normal pgvector credential guard in _create_vector_engine(). With incomplete relational settings it can return a malformed DSN instead of failing fast. Reusing the shared pgvector helper here—or at least mirroring the db_host/db_port/db_name/db_username/db_password validation—would keep the hybrid path predictable.

As per coding guidelines, "Prefer explicit, structured error handling in Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/vector/create_vector_engine.py` around lines
29 - 46, The pghybrid branch is hand-building a PGVector DSN and bypasses the
existing pgvector credential validation, which can produce malformed connection
strings; update the unified_provider == "pghybrid" branch to either call the
shared pgvector helper (reuse the existing _create_vector_engine or the helper
that builds/validates the pgvector DSN) or mirror its validation by verifying
relational_config.db_host, db_port, db_name, db_username, and db_password are
present and raising a clear exception if any are missing, then construct the
connection_string and pass it to PGVectorAdapter exactly as the shared helper
does so the same guardrails (and error messages) apply.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non issue, relational database would have already failed if mis-specified.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanielNScott, that makes sense — since the relational engine is already initialised (and its config validated) before the pghybrid vector-engine path is reached, a mis-specified DSN would have surfaced earlier in the startup sequence. Fair enough, I'll defer to your judgement here.


✏️ Learnings added
Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2587
File: cognee/infrastructure/databases/vector/create_vector_engine.py:29-46
Timestamp: 2026-04-10T17:49:13.498Z
Learning: In `topoteretes/cognee`, the `pghybrid` branch in `create_vector_engine` (`cognee/infrastructure/databases/vector/create_vector_engine.py`) intentionally skips explicit credential validation when building the PGVector DSN. The relational database engine is initialised (and its config validated) earlier in the startup sequence, so any mis-specified relational config (db_host, db_port, db_name, db_username, db_password) would already have caused a failure before this code path is reached. A redundant guard here is therefore unnecessary by design.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: DanielNScott
Repo: topoteretes/cognee PR: 2584
File: cognee/infrastructure/databases/graph/postgres/adapter.py:35-44
Timestamp: 2026-04-10T03:13:45.843Z
Learning: In `topoteretes/cognee`, the factory function `get_graph_engine()` in `cognee/infrastructure/databases/graph/get_graph_engine.py` always calls `await graph_client.initialize()` (via a `hasattr` guard) before returning the adapter. This means all graph adapters, including `PostgresAdapter`, have their schema/tables created before any caller can invoke graph operations. Concerns about missing schema at runtime (e.g., "relation does not exist") do not apply to code paths that go through the factory.

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1990
File: cognee/modules/search/methods/search.py:124-127
Timestamp: 2026-01-13T12:41:17.082Z
Learning: In Python 3.6+ (PEP 530), you can use await inside asynchronous comprehensions (e.g., [await func(x) for x in items]) within async def functions. This is valid and executes sequentially. If parallel execution is desired, collect coroutines and use asyncio.gather, since awaiting inside a comprehension runs each iteration in sequence unless you explicitly parallelize.

@DanielNScott
Copy link
Copy Markdown
Contributor Author

Consolidated information on PR checks and resolution below. Mostly tweaks related to unrealistic edge cases, but easy to fix. A few actual issues found, most notably lock file conflict mistake.

CodeRabbit issues:

  1. DCO check grants trust to CONTRIBUTOR role, which includes any prior PR author
  2. pghybrid env var read once in graph factory, frozen by lru_cache thereafter
  3. Failed DELETE on missing Triplet_text table may abort the enclosing transaction
  4. search_graph_with_distances LIMIT counts joined rows, not distinct nodes
  5. Typo in USE_UNIFIED_PROVIDER silently activates hybrid mode
  6. pghybrid env var read once in vector factory, frozen by lru_cache thereafter
  7. Positional edge indexing crashes on short or malformed edge tuples
  8. Neo4jAdapter import fails if neo4j package not installed
  9. test_pghybrid mutates os.environ at import time, affecting all collected tests
  10. Hybrid adapter test fails when run after tests that leave PGVector pool open

Assessments / resolutions:

  1. Not related to this PR
  2. Fixed -- moved pghybrid check into uncached wrapper in create_graph_engine.
  3. Not a real problem, verified empirically as non-issue.
  4. Fixed -- select top-k nodes by distance in CTE, then join edges outside LIMIT.
  5. Fixed -- gate USE_UNIFIED_PROVIDER on known UNIFIED_PROVIDERS set.
  6. Fixed -- moved pghybrid check into uncached wrapper in create_vector_engine.
  7. Fixed -- added arity guard to skip malformed edge tuples in format_edges.
  8. Reasonable behaviour.
  9. Fixed -- moved env setup and cache clears into main() in test_pghybrid.
  10. Non-issue, incorrect configuration or test cleanup lead to failure.

Test failures:

  1. add_data_points tests fail because unified engine mock lacks has_capability flag
  2. No test coverage for the hybrid write path in add_data_points

Test fixes:
11. Fixed -- added has_capability flag
12. Fixed -- added test_add_data_points_hybrid_write_path verifying hybrid routing

Problems:
- PostgresHybridAdapter missing get_neighborhood, failing instantiation
- hybrid test fixture used removed relational_engine constructor
- hybrid test fixture used __new__ hack to bypass PGVectorAdapter cache
- flaky test due to shared connection pool state across tests

Solutions:
- add get_neighborhood delegation to hybrid adapter
- use connection_string constructor for both graph and vector adapters
- add session-scoped event_loop fixture
- clean up via graph_adapter.sessionmaker instead of relational_engine
- remove flakiness note from test_add_edges_with_vectors

Reasoning and Related Notes:
- search_graph_with_distances LIMIT fix uses CTE to rank nodes before joining edges
- CodeRabbit item on get_embeddable_data returning only first field is a pre-existing
  DataPoint behavior, not introduced by hybrid adapter
- CodeRabbit item on pghybrid skipping pgvector credential validation is low risk
  since relational config is already validated
Problems:
- get_embeddable_data always returned first indexed field
- multi-field DataPoints embedded wrong text for non-first collections

Solutions:
- use getattr(dp, field_name) to embed the correct field per collection
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cognee/infrastructure/databases/hybrid/postgres/adapter.py (1)

270-276: ⚠️ Potential issue | 🟡 Minor

Guard against None metadata.

dp.metadata is typed as Optional[MetaData] in the DataPoint model. If a DataPoint has metadata=None explicitly set, line 271 will raise AttributeError when calling .get().

🛡️ Suggested defensive fix
         for dp in data_points:
-            for field_name in dp.metadata.get("index_fields", []):
+            index_fields = (dp.metadata or {}).get("index_fields", [])
+            for field_name in index_fields:
                 if getattr(dp, field_name, None) is None:
                     continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py` around lines 270
- 276, The loop grouping data points assumes dp.metadata is non-None and calls
dp.metadata.get(...), which will raise AttributeError if metadata is None;
update the loop in adapter.py (the block that iterates over data_points and
builds vector_groups) to first guard dp.metadata (e.g., skip when dp.metadata is
falsy) before accessing .get("index_fields", []), so DataPoint instances with
metadata=None are safely skipped and vector_groups construction remains correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py`:
- Around line 73-76: The initialize method's docstring claims it initializes
"both underlying adapters" but only calls self._graph.initialize(); either
update the docstring to reflect that only the graph adapter is initialized
(e.g., "Initialize the graph adapter") or add the missing call to initialize the
vector adapter (e.g., call self._vector.initialize() or the actual vector
adapter field) so the implementation matches the docstring; edit the initialize
method's docstring and/or add the vector adapter initialization to the body
(refer to initialize, self._graph and the vector adapter attribute name in this
class).

---

Duplicate comments:
In `@cognee/infrastructure/databases/hybrid/postgres/adapter.py`:
- Around line 270-276: The loop grouping data points assumes dp.metadata is
non-None and calls dp.metadata.get(...), which will raise AttributeError if
metadata is None; update the loop in adapter.py (the block that iterates over
data_points and builds vector_groups) to first guard dp.metadata (e.g., skip
when dp.metadata is falsy) before accessing .get("index_fields", []), so
DataPoint instances with metadata=None are safely skipped and vector_groups
construction remains correct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 33de770b-6e67-4d41-9539-1f4ec8d9c9f3

📥 Commits

Reviewing files that changed from the base of the PR and between e1ed4d4 and 57e9665.

📒 Files selected for processing (2)
  • cognee/infrastructure/databases/hybrid/postgres/adapter.py
  • cognee/tests/e2e/postgres/test_postgres_hybrid_adapter.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/tests/e2e/postgres/test_postgres_hybrid_adapter.py

Problems:
- PGVectorAdapter.py line wrapping failed ruff-format in CI pre-commit hook

Solutions:
- apply ruff format to PGVectorAdapter.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py`:
- Line 62: The _metadata cache stored on the adapter (self._metadata) is not
invalidated after prune() calls delete_database(), causing has_collection() to
return stale results; update prune() so that after calling delete_database() you
clear or reset the cache (e.g., reassign self._metadata = MetaData() or None) so
subsequent has_collection() reflections use fresh metadata; ensure you reference
the adapter's self._metadata, the prune() method, and the delete_database() call
when making this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fa0c7046-6953-44bd-85e5-7ee6e5bb0418

📥 Commits

Reviewing files that changed from the base of the PR and between 57e9665 and 5145cba.

📒 Files selected for processing (1)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

…ure/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and
…ing.

Problems:
- _metadata cache not invalidated after prune, has_collection returns stale True
- hybrid adapter docstring says "both" but only initializes graph adapter

Solutions:
- call self._metadata.clear() before delete_database in prune
- update docstring to reflect that vector adapter initializes lazily
…ure/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and
Problems:
- all postgres adapter tests fail in CI with "No module named asyncpg"
- cognee_setup action does not install postgres extra by default
- not sure how this happened, worked on previous branch? merge issue
  somewhere?

Solutions:
- pass extra-dependencies: postgres to cognee_setup in graph_db_tests.yml
@Vasilije1990 Vasilije1990 merged commit 241e6f7 into dev Apr 11, 2026
127 of 135 checks passed
@Vasilije1990 Vasilije1990 deleted the feature/cog-4469-graph-aware-embeddings-integrate-joint-postgres-graph-db-and branch April 11, 2026 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants