Open-source data lakehouse platform — one command, real data, production patterns.
10 services. 4 million rows. Sub-300ms queries. Zero cloud bills.
LakeForge is a fully integrated data lakehouse you spin up with a single command. It ingests a real dataset from HuggingFace (4M Amazon product reviews), transforms it through a dbt medallion architecture, and serves it through a custom React dashboard with sub-second ClickHouse queries.
This is not a tutorial or a collection of docker-compose snippets. It's a working analytical platform with health checks on every service, dependency ordering, real benchmarks, and a custom frontend.
git clone https://github.com/vinsblack/lakeforge
cd lakeforge
cp .env.example .env
docker compose up -dHuggingFace (4M rows)
│
▼
┌──────────────────────────────────────────────────────┐
│ INGESTION │
│ Redpanda (Kafka-compatible, no JVM) │
└────────────────────────┬─────────────────────────────┘
▼
┌──────────────────────────────────────────────────────┐
│ STORAGE │
│ MinIO (S3-compatible) · Project Nessie (catalog) │
└────────────────────────┬─────────────────────────────┘
▼
┌──────────────────────────────────────────────────────┐
│ TRANSFORM │
│ dbt Core — bronze → silver → gold │
│ Prefect — orchestration + scheduling │
└────────────────────────┬─────────────────────────────┘
▼
┌──────────────────────────────────────────────────────┐
│ QUERY │
│ ClickHouse — sub-second OLAP │
└────────────────────────┬─────────────────────────────┘
▼
┌──────────────────────────────────────────────────────┐
│ OBSERVE + PRESENT │
│ Prometheus · Grafana · Superset · React Dashboard │
└──────────────────────────────────────────────────────┘
| Layer | Tool | Role | Status |
|---|---|---|---|
| Object Store | MinIO | S3-compatible, self-hosted | ✅ Active in pipeline |
| Transformation | dbt Core | SQL-first medallion architecture | ✅ Active in pipeline |
| Orchestration | Prefect | Pipeline scheduling with UI | ✅ Active in pipeline |
| OLAP Engine | ClickHouse | Fastest open-source analytical DB | ✅ Active in pipeline |
| Monitoring | Prometheus + Grafana | Platform observability | ✅ Active (ClickHouse metrics) |
| BI | Apache Superset | Self-service SQL and dashboards | ✅ Running, ready to configure |
| Frontend | React + Vite | Custom analytics dashboard | ✅ Active, queries gold layer |
| Streaming | Redpanda | Kafka-compatible broker, no JVM overhead | 🔧 Running, not yet in pipeline |
| Data Catalog | Project Nessie | Git-like versioning for data | 🔧 Running, not yet in pipeline |
Note: Redpanda and Nessie are deployed and healthy but not yet wired into the ingestion flow. The current pipeline loads data via batch (HuggingFace → MinIO → ClickHouse). Streaming ingestion through Redpanda and Iceberg table support via Nessie are on the roadmap.
Prerequisites: Docker Desktop 4.x+ (8 GB RAM recommended)
# Clone and start
git clone https://github.com/vinsblack/lakeforge
cd lakeforge
cp .env.example .env # ← change passwords before production use
docker compose up -d
docker compose ps # verify all services are healthy# Load real data from HuggingFace (takes ~15 min)
./run_pipeline.sh # Linux/Mac
.\run_pipeline.bat # Windows
# Build dbt layers
./run_dbt.sh # Linux/Mac
.\run_dbt.bat # Windows| Service | URL | Default credentials |
|---|---|---|
| LakeForge Dashboard | http://localhost:3001 | — |
| Redpanda Console | http://localhost:8080 | — |
| MinIO Console | http://localhost:9001 | lakeforge / see .env |
| Nessie Catalog | http://localhost:19120 | — |
| ClickHouse HTTP | http://localhost:8123 | lakeforge / see .env |
| Prefect UI | http://localhost:4200 | — |
| Prometheus | http://localhost:9090 | — |
| Grafana | http://localhost:3000 | admin / see .env |
| Superset | http://localhost:8088 | admin / see .env |
McAuley-Lab/Amazon-Reviews-2023 — 877K+ downloads on HuggingFace, peer-reviewed. We load 4 million Electronics reviews via chunked streaming (100K rows per chunk, ~100MB peak RAM). The pipeline is idempotent — safe to re-run without creating duplicates.
HuggingFace API (streaming, chunked)
│
▼
MinIO bronze bucket (Parquet, ~220 MB)
+ ClickHouse bronze.amazon_reviews (raw import)
│ dbt run
▼
lakeforge_silver.reviews_enriched (cleaned, typed, sentiment labels)
│ dbt run
▼
lakeforge_gold.reviews_kpis — monthly KPI aggregations
lakeforge_gold.top_products — top products by review volume
Bronze — raw data as ingested from HuggingFace. No transformations.
Silver — cleaned and enriched:
- Type casting, null handling, validation filters
- Full review text and title preserved for downstream analysis
- Date extraction (year, month, day of week)
- Review length and title length computation
- Sentiment labeling (
positive/neutral/negativebased on rating) - Helpfulness tiering (
highly_helpful/helpful/not_helpful) - Incremental materialization with
delete+insert
Gold — business-ready aggregations:
reviews_kpis: monthly rollups with avg rating, sentiment split, verified %, avg review lengthtop_products: top 10K products by volume with quality metrics
All queries measured on local Docker Desktop, 4 million rows, ClickHouse 24.1.
| Query | Description | Time |
|---|---|---|
| Q1 | Full scan COUNT(*) 4M rows | 139ms |
| Q2 | GROUP BY year + AVG(rating) | 112ms |
| Q3 | Top 100 products (GROUP BY + ORDER BY) | 195ms |
| Q4 | Sentiment distribution | 157ms |
| Q5 | Monthly time series — 8 year window | 142ms |
| Q6 | Verified vs unverified analysis | 129ms |
| Q7 | Gold layer KPI aggregation | 124ms |
| Q8 | Review length distribution | 303ms |
Run on your machine:
./docs/benchmarks/run_bench.sh # Linux/Mac
.\docs\benchmarks\run_bench.ps1 # WindowsThe React dashboard connects to ClickHouse through an nginx reverse proxy and visualizes the gold layer in real time:
- KPI cards — total reviews, average rating, positive %, verified %
- Yearly trend — bar chart of review volume over time
- Rating trend — area chart showing average rating evolution
- Sentiment analysis — positive vs negative % over time
- Top products table — sortable by volume, rating, sentiment
Built with Vite + React + Recharts. Dark theme. Served via nginx in Docker.
Most "lakehouse in Docker" repos on GitHub are bare docker-compose files with placeholder services. Here's what LakeForge ships that they don't:
| Capability | Other repos | LakeForge |
|---|---|---|
| Health checks | rare | ✅ every service |
| Dependency ordering | rare | ✅ correct boot sequence |
| Real data pipeline | ❌ dummy data | ✅ 4M rows from HuggingFace |
| Memory-efficient ingestion | ❌ | ✅ chunked streaming, ~100MB peak |
| Idempotent pipeline | ❌ | ✅ safe to re-run |
| dbt medallion arch | ❌ | ✅ bronze / silver / gold |
| dbt schema tests | ❌ | ✅ not_null, unique, accepted_values |
| Custom frontend | ❌ | ✅ React dashboard |
| Grafana dashboards | ❌ | ✅ ClickHouse monitoring provisioned |
| Published benchmarks | ❌ | ✅ 8 queries with timings |
| Cross-platform scripts | ❌ | ✅ Linux, Mac, Windows |
| CI pipeline | ❌ | ✅ GitHub Actions |
| Server-side auth | ❌ | ✅ credentials never in browser |
lakeforge/
├── docker-compose.yml # 10+ services, health checks, volumes
├── .env.example # All configurable credentials
├── Makefile # Common commands (make up/down/pipeline/bench)
├── run_pipeline.sh / .bat # Data ingestion (HuggingFace → ClickHouse)
├── run_dbt.sh / .bat # dbt transformations
├── infra/
│ ├── clickhouse/
│ │ ├── config.xml # S3 policy, Prometheus metrics, named collections
│ │ └── init.sql # Database initialization
│ ├── monitoring/
│ │ ├── prometheus.yml # Scrape configs (CH, Redpanda, MinIO)
│ │ └── grafana/
│ │ ├── datasources/ # Prometheus datasource
│ │ └── dashboards/ # Auto-provisioned ClickHouse dashboard
│ └── superset/
│ ├── Dockerfile # Pre-built image with clickhouse-connect
│ └── superset_config.py
├── pipelines/
│ ├── requirements.txt
│ └── flows/
│ └── amazon_reviews_pipeline.py # Chunked, idempotent, memory-efficient
├── dbt/
│ ├── dbt_project.yml # Vars: top_products_limit, min_reviews_for_top
│ ├── profiles.yml
│ └── models/
│ ├── bronze/
│ │ └── sources.yml # 11 columns documented + freshness check
│ ├── silver/
│ │ ├── reviews_enriched.sql # Incremental, sentiment, helpfulness
│ │ └── schema.yml # not_null, accepted_values tests
│ └── gold/
│ ├── reviews_kpis.sql # Monthly KPI rollups
│ ├── top_products.sql # Configurable via dbt vars
│ └── schema.yml # not_null, unique tests
├── frontend/
│ ├── Dockerfile # Multi-stage build (node → nginx)
│ ├── nginx.conf # Reverse proxy with server-side CH auth
│ └── src/
│ ├── api/clickhouse.js # Zero credentials in browser
│ └── App.jsx # Responsive, graceful per-section errors
├── docs/
│ └── benchmarks/ # Scripts: sh, ps1, bat + results.md
├── .github/
│ └── workflows/ci.yml # Validate, lint, build, integration test
├── CONTRIBUTING.md
└── LICENSE (MIT)
Planned features — contributions welcome:
- Apache Iceberg tables — integrate PyIceberg for ACID transactions and time travel through the Nessie catalog
- Streaming ingestion — produce HuggingFace data through Redpanda topics, consume into ClickHouse in real time
- Data quality — Great Expectations suites on bronze → silver boundary
- Data lineage — OpenLineage integration for end-to-end tracking
- CDC pipeline — Debezium connector for database change capture
- dbt tests — schema tests and data freshness checks
- Grafana dashboards — pre-built ClickHouse and pipeline monitoring panels
- Multi-dataset support — pluggable ingestion for other HuggingFace datasets
MIT — use freely, contributions welcome.
