LakeForge

Open-source data lakehouse platform — one command, real data, production patterns.
10 services. 4 million rows. Sub-300ms queries. Zero cloud bills.

What is LakeForge?

LakeForge is a fully integrated data lakehouse you spin up with a single command. It ingests a real dataset from HuggingFace (4M Amazon product reviews), transforms it through a dbt medallion architecture, and serves it through a custom React dashboard with sub-second ClickHouse queries.

This is not a tutorial or a collection of docker-compose snippets. It's a working analytical platform with health checks on every service, dependency ordering, real benchmarks, and a custom frontend.

git clone https://github.com/vinsblack/lakeforge
cd lakeforge
cp .env.example .env
docker compose up -d

Architecture

HuggingFace (4M rows)
       │
       ▼
┌──────────────────────────────────────────────────────┐
│  INGESTION                                            │
│  Redpanda (Kafka-compatible, no JVM)                 │
└────────────────────────┬─────────────────────────────┘
                         ▼
┌──────────────────────────────────────────────────────┐
│  STORAGE                                              │
│  MinIO (S3-compatible)  ·  Project Nessie (catalog)  │
└────────────────────────┬─────────────────────────────┘
                         ▼
┌──────────────────────────────────────────────────────┐
│  TRANSFORM                                            │
│  dbt Core — bronze → silver → gold                   │
│  Prefect — orchestration + scheduling                │
└────────────────────────┬─────────────────────────────┘
                         ▼
┌──────────────────────────────────────────────────────┐
│  QUERY                                                │
│  ClickHouse — sub-second OLAP                        │
└────────────────────────┬─────────────────────────────┘
                         ▼
┌──────────────────────────────────────────────────────┐
│  OBSERVE + PRESENT                                    │
│  Prometheus · Grafana · Superset · React Dashboard   │
└──────────────────────────────────────────────────────┘

Stack

Layer	Tool	Role	Status
Object Store	MinIO	S3-compatible, self-hosted	✅ Active in pipeline
Transformation	dbt Core	SQL-first medallion architecture	✅ Active in pipeline
Orchestration	Prefect	Pipeline scheduling with UI	✅ Active in pipeline
OLAP Engine	ClickHouse	Fastest open-source analytical DB	✅ Active in pipeline
Monitoring	Prometheus + Grafana	Platform observability	✅ Active (ClickHouse metrics)
BI	Apache Superset	Self-service SQL and dashboards	✅ Running, ready to configure
Frontend	React + Vite	Custom analytics dashboard	✅ Active, queries gold layer
Streaming	Redpanda	Kafka-compatible broker, no JVM overhead	🔧 Running, not yet in pipeline
Data Catalog	Project Nessie	Git-like versioning for data	🔧 Running, not yet in pipeline

Note: Redpanda and Nessie are deployed and healthy but not yet wired into the ingestion flow. The current pipeline loads data via batch (HuggingFace → MinIO → ClickHouse). Streaming ingestion through Redpanda and Iceberg table support via Nessie are on the roadmap.

Quickstart

Prerequisites: Docker Desktop 4.x+ (8 GB RAM recommended)

# Clone and start
git clone https://github.com/vinsblack/lakeforge
cd lakeforge
cp .env.example .env          # ← change passwords before production use
docker compose up -d
docker compose ps              # verify all services are healthy

# Load real data from HuggingFace (takes ~15 min)
./run_pipeline.sh              # Linux/Mac
.\run_pipeline.bat             # Windows

# Build dbt layers
./run_dbt.sh                   # Linux/Mac
.\run_dbt.bat                  # Windows

Service	URL	Default credentials
LakeForge Dashboard	http://localhost:3001	—
Redpanda Console	http://localhost:8080	—
MinIO Console	http://localhost:9001	lakeforge / see .env
Nessie Catalog	http://localhost:19120	—
ClickHouse HTTP	http://localhost:8123	lakeforge / see .env
Prefect UI	http://localhost:4200	—
Prometheus	http://localhost:9090	—
Grafana	http://localhost:3000	admin / see .env
Superset	http://localhost:8088	admin / see .env

Data Pipeline

Dataset

McAuley-Lab/Amazon-Reviews-2023 — 877K+ downloads on HuggingFace, peer-reviewed. We load 4 million Electronics reviews via chunked streaming (100K rows per chunk, ~100MB peak RAM). The pipeline is idempotent — safe to re-run without creating duplicates.

Pipeline flow

HuggingFace API (streaming, chunked)
       │
       ▼
  MinIO bronze bucket (Parquet, ~220 MB)
  + ClickHouse bronze.amazon_reviews (raw import)
       │  dbt run
       ▼
  lakeforge_silver.reviews_enriched (cleaned, typed, sentiment labels)
       │  dbt run
       ▼
  lakeforge_gold.reviews_kpis     — monthly KPI aggregations
  lakeforge_gold.top_products     — top products by review volume

Medallion architecture (dbt)

Bronze — raw data as ingested from HuggingFace. No transformations.

Silver — cleaned and enriched:

Type casting, null handling, validation filters
Full review text and title preserved for downstream analysis
Date extraction (year, month, day of week)
Review length and title length computation
Sentiment labeling (positive / neutral / negative based on rating)
Helpfulness tiering (highly_helpful / helpful / not_helpful)
Incremental materialization with delete+insert

Gold — business-ready aggregations:

reviews_kpis: monthly rollups with avg rating, sentiment split, verified %, avg review length
top_products: top 10K products by volume with quality metrics

Benchmarks

All queries measured on local Docker Desktop, 4 million rows, ClickHouse 24.1.

Query	Description	Time
Q1	Full scan COUNT(*) 4M rows	139ms
Q2	GROUP BY year + AVG(rating)	112ms
Q3	Top 100 products (GROUP BY + ORDER BY)	195ms
Q4	Sentiment distribution	157ms
Q5	Monthly time series — 8 year window	142ms
Q6	Verified vs unverified analysis	129ms
Q7	Gold layer KPI aggregation	124ms
Q8	Review length distribution	303ms

Run on your machine:

./docs/benchmarks/run_bench.sh     # Linux/Mac
.\docs\benchmarks\run_bench.ps1    # Windows

Custom Dashboard

The React dashboard connects to ClickHouse through an nginx reverse proxy and visualizes the gold layer in real time:

KPI cards — total reviews, average rating, positive %, verified %
Yearly trend — bar chart of review volume over time
Rating trend — area chart showing average rating evolution
Sentiment analysis — positive vs negative % over time
Top products table — sortable by volume, rating, sentiment

Built with Vite + React + Recharts. Dark theme. Served via nginx in Docker.

What makes LakeForge different

Most "lakehouse in Docker" repos on GitHub are bare docker-compose files with placeholder services. Here's what LakeForge ships that they don't:

Capability	Other repos	LakeForge
Health checks	rare	✅ every service
Dependency ordering	rare	✅ correct boot sequence
Real data pipeline	❌ dummy data	✅ 4M rows from HuggingFace
Memory-efficient ingestion	❌	✅ chunked streaming, ~100MB peak
Idempotent pipeline	❌	✅ safe to re-run
dbt medallion arch	❌	✅ bronze / silver / gold
dbt schema tests	❌	✅ not_null, unique, accepted_values
Custom frontend	❌	✅ React dashboard
Grafana dashboards	❌	✅ ClickHouse monitoring provisioned
Published benchmarks	❌	✅ 8 queries with timings
Cross-platform scripts	❌	✅ Linux, Mac, Windows
CI pipeline	❌	✅ GitHub Actions
Server-side auth	❌	✅ credentials never in browser

Project Structure

lakeforge/
├── docker-compose.yml              # 10+ services, health checks, volumes
├── .env.example                    # All configurable credentials
├── Makefile                        # Common commands (make up/down/pipeline/bench)
├── run_pipeline.sh / .bat          # Data ingestion (HuggingFace → ClickHouse)
├── run_dbt.sh / .bat               # dbt transformations
├── infra/
│   ├── clickhouse/
│   │   ├── config.xml              # S3 policy, Prometheus metrics, named collections
│   │   └── init.sql                # Database initialization
│   ├── monitoring/
│   │   ├── prometheus.yml          # Scrape configs (CH, Redpanda, MinIO)
│   │   └── grafana/
│   │       ├── datasources/        # Prometheus datasource
│   │       └── dashboards/         # Auto-provisioned ClickHouse dashboard
│   └── superset/
│       ├── Dockerfile              # Pre-built image with clickhouse-connect
│       └── superset_config.py
├── pipelines/
│   ├── requirements.txt
│   └── flows/
│       └── amazon_reviews_pipeline.py  # Chunked, idempotent, memory-efficient
├── dbt/
│   ├── dbt_project.yml             # Vars: top_products_limit, min_reviews_for_top
│   ├── profiles.yml
│   └── models/
│       ├── bronze/
│       │   └── sources.yml         # 11 columns documented + freshness check
│       ├── silver/
│       │   ├── reviews_enriched.sql  # Incremental, sentiment, helpfulness
│       │   └── schema.yml          # not_null, accepted_values tests
│       └── gold/
│           ├── reviews_kpis.sql    # Monthly KPI rollups
│           ├── top_products.sql    # Configurable via dbt vars
│           └── schema.yml          # not_null, unique tests
├── frontend/
│   ├── Dockerfile                  # Multi-stage build (node → nginx)
│   ├── nginx.conf                  # Reverse proxy with server-side CH auth
│   └── src/
│       ├── api/clickhouse.js       # Zero credentials in browser
│       └── App.jsx                 # Responsive, graceful per-section errors
├── docs/
│   └── benchmarks/                 # Scripts: sh, ps1, bat + results.md
├── .github/
│   └── workflows/ci.yml           # Validate, lint, build, integration test
├── CONTRIBUTING.md
└── LICENSE (MIT)

Roadmap

Planned features — contributions welcome:

Apache Iceberg tables — integrate PyIceberg for ACID transactions and time travel through the Nessie catalog
Streaming ingestion — produce HuggingFace data through Redpanda topics, consume into ClickHouse in real time
Data quality — Great Expectations suites on bronze → silver boundary
Data lineage — OpenLineage integration for end-to-end tracking
CDC pipeline — Debezium connector for database change capture
dbt tests — schema tests and data freshness checks
Grafana dashboards — pre-built ClickHouse and pipeline monitoring panels
Multi-dataset support — pluggable ingestion for other HuggingFace datasets

License

MIT — use freely, contributions welcome.

_{Built with ClickHouse · MinIO · dbt · Prefect · Redpanda · Nessie · Superset · React}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LakeForge

What is LakeForge?

Architecture

Stack

Quickstart

Data Pipeline

Dataset

Pipeline flow

Medallion architecture (dbt)

Benchmarks

Custom Dashboard

What makes LakeForge different

Project Structure

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
dbt		dbt
docs		docs
frontend		frontend
infra		infra
pipelines		pipelines
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
run_dbt.bat		run_dbt.bat
run_dbt.sh		run_dbt.sh
run_pipeline.bat		run_pipeline.bat
run_pipeline.sh		run_pipeline.sh

Folders and files

Latest commit

History

Repository files navigation

LakeForge

What is LakeForge?

Architecture

Stack

Quickstart

Data Pipeline

Dataset

Pipeline flow

Medallion architecture (dbt)

Benchmarks

Custom Dashboard

What makes LakeForge different

Project Structure

Roadmap

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages