Skip to content

vinsblack/lakeforge

Repository files navigation

LakeForge

LakeForge Dashboard

Open-source data lakehouse platform — one command, real data, production patterns.
10 services. 4 million rows. Sub-300ms queries. Zero cloud bills.

CI License: MIT

What is LakeForge?

LakeForge is a fully integrated data lakehouse you spin up with a single command. It ingests a real dataset from HuggingFace (4M Amazon product reviews), transforms it through a dbt medallion architecture, and serves it through a custom React dashboard with sub-second ClickHouse queries.

This is not a tutorial or a collection of docker-compose snippets. It's a working analytical platform with health checks on every service, dependency ordering, real benchmarks, and a custom frontend.

git clone https://github.com/vinsblack/lakeforge
cd lakeforge
cp .env.example .env
docker compose up -d

Architecture

HuggingFace (4M rows)
       │
       ▼
┌──────────────────────────────────────────────────────┐
│  INGESTION                                            │
│  Redpanda (Kafka-compatible, no JVM)                 │
└────────────────────────┬─────────────────────────────┘
                         ▼
┌──────────────────────────────────────────────────────┐
│  STORAGE                                              │
│  MinIO (S3-compatible)  ·  Project Nessie (catalog)  │
└────────────────────────┬─────────────────────────────┘
                         ▼
┌──────────────────────────────────────────────────────┐
│  TRANSFORM                                            │
│  dbt Core — bronze → silver → gold                   │
│  Prefect — orchestration + scheduling                │
└────────────────────────┬─────────────────────────────┘
                         ▼
┌──────────────────────────────────────────────────────┐
│  QUERY                                                │
│  ClickHouse — sub-second OLAP                        │
└────────────────────────┬─────────────────────────────┘
                         ▼
┌──────────────────────────────────────────────────────┐
│  OBSERVE + PRESENT                                    │
│  Prometheus · Grafana · Superset · React Dashboard   │
└──────────────────────────────────────────────────────┘

Stack

Layer Tool Role Status
Object Store MinIO S3-compatible, self-hosted ✅ Active in pipeline
Transformation dbt Core SQL-first medallion architecture ✅ Active in pipeline
Orchestration Prefect Pipeline scheduling with UI ✅ Active in pipeline
OLAP Engine ClickHouse Fastest open-source analytical DB ✅ Active in pipeline
Monitoring Prometheus + Grafana Platform observability ✅ Active (ClickHouse metrics)
BI Apache Superset Self-service SQL and dashboards ✅ Running, ready to configure
Frontend React + Vite Custom analytics dashboard ✅ Active, queries gold layer
Streaming Redpanda Kafka-compatible broker, no JVM overhead 🔧 Running, not yet in pipeline
Data Catalog Project Nessie Git-like versioning for data 🔧 Running, not yet in pipeline

Note: Redpanda and Nessie are deployed and healthy but not yet wired into the ingestion flow. The current pipeline loads data via batch (HuggingFace → MinIO → ClickHouse). Streaming ingestion through Redpanda and Iceberg table support via Nessie are on the roadmap.


Quickstart

Prerequisites: Docker Desktop 4.x+ (8 GB RAM recommended)

# Clone and start
git clone https://github.com/vinsblack/lakeforge
cd lakeforge
cp .env.example .env          # ← change passwords before production use
docker compose up -d
docker compose ps              # verify all services are healthy
# Load real data from HuggingFace (takes ~15 min)
./run_pipeline.sh              # Linux/Mac
.\run_pipeline.bat             # Windows

# Build dbt layers
./run_dbt.sh                   # Linux/Mac
.\run_dbt.bat                  # Windows
Service URL Default credentials
LakeForge Dashboard http://localhost:3001
Redpanda Console http://localhost:8080
MinIO Console http://localhost:9001 lakeforge / see .env
Nessie Catalog http://localhost:19120
ClickHouse HTTP http://localhost:8123 lakeforge / see .env
Prefect UI http://localhost:4200
Prometheus http://localhost:9090
Grafana http://localhost:3000 admin / see .env
Superset http://localhost:8088 admin / see .env

Data Pipeline

Dataset

McAuley-Lab/Amazon-Reviews-2023 — 877K+ downloads on HuggingFace, peer-reviewed. We load 4 million Electronics reviews via chunked streaming (100K rows per chunk, ~100MB peak RAM). The pipeline is idempotent — safe to re-run without creating duplicates.

Pipeline flow

HuggingFace API (streaming, chunked)
       │
       ▼
  MinIO bronze bucket (Parquet, ~220 MB)
  + ClickHouse bronze.amazon_reviews (raw import)
       │  dbt run
       ▼
  lakeforge_silver.reviews_enriched (cleaned, typed, sentiment labels)
       │  dbt run
       ▼
  lakeforge_gold.reviews_kpis     — monthly KPI aggregations
  lakeforge_gold.top_products     — top products by review volume

Medallion architecture (dbt)

Bronze — raw data as ingested from HuggingFace. No transformations.

Silver — cleaned and enriched:

  • Type casting, null handling, validation filters
  • Full review text and title preserved for downstream analysis
  • Date extraction (year, month, day of week)
  • Review length and title length computation
  • Sentiment labeling (positive / neutral / negative based on rating)
  • Helpfulness tiering (highly_helpful / helpful / not_helpful)
  • Incremental materialization with delete+insert

Gold — business-ready aggregations:

  • reviews_kpis: monthly rollups with avg rating, sentiment split, verified %, avg review length
  • top_products: top 10K products by volume with quality metrics

Benchmarks

All queries measured on local Docker Desktop, 4 million rows, ClickHouse 24.1.

Query Description Time
Q1 Full scan COUNT(*) 4M rows 139ms
Q2 GROUP BY year + AVG(rating) 112ms
Q3 Top 100 products (GROUP BY + ORDER BY) 195ms
Q4 Sentiment distribution 157ms
Q5 Monthly time series — 8 year window 142ms
Q6 Verified vs unverified analysis 129ms
Q7 Gold layer KPI aggregation 124ms
Q8 Review length distribution 303ms

Run on your machine:

./docs/benchmarks/run_bench.sh     # Linux/Mac
.\docs\benchmarks\run_bench.ps1    # Windows

Custom Dashboard

The React dashboard connects to ClickHouse through an nginx reverse proxy and visualizes the gold layer in real time:

  • KPI cards — total reviews, average rating, positive %, verified %
  • Yearly trend — bar chart of review volume over time
  • Rating trend — area chart showing average rating evolution
  • Sentiment analysis — positive vs negative % over time
  • Top products table — sortable by volume, rating, sentiment

Built with Vite + React + Recharts. Dark theme. Served via nginx in Docker.


What makes LakeForge different

Most "lakehouse in Docker" repos on GitHub are bare docker-compose files with placeholder services. Here's what LakeForge ships that they don't:

Capability Other repos LakeForge
Health checks rare ✅ every service
Dependency ordering rare ✅ correct boot sequence
Real data pipeline ❌ dummy data ✅ 4M rows from HuggingFace
Memory-efficient ingestion ✅ chunked streaming, ~100MB peak
Idempotent pipeline ✅ safe to re-run
dbt medallion arch ✅ bronze / silver / gold
dbt schema tests ✅ not_null, unique, accepted_values
Custom frontend ✅ React dashboard
Grafana dashboards ✅ ClickHouse monitoring provisioned
Published benchmarks ✅ 8 queries with timings
Cross-platform scripts ✅ Linux, Mac, Windows
CI pipeline ✅ GitHub Actions
Server-side auth ✅ credentials never in browser

Project Structure

lakeforge/
├── docker-compose.yml              # 10+ services, health checks, volumes
├── .env.example                    # All configurable credentials
├── Makefile                        # Common commands (make up/down/pipeline/bench)
├── run_pipeline.sh / .bat          # Data ingestion (HuggingFace → ClickHouse)
├── run_dbt.sh / .bat               # dbt transformations
├── infra/
│   ├── clickhouse/
│   │   ├── config.xml              # S3 policy, Prometheus metrics, named collections
│   │   └── init.sql                # Database initialization
│   ├── monitoring/
│   │   ├── prometheus.yml          # Scrape configs (CH, Redpanda, MinIO)
│   │   └── grafana/
│   │       ├── datasources/        # Prometheus datasource
│   │       └── dashboards/         # Auto-provisioned ClickHouse dashboard
│   └── superset/
│       ├── Dockerfile              # Pre-built image with clickhouse-connect
│       └── superset_config.py
├── pipelines/
│   ├── requirements.txt
│   └── flows/
│       └── amazon_reviews_pipeline.py  # Chunked, idempotent, memory-efficient
├── dbt/
│   ├── dbt_project.yml             # Vars: top_products_limit, min_reviews_for_top
│   ├── profiles.yml
│   └── models/
│       ├── bronze/
│       │   └── sources.yml         # 11 columns documented + freshness check
│       ├── silver/
│       │   ├── reviews_enriched.sql  # Incremental, sentiment, helpfulness
│       │   └── schema.yml          # not_null, accepted_values tests
│       └── gold/
│           ├── reviews_kpis.sql    # Monthly KPI rollups
│           ├── top_products.sql    # Configurable via dbt vars
│           └── schema.yml          # not_null, unique tests
├── frontend/
│   ├── Dockerfile                  # Multi-stage build (node → nginx)
│   ├── nginx.conf                  # Reverse proxy with server-side CH auth
│   └── src/
│       ├── api/clickhouse.js       # Zero credentials in browser
│       └── App.jsx                 # Responsive, graceful per-section errors
├── docs/
│   └── benchmarks/                 # Scripts: sh, ps1, bat + results.md
├── .github/
│   └── workflows/ci.yml           # Validate, lint, build, integration test
├── CONTRIBUTING.md
└── LICENSE (MIT)

Roadmap

Planned features — contributions welcome:

  • Apache Iceberg tables — integrate PyIceberg for ACID transactions and time travel through the Nessie catalog
  • Streaming ingestion — produce HuggingFace data through Redpanda topics, consume into ClickHouse in real time
  • Data quality — Great Expectations suites on bronze → silver boundary
  • Data lineage — OpenLineage integration for end-to-end tracking
  • CDC pipeline — Debezium connector for database change capture
  • dbt tests — schema tests and data freshness checks
  • Grafana dashboards — pre-built ClickHouse and pipeline monitoring panels
  • Multi-dataset support — pluggable ingestion for other HuggingFace datasets

License

MIT — use freely, contributions welcome.


Built with ClickHouse · MinIO · dbt · Prefect · Redpanda · Nessie · Superset · React

About

pen-source data lakehouse platform — ClickHouse, dbt, MinIO, Prefect, React. 4M rows, sub-300ms queries, one command.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors