This repository showcases a production-style, end-to-end Azure Data Engineering project with a strong focus on Azure Data Factory (ADF). The primary goal of this project is to master data ingestion, orchestration, and transformation patterns using ADF, while integrating other core Azure services to simulate a real-world enterprise data platform.
The project follows Modern Data Stack principles and implements the Medallion Architecture (Bronze, Silver, Gold) to deliver reliable, scalable, and analytics-ready data.
Azure Data Factory is used as the central orchestration engine because it is purpose-built for enterprise-scale data movement and workflow automation in Azure.
Why ADF fits this project:
-
Native integration with Azure services.
-
Supports hybrid ingestion via Self-Hosted Integration Runtime (SHIR)
-
Enables dynamic pipelines using parameters, Lookup, and ForEach
-
Cost-effective, scalable, and fully managed
-
Ideal for building repeatable and reliable enterprise ETL/ELT pipelines
-
Master Azure Data Factory pipelines, triggers, data flows, and orchestration
-
Implement dynamic & metadata-driven ingestion patterns
-
Handle multiple real-world data sources (On-Prem, REST API, Azure SQL)
-
Apply incremental loading & watermarking strategies
-
Build a business-ready Gold layer optimized for analytics
-
On-Prem Connectivity: Configured a Self-hosted Integration Runtime (SHIR) to securely ingest local file systems into Azure.
-
API Orchestration: Developed dynamic pipelines to fetch GitHub REST API data using Web Activity and JSON parsing.
-
Incremental Loading (CDC): Built a Watermarking mechanism for Azure SQL DB to ensure only new records are processed, optimizing compute costs.
-
Format Evolution: Transitioned raw CSV/JSON/SQL data into Delta Lake format.
-
Data Quality: Implemented schema enforcement and data type casting within Spark-powered Mapping Data Flows.
-
Idempotency: Used Alter Row transformations to perform MERGE operations, preventing duplicate entries during pipeline re-runs.
-
Aggregation Logic: Created a "Top 5 Performing Airlines" view by aggregating millions of rows into actionable insights.
-
Advanced Window Functions: Leveraged DENSE_RANK to handle ties in revenue performance without skipping ranks.
-
Reporting Readiness: Optimized the final Delta sink for seamless Power BI integration.
-
Azure Data Factory – Data ingestion, orchestration, transformation
-
Azure Data Lake Storage Gen2 – Centralized data lake
-
Self-hosted Integration Runtime (SHIR) – Secure on-prem connectivity
-
Azure SQL Database – Structured source with incremental loading
-
Azure Repos & Pipelines – Version control & CI/CD (conceptual)
-
Azure Logic Apps – Alerting & monitoring (failure notifications)
- All resources are deployed inside a single Azure Resource Group to ensure clean lifecycle management and cost visibility.
-
Secure bridge between local network and Azure
-
Installed on local machine
-
Authenticated using secure registration keys
-
Enables ingestion from:
-
Local CSV / Excel files
-
Network file shares
-
Local SQL Servers
-
Hierarchical namespace enabled
-
Containers structured as:
-
1 bronze – Raw data
-
2 silver – Cleansed & standardized data
-
3 gold – Aggregated business views
-
-
Pipeline parameterized with file list array
-
ForEach activity iterates over multiple files
-
Single reusable Copy Activity
-
Supports scalable batch ingestion with minimal maintenance
-
Web Activity for API availability check
-
REST Dataset for JSON ingestion
-
Raw JSON stored in Bronze layer
-
Demonstrates real-world API ingestion patterns
-
Source: Azure SQL Database (Fact table)
-
Metadata-driven watermark stored in JSON
-
Only new records processed on each run
-
ID-based watermark ensures no data loss
-
Acts as control tower
-
Sequential execution of:
-
On-Prem ingestion
-
API ingestion
-
Incremental SQL ingestion
-
Improves reliability and dependency management
-
Data cleansing & type casting
-
Schema standardization
-
Delta Lake format
-
Upsert logic using primary keys
-
Idempotent pipelines (safe re-runs)
-
Aggregated Top 5 Airlines by Revenue
-
ADF transformations used:
-
Aggregate
-
Window (Dense Rank)
-
Filter
-
Select
-
Gold layer optimized for BI & reporting
Contributions are welcome! Feel free to open issues or submit pull requests for improvements.
If this project helped you learn Azure Data Engineering, please give it a ⭐!
This project demonstrates how Azure Data Factory can be used as a powerful, enterprise-grade orchestration engine to build scalable, automated, and reliable data pipelines that transform raw data into trusted, analytics-ready insights.
