A data analysis project examining global nutrition patterns, dietary trends, and their relationship with obesity rates across countries.
# Activate virtual environment
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies (if not already installed)
pip install -r requirements.txt# Run the complete pipeline (preprocesses all data and creates master panel)
python run_pipeline.pyThis will:
- ✅ Preprocess FAO nutrition data
- ✅ Preprocess obesity data
- ✅ Create food group mappings
- ✅ Create panel datasets
- ✅ Create final master panel
Output: data/processed/final/master_panel_final.csv
# Exploratory Data Analysis
python scripts/analysis/perform_eda.py
# Create interactive visualizations
python scripts/analysis/interactive_plot.pyStep 0: Understand raw data (recommended first):
jupyter notebook notebooks/00_raw_data_exploration.ipynbMain analysis notebook:
jupyter notebook notebooks/01_eda_visualization.ipynbNote: See notebooks/README.md for complete notebook guide.
Bell labs/
│
├── run_pipeline.py # Main pipeline script - run this first!
│
├── data/
│ ├── raw/ # Raw data files (FAO, WHO datasets)
│ │ ├── FoodBalanceSheet_data/
│ │ ├── Population_data/
│ │ └── data.csv # Obesity data
│ │
│ └── processed/ # Cleaned and processed data (organized)
│ ├── cleaned/ # Step 1-3: Cleaned raw data
│ │ ├── Cleaned_FAO_Nutrients.csv
│ │ ├── Cleaned_FAO_Population.csv
│ │ └── Cleaned_Obesity.csv
│ ├── mappings/ # Step 3: Mapping files
│ │ └── Item_to_FoodGroup.csv
│ ├── panels/ # Step 4: Intermediate panels
│ └── final/ # Step 5: Final dataset ⭐
│ └── master_panel_final.csv
│
├── scripts/ # Processing scripts (organized by purpose)
│ ├── preprocessing/ # Step 1-3: Data preprocessing
│ │ ├── preprocess_fao_data.py
│ │ ├── preprocess_obesity_data.py
│ │ └── preprocess_food_group_mapping.py
│ ├── panels/ # Step 4-5: Panel dataset creation
│ │ ├── create_panel_datasets.py
│ │ └── create_master_panel.py
│ └── analysis/ # Step 6+: Analysis and visualization
│ ├── perform_eda.py
│ ├── extended_eda.py
│ └── interactive_plot.py
│
├── notebooks/ # Jupyter notebooks for exploration
│ └── 02_eda_visualization.ipynb # Main analysis notebook
│
├── doc/ # Documentation (organized)
│ ├── README.md # Documentation guide
│ ├── guides/ # How-to guides
│ │ └── methodology.md # Detailed methodology
│ ├── reference/ # Reference docs
│ │ ├── data_dictionary.md # Variable descriptions
│ │ ├── data_analysis.md # Dataset analysis
│ │ └── dataset_analysis.md # Alternative analysis
│ └── notes/ # Research notes
│ └── research_notes.md # Research findings
│ └── reseach_notes.md
│
└── requirements.txt # Python dependencies
-
Preprocess FAO Data (
preprocess_fao_data.py)- Cleans Food Balance Sheet data
- Extracts nutrients (energy, protein, fat)
- Extracts population data
- Output:
Cleaned_FAO_Nutrients.csv,Cleaned_FAO_Population.csv
-
Preprocess Obesity Data (
preprocess_obesity_data.py)- Cleans WHO obesity dataset
- Standardizes country names
- Output:
Cleaned_Obesity.csv
-
Create Food Group Mapping (
preprocess_food_group_mapping.py)- Maps FAO items to food groups (Cereals, Meat, Dairy, etc.)
- Output:
Item_to_FoodGroup.csv
-
Create Panel Datasets (
create_panel_datasets.py)- Creates country-year panels for nutrients
- Aggregates food groups by country-year
- Output:
nutrient_panel.csv,foodgroup_energy_panel.csv, etc.
-
Create Master Panel (
create_master_panel.py)- Merges all datasets into final panel
- Handles missing data
- Output:
master_panel_final.csv⭐
If you need to run steps individually:
python scripts/preprocess_fao_data.py
python scripts/preprocess_obesity_data.py
python scripts/preprocess_food_group_mapping.py
python scripts/create_panel_datasets.py
python scripts/create_master_panel.pyFile: data/processed/final/master_panel_final.csv
Structure: Country-year panel (171 countries, 2010-2022)
Key Variables:
country,year: Identifiersenergy_kcal_day,protein_g_day,fat_g_day: Nutrients (per capita/day)Cereals,Meat,Dairy & Eggs, etc.: Food group energy (kcal/capita/day)Cereals_share,Meat_share, etc.: Food group shares (%)population: Total populationobesity_pct: Obesity prevalence (%)
See data/processed/README.md for detailed variable descriptions.
python scripts/perform_eda.pyGenerates:
- Summary statistics
- Correlation matrices
- Trend visualizations
- Outputs saved to
data/outputs/
python scripts/interactive_plot.pyCreates interactive Plotly charts for:
- Energy vs Obesity trends
- Food group shares over time
- Country comparisons
Open notebooks/02_eda_visualization.ipynb for interactive exploration.
- Documentation Guide:
doc/README.md- Overview of all documentation - Methodology:
doc/guides/methodology.md- Detailed methodology - Data Dictionary:
doc/reference/data_dictionary.md- Variable descriptions - Research Notes:
doc/notes/research_notes.md- Research findings - Processed Data README:
data/processed/README.md- Dataset documentation
- Python 3.8+
- See
requirements.txtfor package list
Main dependencies:
- pandas, numpy
- matplotlib, seaborn, plotly
- jupyter
- scikit-learn
- Data Sources: FAO Food Balance Sheets, WHO Global Health Observatory
- Year Coverage: 2010-2022 (common years across datasets)
- Country Coverage: 171 countries
- Missing Data: Handled via interpolation (max 2-year gaps)
After running the pipeline:
- Explore the data: Open
notebooks/01_eda_visualization.ipynb - Run EDA:
python scripts/analysis/perform_eda.py - Create visualizations:
python scripts/analysis/interactive_plot.py - Build models: Use
data/processed/final/master_panel_final.csvfor regression/ML analysis
Issue: ModuleNotFoundError
- Solution: Activate virtual environment:
source venv/bin/activate
Issue: Missing raw data files
- Solution: Ensure data files are in
data/raw/directory
Issue: Pipeline fails at a step
- Solution: Check error message, fix the issue, and re-run from that step
Last Updated: 2025-01-20