Skip to content

ahmedunshur/synthetic-data-generation-pycon-somalia-2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Data Generation with Python and LLMs

Synthetic data refers to artificially generated data that mimics the statistical properties and patterns of real-world data, but without containing any original identifying information. It is primarily used in scenarios where real data is sensitive, scarce, or difficult to access due to privacy concerns, regulations, or logistical challenges.

Overview

This repository contains the demo code for the talk "Synthetic Data Generation with Python and LLMs," presented at PyCon Somalia 2025.

The project demonstrates two approaches to synthetic data generation, using diabetes patient records as an example:

  1. Using statistical methods to generate synthetic data with NumPy. See demo_1_using_numpy.ipynb.
  2. Using generative models and LLMs with LangChain and OpenAI models. See demo_2_using_llm.ipynb.

Installation

Requires Python 3.10+ and uv.

git clone https://github.com/ahmedunshur/synthetic-data-generation-pycon-somalia-2025.git
cd synthetic-data-generation-pycon-somalia-2025
uv sync

Configuration

Set the OPENAI_API_KEY environment variable which is required for OpenAI models.

This will be used in demo_2_using_llm.ipynb notebook.

SECURITY NOTE: Make sure .env is NOT committed to version control. Add it to .gitignore.

Dataset Description

The synthetic diabetes patient datasets contain the following fields:

Field Name Description
Patient_ID Unique patient identifier
Date_of_birth Patient's date of birth in YYYY-MM-DD format
Sex Patient's biological sex (Male, Female)
Diabetes_type Type of diabetes (Type 1, Type 2)
HbA1c_percent Glycated hemoglobin level, indicates blood glucose control over 2-3 months
Fasting_Glucose_mg_dL Fasting blood glucose (mg/dL)
BMI_kg_m2 Body Mass Index (kg/m²)
Last_Visit_Date Most recent clinical visit date (within last 6 months) in YYYY-MM-DD format

Output

Generated synthetic datasets are saved to data/:

  • synthetic_diabetes_patient_dataset_generated_with_numpy.csv
  • synthetic_diabetes_patient_dataset_generated_with_llm.csv

Contributing

Contributions are welcome! Please open an issue to discuss changes before submitting a PR.

License

The code in this repository is licensed under the MIT License. See the LICENSE file for more details.

The presentation material for the PyCon Somalia 2025 talk is licensed under the Creative Commons Attribution 4.0 International License.


⚠️ DISCLAIMER

All datasets generated in this project are synthetic and do not represent real patients.

The generated data may not accurately reflect real-world clinical datasets. It is produced using simplified assumptions, and no physician or clinical expert was consulted in its generation.

THESE DATASETS CAN NOT BE USED FOR RESEARCH, MEDICAL, OR CLINICAL PURPOSES. THEY ARE GENERATED FOR DEMONSTRATION AND EDUCATIONAL PURPOSES ONLY.

About

Demo project for the PyCon Somalia 2025 talk “Synthetic Data Generation with Python and LLMs”.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors