Synthetic data refers to artificially generated data that mimics the statistical properties and patterns of real-world data, but without containing any original identifying information. It is primarily used in scenarios where real data is sensitive, scarce, or difficult to access due to privacy concerns, regulations, or logistical challenges.
This repository contains the demo code for the talk "Synthetic Data Generation with Python and LLMs," presented at PyCon Somalia 2025.
The project demonstrates two approaches to synthetic data generation, using diabetes patient records as an example:
- Using statistical methods to generate synthetic data with NumPy. See
demo_1_using_numpy.ipynb. - Using generative models and LLMs with LangChain and OpenAI models. See
demo_2_using_llm.ipynb.
Requires Python 3.10+ and uv.
git clone https://github.com/ahmedunshur/synthetic-data-generation-pycon-somalia-2025.git
cd synthetic-data-generation-pycon-somalia-2025
uv syncSet the OPENAI_API_KEY environment variable which is required for OpenAI models.
This will be used in demo_2_using_llm.ipynb notebook.
SECURITY NOTE: Make sure .env is NOT committed to version control. Add it to .gitignore.
The synthetic diabetes patient datasets contain the following fields:
| Field Name | Description |
|---|---|
Patient_ID |
Unique patient identifier |
Date_of_birth |
Patient's date of birth in YYYY-MM-DD format |
Sex |
Patient's biological sex (Male, Female) |
Diabetes_type |
Type of diabetes (Type 1, Type 2) |
HbA1c_percent |
Glycated hemoglobin level, indicates blood glucose control over 2-3 months |
Fasting_Glucose_mg_dL |
Fasting blood glucose (mg/dL) |
BMI_kg_m2 |
Body Mass Index (kg/m²) |
Last_Visit_Date |
Most recent clinical visit date (within last 6 months) in YYYY-MM-DD format |
Generated synthetic datasets are saved to data/:
synthetic_diabetes_patient_dataset_generated_with_numpy.csvsynthetic_diabetes_patient_dataset_generated_with_llm.csv
Contributions are welcome! Please open an issue to discuss changes before submitting a PR.
The code in this repository is licensed under the MIT License. See the LICENSE file for more details.
The presentation material for the PyCon Somalia 2025 talk is licensed under the Creative Commons Attribution 4.0 International License.
All datasets generated in this project are synthetic and do not represent real patients.
The generated data may not accurately reflect real-world clinical datasets. It is produced using simplified assumptions, and no physician or clinical expert was consulted in its generation.
THESE DATASETS CAN NOT BE USED FOR RESEARCH, MEDICAL, OR CLINICAL PURPOSES. THEY ARE GENERATED FOR DEMONSTRATION AND EDUCATIONAL PURPOSES ONLY.