Synthetic Data That Behaves: A Practical Guide to Generating Realistic Healthcare-Like Data Without Violating Privacy
Author(s): Abhishek Yadav Originally published on Towards AI. A hands-on guide to building synthetic data that looks, feels, and behaves like the real world without privacy risk Photo by Luke Chesser on Unsplash Healthcare organizations sit on treasure chests of data be it the appointments, lab results, care journeys, billing patterns, social determinants. Yet the very rules designed to protect patient privacy often make it nearly impossible for analysts and data scientists to experiment freely. And that’s a problem. Because innovation doesn’t start in production systems.It starts with play such as trying ideas, building prototypes, and running experiments without fear of leaking sensitive data. That’s where synthetic data becomes one of the most powerful tools in a data scientist’s toolkit. In this guide, I’ll show you a practical, no-nonsense approach to generating healthcare-like synthetic data that behaves like the real thing such as preserving shape, distributions, trends, correlations, without exposing a single patient. No GANs.No deep learning.No fragile black-box models. Just a clean, transparent method that you can explain to your compliance team in under two minutes. Why Synthetic Data Matters (More Than Ever) Most organizations face three common constraints: 1. Privacy laws delay or block innovation.HIPAA, GDPR, and internal policies often prevent teams from sharing even de-identified datasets internally. 2. Analysts can’t experiment fast.Requesting access to PHI can take weeks or months thus killing momentum for early ideas. 3. Teams need realistic datasets for prototypes and demos.Executive demos, model validations, and vendor evaluations all need data… but not real patient data. Synthetic data solves all three problems when done correctly: ✔ Behaves like real data✔ Contains no patient information✔ Can be shared freely across teams✔ Speeds up model development dramatically But Not All Synthetic Data Is Good Synthetic Data If your synthetic data looks like uniform noise, you’ve only created fake data not useful synthetic data. Good synthetic data has three properties: 1. Statistical fidelityDistributions, seasonal trends, missingness patterns, and class ratios resemble real data. 2. Relationship fidelityCorrelations stay intact, e.g., older age relates to higher visit frequency, chronic conditions relate to higher readmission risk, etc. 3. Behavioral fidelityThe shape of data over time looks right such as wait times, appointment lead times, cancellations, etc. Our goal is not perfect replication of the real dataset.Our goal is behavioral realism. The Approach: A Transparent 3-Layer Synthetic Generator Deep generative models (GANs, VAEs) can create beautiful synthetic datasets, but they are: Hard to tune Risk-prone (they may leak patterns from small datasets) A black box to compliance teams Instead, I use a simple and explainable three-layer generator that works for most healthcare operations datasets. Layer 1: Distributions That Match Reality Every variable gets a distribution based on real-world behavior. Examples: This layer ensures your dataset looks real. Layer 2: Respecting Correlations Example relationships we preserve: Older patients → higher visit frequency Chronic conditions → longer appointment lead times New patients → higher no-show probability Certain specialties → higher cancellation rates We approximate correlations using: Spearman rank correlation for skewed healthcare variables A copula model to generate correlated samples Logical rules layered on top (e.g., chronic_conditions > 3 → high_visit_risk) The dataset now behaves like the real world. Layer 3: Realistic Time Behavior Healthcare data is not static. It has pattern and seasonality: Hourly cycles (e.g., AM peak, lunch dip, PM ramp) Weekly patterns (Monday load, weekend effect) Seasonality (flu season spikes, December cancellations) Simple, explainable curves and holiday flags no black boxes We add these via simple, explainable curves, no AI magic needed. A Minimal Python Generator You Can Trust Here is a clean and readable version you can publish: import numpy as npimport pandas as pdfrom scipy.stats import gaussian_kdenp.random.seed(42)N = 5000 # dataset size# —- Layer 1: Distributions —-ages = np.random.lognormal(mean=3.6, sigma=0.4, size=N).astype(int)ages = np.clip(ages, 18, 95)lead_time = np.random.lognormal(mean=2.3, sigma=0.5, size=N).astype(int)provider_type = np.random.choice( [‘Primary Care’, ‘Cardiology’, ‘Dermatology’, ‘Neurology’], size=N, p=[0.55, 0.20, 0.15, 0.10])no_show_prob = np.random.beta(2, 10, size=N)# —- Layer 2: Relationships —-visit_frequency = np.round((ages / 30) + np.random.normal(0, 1, N)).clip(0)chronic_conditions = np.random.poisson(lam=1.2, size=N)# Rule-based adjustmentno_show_flag = (np.random.rand(N) < (no_show_prob + chronic_conditions*0.02)).astype(int)# —- Layer 3: Time Behavior —-days = pd.date_range(start=”2023-01-01″, periods=N, freq=”H”)seasonality = np.sin(np.linspace(0, 6*np.pi, N)) # approximate monthly cyclewait_time = (lead_time * (1 + 0.3*seasonality)).clip(0).astype(int)# —- Output —-df = pd.DataFrame({ ‘age’: ages, ‘lead_time_days’: lead_time, ‘provider’: provider_type, ‘chronic_conditions’: chronic_conditions, ‘visit_frequency’: visit_frequency, ‘no_show’: no_show_flag, ‘appointment_date’: days, ‘adjusted_wait_time’: wait_time})df.head() This gives you: Realistic age curves Realistic lead times Realistic no-show patterns Seasonal behavior Clean correlations A dataset that behaves, without ever touching PHI. How Close Does Synthetic Data Need to Be? The rule I use when working with compliance and clinical partners: “The data should behave like the real world, not mimic any real patient.” You can validate this using three checks: 1. Distribution plots Compare real vs. synthetic at a high level. 2. Correlation matrices Compare Spearman correlation matrices; directions should match, magnitudes should be plausible (not identical to any specific dataset). 3. Downstream model accuracy Train a simple model (e.g., logistic regression for no‑show). Performance trends on synthetic should approximate real-world (e.g., which features matter) without identical metrics. Quick Visualization Code: import matplotlib.pyplot as pltimport numpy as npplt.figure(figsize=(12,4))plt.subplot(1,3,1); plt.hist(df[‘age’], bins=30, color=’#4c78a8′); plt.title(‘Age’)plt.subplot(1,3,2); plt.hist(df[‘lead_time_days’], bins=30, color=’#f58518′); plt.title(‘Lead Time (days)’)plt.subplot(1,3,3); plt.hist(df[‘chronic_conditions’], bins=15, color=’#54a24b’); plt.title(‘Chronic Conditions’)plt.tight_layout(); plt.show()# Spearman correlation heatmapcols = [‘age’,’lead_time_days’,’chronic_conditions’,’visit_frequency’]corr = df[cols].corr(method=’spearman’)plt.figure(figsize=(5,4))plt.imshow(corr, cmap=’coolwarm’, vmin=-1, vmax=1)plt.colorbar(); plt.xticks(range(len(cols)), cols, rotation=45); plt.yticks(range(len(cols)), cols)plt.title(‘Spearman Correlation’); plt.tight_layout(); plt.show() Distribution plots for Age, Lead Time, and Chronic Conditions generated using Python (Matplotlib) Spearman correlation heatmap generated using Python (Matplotlib) A Case Study: Appointment Optimization A healthcare organization wanted to explore: reducing no-shows optimizing scheduling slots measuring wait time bottlenecks But analysts didn’t have access to operational data until approvals were cleared. By building a synthetic dataset like the one above, they could: Test feature engineering Build early prototypes Try no-show prediction models Experiment with scheduling simulations When real data access was granted weeks later, 80% of the pipeline was already built. The team went from idea → deployment in 13 days, instead of 2–3 months. Synthetic data didn’t replace real data.It unlocked speed. The Real Benefit: Freedom […]