[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data

In this tutorial, we build a complete, production-grade synthetic data pipeline using CTGAN and the SDV ecosystem. We start from raw mixed-type tabular data and progressively move toward constrained generation, conditional sampling, statistical validation, and downstream utility testing. Rather than stopping at sample generation, we focus on understanding how well synthetic data preserves structure, distributions, and predictive signal. This tutorial demonstrates how CTGAN can be used responsibly and rigorously in real-world data science workflows.

!pip -q install "ctgan" "sdv" "sdmetrics" "scikit-learn" "pandas" "numpy" "matplotlib"


import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")


import ctgan, sdv, sdmetrics
from ctgan import load_demo, CTGAN


from sdv.metadata import SingleTableMetadata
from sdv.single_table import CTGANSynthesizer


from sdv.cag import Inequality, FixedCombinations
from sdv.sampling import Condition


from sdmetrics.reports.single_table import DiagnosticReport, QualityReport


from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


import matplotlib.pyplot as plt


print("Versions:")
print("ctgan:", ctgan.__version__)
print("sdv:", sdv.__version__)
print("sdmetrics:", sdmetrics.__version__)

We set up the environment by installing all required libraries and importing the full dependency stack. We explicitly load CTGAN, SDV, SDMetrics, and downstream ML tooling to ensure compatibility across the pipeline. We also surface library versions to make the experiment reproducible and debuggable.

real = load_demo().copy()
real.columns = [c.strip().replace(" ", "_") for c in real.columns]


target_col = "income"
real[target_col] = real[target_col].astype(str)


categorical_cols = real.select_dtypes(include=["object"]).columns.tolist()
numerical_cols = [c for c in real.columns if c not in categorical_cols]


print("Rows:", len(real), "Cols:", len(real.columns))
print("Categorical:", len(categorical_cols), "Numerical:", len(numerical_cols))
display(real.head())


ctgan_model = CTGAN(
   epochs=30,
   batch_size=500,
   verbose=True
)
ctgan_model.fit(real, discrete_columns=categorical_cols)
synthetic_ctgan = ctgan_model.sample(5000)
print("Standalone CTGAN sample:")
display(synthetic_ctgan.head())

We load the CTGAN Adult demo dataset and perform minimal normalization on column names and data types. We explicitly identify categorical and numerical columns, which is critical for both CTGAN training and evaluation. We then train a baseline standalone CTGAN model and generate synthetic samples for comparison.

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=real)
metadata.update_column(column_name=target_col, sdtype="categorical")


constraints = []


if len(numerical_cols) >= 2:
   col_lo, col_hi = numerical_cols[0], numerical_cols[1]
   constraints.append(Inequality(low_column_name=col_lo, high_column_name=col_hi))
   print(f"Added Inequality constraint: {col_hi} > {col_lo}")


if len(categorical_cols) >= 2:
   c1, c2 = categorical_cols[0], categorical_cols[1]
   constraints.append(FixedCombinations(column_names=[c1, c2]))
   print(f"Added FixedCombinations constraint on: [{c1}, {c2}]")


synth = CTGANSynthesizer(
   metadata=metadata,
   epochs=30,
   batch_size=500
)


if constraints:
   synth.add_constraints(constraints)


synth.fit(real)


synthetic_sdv = synth.sample(num_rows=5000)
print("SDV CTGANSynthesizer sample:")
display(synthetic_sdv.head())

We construct a formal metadata object and attach explicit semantic types to the dataset. We introduce structural constraints using SDV’s constraint graph system, enforcing numeric inequalities and validity of categorical combinations. We then train a CTGAN-based SDV synthesizer that respects these constraints during generation.

loss_df = synth.get_loss_values()
display(loss_df.tail())


x_candidates = ["epoch", "step", "steps", "iteration", "iter", "batch", "update"]
xcol = next((c for c in x_candidates if c in loss_df.columns), None)


g_candidates = ["generator_loss", "gen_loss", "g_loss"]
d_candidates = ["discriminator_loss", "disc_loss", "d_loss"]
gcol = next((c for c in g_candidates if c in loss_df.columns), None)
dcol = next((c for c in d_candidates if c in loss_df.columns), None)


plt.figure(figsize=(10,4))


if xcol is None:
   x = np.arange(len(loss_df))
else:
   x = loss_df[xcol].to_numpy()


if gcol is not None:
   plt.plot(x, loss_df[gcol].to_numpy(), label=gcol)
if dcol is not None:
   plt.plot(x, loss_df[dcol].to_numpy(), label=dcol)


plt.xlabel(xcol if xcol is not None else "index")
plt.ylabel("loss")
plt.legend()
plt.title("CTGAN training losses (SDV wrapper)")
plt.show()


cond_col = categorical_cols[0]
common_value = real[cond_col].value_counts().index[0]
conditions = [Condition({cond_col: common_value}, num_rows=2000)]


synthetic_cond = synth.sample_from_conditions(
   conditions=conditions,
   max_tries_per_batch=200,
   batch_size=5000
)


print("Conditional sampling requested:", 2000, "got:", len(synthetic_cond))
print("Conditional sample distribution (top 5):")
print(synthetic_cond[cond_col].value_counts().head(5))
display(synthetic_cond.head())

We extract and visualize the dynamics of generator and discriminator losses using a version-robust plotting strategy. We perform conditional sampling to generate data under specific attribute constraints and verify that the conditions are satisfied. This demonstrates how CTGAN behaves under guided generation scenarios.

metadata_dict = metadata.to_dict()


diagnostic = DiagnosticReport()
diagnostic.generate(real_data=real, synthetic_data=synthetic_sdv, metadata=metadata_dict, verbose=True)
print("Diagnostic score:", diagnostic.get_score())


quality = QualityReport()
quality.generate(real_data=real, synthetic_data=synthetic_sdv, metadata=metadata_dict, verbose=True)
print("Quality score:", quality.get_score())


def show_report_details(report, title):
   print(f"n===== {title} details =====")
   props = report.get_properties()
   for p in props:
       print(f"n--- {p} ---")
       details = report.get_details(property_name=p)
       try:
           display(details.head(10))
       except Exception:
           display(details)


show_report_details(diagnostic, "DiagnosticReport")
show_report_details(quality, "QualityReport")


train_real, test_real = train_test_split(
   real, test_size=0.25, random_state=42, stratify=real[target_col]
)


def make_pipeline(cat_cols, num_cols):
   pre = ColumnTransformer(
       transformers=[
           ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
           ("num", "passthrough", num_cols),
       ],
       remainder="drop"
   )
   clf = LogisticRegression(max_iter=200)
   return Pipeline([("pre", pre), ("clf", clf)])


pipe_syn = make_pipeline(categorical_cols, numerical_cols)
pipe_syn.fit(synthetic_sdv.drop(columns=[target_col]), synthetic_sdv[target_col])


proba_syn = pipe_syn.predict_proba(test_real.drop(columns=[target_col]))[:, 1]
y_true = (test_real[target_col].astype(str).str.contains(">")).astype(int)
auc_syn = roc_auc_score(y_true, proba_syn)
print("Synthetic-train -> Real-test AUC:", auc_syn)


pipe_real = make_pipeline(categorical_cols, numerical_cols)
pipe_real.fit(train_real.drop(columns=[target_col]), train_real[target_col])


proba_real = pipe_real.predict_proba(test_real.drop(columns=[target_col]))[:, 1]
auc_real = roc_auc_score(y_true, proba_real)
print("Real-train -> Real-test AUC:", auc_real)


model_path = "ctgan_sdv_synth.pkl"
synth.save(model_path)
print("Saved synthesizer to:", model_path)


from sdv.utils import load_synthesizer
synth_loaded = load_synthesizer(model_path)


synthetic_loaded = synth_loaded.sample(1000)
print("Loaded synthesizer sample:")
display(synthetic_loaded.head())

We evaluate synthetic data using SDMetrics diagnostic and quality reports and a property-level inspection. We validate downstream usefulness by training a classifier on synthetic data and testing it on real data. Finally, we serialize the trained synthesizer and confirm that it can be reloaded and sampled reliably.

In conclusion, we demonstrated that synthetic data generation with CTGAN becomes significantly more powerful when paired with metadata, constraints, and rigorous evaluation. By validating both statistical similarity and downstream task performance, we ensured that the synthetic data is not only realistic but also useful. This pipeline serves as a strong foundation for privacy-preserving analytics, data sharing, and simulation workflows. With careful configuration and evaluation, CTGAN can be safely deployed in real-world data science systems.


Check out the Full Codes hereAlso, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post [In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data appeared first on MarkTechPost.

Liked Liked