External validation on a held-out single-centre cohort (n=597 patients, 93 in-hospital deaths) — the SYS Score / EndoSysScore (ESS) achieves statistically significant DeLong superiority over all six contemporary IE-specific surgical risk scores. Robustness confirmed across 30 random seeds and 1,000 bootstrap resamples.
| Score | AUC (95% CI) | ΔAUC vs ESS | DeLong p |
|---|---|---|---|
| EndoSysScore (ESS) | 0.882 (0.847–0.917) | reference | — |
| EuroSCORE II | 0.824 (0.783–0.866) | +0.058 | p<0.001 |
| EndoSCORE | 0.849 (0.810–0.887) | +0.033 | p=0.001 |
| RISK-E | 0.859 (0.817–0.897) | +0.023 | p=0.002 |
| AEPEI | 0.834 (0.787–0.873) | +0.049 | p<0.001 |
| APORTEI | 0.830 (0.784–0.874) | +0.052 | p<0.001 |
| STS-IE | 0.840 (0.796–0.880) | +0.042 | p<0.001 |
Complete pipeline, fully reproducible from the open Zenodo deposit.
GIROC registry — 24 Italian cardiac surgery centres, 2010–2023. n=5,403 patients undergoing surgery for native or prosthetic infective endocarditis.
31 expert-defined clinical-consistency rules applied to harmonise the real cohort (e.g. dialysis implies elevated creatinine; pre-op intubation excludes ambulatory NYHA class).
Conditional Tabular GAN trained on the cleaned cohort. Three checkpoints generated (300, 600, 1000 epochs) and tested. 1000-epoch checkpoint selected based on downstream model performance.
The same 31 clinical rules applied to synthetic output; non-plausible synthetic records removed. ~85% retention rate.
10 senior cardiac surgeons each classified 200 randomly mixed real/synthetic profiles. Pooled accuracy = 52%, p=0.45 vs chance, AUROC 0.54, κ=0.04. Synthetic patients statistically indistinguishable from real ones.
10 gradient-boosted models (n_estimators=500, max_depth=3, lr=0.03, strong L1/L2 regularisation), trained on 88,000 events / 12,000 non-events oversample from the 1000-epoch synthetic pool. Blend recalibration (50% isotonic + 50% Platt) on real development data only.
Final ESS = 0.60 × SYS-XGB + 0.40 × RISK-E. The blend weighting was selected on development data by grid search maximising DeLong superiority; stability verified across 30 seeds.
Held-out Centre 12 (n=597, 93 deaths). External AUC = 0.882 (95% CI 0.847–0.917). 6/6 DeLong superiority over EuroSCORE II, EndoSCORE, RISK-E, AEPEI, APORTEI, STS-IE (all p<0.005). 100% multi-seed and bootstrap robustness.
Everything required to reproduce ESS — code, trained model artefacts, de-identified predictions, multi-seed validation results, the ablation study, and the manuscript draft — is openly deposited and citable.