Leakage-Safe Benchmark Design for Market-Stress Early Warning: An Economically Credible Evaluation
Market-stress early-warning systems are often evaluated under benchmark settings that permit information leakage or obscure the operational tradeoffs relevant to monitoring decisions. This paper develops a leakage-safe benchmark design for assessing machine-learning-based early-warning models in a financially meaningful and reproducible way. Using a unified walk-forward evaluation framework, the study compares alternative warning systems across benchmark specifications that vary in feature sets, forecast horizons, stress-event definitions, and threshold-selection rules. Performance is evaluated not only with conventional statistical measures, including ROC-AUC and PR-AUC, but also with operational criteria directly relevant to real-time monitoring, such as false alarms per year, event hit rate, median lead time, and alert-budget diagnostics. The results show that benchmark design materially affects model rankings and that systems appearing strong under conventional statistical metrics are not always preferred once operational tradeoffs are taken into account. Across specifications, leakage-safe evaluation reveals substantial heterogeneity in performance and highlights the importance of aligning model assessment with the practical objectives of early-warning use. These findings suggest that benchmark specification should be treated as part of the empirical question rather than as a neutral background choice. The paper contributes to the computational evaluation of financial early-warning systems by providing a transparent and economically credible framework for model comparison under temporal dependence.