AControlled Comparison of Deep Learning Architectures for Multi-Horizon Financial Forecasting Evidence from 918 Experiments
Multi-horizon price forecasting is central to portfolio allocation, risk management, and algorithmic trading, yet deep learning architectures have advanced faster than rigorous financial benchmarking can properly evaluate them. Existing comparisons are often limited by inconsistent hyperparameter budgets, single-seed evaluation, narrow asset coverage, and a lack of statistical validation. This study presents a controlled comparison of nine architectures—Autoformer, DLinear, iTransformer, LSTM, ModernTCN, N-HiTS, PatchTST, TimesNet, and TimeXer—spanning four model families (Transformer, MLP, CNN, and RNN), evaluated across three asset classes (cryptocurrency, forex, and equity indices) and two forecasting horizons (h ∈ {4, 24} hours), for a total of 918 experiments. All runs follow a five-stage protocol: fixed-seed Bayesian hyperparameter optimization, configuration freezing per asset class, multi-seed final training, uncertainty-aware metric aggregation, and statistical validation. ModernTCN achieves the best mean rank (1.333) with a 75% first-place rate across 24 evaluation settings, followed by PatchTST (2.000), and the global leaderboard reveals a clear three-tier performance structure. Variance decomposition shows architecture explains 99.90% of raw RMSE variance versus 0.01% for seed randomness, and rankings remain stable across horizons despite 2–2.5× error amplification. Directional accuracy is statistically indistinguishable from 50% across all 54 model–category–horizon combinations, indicating that MSE-trained architectures lack directional skill at hourly resolution. These findings suggest that large-kernel temporal convolutions and patch-based Transformers consistently outperform alternatives, architectural inductive bias matters more than raw capacity, three-seed replication is sufficient, and directional forecasting requires explicit loss-function redesign; all code, data, trained models, and evaluation outputs are released for independent replication.