Goal Analytics: A Fully Built Data Pipeline, Six Models, and a Real Backtest

Here’s everything that’s actually in the system now, how it’s validated, and what’s next.

A couple of months ago I wrote about the idea: a three-module system for forecasting the World Cup, built in public. A few days ago I posted the first set of predictions from Module 1 — Argentina at 28.4%, Spain at 19.1%, the rest of the field splitting the remainder.

Since then I went back and finished Module 1 properly. Two new models, a real validation harness, an optimized data pipeline, a live FIFA ranking feed, and a more precise README. We’re a day into the tournament now, so this felt like the right moment to write down what’s actually in the system before results start piling up.

→ Dashboard: https://goal-analytics-wc2026.streamlit.app/

→ Code: https://github.com/nithinnarla/goal-analytics/

The Shape of the System

Goal Analytics runs on two layers:

  1. Match-level models. Given two teams, produce a win/draw/loss probability and a scoreline distribution.
  2. Tournament-level simulation. Run the match-level model across the real 104-match WC2026 bracket, 10,000 times, to get title and group-stage probabilities.

Layer 1 isn’t one model — it’s six approaches run side by side: Elo, independent-Poisson scorelines, logistic regression, random forest, XGBoost, and Monte Carlo on top of Elo+Poisson. None of them is “the” model. They’re compared, and now — for the first time — backtested.

The Data Layer

A few pieces feed the system, and one of them got a real cleanup pass this round:

  • martj42/international_results — roughly 25,000 matches since 2010, pulled live, used for Elo history, recent-form features, and training the random forest and XGBoost models.
  • data/teams.py — hand-calibrated pre-tournament Elo, FIFA rank, and group assignment for all 48 teams.
  • data/knockout_fixtures.py — the actual WC2026 bracket from Round of 32 to the Final, including the “best 8 third-placed teams” advancement rules.
  • data/fifa_rankings.py (new) — scrapes the live FIFA World Ranking for a side-by-side comparison against the model’s own ranking. Informational only for now — it doesn’t feed any model yet.

The unglamorous fix: team-name aliasing. The historical results dataset uses names like “United States,” “Korea Republic,” “Türkiye,” and “Curaçao” — the hand-calibrated team table used different spellings for some of these. Mismatches like that silently drop matches from training data without throwing an error, which is the worst kind of bug because everything still runs and just quietly trains on less data. That’s now reconciled with accent-stripping and an explicit alias table, and the pipeline runs noticeably faster as a side effect of cleaning up redundant lookups.

Feature Engineering

Every ML model — logistic regression, random forest, XGBoost — trains on the same six features per fixture:

  • elo_diff
  • home_advantage
  • form_diff (recent points-per-game, home minus away)
  • scored_diff
  • conceded_diff
  • elo_sq_diff (captures non-linearity in lopsided matchups)

Same inputs across all three learners, so any difference in their predictions is about the algorithm, not what they’re allowed to see.

The Six Models

Elo

Standard expected-score formula. Draw probability is adjusted toward a roughly 22% World Cup baseline as Elo gap shrinks. Two separate home-advantage rules apply: +100 Elo for host nations (Mexico, USA, Canada) playing in their own cities, and +100 Elo generally for any team in a non-neutral match.

Independent Poisson scorelines

Expected goals come from the Elo gap (lambda = average_goals × 10^(elo_diff / 800)), and each side’s scoreline is an independent Poisson draw on its own lambda — the two scores aren’t correlated with each other. The full Dixon-Coles (1997) model adds a correlation term that corrects for low-scoring results (0–0, 1–1, 1–0 happen more often in real football than two independent draws would predict); this implementation doesn’t include that term yet. More on that in “What’s Next” below.

Logistic Regression

A calibrated linear baseline. StandardScaler plus multinomial logistic regression, trained on roughly 900 World-Cup-only matches going back to 1930, validated with 5-fold cross-validation.

Random Forest (new)

200 trees, max depth 10, trained on the full recent-era dataset — about 25,000 matches since 2010. Same six features as the logistic regression, but a much larger and more recent training set.

XGBoost (new)

Same training data and features as the random forest. 200 boosting rounds, max depth 4, learning rate 0.05.

Monte Carlo simulation

10,000 full-tournament runs for title probabilities, 5,000 for group-position breakdowns. Each simulation plays all 72 group-stage matches by sampling from the Elo-to-Poisson distribution, ranks groups by points then goal difference then goals for (FIFA’s head-to-head tiebreaker isn’t implemented — also on the future-work list), takes the best eight third-placed teams forward under the real combination rules, and walks the actual bracket to a champion. Drawn knockout matches go to a penalty-shootout model: 50/50 with a small adjustment proportional to the Elo gap.

Does Any of This Work? The Backtest

This is the part that didn’t exist three weeks ago, and it’s the part I think matters most.

All six approaches are now backtested against the actual 2018 and 2022 World Cups, point-in-time. For the 2018 backtest, every model — including the random forest and XGBoost — is retrained using only data available before Russia’s opening match against Saudi Arabia. Same idea for 2022, cut off before the Qatar-Ecuador opener.

This matters because the “live” models for 2026 are trained on the full historical dataset, which already includes 2018 and 2022. Scoring a model on tournaments it was trained on isn’t a backtest, it’s checking its homework against the answer key. Point-in-time retraining is the only way the comparison means anything.

Two metrics per model: accuracy (did its top pick match the actual result?) and multi-class Brier score (how well-calibrated were its probabilities — 0 to 2, lower is better).

I’m deliberately not putting a single aggregate number here. The dashboard’s new Model Backtest tab shows every 2018 and 2022 match with each model’s pick laid next to the real result, and that match-by-match view is more honest than a headline accuracy figure — it shows you where each model agrees with the others, where they split, and where all six missed the same upset. If you want to know whether XGBoost actually beats Elo, that’s the tab to open.

The Dashboard, Now Six Tabs

  • Win Probabilities — title odds for all 48 teams, plus the new FIFA-ranking comparison panel
  • Group Predictions — P(1st/2nd/3rd/4th) per group from 5,000 simulations
  • Match Predictor — pick two teams, get win/draw/loss, expected goals, and a scoreline heatmap
  • Live Tracker — log real results as the tournament plays out; Brier score updates as you go
  • Bracket — projected knockout bracket with a cross-check against the ML models
  • Model Backtest (new) — the 2018/2022 validation, match by match

What’s Next

Near-term, on the existing codebase:

  • The Dixon-Coles tau correction for the Poisson model. Right now the two scorelines are independent Poisson draws, which slightly underweights low-scoring results like 0–0 and 1–0 relative to real football. Adding the tau term is the highest-value methodology upgrade on this list.
  • Real FIFA tiebreakers (head-to-head record, disciplinary points) in group and third-place ranking, replacing the current points-then-goal-difference-then-random simplification.
  • Reconciling the two Elo systems — the hand-calibrated WC2026 ratings in data/teams.py and the ones derived from the full match-history pipeline currently live side by side without being unified.
  • Using FIFA ranking deltas as an actual model feature instead of just a display panel.
  • A larger backtest sample — 2018 and 2022 is two tournaments’ worth of group-stage and knockout matches, which is enough to be informative but not enough to draw strong conclusions from.

Further out — Module 2 and Module 3:

  • Module 2: Player2Vec. Squad-level embeddings built from player co-occurrence in lineups, so the model can represent “this squad” rather than just “this team’s Elo number” — useful for catching cases where a side is missing a key starter relative to its rating. Planned for after the tournament, once there’s lineup data to work with.
  • Module 3: an ILP-based fantasy squad optimizer, built on top of whatever Module 2 produces. Targeted for before Euro 2028, contingent on Module 2 actually being useful.

The Repo

github.com/nithinnarla/goal-analytics — MIT licensed, plain Python: numpy, pandas, scikit-learn, xgboost, streamlit. PRs welcome.

We’re one day into the tournament. The Live Tracker is logging results as they come in — next update will have something to say about how the model is actually doing.

Know Your Author

Nithin Narla is a Data Engineer

He likes to build data pipelines, visualize data and create insightful stories. He is passionate about data visualization, machine learning, and building insightful data-driven solutions. He enjoys sharing his knowledge and learning experiences through writing on Medium. You can connect with him and follow his journey in the world of Data Science and AI.

Thank You!


Goal Analytics: A Fully Built Data Pipeline, Six Models, and a Real Backtest was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Liked Liked