[R] Snapchat’s Recommendation System Had a Scaling Problem. They Solved It with Graph Theory (and GiGL).

Storing a graph with 100 billion edges requires 800 GB of memory. Just for the 64-bit large integer IDs. Before a single feature is loaded.

That is the reality of industrial-scale Graph Neural Networks. And it is exactly why most GNN research never reaches production.

Snapchat built a framework called GiGL (Gigantic Graph Learning) that runs GNNs on graphs with 900 million nodes and 16.8 billion edges. End-to-end, in under 12 hours and every day.

The gap between research and production is not the model. It is the plumbing.

PyTorch Geometric (PyG) is the most popular GNN library in academia. It has excellent layer implementations, an active community, and clean APIs.

Modern PyG (2.0+) is no longer limited to single-machine training. It offers NeighborLoader and ClusterLoader for mini-batch training on subgraphs, FeatureStore and GraphStore abstractions for out-of-core data (e.g., via RocksDB or Kuzu), and distributed training support via PyTorch DDP. These are real capabilities. The ogbn-papers100M benchmark (100M nodes, 2.5B edges) has been trained using PyG with disk-backed remote backends.

The gap is not in modelling primitives. It is in everything around them.

Snapchat’s friend graph has 900 million nodes and 16.8 billion edges, with 249 node features and 19 edge features. Running GNNs at this scale daily requires orchestrated, distributed data preprocessing from relational databases, billion-scale subgraph sampling as a managed Spark job, globally consistent train/val/test splits, fault-tolerant multi-node training, parallel inference across hundreds of workers, and automated pipeline scheduling. PyG provides none of this infrastructure. Nor should it. That is not its job.

GiGL does not replace PyG. It wraps it. You define your GAT or GraphSAGE model in standard PyG syntax and handle everything else with GiGL.

For example, treat subgraph sampling as a massive ETL job (e.g. Apache Spark on Scala), not a real-time graph traversal. Pre-compute every node’s k-hop neighbourhood to cloud storage. Then training becomes standard data-parallel ML. Without a shared graph state and a distributed graph engine during training.

Snapchat calls this approach “tabularization“. They claim that it reduced costs by 80% compared to their previous Apache Beam implementation.

The GiGL architecture is composed of six components

GiGL is a pipeline, not a library, where six components execute sequentially, each with independent horizontal scaling:

  1. Config Populator: resolves template configs into frozen configs with deterministic asset URIs. This makes every downstream component idempotent and retryable.
  2. Data Preprocessor: TensorFlow Transform on Apache Beam (Cloud Dataflow). Reads raw relational data from BigQuery, enumerates node IDs to contiguous integers, and applies distributed feature transforms (normalisation, encoding, imputation). Outputs TFRecords.
  3. Subgraph Sampler: Apache Spark on Scala (Dataproc). Generates k-hop localised subgraphs for each node via repeated joins on edge lists. For link prediction, it also samples anchor, positive, and negative node subgraphs. Two backends: Pure-ETL for homogeneous graphs and NebulaGraph for heterogeneous graphs.
  4. Split Generator: Spark on Scala. Assigns samples to train/val/test with transductive, inductive, or custom strategies. It masks validation/test edges from training to prevent leakage.
  5. Trainer: PyTorch DDP on Vertex AI or Kubernetes. Collates subgraph samples into batch subgraphs and feeds them into user-defined PyG training loops. Supports early stopping, TensorBoard logging, and custom loss functions.
  6. Inferencer: Apache Beam on Cloud Dataflow. Embarrassingly parallel CPU inference across all nodes. Writes embeddings to BigQuery. Un-enumerates node IDs back to original identifiers.

Orchestration runs on Kubeflow Pipelines or Vertex AI. The frozen config design lets you rerun the Trainer 50 times for hyperparameter tuning without rerunning the Subgraph Sampler. That saves hours of computation per iteration.

What Snapchat actually learned from its 35 production launches

The paper (see sources, below) is transparent about what worked, what failed, and by how much. Three patterns stand out.

Pattern 1: Graph quality beats model complexity.

Snapchat’s first GNN used GraphSAGE on the friendship graph. Solid +10% lift in new friends made.

Then they switched the graph definition from “who is friends with whom” to “who recently interacted with whom” (the engagement graph). They used the same model but built a new graph. The result was an additional 8.9% improvement and a significant cost reduction because the engagement graph is sparser.

One feature normalisation step on the content recommendation graph improved MRR from 0.39 to 0.54. A 38% relative improvement from a single preprocessing decision.

The lesson: before you touch the model architecture, fix the graph and the features.

Pattern 2: Attention-based GNNs dominate on social graphs.

Snapchat systematically tested all PyG convolution layers available at the time. GAT consistently outperformed mean and sum aggregation. Their hypothesis is that social networks follow scale-free degree distributions because not all neighbours contribute equally. Attention learns to weight strong-engagement relationships over weak ones.

The upgrade from GraphSAGE to GAT delivered a +6.5% improvement in core friend recommendation metrics.

Pattern 3: How you query matters as much as what you embed.

Snapchat initially used each user’s own GNN embedding as the ANN query for friend retrieval. It is a standard approach.

Then they tried querying with the embeddings of a user’s existing friends instead. They call this “Stochastic EBR”. It broadened the candidate search space and captured richer social signals.

The result? +10.2% and +13.9% on core business metrics. It became the default retrieval scheme for friend recommendation at Snapchat.

They did no model change and no retraining. Just a different query strategy over the same embeddings.

The recommendation system

Every recommendation system with relational data is a graph problem in disguise. Users, items, interactions, context. Nodes and edges.

Snapchat demonstrates this across three domains:

  1. Friend recommendation: user-user engagement graph. GNN embeddings feed the largest retrieval funnel via ANN search, and also serve as dense features in the ranking model.
  2. Content recommendation (Spotlight, Discover): user-video bipartite graph. Video-to-video co-engagement graph sparsified by Jaccard thresholding. GNN embeddings power video-to-video and user-to-video EBR. Launch impact: +1.54% total time spent on Spotlight.
  3. Ads recommendation: product co-engagement graph with text/image embeddings and metadata as node features. With only 10% of the training data volume used by the control shallow-embedding model, GiGL’s 2-layer GAT achieved precision parity while improving recall by 27.6%.

The recurring pattern: GNN embeddings add the most value in the retrieval stage (embedding-based dense retrieval) and as auxiliary features in rankers. Topology information improves even precision-focused models that were not designed to use graph structure.

When GiGL makes sense and when it does not

GiGL and PyG operate at different abstraction layers. PyG is a modelling library, while GiGL is a production pipeline that uses PyG inside the Trainer.

Use GiGL when your graph has billions of edges, when you need daily batch inference, and you are on GCP. The framework assumes the use of Dataflow, Dataproc, Vertex AI, BigQuery, and GCS.

Use standalone PyG when you need fast iteration, full control over the training loop, or when PyG’s built-in scalability features (NeighborLoader, remote backends, distributed training) meet your infrastructure and scaling requirements. For graphs up to a few billion edges with the right hardware and out-of-core backends, standalone PyG can take you further than it could a few years ago.

Use AWS GraphStorm when you need SageMaker-native deployment, built-in BERT+GNN co-training for text-rich graphs, or zero-code CLI pipelines.

The uncomfortable truth about GNNs at scale

Most of the value Snapchat derived from GNNs came from decisions unrelated to novel architectures: better graph definitions, feature normalisation, loss function selection, and retrieval query strategies.

The framework’s job is to make those experiments fast and cheap at a billion scale. GiGL does that by turning graph sampling into an ETL problem and training into standard data-parallel ML.

Snapchat completed 35+ production launches in two years across three business domains, with measurable lift in every metric.

Sources:

submitted by /u/mmark92712
[link] [comments]

Liked Liked