Why Recommendation Systems Are Structurally Different from Deep Learning [2/2]

Author(s): NP_123 Originally published on Towards AI. How DLRM Trades Expressiveness for Structure at Scale This article is Part 2 of a two-part series on the structural and engineering trade-offs behind modern recommendation models such as DLRM. Part 1: https://medium.com/@np123greatest/why-recommendation-systems-are-structurally-different-from-deep-learning-1-2-62e9130acc6e The reference paper: https://arxiv.org/pdf/1906.00091 📌 TL;DR Once embeddings are in place, the real modeling question becomes how features should interact. This article explains why leaving all interactions to deep networks is often inefficient, and how Factorization Machines introduce structured second-order interactions. It then shows why DLRM succeeds by making deliberate engineering trade-offs between expressiveness, scalability, and controllability — rather than by being deeper or more complex. 4. The Core Idea of Factorization Machines: Making Second-Order Interactions Explicit and Controllable This chapter responds to a key question left open in the previous chapter: In the embedding space, how should relationships between features be modeled? The answer given by Factorization Machines (FM) is: Do not leave feature interactions entirely to a black-box network to “figure out”, but instead explicitly and structurally model second-order relationships between features. 4.1 From Matrix Factorization to CTR: Why Are More General Interaction Forms Needed? In classical recommendation systems, matrix factorization provides an extremely elegant answer. For the user–item rating prediction problem, the prediction takes the form: Source: Image by the author. The meaning of this formulation is very clear: Each user and each item is mapped to an embedding The predicted value comes from the inner product of two embeddings The inner product characterizes the “matching degree” in the latent factor space This works extremely well in binary user–item interaction scenarios. The Problem in CTR Scenarios: More Than One Pair of Interactions However, in CTR prediction, the problem becomes more complex: Features are no longer limited to user_id and item_id They also include: Context (hour, device, location) Ad attributes (campaign, creative) Various statistical features In other words, CTR is a multi-entity, multi-feature matching problem. If we try to generalize matrix factorization to CTR, a natural thought arises: Should we model all pairwise interactions between features? For example: Source: Image by the author. But this immediately leads to an engineering disaster: Let the number of features be N, The number of second-order interactions becomes O(N²). Each interaction pair requires independent parameters As a result, both parameter size and computation cost explode quadratically. This means the problem is not whether second-order interactions are needed, but rather: How to model second-order interactions without introducing quadratic parameter growth? 4.2 The Key Insight of Factorization Machines Faced with the quadratic parameter and computation costs brought by second-order feature interactions in CTR scenarios, FM does not abandon interaction modeling. Instead, it redefines what “interaction” means. The core insight of FM can be summarized in one sentence: Explicitly model second-order interactions, while avoiding quadratic growth in parameters. Under this idea, FM parameterizes second-order interactions as follows: Source: Image by the author. where: x_i: the value of the i-th feature (one-hot / multi-hot / numerical) v_i: the embedding corresponding to the i-th feature <vi,vj>: inner product between embeddings, representing their matching degree in latent factor space From Formula to Structure: How Is FM Actually Computed? Structurally, Factorization Machines can be decomposed into three clear stages(As shown in Figure): Core computation structure of Factorization Machines: all sparse features are first mapped to embedding vectors, then explicit second-order interaction features are constructed via pairwise inner products. The interaction results are concatenated with dense features and fed into an MLP for final CTR prediction: Source: Image by the author. 1. Embedding Representation Stage Each sparse feature is mapped to a low-dimensional vector (e.g., 64 dimensions) and no longer directly participates in high-dimensional computation. 2. Explicit Second-Order Interaction Stage (Interaction Layer) All embedding vectors corresponding to sparse features (in later models such as DLRM, dense features are usually first mapped into the same embedding space before participating in interactions) are paired and processed via dot-products. Each feature pair produces a scalar interaction feature: <vi,vj> This layer introduces no new interaction parameters, but provides the model with a clear second-order interaction structure. 3. Prediction Stage (Top MLP) All second-order interaction features are concatenated with dense features and fed into an MLP, which models higher-order nonlinear combinations and outputs the click probability. Why Is This a “Structured” Second-Order Interaction? Compared with brute-force enumeration of all second-order parameters, FM has several advantages: 1.Parameter Sharing: All interactions share the same embedding space. Parameter size is O(N * d), rather than O(N²). 2.Controllable Computation The number of second-order interactions is N(N-1)/2. Each interaction only involves a single dot-product. 3.Preservation of Latent Factor Semantics: Inner products naturally capture similarity / matching degree. This is consistent with the modeling assumptions of matrix factorization. The essence of FM is not “adding more features”, but rather: Using stronger structural constraints to compress and express relationships between features. 4.3 Summary: FM Introduces “Controllable Relational Structure” into the Embedding Space Reviewing the derivation in this chapter, we can see that Factorization Machines are not introduced to increase complexity, but to add a long-missing piece of structural capability to embedding representations. After embeddings have solved “how to represent features”, FM further answers: Can relationships between features be explicitly modeled without introducing quadratic parameter and computation costs? By re-parameterizing each feature-pair interaction as an inner product between corresponding embedding vectors, FM achieves balance in several dimensions: Explicit modeling of second-order feature relationships Linearly controllable parameter scale Preservation of latent factor similarity semantics From a modeling perspective, FM does not provide a “complete model”, but rather a structured interaction template. It specifies which relationships should be computed, and in what form they should be computed. But it does not take responsibility for complex nonlinear aggregation. At this point, we have obtained all the key components required for recommendation system modeling: Using embeddings to represent discrete entities Explicitly modeling feature relationships in a structured way Using MLPs to handle higher-order nonlinear combinations The next chapter will show how these components are systematically combined in DLRM to form an efficient and scalable industrial-grade recommendation […]

Liked Liked