Crafting the Eyes for Thinking Machines: Rewiring the Retina- The Anatomy of ViTStruct

“There is no joy in a dinner where the soup, the main course, and the dessert are blended into a single, beige slurry. The richness of the experience lies in the separation — savoring the distinct texture of each component in its proper moment. Yet, this ‘slurry’ is exactly how standard Vision Transformers treat an image.”
In Part 1, I spent hours obsessively prepping my ingredients. I took Visual Genome — a dataset rich with distinct flavors like Objects, Relationships, and Coordinates — and I engineered a RAM-resident pipeline to deliver them fresh to the hardware.
If you haven’t read the previous article, read it now to get the full context of this series : https://medium.com/@anagha.srivasa/crafting-the-eyes-for-thinking-machines-the-white-box-vlm-e7d624bd14bd?source=your_stories_outbox—writer_outbox_published—————————————–
But now I face a critical problem.
If I feed these carefully separated ingredients into a standard Vision Transformer (ViT), I am essentially throwing that gourmet meal into a blender. A standard ViT takes the “steak” (the object) and the “tablecloth” (the background), chops them into 256 square patches, and mixes them into a flat, indistinguishable soup of numbers.
The “Bag of Patches” Flaw
To understand why this happens, we have to look at how a ViT actually “sees.”
When you feed an image into a standard encoder (like CLIP or a vanilla ViT), it doesn’t see a “Dog.” It sees a grid. It cuts the image into 16 X 16 patches, flattens them into a sequence, and processes them all together.
In this process, the concept of “entity” is lost. The pixel at the corner of a bounding box is treated exactly the same as a pixel in the blurry background. The self-attention mechanism swirls them all together, allowing the background to bleed into the object and the object to bleed into the noise.
This works fine for general classification (“Is there a dog in this image?”). But for reasoning (“Is the dog standing on the grass or in the grass?”), this blending is disastrous. The model has to statistically guess where the dog ends and the grass begins. It is no longer looking at the Object; it is looking at a statistical approximation of the object mixed with its surroundings.
The Solution: Unplugging the Blender
To build a ‘White Box,’ I need to stop the blending.

I need an architecture that respects the boundaries of reality. I need a model that processes the Object Stream when it needs to name a noun, and the Scene Stream when it needs to describe the weather. I need to physically split the vision pipeline so that the “steak” stays on the plate and the “wine” stays in the glass.
In this article, I will build the ViTStructEncoder. I am going to physically rewire the attention mechanism to produce four distinct streams of reality, forcing the model to savor each part of the image separately.
We have the ingredients. Now, let’s engineer the separation.
The Encoder — Physical Stream Splitting
The heart of the White Box is the ViTStructEncoder.
At first glance, it looks like a standard Vision Transformer. It takes an image, splits it into patches, and runs it through layers of attention.
But the difference lies in the output. A standard ViT returns a single tensor: [Batch, Sequence_Length, Dim]. It says, “Here is the image, good luck finding the dog.”
My encoder returns a Dictionary of Streams. It explicitly separates the visual information into four distinct channels before it ever reaches the language model.
1. The Inputs
The encoder takes two inputs:
- The Image: [B, 3, 384, 384] (The pixels)
- The Bounding Boxes: [B, N, 4] (The geometry from Visual Genome)
We start by running the image through a standard Transformer Encoder. This gives us the raw materials: a sequence of 576 patch embeddings (for a 384 X 384 image with patch size 16) plus one global CLS token.
2. The Innovation: Geometric Stream Splitting
Instead of returning this raw sequence, I perform Physical Stream Splitting. I use the bounding box coordinates to mathematically slice the feature map into meaningful entities.
Here is the logic for the four streams:
Stream A: The full Stream (Raw Patches)
This is the safety net. It is the raw sequence of all 576 patches. We keep this because “structure” isn’t everything; sometimes the model just needs to see the texture of the grass or the gradient of the sky.
- Shape: [B, 577, D] (CLS + Patches)
Stream B: The scene Stream (Context)
This is the standard CLS token. In a ViT, this token learns to summarize the “global vibe” of the image. It answers questions like “Is this indoors or outdoors?”
- Shape: [B, 1, D]
Stream C: The objects Stream (Region Pooling)
This is the core innovation. How do we turn a bounding box [x1, y1, x2, y2] into a single vector?
I use a technique I call Geometric Masking.
- Grid Mapping: I precompute the (x, y) center of every patch in the 24 X 24 grid.
- Broadcasting: For every object, I create a binary mask: Is the center of Patch i inside Bounding Box j?
- Weighted Pooling: I take the average of all patch embeddings that fall inside the box.
This creates a dedicated vector for “The Dog” that is mathematically composed only of the pixels belonging to the dog.
# Calculate mask: Which patches fall inside the bounding box?
in_box_mask = (
(cx >= x_min) & (cx <= x_max) &
(cy >= y_min) & (cy <= y_max)
)
# Weighted pooling: Average features of valid patches
obj_feats = torch.einsum("bon,bnd->bod", weights, patch_feats)
- Shape: [B, N_obj, D]
Stream D: The bg Stream (Background)
Standard models struggle to separate “Object” from “Background.” I force the separation.
I calculate a Background Mask which is simply the inverse of all object masks. If a patch belongs to any object, it is excluded. The remaining patches — the walls, the sky, the empty floor — are pooled into a single “Background” token.
- Shape: [B, 1, D]
3. The Result: A Dictionary of Reality
Instead of a messy tensor, the output of my encoder is a structured dictionary. This is the data structure that will flow through the rest of the network:
visual_memory = {
"full": full_feats, # The raw pixels (Texture)
"scene": scene_feats, # The global context (Vibe)
"objects": obj_feats, # The entities (Nouns)
"bg": bg_feats # The surroundings (Setting)
}
By explicitly separating these signals, I have removed the ambiguity. The model no longer has to guess if a vector represents the dog or the grass. I have handed it two separate vectors and said, “This is the Dog. That is the Grass.”
Now, the question is: How do we teach the Language Model to listen to the right stream at the right time? That requires a new kind of attention.
The Attention Mechanism — Structured Cross-Attention
We have successfully split the image into a dictionary of streams: objects, bg, scene, and full. Now, we need a way for the Language Model to consume them.
If we simply concatenated these vectors back together (cat([objects, bg, scene])), we would undo all our hard work. The self-attention layers would just blend them into a soup again.
To prevent this, I built a custom module called StructuredCrossAttention . Instead of one massive attention pass, it performs Parallel Attention followed by Dynamic Gating.
1. Dictionary-Keyed Projections (Parallel Processing)
In a standard Cross-Attention layer, there is one set of projection matrices (W_Q, W_K, W_V) that processes the entire image.
In my architecture, I use Dictionary-Keyed Projections. I define a specific set of weights for each stream type. The model learns to process “Object” vectors differently than it processes “Background” vectors.
The Code:
# Dictionary-keyed projections
self.k_proj = nn.ModuleDict({
k: nn.Linear(embed_dim, embed_dim) for k in source_keys
})
self.v_proj = nn.ModuleDict({
k: nn.Linear(embed_dim, embed_dim) for k in source_keys
})
This effectively runs three separate cross-attention operations in parallel:
- Query (Text) ← → Key/Value (Objects)
- Query (Text) ← → Key/Value (Background)
- Query (Text) ← → Key/Value (Scene)
This ensures mathematical isolation. The gradient updates for “Dog” don’t pollute the weights for “Grass.”
2. The “Softmax” Gating Logic (The Conductor)
Once we have the outputs from these three streams, we have to combine them into a single vector for the next text token. We can’t just average them (that’s the blender again). We need to choose the relevant stream.
I implemented a learnable gating mechanism using a parameter called self.source_gates.
# Dynamic Gating
# 1. Stack outputs from all streams [Num_Streams, Batch, Seq, Dim]
stacked = torch.stack(outputs, dim=0)
# 2. Compute softmax gates from learnable parameters
gates = F.softmax(self.source_gates[valid_indices], dim=0).view(-1, 1, 1, 1)
# 3. Weighted Sum
combined = (gates * stacked).sum(dim=0)
The Intuition:
This acts like a conductor in an orchestra. The source_gates parameter learns a static bias, but because it interacts with the dynamic attention scores, the model can shift focus:
- If the model is trying to generate the word “holding,” the attention score for the Object Stream will be high, and the gate will prioritize that signal.
- If the model is generating “indoors,” the Scene Stream takes over.
This completes the heart of “White Box” :
- Encoder: Physically separates the data.
- Projections: Processes the data in isolation.
- Gating: Recombines the data only at the final moment, based on relevance.
We have built the eyes (Encoder) and the optic nerve (Attention). Now, all that remains is to verify if the brain actually uses them. Now, we integrate this into the CustomDecoderLayer and define our measure of success.
The Integration — The Dual-Path Decoder
We have built a powerful engine (the ViTStructEncoder) and a sophisticated transmission (the StructuredCrossAttention). Now, we have to install them into the chassis: the Transformer Decoder.
But here, I faced a dilemma.
If I completely replace the standard attention mechanism with my new “Structured” one, I am taking a massive gamble. What if the model needs to see a raw, unstructured patch to understand a texture? What if “The Dog” vector I created is perfect for the noun “dog,” but terrible for the adjective “furry”?
To solve this, I designed the CustomDecoderLayer (Block 4) as a Dual-Path System.
1. The Two Paths
Every single layer of my decoder runs two parallel cross-attention operations:
- Path A: The Safety Net (full_cross_attn) This is standard, vanilla Multi-Head Attention. It looks at the full stream (the raw sequence of 576 patches). It ensures that no matter how fancy my structured architecture gets, the model never loses access to the raw pixel reality.
- Input: visual_memory[‘full’]
- Path B: The Specialist (struct_cross_attn) This is our innovation. It looks at the objects, bg, and scene streams, applying the dictionary-keyed projections and softmax gating we designed. It provides the reasoning logic.
- Input: visual_memory[‘objects’], visual_memory[‘bg’], visual_memory[‘scene’]
2. The Global Sigmoid Gate (The Lie Detector)
How do we combine them? Do we just add them up? Do we average them?
No. We let the model decide.
I introduced a single learnable scalar parameter: self.gate_param.
# 1. Compute the gate (0.0 to 1.0)
gate = torch.sigmoid(self.gate_param)
# 2. Blend the paths
# If gate is high, trust Structure. If low, trust Raw Pixels.
cross_out = gate * struct_out + (1.0 - gate) * full_out
This parameter is more than just a weight; it is my Scientific Proof.
- If, after training, gate converges to 0.0, it means my “White Box” hypothesis was wrong. The model found the structured streams useless and preferred the raw patches.
- If gate grows to 0.5 or higher, it means the model is actively choosing to use the separated object/scene representations to generate text.
It is a built-in lie detector for my architecture.
The Ghost in the Machine
We have done it. We have successfully engineered a White Box foundation.
- The Encoder physically splits the image into distinct streams (Objects, Background, Scene), preventing the “blender” effect of standard ViTs.
- The Attention Mechanism processes these streams in parallel with specialized weights.
- The Decoder dynamically gates between raw pixels and structured entities, allowing us to verify if the model is actually “thinking” in structures.
The Cliffhanger
We have the body. We have the brain. But we are missing the discipline.
Here is the dirty secret of custom architectures: They are fragile. Because we are splitting the image into vectors that we define (Object vs. Scene), there is nothing stopping the model from cheating. Without careful constraints, the “Object” vector and the “Scene” vector might collapse into identical representations. The model might just copy the same data into both streams to minimize loss, effectively destroying our “White Box.”
We need a way to force the model to keep these streams mathematically distinct. We need to force Orthogonality.
In the next article, “The Stability Battle,” I will show you how I engineered the VLMCaptionLoss—a custom loss function that penalizes the model whenever its internal thoughts get too similar.
See you in my next article!
Crafting the Eyes for Thinking Machines: Rewiring the Retina- The Anatomy of ViTStruct was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.