Using Embeddings for Identity Resolution in a Composable CDP
A warehouse-native architecture combining vector embeddings and LLM arbitration for high-fidelity customer profiles

In the modern retail landscape, the “Golden Record” is often a hollow promise. Traditional Customer Data Platforms (CDPs) rely on deterministic matching — rigid SQL rules like WHERE a.email = b.email. But what happens when the email is missing, the name is misspelled, or the user interacts across fragmented silos with nothing but a behavioral footprint?
The result is “Zombie Profiles”: fragmented identities that drain marketing budgets and hallucinate customer lifetime value (CLV).
To solve this, we must move beyond fuzzy matching and into the Identity Nexus: a semantic approach to profile unification using Vector Embeddings and Agentic LLM Reasoning.
In an era where AI systems drive personalization and automation, identity ambiguity propagates directly into model bias and activation errors.
1. The Death of Deterministic Rules
Deterministic matching is binary. It’s either a 1 (match) or a 0 (new profile). In a high-velocity retail environment, this rigidity creates massive data debt.
Consider these three interactions:
- Jon Doe buys a “Luxury Diver Watch” in a Taipei boutique.
- John Doe searches for “Rolex Submariner Comparison” on the web store.
- J. Doe asks a WhatsApp bot about “Service centers for automatic movements.”
A traditional SQL join fails here. Even Levenshtein distance (fuzzy string matching) struggles with the contextual gap between a “Boutique Purchase” and a “Bot Inquiry.” We don’t just need to match names; we need to match intent.
2. Technical Core: From Strings to Semantic Vectors
The breakthrough lies in representing a customer not as a row of strings, but as a Semantic Anchor in a multi-dimensional vector space.
The Multi-Modal Embedding Loop
We use a lightweight, sovereignty-focused model to encode both identity attributes and behavioral context into a 384-dimensional vector.
from sentence_transformers import SentenceTransformer
import numpy as np
# Mercer's Choice: Local, fast, and high-fidelity
model = SentenceTransformer('sovereignty-focused model')
def create_semantic_anchor(profile):
"""
Combines PII with behavioral intent to create a unique identity vector.
"""
identity_context = (
f"Name: {profile.get('name', 'Unknown')} | "
f"Recent Intent: {profile.get('intent', 'N/A')} | "
f"Geo-Context: {profile.get('location', 'Global')}"
)
return model.encode([identity_context])[0]
By converting identities into vectors, we can calculate the Cosine Similarity between profiles. These vectors can be stored directly in warehouse-native vector columns (such as pgvector, Snowflake Cortex, or BigQuery vector search) to maintain the Composable CDP’s “storage-near-logic” advantage.
A score of 1.0 is a perfect semantic match; 0.0 is total noise.
3. The Agentic Arbitrator: Solving the “Gray Zone.”
The real world is messy. Most matches fall into the “Gray Zone” (Similarity between 0.70 and 0.85). Pure threshold tuning fails here because cosine similarity captures semantic proximity, but not behavioral plausibility. Two profiles may be linguistically similar yet geographically incompatible, or temporally contradictory. This is where reasoning — not just mathematical distance — becomes essential.
In the Identity Nexus, we deploy an Agentic Arbitrator — an LLM-driven forensic auditor that reasons through the evidence.
Implementation: The Reasoning Loop
When a similarity score hits the Gray Zone, the system triggers a “Reasoning Request” to an LLM (e.g., Gemini 1.5 Pro or GPT-4o).
def agentic_arbitration(profile_a, profile_b, similarity_score):
if 0.70 <= similarity_score < 0.85:
prompt = f"""
As a Forensic Data Auditor, analyze these two profiles:
Profile A: {profile_a}
Profile B: {profile_b}
Similarity Score: {similarity_score}
Is it behaviorally plausible that these are the same individual?
Consider the geographic proximity and the niche alignment of their interests.
Provide a confidence score and a 'Match' or 'No Match' decision.
"""
# Call LLM for reasoning
return call_llm_reasoning(prompt)
return "Auto-Match" if similarity_score >= 0.85 else "Distinct"
This turns a cold mathematical score into a contextual decision, drastically reducing the need for manual data cleaning teams.
Importantly, the LLM does not mutate records directly — it produces a governed recommendation layer subject to explicit stitching policies.
Operational Considerations for Production
Building an AI-driven resolver requires balancing accuracy with overhead:
Latency Management: Trigger LLM reasoning only for the Gray Zone to optimize costs and response times.
Vector Caching: Store frequently accessed identity anchors in-memory to accelerate high-velocity retail lookups.
Audit Logging: Maintain a trace of the LLM’s reasoning for every “Stitched” decision to ensure GDPR/CCPA compliance.
Evaluation & Threshold Calibration
A production-grade resolver should calibrate similarity thresholds using labeled match / non-match pairs to balance precision and recall. Raising the auto-match threshold improves identity purity but increases Gray Zone arbitration volume. By tracking false-stitch rates and arbitration frequency, the Identity Nexus becomes a measurable AI subsystem rather than a heuristic rule engine.
4. Why Composable CDP?
This AI-driven resolution shouldn’t happen in a “Black Box” SaaS CDP. It belongs in your Composable CDP — your Snowflake, BigQuery, or PostgreSQL warehouse.
By decoupling the AI Brain (the resolver) from the Data Body (the warehouse), you maintain:
- Data Sovereignty: Your customer vectors never leave your controlled environment.
- Auditability: You can trace exactly why the Agent decided to stitch two profiles together.
- Velocity: You can re-run your “Stitching Logic” as models improve without migrating data.
5. Conclusion: The Future of the Golden Record
The “Golden Record” is no longer a static row in a database; it is a dynamic, evolving Identity Nexus. By leveraging embeddings and agentic reasoning, we turn data noise into high-fidelity architectural assets.
The objective is not personalization theater, but identity precision at scale.
Using Embeddings for Identity Resolution in a Composable CDP was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.