Graph Clustering for Entity Resolution: Why Union-Find Breaks at Web Scale

digitado ⋅ 12 de June de 2026

Introduction:

When engineers first encounter entity resolution, the problem looks deceptively simple:

compare records,
score likely matches,
connect the ones above a threshold,
cluster the graph.

That mental model works for a whiteboard interview.

It does not survive production.

Because once you move from a few thousand clean records to hundreds of millions of noisy ones, entity resolution stops being a neat graph exercise and becomes something much messier:

a probabilistic system,
a data quality system,
a correction workflow,
and, eventually, a trust problem.

At that point, almost everyone reaches for the same algorithm:

Union-Find.

And that is usually where the trouble starts.

Union-Find is elegant. It is fast. It is one of the best data structures ever taught in an algorithms class.

It is also a bad fit for large-scale entity resolution.

Not because it is slow. n Because it is too dumb.

It merges aggressively, remembers forever, and knows nothing about the real-world constraints that make entity resolution hard: conflicting attributes, ambiguous evidence, bad upstream data, human corrections, and the fact that one wrong merge can poison every downstream product built on top of your canonical entity graph.

This article is about where Union-Find breaks, why it breaks, and what to use instead.

I will walk through the failure modes, then show a production-friendly alternative: weighted graph clustering with safeguards, abstention, and incremental correction support.

If you are building entity resolution for CRM deduplication, company intelligence, identity graphs, or enrichment pipelines, this is the difference between a demo that looks smart and a system that survives contact with reality.

Entity Resolution Is Not a Matching Problem. It Is a Clustering Problem.

At small scale, people talk about entity resolution as if it were mostly about pairwise matching.

It is not.

Pairwise matching is only the front door. The real problem begins after the scores are computed.

Once you have candidate matches, you are no longer asking:

Do these two records look similar?

You are asking:

What graph structure do these similarities imply, and how much damage will I do if I trust them too much?

That is a very different question.

At scale, entity resolution is a graph problem:

Nodes are raw records n A vendor row, a crawled company page, a manually curated profile, a CRM object
Edges are match hypotheses n Usually generated by an ML model, rules engine, or hybrid scorer
Clusters are canonical entities n The thing your downstream systems will treat as “truth”

That last step is where most of the pain lives.

Because downstream systems do not consume pairwise scores. They consume clusters.

And clusters are where small local mistakes become global ones.

Here is a minimal Python setup:

from dataclasses import dataclass
from typing import Optional

@dataclass
class Record:
    id: str
    name: str
    domain: Optional[str]
    country: Optional[str]
    employee_count: Optional[int]

@dataclass
class MatchEdge:
    record_a: str
    record_b: str
    confidence: float

A model might emit edges like this:

edges = [
    MatchEdge("r1", "r2", 0.97),
    MatchEdge("r2", "r3", 0.91),
    MatchEdge("r3", "r4", 0.54),
    MatchEdge("r4", "r5", 0.88),
]

Now comes the deceptively dangerous question:

How do you turn those edges into entities?

Most engineers answer: “Easy. Union-Find.”

That answer is understandable. It is also where production trouble begins.

Why Union-Find Feels Right

Union-Find is one of those algorithms that feels almost unfairly good.

It is compact. n It is fast. n It gives you connected components with near-constant-time unions and finds.

And if your problem is truly:

Given a set of edges, compute connected components efficiently,

then Union-Find is exactly the right tool.

Here is the standard implementation:

class UnionFind:
    def __init__(self, nodes: list[str]):
        self.parent = {n: n for n in nodes}
        self.rank = {n: 0 for n in nodes}

    def find(self, x: str) -> str:
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])
        return self.parent[x]

    def union(self, x: str, y: str):
        px, py = self.find(x), self.find(y)
        if px == py:
            return
        if self.rank[px] < self.rank[py]:
            px, py = py, px
        self.parent[py] = px
        if self.rank[px] == self.rank[py]:
            self.rank[px] += 1

def cluster_with_union_find(records: list[Record],
                            edges: list[MatchEdge],
                            threshold: float = 0.8) -> dict[str, str]:
    uf = UnionFind([r.id for r in records])
    for edge in edges:
        if edge.confidence >= threshold:
            uf.union(edge.record_a, edge.record_b)
    return {r.id: uf.find(r.id) for r in records}

For toy datasets, this looks great.

That is exactly why so many teams start here.

The problem is not that Union-Find fails immediately. n The problem is that it fails quietly.

At first, it seems to work. Then your graph gets noisier. Then your sources get messier. Then corrections start arriving. Then one day you notice a cluster with 4,000 records and realize your “simple clustering layer” has become the most dangerous part of the pipeline.

Let’s walk through the failure modes.

Failure Mode #1: Transitivity Turns Weak Evidence Into Strong Damage

This is the classic trap.

Suppose you have three records with similar names across different countries:

edges_transitive = [
    MatchEdge("ndc_us", "ndc_uk", 0.89),
    MatchEdge("ndc_uk", "ndc_sg", 0.85),
    MatchEdge("ndc_us", "ndc_sg", 0.82),
]

Union-Find sees three edges above threshold and happily merges all three into one cluster.

But what if these are not one company?

What if they are separate legal entities that happen to share a naming pattern?

That is not an edge-scoring problem anymore. That is a transitivity problem.

Union-Find assumes that if:

A should merge with B
and B should merge with C

then A, B, and C belong together.

In entity resolution, that assumption is often too strong.

Because similarity is not the same thing as identity.

And identity is not always transitive under noisy evidence.

This is how bad clusters form in the real world:

common names
subsidiaries
regional branches
franchise structures
stale vendor records
partial crawls
ambiguous abbreviations

One weak bridge is enough to connect two neighborhoods of the graph that should never have touched.

Once that happens, the cluster starts absorbing more edges. Then more. Then more.

Congratulations: you now have a mega-cluster.

And mega-clusters are not just wrong. They are contagious. They corrupt every downstream system that trusts your canonical entity layer.

Failure Mode #2: A Single Threshold Pretends All Evidence Means the Same Thing

This is another seductive simplification.

Most first-pass systems do something like this:

if edge.confidence >= 0.8:
    merge()

That looks clean. It is also naive.

Because a score of 0.85 does not mean the same thing across different evidence types.

A 0.85 based on:

same domain
same address
same phone
similar name

is not remotely equivalent to a 0.85 based on:

name overlap only

Yet a global threshold treats them as interchangeable.

They are not.

In production, you almost always need signal-aware thresholds.

from dataclasses import dataclass

@dataclass
class MatchEdge:
    record_a: str
    record_b: str
    confidence: float
    signal_types: list[str]

THRESHOLDS = {
    frozenset(["name", "domain", "address"]): 0.75,
    frozenset(["name", "domain"]): 0.82,
    frozenset(["name"]): 0.93,
}

This is not a minor tuning detail. It is a structural requirement.

Because entity resolution is not just about score magnitude. It is about what kind of evidence produced the score.

If your clustering layer ignores that, it will over-trust weak evidence and under-utilize strong evidence.

Either way, quality suffers.

Failure Mode #3: Union-Find Has No Memory of Why a Merge Happened

This is where operational reality catches up with algorithmic elegance.

In production, merges get challenged.

A researcher flags a bad merge. n A vendor feed corrects a domain. n A new crawl reveals a mismatch. n A customer escalation exposes a false positive.

Now what?

With Union-Find, once two nodes are merged, that decision is effectively permanent.

uf.union("company_a", "company_b")

There is no native split operation.

No rollback. n No local undo. n No “remove this edge and recompute only what changed.”

Your options are ugly:

tolerate the bad merge,
bolt on ad hoc exceptions,
or rebuild clustering from scratch.

At small scale, that is annoying.

At large scale, it is operational debt with interest.

A production entity graph must support corrections as a first-class behavior, not as an afterthought.

If your clustering algorithm cannot gracefully absorb reversals, it is not production-ready.

Failure Mode #4: Union-Find Cannot Tell a Healthy Cluster From a Sick One

This is the part people usually discover too late.

Real entities have natural limits.

A company may legitimately accumulate:

multiple source records
multiple historical snapshots
multiple crawled pages
multiple researcher edits
multiple location variants

But there are still bounds.

If a cluster suddenly contains 8,000 records, that is not “great recall.” n That is a warning siren.

Union-Find has no opinion about cluster health.

It does not know:

whether a cluster is too large,
whether it is held together by weak bridges,
whether density is suspicious,
whether low-confidence edges are doing all the work.

It only knows connectivity.

And connectivity alone is not enough.

In entity resolution, you need the ability to say:

Yes, these nodes are technically connected, but I do not trust this component as a single entity.

That is a production mindset. Union-Find does not have one.

The Better Model: Weighted Graph Clustering With Constraints

Once you accept that entity resolution is not just connectivity, the architecture changes.

The production-friendly model looks more like this:

build a graph of candidate matches
filter edges using signal-aware thresholds
enforce attribute safeguards before merging
compute connected components on the surviving graph
detect suspiciously large clusters
fragment them by trimming weak edges
support edge removals and local recomputation
explicitly abstain on ambiguous cases

That is still graph clustering. But it is graph clustering with judgment.

Here is a practical implementation using NetworkX.

A More Honest Entity Resolution Graph

import networkx as nx

class EntityResolutionGraph:
    def __init__(self):
        self.graph = nx.Graph()
        self.record_attributes = {}

    def add_record(self, record: Record):
        self.graph.add_node(record.id)
        self.record_attributes[record.id] = record

    def _passes_safeguards(self, id_a: str, id_b: str) -> bool:
        a = self.record_attributes.get(id_a)
        b = self.record_attributes.get(id_b)
        if not a or not b:
            return True

        if a.country and b.country and a.country != b.country:
            return False

        if a.domain and b.domain and a.domain != b.domain:
            return False

        if a.employee_count and b.employee_count:
            ratio = max(a.employee_count, b.employee_count) / max(min(a.employee_count, b.employee_count), 1)
            if ratio > 10.0:
                return False

        return True

    def _get_threshold(self, signal_types: list[str]) -> float:
        key = frozenset(signal_types)
        return THRESHOLDS.get(key, 0.90)

    def add_edge(self, edge: MatchEdge):
        threshold = self._get_threshold(edge.signal_types)
        if edge.confidence < threshold:
            return
        if not self._passes_safeguards(edge.record_a, edge.record_b):
            return

        self.graph.add_edge(
            edge.record_a,
            edge.record_b,
            confidence=edge.confidence,
            signal_types=edge.signal_types
        )

    def remove_edge(self, id_a: str, id_b: str):
        if self.graph.has_edge(id_a, id_b):
            self.graph.remove_edge(id_a, id_b)

    def get_clusters(self, max_cluster_size: int = 500) -> dict[str, str]:
        clusters = {}
        for component in nx.connected_components(self.graph):
            if len(component) > max_cluster_size:
                clusters.update(self._fragment_large_cluster(component))
            else:
                canonical = min(component)
                for node in component:
                    clusters[node] = canonical
        return clusters

    def _fragment_large_cluster(self, component: set[str]) -> dict[str, str]:
        subgraph = self.graph.subgraph(component).copy()

        edges_by_confidence = sorted(
            subgraph.edges(data=True),
            key=lambda e: e[2].get("confidence", 0)
        )

        for u, v, _ in edges_by_confidence:
            subgraph.remove_edge(u, v)
            components = list(nx.connected_components(subgraph))
            if all(len(c) <= 500 for c in components):
                break

        result = {}
        for sub_component in nx.connected_components(subgraph):
            canonical = min(sub_component)
            for node in sub_component:
                result[node] = canonical
        return result

This is not “fancier Union-Find.”

It is a different philosophy.

Instead of asking:

Which edges are above threshold?

you are asking:

Which edges deserve to change the shape of the entity graph?

That is the right question.

The Most Important Production Idea: Abstention

If there is one idea that separates toy ER systems from production ER systems, it is this:

You do not have to merge every plausible edge.

In fact, you should not.

The best production systems usually operate with three zones:

discard
review
merge

That looks like this:

THRESHOLDS = {
    frozenset(["name", "domain", "address", "phone"]): 0.70,
    frozenset(["name", "domain", "address"]): 0.78,
    frozenset(["name", "domain"]): 0.85,
    frozenset(["name", "country"]): 0.90,
    frozenset(["name"]): 0.95,
}

ABSTENTION_FLOOR = 0.50

def send_to_review_queue(edge: MatchEdge):
    print(f"Review needed for edge: {edge}")

def process_edge(edge: MatchEdge, graph: EntityResolutionGraph):
    if edge.confidence < ABSTENTION_FLOOR:
        return

    threshold = graph._get_threshold(edge.signal_types)

    if edge.confidence < threshold:
        send_to_review_queue(edge)
        return

    graph.add_edge(edge)

This is not just a workflow convenience. It is a quality strategy.

Because in entity resolution, the cost of a wrong merge is usually much higher than the cost of a delayed merge.

A non-merge can be revisited later when more evidence arrives.

A bad merge contaminates the canonical layer immediately.

That is why abstention is powerful. It gives your system permission to be uncertain without being reckless.

And that is exactly what production systems need.

Incremental Updates Are Not Optional

Once you support abstention and corrections, the next requirement becomes obvious:

you need local recomputation.

Here is a lightweight incremental extension:

class IncrementalEntityGraph(EntityResolutionGraph):
    def __init__(self):
        super().__init__()
        self._cluster_cache: dict[str, str] = {}
        self._dirty_nodes: set[str] = set()

    def add_edge(self, edge: MatchEdge):
        super().add_edge(edge)
        self._dirty_nodes.update([edge.record_a, edge.record_b])

    def remove_edge(self, id_a: str, id_b: str):
        super().remove_edge(id_a, id_b)
        self._dirty_nodes.update([id_a, id_b])

    def get_clusters(self, max_cluster_size: int = 500) -> dict[str, str]:
        if not self._dirty_nodes:
            return self._cluster_cache

        affected_components = set()
        for node in self._dirty_nodes:
            if node in self.graph:
                component = frozenset(nx.node_connected_component(self.graph, node))
                affected_components.add(component)

        for component in affected_components:
            if len(component) > max_cluster_size:
                updates = self._fragment_large_cluster(set(component))
            else:
                canonical = min(component)
                updates = {node: canonical for node in component}
            self._cluster_cache.update(updates)

        self._dirty_nodes.clear()
        return self._cluster_cache

This is the kind of behavior real systems need:

new evidence arrives,
bad evidence is removed,
only affected subgraphs are recomputed,
the rest of the graph stays untouched.

That is how you keep correction workflows fast without paying for a full rebuild every time someone fixes one bad edge.

What Changes at Distributed Scale

NetworkX is excellent for prototyping, debugging, and medium-sized workloads.

It is not the final answer for web-scale graphs.

Once you move into very large graphs, you typically shift to distributed connected-components computation using Spark-based graph tooling.

A simplified GraphFrames example:

from graphframes import GraphFrame
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.appName("EntityResolution").getOrCreate()

edges_df = (
    raw_edges_df
    .filter(F.col("confidence") >= F.col("dynamic_threshold"))
    .filter(F.col("country_match") == True)
    .filter(F.col("domain_match") == True)
    .select(
        F.col("record_a").alias("src"),
        F.col("record_b").alias("dst"),
        "confidence"
    )
)

vertices_df = records_df.select(F.col("id"))

gf = GraphFrame(vertices_df, edges_df)

spark.sparkContext.setCheckpointDir("/tmp/graph_checkpoints")

result = gf.connectedComponents()
clusters_df = result.select("id", "component")

The important point is not the framework choice.

The important point is that even at distributed scale, the winning pattern is still the same:

filter aggressively,
enforce safeguards early,
cluster conservatively,
and preserve the ability to correct mistakes.

Scale does not make bad clustering assumptions better. It only makes them more expensive.

Union-Find vs. Production Entity Resolution

Here is the blunt version:

Union-Find is still a beautiful algorithm.

It is just solving a cleaner problem than the one production entity resolution actually gives you.

That distinction matters.

A lot.

Final Thought: The Real Job Is Not Clustering. It Is Protecting Trust.

This is the part that matters most.

Entity resolution is not just about grouping similar records. n It is about deciding what your entire data platform is allowed to believe.

Once a cluster becomes canonical, downstream systems stop asking questions.

That means every bad merge becomes institutionalized:

in profiles,
in enrichment APIs,
in analytics,
in search,
in customer-facing products.

So the real job of an entity resolution system is not merely to cluster.

It is to cluster without destroying trust.

That is why simplistic connectivity logic fails.

And that is why production systems need safeguards, abstention, reversibility, and skepticism built into the graph itself.

Union-Find gives you speed.

Production entity resolution needs judgment.

And judgment wins.

Like 0

Liked Liked