Exactly-Once in Spark Structured Streaming: What That Actually Means

Every time I see ‘exactly-once semantics’ in a product README, I add a small mental asterisk. Not because the claim is false (it’s often technically accurate), but because it means something much more specific than most people assume when they first encounter it.

I spent the better part of a sprint in our banking platform trying to diagnose why we had duplicate payment records in our Delta Lake tables despite using Spark Structured Streaming with checkpointing enabled. We had ‘exactly-once.’ We still had duplicates. Here’s why, and what exactly-once actually guarantees.

The guarantee, precisely stated

Spark Structured Streaming’s exactly-once guarantee is this: each input record will be processed and produce an output exactly once, assuming the source is replayable and the sink supports idempotent writes or transactions.

That last clause is doing a lot of work. ‘The sink supports idempotent writes or transactions.’ If your sink doesn’t meet that condition, Spark’s internal exactly-once mechanism, which is real and works, cannot give you an end-to-end exactly-once guarantee. It can only guarantee that Spark processed each record once internally. What happens when the result hits the sink is outside the guarantee.

What was actually happening in our pipeline

Our pipeline was reading payment events from Kafka and writing to a Delta Lake table in ADLS Gen2. We had checkpointing enabled, which tracks offsets and ensures Spark doesn’t reprocess Kafka messages after a restart.

The problem was in how we were writing. We were using a streaming foreachBatch writer with a custom merge logic. We used MERGE INTO on the Delta table to upsert records. The MERGE statement was correct in isolation, but we had a subtle bug: on retry after a task failure, Spark would re-execute the foreachBatch function for the same micro-batch. Our MERGE logic was idempotent for the final state but it was being called twice for the same batch ID, and due to a transaction isolation issue in our staging table, some records were being written twice before the MERGE deduplicated them.

The fix was simple once we understood the problem: check the batch ID at the start of foreachBatch and skip processing if that batch ID had already been committed. Delta Lake’s transaction log made this easy to verify. But finding the bug took three days because we kept looking at Kafka offset management; we assumed ‘exactly-once’ meant the problem couldn’t be in our write path.

The pattern in practice: at the top of your foreachBatch function, query the Delta table’s transaction log for the current batch ID. If it’s already been committed, return early. Otherwise, proceed with your MERGE or write logic. In pseudocode: def processbatch(df, batchid): if batchidexistsintarget(batchid): return; else: executemerge(df, batchid). The batchidexistsintarget check is a simple read against a committedbatches column or a separate tracking table. This adds a few seconds of overhead per batch but eliminates the duplicate-on-retry class of bugs entirely.

The three layers you actually need to think about

Exactly-once in streaming is better thought of as a property of the full pipeline, not of any single component. There are three layers to get right.

Layer one is source replay. Your source needs to be replayable so that after a failure, Spark can re-read records from a committed offset. Kafka does this well. HTTP webhooks don’t. If your source isn’t replayable, you’re at-most-once by design and no amount of framework magic changes that.

Layer two is Spark’s internal processing guarantee. This is what checkpointing handles. Spark tracks which offsets have been committed and which micro-batches have completed. After a restart, it resumes from the last committed checkpoint. This layer works well and you mostly don’t need to think about it if you’ve configured checkpointing correctly.

Layer three is sink idempotency. This is the one that bites people. If your sink write is not idempotent, meaning running it twice with the same data produces the same result; then a task retry or driver restart can produce duplicates. Delta Lake’s MERGE INTO is idempotent if written correctly. Appending with no deduplication is not.

A note on Delta Lake specifically

Delta Lake’s streaming write mode (writeStream) with checkpointing enabled does give you end-to-end exactly-once for append operations because it uses a transaction log with atomic commits and idempotent batch IDs. If you’re using writeStream.format(‘delta’).start(), you’re in good shape without any extra work.

Where it gets complicated is when you’re using foreachBatch for custom write logic: merges, conditional inserts, multi-table writes. In that case, you’re responsible for making your write logic idempotent with respect to the batch ID, because Spark can and will retry batch execution under failure conditions. To be explicit about the distinction: writeStream with Delta handles idempotency for you because the Delta transaction protocol tracks batch IDs internally and will skip a batch that has already been committed. With foreachBatch, you are stepping outside that automatic tracking and taking on the responsibility yourself. Most production pipelines that do anything beyond simple appends end up using foreachBatch, which is why this distinction matters more than the documentation might suggest.

The practical checklist

Before trusting an exactly-once claim in your pipeline, answer these: Is your source replayable with committed offsets? Is checkpointing enabled and writing to durable storage (not local disk)? If you’re using foreachBatch, is your write logic idempotent if called twice with the same batch ID? If you’re doing MERGE operations, have you verified they behave correctly under concurrent retries?

If the answer to all four is yes, you probably have genuine end-to-end exactly-once. If any answer is no, you have exactly-once at the Spark processing layer and something weaker at the ends. That gap is where duplicates live.

If you’re also dealing with pipeline framework design and how to build these guarantees into a reusable layer, I cover that topic separately in my article on config-driven pipeline frameworks.

Liked Liked