A Block Editor Is Not Just a Text Field
A block editor Is a document editor
The textbook Compose text-input pattern is clean on a whiteboard: keep the document in immutable state, feed the current value into a remembered TextFieldState, and push edits back through callbacks. For ordinary input, that model is fine. It breaks the moment Enter stops meaning insert a newline and starts meaning split this block, preserve formatting, and keep the document valid for serialization.
The document is an ordered list of structural units — paragraphs, headings, quotes, todos, list items, dividers — closer to Notion than to a single rich-text field. Each block has its own identity, lifecycle, and serialization rules. Some gestures are still local text edits; many are document operations expressed through a text UI. Enter may split a block. Pressing Backspace at the start of a block may mean merge two blocks and reconcile their text, spans, and caret. Converting a paragraph into a list item changes structure, not just appearance. A block editor is a document editor that happens to use text fields at the edge.

Reference implementation: Cascade Editor — a working block editor built on the architecture described in this article.
Immutable snapshot state and long-lived runtime editing objects are both necessary, but they cannot both lead at the same time. The architecture only stabilizes once each state domain has one owner. The rest of this article is about where those boundaries sit.
The clean architecture that fails
Start with the architecture that tries to avoid choosing an owner. It looks disciplined: keep block text in reducer state, remember a TextFieldState in the UI, and use an effect to “sync” the field whenever the reducer changes.
data class BlockState(
val text: String = "abcdef",
)
@Composable
fun BrokenBlockEditor() {
var block by remember { mutableStateOf(BlockState()) }
val fieldState = rememberTextFieldState(initialText = block.text)
LaunchedEffect(block.text) {
fieldState.edit {
delete(0, length)
append(block.text)
}
}
LaunchedEffect(Unit) {
snapshotFlow { fieldState.text.toString() }
.collectLatest { latest ->
block = block.copy(text = latest)
}
}
BasicTextField(state = fieldState)
}
Put the caret between c and d in abcdef, type X, and the text becomes abcXdef with the caret jumping to the end. The reducer did receive the correct string. During typing, though, the authoritative object is no longer block.text. It is the live editing buffer inside fieldState, including selection and composition state.
The sequence looks like this:
- the user types into
fieldState - the live buffer becomes
abcXdef, with the caret afterX snapshotFlowpublishes that text back into immutable state- the reducer updates
block.text LaunchedEffect(block.text)rewrites the whole buffer from snapshot state, and the caret lands at the en
The same architecture gets worse on structural edits. Each block owns its own remembered TextFieldState, and backspace at the start of block B should merge B into the previous block A.
fun mergeIntoPrevious(blocks: List<BlockState>, index: Int): List<BlockState> {
val previous = blocks[index - 1]
val current = blocks[index]
return blocks.toMutableList().apply {
this[index - 1] = previous.copy(text = previous.text + current.text)
removeAt(index)
}
}
The reducer result is still correct. The broken part is that the runtime editing objects no longer line up with the document structure they are supposed to represent. At that point, selection restoration stops being an intrinsic property of the editing model and turns into bookkeeping: capture offsets, compute a new position, wait for state to sync, then push the caret back in.
This is the shape of an ordering bug: individually valid operations running in the wrong phase of the pipeline. The effect has no way to tell a fresh reducer update apart from the user’s own typing echoed back, so it replays either way.
Effects still belong around an editor for focus requests, cleanup, scrolling, popup positioning, or observing committed edits. The anti-pattern is using effect-driven text mirroring as the primary boundary between immutable document state and the live editing buffer.
One owner per state domain
There are three owners, each responsible for one or more state domains.

Two rows are easy to miss. Selection is split across two domains: block-level selection — which blocks are highlighted in multi-select — belongs to the snapshot, while text selection inside a focused block belongs to the live TextFieldState, where the IME, composition state, and selection machinery already live. Pending styles are runtime intent, not document state. “The next typed character should be bold” is not the same fact as “characters 3 through 8 are bold.” The first is editing intent attached to a caret; the second is durable content.
The non-obvious move is refusing to force those facts into one owner: selection belongs to the IME-facing buffer, while pending styles belong to the editor’s promise about the next insertion.
The implementation rule: runtime holders are created once per block identity, then reused until the editor intentionally resets them.
@Stable
public class BlockTextStates {
private val states = mutableMapOf<BlockId, TextFieldState>()
/**
* Monotonically increasing counter, incremented on [clear].
* Used as a `remember` key so composables re-fetch from the map
* after a bulk reset.
*/
internal var generation: Int by mutableIntStateOf(0)
private set
public fun getOrCreate(
blockId: BlockId,
initialText: String,
initialCursorPosition: Int = 0
): TextFieldState {
return states.getOrPut(blockId) {
TextFieldState(initialText = "$ZWSP$initialText").also { state ->
val safePosition = initialCursorPosition.coerceIn(0, initialText.length)
state.edit {
selection = TextRange(safePosition + 1)
}
}
}
}
public fun clear() {
states.clear()
// Also resets commit-tracking state — covered in Section 6.
generation++
}
}
The ZWSP sentinel lives only in the runtime buffer; Section 7 unpacks why.
Call sites key on block.id and generation:
val textContent = block.content as? BlockContent.Text ?: return
val textStates = LocalBlockTextStates.current
val textFieldState = remember(block.id, textStates.generation) {
textStates.getOrCreate(block.id, textContent.text)
}
val spanStates = LocalBlockSpanStates.current
val spanState = remember(block.id, spanStates.generation) {
spanStates.getOrCreate(block.id, textContent.spans, textContent.text.length)
}
block.id is per-block identity: change the id, reload the holder. generation is the bulk-reset escape hatch: clear() increments it, invalidating every remember keyed by it, so document load or history replay wipes runtime state in one call. Between resets, the same holders survive typing, structural edits, and recomposition. Snapshot content seeds them when a block first appears; it does not keep reclaiming ownership while the user edits.
Split proves the ownership model
Split touches every state domain at once — live text, live spans, pending styles, structure, focus, history — so it exposes ownership bugs immediately. Each layer leads for one part of the operation, and the order is not interchangeable.
Enter performs a structural edit
onEnter() wraps the whole operation in a structural history transaction. In this implementation, that transaction first breaks open typing batches, then captures before/after checkpoints and pushes a forced StructuralEntry, which keeps type → split in separate undo steps.
override fun onEnter(blockId: BlockId, cursorPosition: Int) = runStructuralMutation {
// 1. Generate the id up front so runtime state and the reducer target the same block.
val newBlockId = BlockId.generate()
// 2. Authoritative text comes from the runtime buffer, not from snapshot state.
val currentText = textStates.getVisibleText(blockId) ?: return@runStructuralMutation
val beforeText = currentText.take(cursorPosition)
val afterText = currentText.drop(cursorPosition)
// 3. Resolve continuation styles BEFORE split() — it clears pending on both blocks.
val continuationStyles = resolveContinuationStyles(
blockId = blockId,
cursorPosition = cursorPosition,
textLength = currentText.length,
)
// 4. Split runtime spans first: clips crossing spans, rebases target to 0-based,
// and clears pending styles on source and target.
spanStates.split(sourceBlockId = blockId, newBlockId = newBlockId, position = cursorPosition)
// 5. Reattach continuation intent to the new block AFTER split() cleared it.
if (continuationStyles.isNotEmpty()) {
spanStates.setPendingStyles(newBlockId, continuationStyles)
}
// 6. Truncate the source runtime buffer to match the pre-split prefix.
textStates.setText(blockId, beforeText, cursorPosition = beforeText.length)
// 7. Hand the reducer a runtime-derived payload. Structural work only beyond this point.
dispatch(
SplitBlock(
blockId = blockId,
atPosition = cursorPosition,
newBlockId = newBlockId,
newBlockText = afterText,
newBlockSpans = spanStates.getSpans(newBlockId),
sourceBlockText = beforeText,
sourceBlockSpans = spanStates.getSpans(blockId),
)
)
}
The ordering matters – If the split were computed from snapshot text instead of the live buffer, the reducer could still be perfectly correct and the editor would still split the wrong content.
Two things are load-bearing here. newBlockId is generated immediately so every downstream operation targets the same block identity. And currentText comes from BlockTextStates, not from snapshot state, because during active editing the runtime buffer is the authoritative text source.
Resolve continuation styles before the split clears them
BlockSpanStates.split() clears pending styles on both source and target, so any decision about what formatting continues into the new block has to be made before it runs. The rules are deliberately narrow:
- Explicit pending styles (user toggled Bold before Enter) win.
- No pending styles + collapsed cursor at end of block: the new block inherits the styles active at the last character.
- Mid-block splits get no positional continuation — styled content already moves through the span split.
- Ranged selections get none either.
That decision has to happen in runtime state, before structure changes, because this is the only moment when the editor still knows both the user’s current formatting intent and the pre-split context it came from.
Spans and text cross the boundary at different moments
Steps 4–6 perform the runtime mutation; step 7 hands off to the reducer. The important asymmetry is how the new block’s state crosses that boundary.
The new block’s spans are transferred runtime-first, at step 4. Its text state is never allocated in onEnter() at all; the reducer inserts a snapshot block carrying afterText, focuses it, and only when that block renders does TextBlockField call textStates.getOrCreate(block.id, ...).
This is the ownership model under stress: formatting continuity has to be repaired while the source runtime context still exists, but text state for a not-yet-rendered block has nothing live to contradict the snapshot. Both paths agree on newBlockId, so when the new block finally renders, the runtime text holder and the already-split runtime spans attach to the same document node instead of drifting into separate targets.
The reducer canonicalises; it does not recompute
The reducer still owns structure: insert the new block, propagate list type, renumber, move focus. For content, its job changes.
public data class SplitBlock(
val blockId: BlockId,
val atPosition: Int,
// Runtime-derived payload. All optional for backward compatibility.
val newBlockId: BlockId? = null,
val newBlockText: String? = null,
val newBlockSpans: List<TextSpan>? = null,
val sourceBlockText: String? = null,
val sourceBlockSpans: List<TextSpan>? = null,
) : EditorAction {
override fun reduce(state: EditorState): EditorState {
// …resolve block + clamp position…
// Prefer runtime-resolved text; snapshot split is the fallback.
val beforeText = sourceBlockText ?: textContent.text.take(clampedPosition)
val afterText = newBlockText ?: textContent.text.drop(clampedPosition)
val (beforeSnap, afterSnap) = SpanAlgorithms.splitAt(textContent.spans, clampedPosition)
val beforeSpans = SpanAlgorithms.normalize(sourceBlockSpans ?: beforeSnap, beforeText.length)
val afterSpans = SpanAlgorithms.normalize(newBlockSpans ?: afterSnap, afterText.length)
// …insert new block, propagate list type, move focus…
}
}
The ?: pattern shows the contract. The reducer prefers runtime-resolved text and spans when provided; otherwise it splits the snapshot itself. The result goes through SpanAlgorithms.normalize() against final text lengths. Snapshot state stays canonical — canonicalised from runtime truth, not reconstructed from it.
Split is where the ownership model proves itself: current text, formatting continuity, and structure each have a clear owner, and the handoff order is explicit enough that no layer has to guess which version of the document is real. The editor never has to repair a split after the fact.
Merge inverts split, with different failure modes
Split and merge are inverse document operations, but they fail for different reasons. Split starts from one focused block that already has live runtime state. Merge starts at the boundary between two blocks, and one of those blocks may still exist only in snapshot state because it has never been rendered in the current composition.
The runtime buffer provides the signal. When BackspaceAwareTextField observes that the leading sentinel would disappear while the visible cursor is at 0, it restores the sentinel and calls onBackspaceAtStart(). That callback is the editor’s structural signal for merge: not “delete text,” but “merge this block into the previous one.”
Once that callback fires, the first job is to ensure that both merge participants have runtime holders. The current block is focused, but the previous block may be off-screen and may never have allocated a TextFieldState or span holder in this composition. So onBackspaceAtStart() seeds both sides with getOrCreate() before doing anything destructive. Split creates a new identity from one live source; merge may have to materialize missing live state for an already existing target.
textStates.getOrCreate(blockId, sourceContent?.text.orEmpty())
textStates.getOrCreate(previousBlock.id, targetContent.text)
spanStates?.getOrCreate(blockId, sourceContent?.spans.orEmpty(), sourceContent?.text?.length ?: 0)
spanStates?.getOrCreate(previousBlock.id, targetContent.spans, targetContent.text.length)
val targetTextLength = textStates.mergeInto(
sourceId = blockId,
targetId = previousBlock.id,
)
spanStates?.mergeInto(
sourceId = blockId,
targetId = previousBlock.id,
targetTextLength = targetTextLength,
)
textStates.mergeInto() appends the source text into the target runtime buffer, places the caret at the merge point, and returns the target’s pre-merge visible length. That value gives spanStates.mergeInto() the anchor it needs to shift source spans exactly once before combining the two span lists. The span layer also clears pending styles on both sides, so formatting intent does not leak across a block boundary that no longer exists.
Only after the runtime merge completes does snapshot state catch up. The callback reads the merged text and spans from the surviving runtime holders, dispatches UpdateBlockContent() for the target, and only then dispatches DeleteBlock() for the source and moves focus. UpdateBlockText() would be wrong here because it resets spans.
Split shows that runtime state must lead when one block becomes two. Merge shows the harder asymmetry: before two blocks can become one, runtime state may need to be reconstructed for both.
Observe committed text, not input intent
Split and merge establish who owns structural state. The next boundary is subtler: where text-side effects are allowed to react. One tempting design is to maintain spans inside the input pipeline: a key arrives, you infer intent, and you mutate formatting state immediately. That feels responsive, but it couples span logic to keystrokes, IME behavior, and callback timing.
This editor takes the opposite approach. TextBlockField first lets TextFieldState apply the change, then observes the resulting state snapshot — visible text plus visible selection — from one snapshotFlow.
LaunchedEffect(textFieldState, spanTextObserver, textHistoryTracker, stateHolder) {
var lastObservedVisibleText = textFieldState.visibleText()
// Raw snapshot identity can change even when the visible string does not.
var lastObservedTextSnapshot = textFieldState.text
snapshotFlow {
// Text and selection are observed together so selection-only updates and
// text commits are classified from the same post-commit boundary.
Pair(textFieldState.text, textFieldState.visibleSelection())
}.collect { (currentTextSnapshot, selection) ->
val textSnapshotChanged =
currentTextSnapshot !== lastObservedTextSnapshot ||
currentTextSnapshot.length != lastObservedTextSnapshot.length
val currentVisibleText = if (textSnapshotChanged) {
// Visible text is derived only after a new committed snapshot arrives.
visibleTextFromSnapshot(currentTextSnapshot)
} else {
lastObservedVisibleText
}
if (textSnapshotChanged) {
val isProgrammatic = textStates.hasPendingProgrammaticCommit(block.id)
if (currentVisibleText != lastObservedVisibleText) {
// Post-commit observers run only when committed visible text changed.
spanTextObserver.onCommittedVisibleText(currentVisibleText)
lastObservedVisibleText = currentVisibleText
} else if (isProgrammatic) {
// A programmatic commit can replace the raw snapshot without changing
// visible text (for example, a split at the end of a block). Consume it
// here so the next real keystroke is not misclassified as programmatic.
textStates.consumeProgrammaticCommit(block.id)
} else {
noteSelectionState(selection)
}
lastObservedTextSnapshot = currentTextSnapshot
} else {
noteSelectionState(selection)
}
}
}
That collector is the post-state-update boundary for text side effects. Everything downstream sees the same TextFieldStatesnapshot, in the same order, after the edit has been applied to the buffer. It is not an IME-commit signal; it is the point where external editor logic reacts to the text and selection currently held by TextFieldState.
The span observer is where that boundary earns its keep. SpanMaintenanceTextObserver does not ask what key was pressed. It classifies the committed result against the right baseline.
internal fun onCommittedVisibleText(currentVisibleText: String) {
val expectedProgrammaticText = textStates.consumeProgrammaticCommit(blockId)
if (expectedProgrammaticText != null) {
if (currentVisibleText == expectedProgrammaticText) {
previousVisibleText = currentVisibleText
return
}
previousVisibleText = expectedProgrammaticText
}
val edit = computeEdit(previousVisibleText, currentVisibleText) ?: return
previousVisibleText = currentVisibleText
spanStates.adjustForUserEdit(
blockId = blockId,
editStart = edit.start,
deletedLength = edit.deletedLength,
insertedLength = edit.insertedLength,
)
applyPendingStyles(edit.start, edit.insertedLength, currentVisibleText.length)
}
pendingProgrammaticCommits is the piece that makes this safe. Structural operations such as split, merge, slash replacement, history replay, or setText() are allowed to mutate the runtime buffer directly. When that happens, span maintenance must not “fix” the same edit a second time. So the observer first consumes the expected programmatic result. If the committed text matches it exactly, the callback-side transfer already handled the span work and the observer returns early. If the committed text has moved past that baseline, the observer rebases previousVisibleText and computes only the real user delta that followed.
There is one nasty edge case that makes this boundary necessary. A programmatic edit can produce a new raw text snapshot without changing visible text at all. A split at the end of a block is enough: the source block may still pass through setText() even though its visible content stays the same. That stale programmatic commit still has to be consumed. Otherwise the next real keystroke is misclassified as programmatic, and span maintenance skips a genuine user edit.
Do not mutate external editor state from input intent. Let the buffer commit, observe committed visible text once, then classify the result against the correct baseline.
The ZWSP tax and visible coordinates
To detect backspace at visible position 0, every runtime TextFieldState begins with a leading zero-width space sentinel. If an edit removes that first character, BackspaceAwareTextField treats it as “backspace at start,” invokes the structural callback, and reinserts the prefix.
val sentinelGuard = remember {
InputTransformation {
if (!asCharSequence().startsWith(ZWSP)) {
onBackspaceAtStart()
insert(0, ZWSP)
}
}
}
The cost of that trick is two coordinate spaces. Buffer coordinates include the sentinel at index 0; visible coordinates do not. Every buffer index is therefore offset by +1 relative to what the user sees.
Raw buffer: [ZWSP][H][e][l][l][o]
Buffer index: 0 1 2 3 4 5
│
└── +1 shift at the boundary
│
Visible text: [H][e][l][l][o]
Visible index: 0 1 2 3 4
Helpers make that boundary explicit: visibleText() strips the prefix, visibleCursorPosition() subtracts one, and visibleSelection()subtracts one from both bounds.
public fun TextFieldState.visibleSelection(): TextRange = TextRange(
start = (selection.start - 1).coerceAtLeast(0),
end = (selection.end - 1).coerceAtLeast(0),
)
BlockTextStates carries the same rule into getSelection() and setSelection(), so history replay, formatting, and slash-command logic never see the sentinel.
This is the ZWSP tax. Any feature that touches TextFieldState directly has to know which coordinate space it is in, and mistakes here are off-by-one bugs with visible consequences: the caret lands one position off, or a span highlights the wrong range. The containment strategy is to pay that price only at the runtime boundary. Snapshot state stores plain visible text and visible-coordinate spans; render-time code like SpanMapper shifts ranges back by +1 only when talking to the buffer. This is a runtime-buffer concern, not snapshot state.
Use this only when the problem earns it
Use this architecture only when the editor actually has these constraints. Blocks split, merge, reorder, and serialize independently, while live text, selection, and formatting still need to behave like native editing. If the problem is simpler, the solution should be simpler too. A single-buffer editor is cheaper to build, easier to reason about, and easier to keep correct.
“Single source of truth” is too blunt for advanced editing. It becomes useful again only when scoped to a domain: structure in snapshot state, live text in the runtime buffer, formatting intent in runtime span state. Stability comes from explicit ownership and explicit handoff rules, not from forcing every kind of state through one model.
If you build something like this, the rules are short:
- generate and reuse one
newBlockIdfor the whole split path - register every programmatic text mutation as a pending commit
- clear or recompute pending styles deliberately on split and merge
- use
UpdateBlockContentwhen text and spans must stay aligned
That is the price of correctness when the problem stops being ordinary text input.
n n
