Why Smarter Models Make Worse Keyboards

Neural models get the hype. But that lightning-fast word suggestion as you type? That’s not a neural model. It’s a lookup table. And it’s beating transformers on the metric that matters.


I’ve spent a lot of time thinking about prediction under pressure.

When I was working on Indian-language typing for early Kindle devices, we had a problem that no amount of ML enthusiasm could solve: the hardware simply didn’t care about our ambitions. We had roughly 512 MB of RAM, zero connectivity, and a hard wall of 10 ms per keystroke. Miss that deadline, and the keyboard felt broken — even if it was technically correct.

Neural models weren’t an option. So we built something that worked: a character-level n-gram system that turned prediction into pure lookup. No inference. No floating point math at runtime. Just precomputed transitions and a ranking table.

का -> [म, र, ह]

It was fast. It was deterministic. And it taught me something I still use today: constraints aren’t obstacles to good architecture. They are the architecture.

That system was a lookup table. Not a model. Not an embedding. A lookup table — and it was faster and more reliable than anything neural we could have run on that hardware. I’ve since watched teams reach for heavier models to improve keyboard quality and make the product measurably worse. Smarter model, worse keyboard. It happens more often than anyone admits.


The Problem With “Just Use a Transformer”

Here’s what nobody tells you when they pitch LLM-powered keyboards: transformers are slow in ways that actually matter for typing.

Not slow in the abstract. Slow in the you-typed-a-character-and-nothing-happened-for-150ms sense. That kind of slow destroys the feel of a keyboard, and users notice it even if they can’t articulate why.

The numbers tell the story clearly:

| Layer | Latency | Works Offline? | Memory |
|—-|—-|—-|—-|
| N-gram (Tier 1) | < 10 ms | Yes | A few MB per language |
| On-device neural (Tier 2) | 30–150 ms | Yes | 20–80M parameters |
| Cloud LLM (Tier 3) | ~500 ms | No | External API |


A cloud model firing on every keystroke would crater your battery, burn your data plan, and still feel sluggish compared to a well-built lookup table. This isn’t a temporary limitation waiting for better hardware. It’s physics.


What We Actually Built (Then and Now)

The Kindle keyboard system had three core ideas that I’ve watched survive into modern production systems largely intact.

Prediction as lookup, not computation. We precomputed every valid character transition at build time. At runtime, the system fetched — not calculated — the next candidates. Latency became a property of memory bandwidth, not model complexity.

Rank, don’t score. We threw away floating-point probabilities and replaced them with quantized rank orderings. Users don’t need to know that “म” has a 34.7% transition probability from “का”. They need to know it should appear first. Preserving ordering let us compress aggressively without any perceptible quality loss.

Local beats global, every time. A small on-device prefix-frequency map that learned from your typing consistently outperformed expanding the global model. If you write a lot about astrophysics, your keyboard should know that. No cloud call required.

These weren’t workarounds. They were the right abstractions for the problem.


Modern Keyboards Are Cache Hierarchies in Disguise

If you’ve worked on systems with L1/L2/L3 caches, the architecture of a modern AI keyboard will feel immediately familiar.

Tier 1 (L1) — N-grams: Fast, local, deterministic. Generates candidates on every keystroke. Never misses. Never waits.

Tier 2 (L2) — On-device neural: A small quantized transformer (often 30–80M parameters, compressed for mobile NPUs) does grammar-aware re-ranking and short-context refinement. Slower, but offline and private.

Tier 3 (L3) — Cloud: A full LLM for long-form generation, complex intent, and things that genuinely need deep context. High latency, high power — used sparingly.

The layers don’t compete. They compose. Tier 1 runs on every keystroke. Tier 2 runs when context helps. Tier 3 runs when you explicitly ask for something complex — drafting an email, finishing a thought.


The Part That Actually Matters: The Decision Layer

Here’s what nobody tells you when they pitch LLM-powered keyboards: transformers are slow in ways that actually matter for typing.

Not slow in the abstract. Slow in the you-typed-a-character-and-nothing-happened-for-150ms sense. That kind of slow destroys the feel of a keyboard, and users notice it even if they can’t articulate why.

This is where “smarter” actively backfires. A more capable model that crosses the latency threshold doesn’t produce a better keyboard – it produces a broken one. The quality improvement is invisible because the interaction feel is gone. Users don’t think “the suggestions are slightly less accurate today.” They think “this keyboard is broken.” Capability without speed registers as failure.

The numbers tell the story clearly:

python

def route_to_tier(input_context, connectivity, latency_budget):
    if len(input_context.split()) == 1:
        return "tier1_ngram"  # Single word: pure lookup

    if has_imperative(input_context) or len(input_context.split()) > 10:
        return "tier3_cloud"  # Complex intent: send it up

    if connectivity == "offline" and latency_budget < 50:
        return "tier2_neural"  # Offline, tight budget: local neural only

    return "tier1_tier2"  # Default: fast + contextual

def has_imperative(text):
    imperatives = {"write", "draft", "create", "compose", "reply"}
    return any(text.lower().startswith(i) for i in imperatives)

In production, this gets more sophisticated — lightweight classifiers, latency budgets, real-time connectivity signals. But the core logic holds: the system’s intelligence lives in orchestration, not in any single model.

Picking the wrong tier costs you either quality or responsiveness. Getting the routing right is what separates a keyboard that feels magical from one that feels like a demo.


Why Indic Languages Make This Even Harder

Working with Hindi, Tamil, or Kannada isn’t just a localization problem — it’s a tokenization problem that exposes the limits of LLM-centric approaches.

Standard subword tokenizers (BPE, SentencePiece) were tuned heavily on English. The result: English maps roughly one token per word. Indic languages fragment badly.

  • international → 1–2 tokens under common BPE vocabularies
  • किताबें (Hindi for “books”) → often splits into कि + ताब + ें or more

More tokens means more inference steps, more latency, and — if you’re on a cloud API — more cost. For a keyboard firing on every keystroke, that math gets ugly fast.

N-gram systems sidestep this entirely. They operate at the grapheme level, which is exactly the right granularity for character-by-character typing. They provide stable, language-aware coverage that anchors the prediction pipeline before neural refinement takes over.

This isn’t a gap that better tokenizers will fully close. The latency profile of the lookup tier is structurally different from any inference-based approach.


Privacy Fell Out of the Architecture, Not From a Policy Doc

One thing I didn’t anticipate when we built offline-first: privacy as a side effect.

When your core prediction runs on-device against a local model, your keystrokes don’t need to go anywhere. The global model provides language structure; your personal frequency map encodes your actual vocabulary. The combination is more useful than either alone — and none of it requires a server call for routine prediction.

In a world where keyboard apps have historically had uncomfortable levels of data access, building personalization that’s genuinely local is a meaningful property.


Will This Architecture Last?

Honest answer: I don’t know, and anyone who tells you otherwise is guessing.

Hardware NPUs are improving fast. On-device models are getting smaller and faster. Tokenizers will get better at non-English scripts. Any of these could shift the tradeoffs.

But I’d bet on the architectural principle outlasting the specific implementation: in latency-critical paths, you want to minimize runtime inference. Whether that means n-grams, a heavily quantized tiny model, or something we haven’t built yet – the shape of the problem stays the same.

The lesson from constrained hardware in 2015 applies just as cleanly to an NPU-equipped phone in 2025: fast wins. And fast is a design choice, not a hardware gift.


The Takeaway

If you’re building any system where a user is waiting on every interaction — typing, drawing, gaming, real-time audio — the hybrid architecture pattern is worth internalizing:

  • Deterministic, precomputed layer first. It’s always faster than inference.
  • Contextual refinement second. Small, local, quantized.
  • Deep inference on demand. Expensive, powerful, used sparingly.
  • Router in the middle. This is where your actual engineering judgment lives.

The model zoo gets the attention. The system design does the work. And the keyboards that feel smartest aren’t running the smartest models — they’re running the right model in the right layer at the right moment.




Liked Liked