Geoembeddings: Why the Geospatial Industry is Moving Beyond Pixel Matching

For decades, geospatial analysis has relied on pixel-based methods. Techniques like template matching, spectral indices, and hand-crafted features work well for narrow, well-defined problems. But as satellite and aerial imagery volumes have grown and applications have diversified, a key limitation has become clear: pixels encode measurements (brightness, wavelength), not meaning. They don’t directly represent concepts like “farmland” or “flood-damaged infrastructure.”
Geoembeddings address this gap by shifting from raw pixel comparison to learned semantic representations. Instead of building separate pipelines for each task, you compute a shared embedding for each image and reuse it across retrieval, classification, clustering, and change detection. Combined with efficient vector search, this enables large-scale analysis that would be impractical with traditional approaches.
This article explains what geoembeddings are, where they offer advantages over classical methods, where they fall short, and how tools like Faiss and Zarr make them usable at scale.
What Are Geoembeddings?
A geoembedding is a dense vector representation of a geographic scene, produced by a neural network. Typically, it’s a vector of 256 to 2,048 floating-point values.
The key property is that semantically similar scenes tend to map to nearby points in this high-dimensional space. For example, two images of similar land cover that are captured at different times, resolutions, or lighting conditions may still produce similar embeddings, while very different scenes map farther apart.
These embeddings are learned representations. Instead of preserving raw pixel values, the model compresses the image while retaining patterns relevant to structure, texture, and semantic content. Ideally, it becomes less sensitive to nuisance variation (lighting, minor rotation, sensor noise), though the degree of invariance depends heavily on training data and model design.
Embedding models can be trained in different ways:
- Supervised vision models (e.g., ResNet variants)
- Self-supervised models (e.g., DINO-style approaches)
- Vision-language models (e.g., CLIP-like systems)
Each training approach shapes what the embedding captures, but most aim for the same outcome: distances in embedding space roughly reflect semantic similarity.
How Traditional Methods Work (and Why They Struggle)
Understanding traditional approaches helps clarify where embeddings help and where they don’t.
Pixel-level matching (e.g., template matching, cross-correlation) directly compares intensities. These methods are simple and fast but fragile: small rotations, lighting changes, or seasonal differences can significantly degrade performance.
Hand-crafted features (e.g., SIFT, ORB, HOG) extract local patterns like edges and corners. They are more robust than raw pixel matching, but in overhead imagery where scenes are repetitive, and the viewpoint varies widely, they often struggle to distinguish similar regions (e.g., different agricultural fields).
Spectral indices (e.g., NDVI, NDWI) are effective for specific tasks like vegetation or water detection, but they are narrow in scope and require particular spectral bands. They don’t generalize well to broader semantic tasks like scene retrieval or infrastructure analysis.
Supervised CNN pipelines can perform well when trained and tested on similar data. However, geospatial imagery varies significantly across sensors, regions, and seasons. Models trained in one setting often degrade when applied elsewhere, requiring additional labeled data and retraining.
Across these methods, a common limitation is reliance on low-level signals. They capture measurable properties of pixels, but only indirectly represent higher-level concepts and often in ways that don’t generalize across conditions.
Advantages of Geoembeddings
Geoembeddings shift the focus from raw measurements to learned representations. When trained appropriately, this can enable several practical advantages.
Semantic Invariance
Embedding models can learn partial invariance to changes like lighting, season, and resolution. For example, similar land cover types may cluster together even when captured under different conditions.
This is not guaranteed, and the performance depends on the diversity of training data, but when it holds, it reduces the need for separate pipelines across sensors or time periods.
Generalization
A single embedding can support multiple downstream tasks:
- nearest-neighbor retrieval
- clustering
- lightweight classification (e.g., k-NN)
- change detection (via distance comparisons)
This reduces the need to design and maintain separate models for each use case, though task-specific models may still outperform embeddings in specialized scenarios.
Reduced Labeling Requirements (in Some Cases)
Pretrained embeddings can sometimes be adapted to new tasks with relatively small labeled datasets. For example, a small set of labeled examples can be used for nearest-neighbor classification or similarity search.
However, this depends on how well the embedding space aligns with the target task. In some domains, substantial labeling or fine-tuning is still required.
Scalability
Once embeddings are precomputed, similarity search becomes a vector search problem. Approximate nearest-neighbor (ANN) methods allow millions of vectors to be queried in milliseconds, enabling interactive exploration of large archives.
This is often significantly faster than running a full model inference pipeline at query time.
Multimodal Capability
Vision-language models can map images and text into a shared embedding space. This enables text-based search over imagery (e.g., “flooded urban area”), which is difficult to achieve with traditional feature-based methods.
Performance varies widely depending on training data and domain alignment.
Ensemble Composability
Different models capture different features (e.g., texture vs. structure vs. semantics). In some cases, combining embeddings (e.g., concatenation) can improve performance by leveraging complementary strengths, though it increases storage and compute costs.
Use Cases
These properties translate into several practical workflows:
- Scene retrieval: Find images similar to a query example across large archives.
- Change detection: Compare embeddings over time to identify regions with significant semantic change.
- Few-shot classification: Use a small labeled set with nearest-neighbor methods.
- Disaster response: Retrieve historical examples of similar damage patterns for rapid triage.
- Unsupervised mapping: Cluster embeddings to discover land-use patterns.
- Cross-sensor analysis: When trained appropriately, embeddings can help align data from different sensors.
In practice, performance varies by task and data domain, and embeddings are often one component in a larger system.
Limitations and Tradeoffs
Geoembeddings are useful, but they introduce new constraints.
Compute Cost Upfront
Generating embeddings for large datasets requires significant compute resource. This is typically a one-time cost, but it can be substantial.
Model Bias
Embeddings reflect their training data. Performance can degrade sharply on out-of-distribution inputs (e.g., new sensors, resolutions, or geographies).
Interpretability
Embedding vectors are not directly interpretable. When results are unexpected, debugging can be difficult compared to more transparent methods.
Static Representation
Embeddings must be recomputed when imagery changes. This can be costly for frequently updated datasets.
Storage Overhead
Large embedding collections require nontrivial storage (e.g., tens of GB for tens of millions of samples), along with associated metadata
Dimensionality Sensitivity
Similarity depends on the choice of metric (e.g., cosine vs. Euclidean), normalization, and model training. Poor choices can degrade results.
Faiss: Scalable Similarity Search
Once you have embeddings, the next problem is clear: how do you search through millions of them fast?
A brute-force approach is unscalable. To find the 10 nearest neighbors to a query embedding in 10 million 384-dimensional vectors, you’d compute distances from the query to all 10 million vectors (10M × 384 multiply-accumulate operations), sort them, and extract the top 10. That’s roughly 3.8 billion operations per query. On a CPU, that’s dozens of seconds. On a GPU with good implementations, maybe a second. But if you’re serving a thousand concurrent users, that’s a thousand GPU-seconds per query batch, impractical.
This is where Faiss comes in. Faiss is Meta Research’s library for efficient approximate nearest-neighbor search. It trades a small amount of recall (you might miss the true nearest neighbor) for massive speedups, making interactive search practical.
How Faiss Works (Conceptually)
[calibrated_option_b] *** EPOCH 2/25 COMPLETE *** | epoch_loss 31887.5332 | PSNR 33.09 | SSIM 0.7589 | OHR 0.3478 | score 29.612Faiss provides multiple index types. Three dominate in practice:
Flat Index (IndexFlatL2, IndexFlatIP)
This is the baseline: it stores all vectors exactly and uses brute-force distance computation. It has 100% recall (you always find the true nearest neighbors), but it is only practical for tens of thousands to a few hundred thousand vectors. Compute time grows linearly with the size of the index.
Use Flat for: evaluation, small archives, and validating whether a fancier index is misconfigured.
IVF (Inverted File Index)
IVF partitions the embedding space into regions (Voronoi cells) and stores vectors clustered into their region. At query time, you only search the nearest clusters (controlled by a hyperparameter nprobe), not the entire index.
Suppose you partition 10 million vectors into 1,000 clusters. At query time, you search the nearest 1 cluster (nprobe=1), which is ~10,000 vectors instead of 10M, a 1,000x speedup. You’ll miss some true nearest neighbors (lower recall), but the speedup is worth it.
Recall is tunable: increase nprobe, and you search more clusters, improving recall at the cost of speed.
Use IVF for: medium archives Faiss: Scalable Similarity Search
Once you have embeddings, the next problem is clear: how do you search through millions of them fast?
HNSW (Hierarchical Navigable Small World)
HNSW builds a graph-based index at construction time. The graph is a multi-layer proximity structure, like a simplified skip list. At query time, you navigate the graph by “hopping” to progressively closer neighbors, reaching the nearest neighbors in logarithmic time relative to the size of the index.
This is more sophisticated than IVF. Construction is slow, but queries are very fast, and recall is often better than IVF at comparable query speeds.
Use HNSW for: large archives (10M-100M+ vectors), when recall matters as much as speed, and when query speed is the bottleneck.
Choosing an Index
For geoembeddings in production: <1 million vectors: Flat in memory, or IVF if memory is tight
1–100 million vectors: IVF sharded by region or time, or HNSW if you can afford the memory
>100 million vectors: Distributed Faiss (using Ray) or a specialized vector database (Qdrant, Weaviate, Pinecone)(1M-100M vectors), when you’re willing to trade recall for speed, and when you have memory constraints.
Zarr: Cloud-Native Storage for Embeddings
You have 10 million embeddings, every 384 floats. That’s 15 GB of raw vector data. You could dump it to a NumPy .npy file, but then you can’t partially read it and load the embeddings.npy loads the entire 15 GB into RAM, which isn’t feasible on most machines.
You could use HDF5, a hierarchical data format that supports chunking and partial reads. But HDF5 has a single-writer lock (only one process can write at a time), doesn’t play well with cloud object storage (S3, GCS, Azure Blob), and is moving into maintenance mode.
Enter Zarr, a modern chunked array format designed from the ground up for cloud-native workflows.
What Zarr Is
Zarr is an N-dimensional array storage format with pluggable compression and chunking. Unlike HDF5 (a single binary file), Zarr stores chunks as separate objects in a directory structure, which makes it compatible with cloud object storage: you can store a Zarr array directly on S3 and read arbitrary chunks without downloading the entire array.
Zarr supports:
- Arbitrary chunk shapes: store embeddings in chunks of shape (10000, 384) so each chunk is ~15 MB which is portable and easy to parallelize
- Multiple compressors: Blosc, Zstd, and others, balancing compression ratio and speed
- Lazy loading: read only the chunks you need
- Parallel writes: multiple processes can append chunks without locking
- Metadata as JSON: chunk layout, compression settings, and custom metadata are stored as .json files, human-readable and inspectable
Why Not Alternatives?
NumPy .npy / .npz: Single monolithic file. You can’t partially load it. For 15 GB, that’s a non-starter.
HDF5: Single-writer locking makes concurrent writes impossible. No native S3/GCS support (you can hack it with community plugins, but it’s fragile). Not designed for object storage paradigms.
Parquet: Columnar format, excellent for mixed-type tabular data with metadata, but awkward for dense float arrays. Adds unnecessary overhead.
Zarr: Designed for this exact use case. Chunks map cleanly to object-storage objects. Metadata is JSON. Lazy reads work seamlessly. Concurrent writes scale.
Schema for Geoembeddings
A typical Zarr schema for geoembeddings:
embeddings.zarr/
.zarray # metadata: shape, chunks, dtype, compressor
.zattrs # custom metadata: units, model name, creation date
0/0 # chunk [0:10000, 0:384]
1/0 # chunk [10000:20000, 0:384]
2/0 # chunk [20000:30000, 0:384]
…
metadata.jsonl # one line per tile: spatial bounds, CRS, timestamp, sensor
Each chunk is independently compressed, lazily loadable, and can be written in parallel. Metadata is stored alongside — critical for production geospatial workflows where you need to filter by spatial region before searching embeddings.
Industry Storage Patterns
In practice, how do geospatial organizations store and serve embeddings at scale?
Small Scale (<1 Million Vectors)
Store Zarr on a local SSD (or a small NAS). Compute the Faiss Flat index and load it into memory. This is feasible: 1M × 384 float32 = 1.5 GB, fits on any laptop. Query times are sub-second. No advanced ops needed.
Medium Scale (1–100 Million Vectors)
Store Zarr on object storage (S3, GCS, Azure Blob). Build an IVF index sharded by region or time, e.g., separate indexes for each state or quarter. This keeps individual indexes small enough to fit in memory (100M vectors ÷ 50 shards = 2M vectors per shard = 3 GB per index). Users issue spatial queries: “search my region” hits only one shard’s index, avoiding global searches.
Alternatively, build a single HNSW index if memory allows.
Large Scale (100+ Million Vectors)
Distribute the Faiss index across a cluster using a framework like Ray. Or use a dedicated vector database: Qdrant, Weaviate, or Pinecone (cloud-hosted, managed). These systems handle distribution, replication, and failover transparently.
The Critical Pattern: Spatial Pre-Filter + ANN Search
No matter the scale, production geospatial systems follow this pattern:
- Spatial pre-filter: User requests embeddings matching a geographic bounding box (bbox). Use a spatial index (PostGIS, STRtree, R-tree) to retrieve only tiles within that bbox.
- ANN search: Run Faiss search only on the filtered set, not the entire index.
This is essential: searching all 100 million global embeddings for “urban areas similar to this query tile” is slow and wasteful. Searching only the 50,000 urban tiles in a specific country is orders of magnitude faster.
Without spatial metadata coupled to embeddings, this pattern is impossible. This is why metadata is non-negotiable: bbox, CRS (coordinate reference system), acquisition timestamp, sensor type. Store it as a sidecar JSONL file, a separate Zarr array, or embedded in an Xarray dataset.
Geoembeddings themselves are dense vectors output by a neural network, but are straightforward. The magic is in the ecosystem that lets you deploy them at scale: Faiss for indexing, Zarr for storage, and spatial metadata for filtering.
Together, these enable a different kind of geospatial workflow. Instead of “I have a task; which algorithm should I use?”, it becomes “I have embeddings; what questions can I ask?” The representation is decoupled from the application. The same embeddings serve retrieval, classification, clustering, and change detection. The same storage and index layer scales from a laptop to a cloud platform.
This shift is already happening. Foundation models pretrained on global satellite archives (Prithvi, SatMAE, GeoFormer) are emerging. Archives of pre-embedded satellite imagery are becoming public goods. The move beyond pixel-matching isn’t speculative, but it’s the industry’s trajectory.
If you want to explore it locally and play around with embedding-based search, try my project GeoEmbed: https://github.com/amrithc/GeoEmbed
Geoembeddings: Why the Geospatial Industry is Moving Beyond Pixel Matching was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.