Overview

Geohashing is a spatial indexing technique widely used in location-based systems such as Uber, Lyft, Yelp, and Google Maps. It addresses the challenge of efficiently querying two-dimensional geospatial data (latitude and longitude) by transforming it into a one-dimensional string representation that can be indexed efficiently by traditional databases.

The key insight behind Geohashing is the preservation of spatial locality: geographically close points are likely to share similar string prefixes. This property enables efficient proximity searches, range queries, and scalable data partitioning.


Core Concept: The World as a Hierarchical Grid

At its foundation, a Geohash represents the Earth as a hierarchical grid.

  • The entire world map is treated as a single rectangle:
    • Latitude range: [-90, 90]
    • Longitude range: [-180, 180]
  • This rectangle is subdivided into 32 equal-sized sub-rectangles.
  • Each sub-rectangle is assigned a unique character from a Base32 alphabet (0–9, b–z, excluding a, i, l, o).
  • Any sub-rectangle can be recursively subdivided further to achieve higher spatial precision.

A Geohash string such as tdr1v therefore represents a rectangular geographic region, not a single point.


Spatial Locality and Prefix Matching

The most important property of Geohashing is spatial locality.

  • Prefix similarity implies proximity: Locations that share longer Geohash prefixes are geographically closer.
  • Length determines precision:
    • Short Geohashes (e.g., td) represent very large regions.
    • Longer Geohashes (e.g., tdr1v) represent increasingly smaller regions, such as neighborhoods or city blocks.

This allows proximity queries to be converted into efficient prefix-based string searches, such as finding all drivers whose Geohash begins with a given prefix.


The Geohash Encoding Algorithm

Objective: Convert a two-dimensional coordinate pair (latitude, longitude) into a compact alphanumeric string.

1. Binary Space Partitioning (Quantization)

Input

  • Latitude ∈ [-90, 90]
  • Longitude ∈ [-180, 180]

Process

  • Each coordinate range is recursively halved.
  • For each split:
    • If the coordinate lies in the upper or right half, append 1.
    • If it lies in the lower or left half, append 0.

Output

  • Two independent bit streams: one for latitude and one for longitude.

2. Bit Interleaving (Z-Order Curve)

To map the two-dimensional coordinate space into a one-dimensional representation:

  • Even-indexed bits are taken from the longitude bit stream.
  • Odd-indexed bits are taken from the latitude bit stream.

This produces a single unified bit stream while preserving spatial locality using a Z-order (Morton) curve.


3. Base32 Encoding

  • The unified bit stream is divided into chunks of 5 bits.
  • Each chunk is converted to a decimal value (0–31).
  • The value is mapped to a character in the Geohash Base32 alphabet.

Result

  • A compact Geohash string (e.g., 9q8yy).

The Boundary Problem

Although Geohashing preserves spatial locality, it does not guarantee perfect adjacency.

Two geographically close points may lie in adjacent Geohash cells and therefore share only a short prefix. This commonly occurs near cell boundaries, such as opposite sides of a street.


To avoid missing nearby entities due to boundary effects, proximity queries include:

  • The current Geohash cell.
  • Its 8 immediate neighboring cells.

By querying all nine cells (center + neighbors) in parallel, systems ensure complete coverage of nearby locations.


Scale and Storage Characteristics

Large-scale location-based systems exhibit the following characteristics:

  • High write throughput: Drivers report location updates every 3–5 seconds.
  • High read throughput: Riders frequently query nearby drivers.
  • Ephemeral data: Only the most recent location is required for real-time matching.

As a result, systems separate concerns into:

  • Live Location Data (real-time matching)
  • Trip History Data (billing, analytics, auditing)

Redis as the Live Location Store

Redis is commonly used for live location storage due to its low latency and in-memory performance.

Redis Internals

  • GEOADD key longitude latitude member
    • Internally converts coordinates into a 52-bit Geohash integer.
    • Stores the value in a Sorted Set (ZSET).
  • GEORADIUS
    • Performs efficient range queries over sorted sets to locate nearby points.

This makes Redis an ideal choice for real-time geospatial indexing.


Asynchronous Persistence Pattern

To achieve low-latency updates while maintaining durability:

  1. Driver sends a location update.
  2. Server immediately updates Redis.
  3. Server publishes the update to a message queue or event stream.
  4. Server responds success to the driver.
  5. Background workers persist the data to a durable store (e.g., Cassandra or SQL).

This fire-and-forget pattern decouples user-facing latency from long-term storage.


Scaling with Sharding

A single Redis instance cannot handle global-scale traffic. Data must be partitioned across multiple nodes.

Common Sharding Strategies

Shard by Driver ID

  • Writes are evenly distributed.
  • Reads require querying all shards (scatter-gather).
  • Very stable under write-heavy workloads.

Shard by Location (Geohash)

  • Reads are fast and localized.
  • Writes risk overload in high-density regions.
  • Vulnerable to traffic hotspots.

The Hotspot Problem

When sharding by location, dense areas (e.g., city centers, stadiums) can overload a single shard.

Read Hotspots

  • Mitigated using read replicas.
  • Read traffic is distributed across multiple nodes.

Write Hotspots

  • More difficult to mitigate.
  • In primary–replica architectures, all writes must go to a single primary node.

Reducing Write Hotspots with Higher Precision

Increasing Geohash precision reduces the size of each shard’s geographic coverage:

  • Low precision: Large regions, many drivers, high write contention.
  • High precision: Smaller regions, fewer drivers per shard, better write distribution.

For example:

  • dr5r → one large bucket
  • dr5r1, dr5r2, … → many smaller buckets

Final Trade-Off Analysis

Strategy Writes Reads Risk
Shard by Driver ID Evenly balanced Slow (scatter-gather) Low
Shard by Location Fast and localized Hotspot-prone High

Prioritizing system stability favors sharding by Driver ID, while prioritizing query performance favors sharding by location.


Hybrid Architecture with CQRS

Large-scale systems typically adopt a hybrid model.

Write Path (Source of Truth)

  • Database sharded by Driver ID.
  • Optimized for high write throughput and stability.

Read Path (Search Index)

  • Redis cluster sharded by Geohash.
  • Optimized for fast proximity queries.

Updates propagate asynchronously from the write store to the read index via an event stream (e.g., Kafka). This introduces eventual consistency, which is acceptable for real-time location visualization.


Conclusion

Geohashing is a foundational technique for building scalable, low-latency, location-based systems. When combined with appropriate sharding strategies, asynchronous processing, and CQRS, it enables systems to efficiently support millions of real-time users while maintaining stability and performance at global scale.