Sharding of data is done to achieve Scalability. Sharding means that each piece of data belongs to exactly one shard only. Sharding and replication are usually combined. A node may store more than one shard.


Sharding by key range

One way of sharding is to assign nodes a continuous range of keys. All the reads/writes pertaining to a key range are transferred to a single node. This sharding technique is used by BigTable, HBase.

Within each partition, we can keep keys in sorted order. This helps in range scan queries. The keys can be treated as a concatenated index in order to fetch serval related records in one query.

For example, getting a range of data from the sensor can be done by designing the key as (year-month-day-hour-minute-second)



Hotspots

  • This can still create hotspots. Suppose a group of sensors is writing values to the database. If the key is designed according to the time then, only one partition will receive all the writes and other partitions will be idle.
  • To overcome the problem the key can be prefixed with sensor id and that sensor id can be distributed across the partitions

Sharding by Hash of Key

A hash function emits a random output even if the inputs are related. This makes is it perfect for keys distribution. The key is passed to a hash function. The resultant hash then determines the partition to which it should be assigned

Due to randomization, the speed-up of range queries is lost

Cassandra combines the Hash and key range. It has a compound primary key, consisting of several columns. The first part is used as input to the hash function, the other part of the key is used to sort the data in the partition.

This can be utilized in scenarios like social networking sites. The key can consist of (user_id, update_timestamp). The data is distributed among partitions using the user_id. Within the partition, the data is sorted according to the update_timestamp.