Zookeeper
Zookeeper(ZK) is a coordination service. One of the basic form of coordianation is Configuration. ZK stores the configuration of a system and makes it highly available by replication.
Coordination
- Group Membership
- Leader election
- Dynamic Configuration
- Status Monitoring
- Queuing
Overview
- It does not implement lock like structures as Chubby but uses
wait free
data objects organized heirarichally as in file system. - It is suited for read heavy systems. Reads are quite fast but write take some time to persist.
- It ensures
FIFO client ordering and Linearizable writes
butreads are eventually consistant
- Under the hood ZK uses
Zab (Zookeeper Atomic Broadcast)
service - Clients cache the current ZK leader, to prevent them probing ZK everytime
- Clients establish session when they connect to ZK service and obtain a session handle through which they issue requests
Data Model
ZK stores data in file format called znodes. They are heirarchal namespaces (like file system). Each znode has data and children. There are two types of znodes:
-
Ephemeral: znode deleted when creater fails or explicitly deleted.
A typical use case for ephemeral nodes is when using ZooKeeper for discovery of hosts in the distributed system. Each server can then publish its IP address in an ephemeral node, and should a server loose connectivity with ZooKeeper and fail to reconnect within the session timeout, then its information is deleted.
-
Sequence: append a monotonically increasing counter
Zookeeper APIs
Create, delete, exists znode
- create(path, data, flags): Creates a znode with path name path and stores the data[] in it, and returns the name of new znode.flags indicate the type of node: regular, ephermeral, sequential flag.
- delete(path, version): Delete the znode if it is at the expected version
- exists(path, watch): Returns true is path exists else returns false. It enables client to keep a watch
get, set data znode
- getData(path, watch): returns the data and meta-data, such as version info associated with znode. Enables watch if path exists.
- setData(path, data, version): writes data[] to path if the znode with version exists
others
- getChildren(path, watch): returns the set of names of the children of a znode.
- syncpath(path): wait for all the updates pending at the start of the operation to propogate to the server that the client is connected to.
Writes
All the writes go through the leader. Every data object has a version number. When a write is successfull, the version number is increamented by 1.
Let’s write a program that increaments a counter in file whenever an incident occurs. The incident can occur at any client and should result in increament of counter.
while true:
x,v = getData("file")
if (setData("file", x+1, v)):
break
}
Here setData(“file”,x+1, v) will only set file to x+1 if the version number matches v.
Locks
ZK can provide distributed locks.
- The clients who want to aquire the lock attempt to create an ephemeral node, which gets deleted when the lock is released.
- The client is successful in aquring the lock if it is able to create the node.
- If the it fails, it means that some node has already created the node and acquired the lock. Then client then can put a watch on the node and wait for the notification of the node getting deleted. It can then retry again to acquire the lock.
acquire():
while true:
if create("lf", EPHEMERAL)
success
if exists("lf", watch = true)
wait for notification
release():
delete("lf")
When a client releases the lock, every other client will try to aquire the lock. If there a lot of clients waiting for the lock, it will trigger a huge traffic of create requests. This is called Herd effect
. The complexity becomes \(O(n^2)\).
To overcome this problem, sequence in which clients requests can be used to grant the lock.
While true:
n = create(l + '/lock-', EPHEMERAL|SEQUENTIAL)
C = getChildren(l, false)
if n is lowest znode in C, exit
p = znode in C ordered just before n
if exists(p, true) wait for watch event
goto 2
Unlock()
delete(n)
configuration
- workers get configuration
- getData(“../config/settings”, true)
- Administrator changes setting
- setData(“../config/settings”, newConf, -1)
- Workers notified of change and get new settings
- getData(“../config/settings”, true)
leader election
An easy way of doing leader election with ZooKeeper is to let every server publish its information in a zNode that is both sequential and ephemeral
. Then, whichever server has the lowest sequential zNode is the leader. If the leader or any other server for that matter, goes offline, its session dies and its ephemeral node is removed, and all other servers can observe who is the new leader.