Documentation/md-cluster.txt - kernel/bruno - Git at Google

 The cluster MD is a shared-device RAID for a cluster.


 1. On-disk format

 Separate write-intent-bitmap are used for each cluster node.
 The bitmaps record all writes that may have been started on that node,
 and may not yet have finished. The on-disk layout is:

 0                    4k                     8k                    12k
 -------------------------------------------------------------------
 | idle                | md super            | bm super [0] + bits |
 | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
 | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
 | bm bits [3, contd]  |                     |                     |

 During "normal" functioning we assume the filesystem ensures that only one
 node writes to any given block at a time, so a write
 request will
  - set the appropriate bit (if not already set)
  - commit the write to all mirrors
  - schedule the bit to be cleared after a timeout.

 Reads are just handled normally.  It is up to the filesystem to
 ensure one node doesn't read from a location where another node (or the same
 node) is writing.


 2. DLM Locks for management

 There are two locks for managing the device:

 2.1 Bitmap lock resource (bm_lockres)

  The bm_lockres protects individual node bitmaps. They are named in the
  form bitmap001 for node 1, bitmap002 for node and so on. When a node
  joins the cluster, it acquires the lock in PW mode and it stays so
  during the lifetime the node is part of the cluster. The lock resource
  number is based on the slot number returned by the DLM subsystem. Since
  DLM starts node count from one and bitmap slots start from zero, one is
  subtracted from the DLM slot number to arrive at the bitmap slot number.

 3. Communication

 Each node has to communicate with other nodes when starting or ending
 resync, and metadata superblock updates.

 3.1 Message Types

  There are 3 types, of messages which are passed

  3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
    updated, and the node must re-read the md superblock. This is performed
    synchronously.

  3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
    so that each node may suspend or resume the region.

 3.2 Communication mechanism

  The DLM LVB is used to communicate within nodes of the cluster. There
  are three resources used for the purpose:

   3.2.1 Token: The resource which protects the entire communication
    system. The node having the token resource is allowed to
    communicate.

   3.2.2 Message: The lock resource which carries the data to
    communicate.

   3.2.3 Ack: The resource, acquiring which means the message has been
    acknowledged by all nodes in the cluster. The BAST of the resource
    is used to inform the receive node that a node wants to communicate.

 The algorithm is:

  1. receive status

    sender                         receiver                   receiver
    ACK:CR                          ACK:CR                     ACK:CR

  2. sender get EX of TOKEN
     sender get EX of MESSAGE
     sender                        receiver                 receiver
     TOKEN:EX                       ACK:CR                   ACK:CR
     MESSAGE:EX
     ACK:CR

     Sender checks that it still needs to send a message. Messages received
     or other events that happened while waiting for the TOKEN may have made
     this message inappropriate or redundant.

  3. sender write LVB.
     sender down-convert MESSAGE from EX to CW
     sender try to get EX of ACK
     [ wait until all receiver has *processed* the MESSAGE ]

                                      [ triggered by bast of ACK ]
                                      receiver get CR of MESSAGE
                                      receiver read LVB
                                      receiver processes the message
                                      [ wait finish ]
                                      receiver release ACK

    sender                         receiver                   receiver
    TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR
    MESSAGE:CR
    ACK:EX

  4. triggered by grant of EX on ACK (indicating all receivers have processed
     message)
     sender down-convert ACK from EX to CR
     sender release MESSAGE
     sender release TOKEN
                                receiver upconvert to PR of MESSAGE
                                receiver get CR of ACK
                                receiver release MESSAGE

    sender                      receiver                   receiver
    ACK:CR                       ACK:CR                     ACK:CR


 4. Handling Failures

 4.1 Node Failure
  When a node fails, the DLM informs the cluster with the slot. The node
  starts a cluster recovery thread. The cluster recovery thread:
 	- acquires the bitmap<number> lock of the failed node
 	- opens the bitmap
 	- reads the bitmap of the failed node
 	- copies the set bitmap to local node
 	- cleans the bitmap of the failed node
 	- releases bitmap<number> lock of the failed node
 	- initiates resync of the bitmap on the current node

  The resync process, is the regular md resync. However, in a clustered
  environment when a resync is performed, it needs to tell other nodes
  of the areas which are suspended. Before a resync starts, the node
  send out RESYNC_START with the (lo,hi) range of the area which needs
  to be suspended. Each node maintains a suspend_list, which contains
  the list  of ranges which are currently suspended. On receiving
  RESYNC_START, the node adds the range to the suspend_list. Similarly,
  when the node performing resync finishes, it send RESYNC_FINISHED
  to other nodes and other nodes remove the corresponding entry from
  the suspend_list.

  A helper function, should_suspend() can be used to check if a particular
  I/O range should be suspended or not.

 4.2 Device Failure
  Device failures are handled and communicated with the metadata update
  routine.

 5. Adding a new Device
 For adding a new device, it is necessary that all nodes "see" the new device
 to be added. For this, the following algorithm is used:

     1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
        ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
     2. Node 1 sends NEWDISK with uuid and slot number
     3. Other nodes issue kobject_uevent_env with uuid and slot number
        (Steps 4,5 could be a udev rule)
     4. In userspace, the node searches for the disk, perhaps
        using blkid -t SUB_UUID=""
     5. Other nodes issue either of the following depending on whether the disk
        was found:
        ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
                 disc.number set to slot number)
        ioctl(CLUSTERED_DISK_NACK)
     6. Other nodes drop lock on no-new-devs (CR) if device is found
     7. Node 1 attempts EX lock on no-new-devs
     8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
        as SpareLocal
     9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
     10. Other nodes get the information whether a disk is added or not
 	by the following METADATA_UPDATED.
	The cluster MD is a shared-device RAID for a cluster.


	1. On-disk format

	Separate write-intent-bitmap are used for each cluster node.
	The bitmaps record all writes that may have been started on that node,
	and may not yet have finished. The on-disk layout is:

	0 4k 8k 12k
	-------------------------------------------------------------------
	\| idle \| md super \| bm super [0] + bits \|
	\| bm bits[0, contd] \| bm super[1] + bits \| bm bits[1, contd] \|
	\| bm super[2] + bits \| bm bits [2, contd] \| bm super[3] + bits \|
	\| bm bits [3, contd] \| \| \|

	During "normal" functioning we assume the filesystem ensures that only one
	node writes to any given block at a time, so a write
	request will
	- set the appropriate bit (if not already set)
	- commit the write to all mirrors
	- schedule the bit to be cleared after a timeout.

	Reads are just handled normally. It is up to the filesystem to
	ensure one node doesn't read from a location where another node (or the same
	node) is writing.


	2. DLM Locks for management

	There are two locks for managing the device:

	2.1 Bitmap lock resource (bm_lockres)

	The bm_lockres protects individual node bitmaps. They are named in the
	form bitmap001 for node 1, bitmap002 for node and so on. When a node
	joins the cluster, it acquires the lock in PW mode and it stays so
	during the lifetime the node is part of the cluster. The lock resource
	number is based on the slot number returned by the DLM subsystem. Since
	DLM starts node count from one and bitmap slots start from zero, one is
	subtracted from the DLM slot number to arrive at the bitmap slot number.

	3. Communication

	Each node has to communicate with other nodes when starting or ending
	resync, and metadata superblock updates.

	3.1 Message Types

	There are 3 types, of messages which are passed

	3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
	updated, and the node must re-read the md superblock. This is performed
	synchronously.

	3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
	so that each node may suspend or resume the region.

	3.2 Communication mechanism

	The DLM LVB is used to communicate within nodes of the cluster. There
	are three resources used for the purpose:

	3.2.1 Token: The resource which protects the entire communication
	system. The node having the token resource is allowed to
	communicate.

	3.2.2 Message: The lock resource which carries the data to
	communicate.

	3.2.3 Ack: The resource, acquiring which means the message has been
	acknowledged by all nodes in the cluster. The BAST of the resource
	is used to inform the receive node that a node wants to communicate.

	The algorithm is:

	1. receive status

	sender receiver receiver
	ACK:CR ACK:CR ACK:CR

	2. sender get EX of TOKEN
	sender get EX of MESSAGE
	sender receiver receiver
	TOKEN:EX ACK:CR ACK:CR
	MESSAGE:EX
	ACK:CR

	Sender checks that it still needs to send a message. Messages received
	or other events that happened while waiting for the TOKEN may have made
	this message inappropriate or redundant.

	3. sender write LVB.
	sender down-convert MESSAGE from EX to CW
	sender try to get EX of ACK
	[ wait until all receiver has processed the MESSAGE ]

	[ triggered by bast of ACK ]
	receiver get CR of MESSAGE
	receiver read LVB
	receiver processes the message
	[ wait finish ]
	receiver release ACK

	sender receiver receiver
	TOKEN:EX MESSAGE:CR MESSAGE:CR
	MESSAGE:CR
	ACK:EX

	4. triggered by grant of EX on ACK (indicating all receivers have processed
	message)
	sender down-convert ACK from EX to CR
	sender release MESSAGE
	sender release TOKEN
	receiver upconvert to PR of MESSAGE
	receiver get CR of ACK
	receiver release MESSAGE

	sender receiver receiver
	ACK:CR ACK:CR ACK:CR


	4. Handling Failures

	4.1 Node Failure
	When a node fails, the DLM informs the cluster with the slot. The node
	starts a cluster recovery thread. The cluster recovery thread:
	- acquires the bitmap<number> lock of the failed node
	- opens the bitmap
	- reads the bitmap of the failed node
	- copies the set bitmap to local node
	- cleans the bitmap of the failed node
	- releases bitmap<number> lock of the failed node
	- initiates resync of the bitmap on the current node

	The resync process, is the regular md resync. However, in a clustered
	environment when a resync is performed, it needs to tell other nodes
	of the areas which are suspended. Before a resync starts, the node
	send out RESYNC_START with the (lo,hi) range of the area which needs
	to be suspended. Each node maintains a suspend_list, which contains
	the list of ranges which are currently suspended. On receiving
	RESYNC_START, the node adds the range to the suspend_list. Similarly,
	when the node performing resync finishes, it send RESYNC_FINISHED
	to other nodes and other nodes remove the corresponding entry from
	the suspend_list.

	A helper function, should_suspend() can be used to check if a particular
	I/O range should be suspended or not.

	4.2 Device Failure
	Device failures are handled and communicated with the metadata update
	routine.

	5. Adding a new Device
	For adding a new device, it is necessary that all nodes "see" the new device
	to be added. For this, the following algorithm is used:

	1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
	ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
	2. Node 1 sends NEWDISK with uuid and slot number
	3. Other nodes issue kobject_uevent_env with uuid and slot number
	(Steps 4,5 could be a udev rule)
	4. In userspace, the node searches for the disk, perhaps
	using blkid -t SUB_UUID=""
	5. Other nodes issue either of the following depending on whether the disk
	was found:
	ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
	disc.number set to slot number)
	ioctl(CLUSTERED_DISK_NACK)
	6. Other nodes drop lock on no-new-devs (CR) if device is found
	7. Node 1 attempts EX lock on no-new-devs
	8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
	as SpareLocal
	9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
	10. Other nodes get the information whether a disk is added or not
	by the following METADATA_UPDATED.