Sie sind auf Seite 1von 5

DualPrimary:ThinkTwice

Martin Loschwitz <martin.loschwitz@linbit.com> Edited by Florian Haas


Copyright 2010, 2011 LINBIT HA-Solutions GmbH

Trademark notice
DRBD and LINBIT are trademarks or registered trademarks of LINBIT in Austria, the United States, and other countries. Other names mentioned in this document may be trademarks or registered trademarks of their respective owners.

License information
The text and illustrations in this document are licensed under a Creative Commons Attribution-NonCommercialNoDerivs 3.0 Unported license ("CC BY-NC-ND"). A summary of CC BY-NC-ND is available at http://creativecommons.org/licenses/by-nc-nd/3.0/. The full license text is available at http://creativecommons.org/licenses/by-nc-nd/3.0/legalcode. In accordance with CC BY-NC-ND, if you distribute this document, you must provide the URL for the original version.

1. Active/Active vs. Dual-Primary .................................................................................. 2. What a filesystem does ............................................................................................ 3. Requirements for Dual-Primary DRBD ........................................................................ 4. Cluster filesystems to the rescue ............................................................................... 5. Design-related issues ............................................................................................... 6. Fencing mechanisms ................................................................................................. 7. Dual-Primary and long-distance links ......................................................................... 8. Valid Dual-Primary scenarios ..................................................................................... 9. Conclusion .............................................................................................................. 10. Feedback ..............................................................................................................

1 2 2 3 3 3 4 4 4 4

This document explains pros and cons of DRBD Dual Primary configurations. It illustrates why relatively often Dual Primary configurations can do more harm than good, and suggests to give a good second thought before one creates this type of DRBD setup.

1.Active/Activevs.Dual-Primary
First of all, lets get down to some basics and create a convention for describing specific types of setups. Active/Active simply means that in a two-node-cluster, both cluster nodes run certain applications. This does not necessarily imply identical applications, and certainly does not imply concurrent data access. Think of an HA cluster where a test database and a production database are set up; if both cluster nodes are up and running, one database will run on one node and the other database will run on the other. If one node is down, only the production database will be up on the remaining node. The test database will be down and remain down until the other cluster node comes back online. Here, different DRBD resources can be in the Primary role on different nodes, but one DRBD resource does not necessarily need to be Primary on both nodes at the same time. In fact, it usually will not be. Dual-Primary, in contrast, means that the same DRBD resource has the Primary role on both cluster nodes at the same time, so that is becomes possible to access the resource in read/write mode simultaneously.

Dual Primary: Think Twice The official LINBIT naming scheme is like this: When a DRBD resource is in Primary mode on two nodes at the same time, we will refer to it as Dual-Primary. When a two node cluster is set up to run different applications on different nodes given that both nodes are available, then this will be an Active/Active setup in LINBIT speak.

2.Whatafilesystemdoes
Filesystems exist for exactly one reason: They organize storage devices. Standard storage devices follow no specific structure. Adding a structure is what a filesystem does: It makes it possible to quickly find information written to the device in the past. When you open a file in your favourite editor, its the filesystems task to know which specific area of your storage device it has to access to open exactly the file you want to see. And when you write changes to the disk, the filesystem makes sure the newly created information are integrated properly into the filesystems structure. Imagine the following situation: Two applications on your system try to access the same region on your storage device at the same time. Both write requests hit the filesystem. The filesystem will process one request while queueing the other, then process the other request. It would know all the time which piece of information it wrote to which part of the disk. The filesystem would remain consistent. Keep in mind that its not the filesystems task to actually validate the content of a specific file it writes down to the disk. If one application in your system overwrites the contents of a file created by another application, making that other application go haywire, then from the filesystems point of view everything is still fine as long as it knows which piece of information it wrote down where and when. Administrators, on the other hand, are obviously free to disagree. The opposite of a consistent filesystem is a corrupt filesystem. Filesystems might be damaged by a number of factors, including hardware problems, general stability problems or even inept administrators. Recovering a corrupt filesystem is a tricky task and often does not work as expected. That is why administrators need to guard against filesystem corruption in advance (and while we are at it: the best way is and will always be keeping good backups).

3.RequirementsforDual-PrimaryDRBD
When a DRBD resource is in Dual-Primary mode, the DRBD kernel driver allows write attempts to happen to this specific DRBD resource on both nodes sharing the resource. By design, a DRBD resource is supposed to have the same contents on both nodes of a cluster. Changes from node A are replicated over to node B and the other way around. With a standard filesystem (Ext3/4, XFS ), the following situation could arise: Imagine a DRBD resource with an Ext4 filesystem on it. DRBD is in Dual-Primary mode, the Ext4 filesystem is mounted on both cluster nodes. Application A writes something down to the filesystem residing on the DRBD resource on node A, which then gets written to the physical storage device. At the very same time, application B tries to write something down to the filesystem on node B, which gets written down to exactly the same region on the storage device of node B. DRBD replicates the changes from node A to node B and the other way around. It changes the contents of the physical storage device. However - as DRBD resides under the mentioned Ext4 filesystems, the filesystem on the physical disk of node A does not notice the changes coming from node B (and vice versa). This process is called a concurrent write. Starting from now, the actual content of the storage device differs from what the filesystem there thinks it should be. The filesystem is corrupt. Because of this, normal filesystems just can not be used in Dual-Primary setups not even when in read-only-mode (on one of the two nodes): Even then, the filesystem meta data might still be changed. Additionally, Linux assumes that when it mounts a filesystem in read-only mode, there will be absolutely nothing else changing that filesystem. This leads to massive cache coherency problems: Imagine a file that was read into the cache on the read-only node somewhen in the past

Dual Primary: Think Twice and is re-accessed later while in the meantime, it was changed on the node where the filesystem is mounted in read-write mode. The Secondary node is not going to notice these changes.

4.Clusterfilesystemstotherescue
Cluster filesystems are an attempt to face that challenge. You might have heard of them: GFS, GFS2, OCFS2 there are quite a few of them around. What they all have in common are mechanisms to avoid concurrent writes. With a cluster filesystem in place, every server exports its storage devices (a DRBD resource can be such a storage device, too). Every client which wants to use the cluster filesystem has to mount it with the cluster filesystem software itself as the appropriate interface. All write attempts to the cluster file system are coordinated by that software. Numerous components needed for this process are part of recent linux kernels already, amongst them the Distributed Lock Manager (DLM), which provides locking-related functions to the cluster filesystems.

5.Design-relatedissues
Given that cluster filesystems exist, you might wonder why using Dual-Primary setups should be a problem. The consistency of a filesystem residing on a storage device is not the only condition that needs to be met in order for a storage device to be considered usable. There are other challenges rendering the task of running a DRBD resource in Dual-Primary mode complicated, many of them related to the design of modern storage solutions. There is, for instance, the All or nothing rule. That rule explicitly defines that a set of data can only be considered consistent if its really known that every single bit on that set of data is what its supposed to be (for example, in DRBD speak, this would mean a resource is UpToDate). As soon as a device is being written to in an uncoordinated manner (caused by hardware errors, a cluster filesystem going berserk or whatever), a set of data has to count as inconsistent (because we can not definitely assume that is is consistent). It then is worthless. In a simple Primary-Secondary setup with DRBD, even if one node crashes, the DRBD on the remaining machine will still have a consistent filesystem. We can safely assume that because in this sort of setup, there is nothing that could have written to the Secondary DRBD drive except for the Primary DRBD drive, which went away. When having a Dual-Primary resource, we principally have to assume that as soon as the two nodes sharing that DRBD drive get disconnected from each other, uncoordinated write attempts can happen on either of them. Measures need to be taken to make sure that when node is in trouble, that node can not cause corruption of a set of data anymore welcome to fencing.

6.Fencingmechanisms
You might have heard of fencing mechanisms. Doing some fencing generally is a good idea not only in Dual-Primary setups. However, in Dual-Primary configurations, fencing is vital and absolutely necessary for proper cluster functionality. One of the well-known fencing mechanisms is STONITH (which stands for Shoot the other node in the head). Using STONITH means that as soon as the cluster-manager detects that a node has problems, it will reboot it to make sure that no further changes of data happen on that cluster node. Setting up fencing is not only complicated for Dual-Primary-Setups (aside from the fact that in most cases, it needs special harware), it also renders a cluster setup more complicated and adds considerable complexity to the cluster. Last but not least, there are practical impacts caused by the internal fencing mechanisms used by all cluster filesystems: Whenever the cluster manager notices that a node needs to be fenced, the

Dual Primary: Think Twice cluster filesystems will halt all I/O operations on all nodes, leading to possible I/O wait (depending on how long it takes for the fencing to happen).

7.Dual-Primaryandlong-distancelinks
As pointed out previously, a Dual-Primary DRBD resource is way more prone to connection hickups and connection breakdowns than a standard resource. For that reason, Dual-Primary setups should only be run in environments where a back-to-back connection ("cross-link") between the two nodes sharing a DRBD resource is available. Doing Dual-Primary setups over a long distance link is begging for trouble as long as the link is not dark fibre. Running DRBD in Dual-Primary mode requires that the affected resource uses protocol C. DRBD will return an error message if you try to put a resource into Dual-Primary mode which uses protocol A or B.

8.ValidDual-Primaryscenarios
Building a reasonable Dual-Primary setup is a complex and difficult task as stated previously. Nevertheless, this document does not want to put in question that usage scenarios for it do exist. Here are some scenarios where LINBIT considers to be worth the effort. Clustered Samba (CTDB): Clustered Samba is a special flavour of Samba built for cluster setups. It depends on a clustered filesystem. Oracle: When one needs to use Oracles database with the RAC feature, a clustered filesystem is needed (obviously, OCFS2 is the preferred one by Oracle).

Note
Consult with an Oracle support representative about support considerations for Oracle RAC on Dual-Primary storage with DRBD and OCFS2. Live migration for VMs: When using virtual machines (be it KVM or Xen), there is the possibility of using live migration to move a running VM from one host to another without having it to power off and on again. During the live migration, for a short period of time, obviously both nodes need to be able to access the underlying storage device in read/write mode. That can be achieved by using DRBD in Dual-Primary mode. Please note: You dont need a clustered filesystem in this case, libvirt can take care of this for you. Parallel NFS: NFS starting with version 4.1 supports pNFS, which stands for Parallel NFS. pNFS can be run on Top of a Dual-Primary DRBD, too.

9.Conclusion
As pointed out, there are some scenarios where using Dual-Primary DRBDs makes sense CTDB, Oracle, Live Migration with virtual machines, and pNFS, to name a few examples. If you are about to use Dual-Primary for other setups, it might well be worth the effort to think about that again in most cases, the time invested into avoiding Dual-Primary setups pays off double and triple when the solution is in production use. If you, however, still think you need a Dual-Primary setup, give us a call to find out whether LINBIT is able to help you with whatever you are about to implement.

10.Feedback
Any questions or comments about this document are highly appreciated and much encouraged. Please contact the author(s) directly; contact email addresses are listed on the title page.

Dual Primary: Think Twice For a public discussion about the concepts mentioned in this white paper, you are invited to subscribe and post to the drbd-user mailing list. Please see http://lists.linbit.com/listinfo/ drbduser for details.

Das könnte Ihnen auch gefallen