Lesson 7 - Manage and Administer - Using Site-Awareness With Mirroring

Lesson 7: Manage and Administer - Using Site-Awareness with Mirroring - Text
Page 1 of 5
LESSON 7: MANAGE AND ADMINISTER - USING SITE-AWARENESS WITH MIRRORING

PRINT DOCUMENT
Veritas Storage Foundation 5.1 for UNIX: Manage and Administer
BOB LUCAS: Welcome to Lesson 7 Using Site Awareness with Mirroring.
Lesson topics and objectives

So in this lesson we'll explore a feature that was introduced with Storage Foundation 5.0, and still exists in 5.1, which is remote
mirroring and site-awareness mirroring. We'll talk about configuring, recovering from failures with this, and verifying a site-aware
environment, and how to setup volumes, disk groups and DCO logs in a site-aware environment.
What is remote mirroring and site awareness?
Many organizations are going to have disaster recovery sites setup as part of their environment.
The need for remote mirroring and site awareness
And the disaster recovery sites may be local sites or they may be very distant remote sites or even a combination of both. The
purpose is to recover from local disasters as well as far distant disasters that may happen and fail the entire application group
environment over to another site where the disaster hasn't occurred. And so they'll be using disaster recovery technologies to do that
sort of thing, as well as clustering for each one of the individual sites. Sometimes you may have the need for a local site disaster
recovery secondary site where you can at least failover to somewhere across an earthquake fault line or across a river, down the
street or up the street or somewhere where you don't have to move data or even bring up data in a very remote location. The
purpose for this is usually because you want to be able to create a consistent copy of that data, all the data at all times. And you
don't want to have to snapshot or mirror every single volume all the time because you simply don't have the administrative time to be
able to do that, and you want this done 24/7/365 because you never know when the disaster is going to happen. So a good example
might be, I'll use an earthquake as an example. An earthquake may affect one particular side of a fault line, but not another. So it
may not be necessary to failover to such a great degree thousands and thousands of miles if the other part of the state that you're in
has not been affected by the earthquake and you still have very valid copies of your data at that site. It also helps with replication
performance to have multiple different sites, some local where you can do synchronous replication and preserve good performance
on the applications that are doing the data I/O. And then also have a secondary site where you can do asynchronous replication from
your primary site to a much more distant remote site to, again, allow the application to have good performance at the primary site.
But you may not be quite as updated at the remote distant site as you are at the local recovery site. So for purposes of this class,
we're going to focus on site-awareness for volumes, disk groups and how it is implemented in something called campus clustering or
remote mirroring. We also call campus clustering stretch clustering sometimes, it refers to the same thing. In each case you're going
to have mirrors in Volume Manager volumes built across SAN environments from one site to another where both of those sites have
hosts that may be part of the same clustered environment. So this is typically done with Veritas Cluster Server when you're using
Storage Foundation, and maybe other clustering applications as well. So if you have a need for high availability disaster recovery and
you have your campus clusters or stretch clusters setup in your environment, remote mirroring and site-awareness can be a very
effective tool to be able to make sure that you're redundant to enough of a degree that you can failover the entire site from, for
example one side of the fault line to the other or down the street from the other site, and not have to go to such a degree as to WAN
failover everything to the remote site. So this slide describes two different sites. Site A, which is the primary, site B, which is the
secondary, and there may be a distance of maybe 40 or 50 kilometers between the two sites. Symantec has tested, in Storage
Foundation 5.1, up to an 80 kilometer distance between two test campus cluster sites. And we have been able to verify our product
works under 80 kilometers, but that's usually a very long distance for campus clustering. However, if you do have that much distance
between your two local sites, it is supported and tested. So we may have a Fibre Channel connection or other types of local highspeed connection between the two sites that enables us to have SAN mirroring. Now that connection is important and the speed of
the connection is very, very important because without that speedy connection, it's very difficult to ensure that each mirror at each
site is going to be updated with every write operation that happens. If you don't have a fast enough connection, you'll know because
one site will always be outdated and the other site will always be faster. So we'll assume for the moment that we have a fast enough
pipe between both sites. The sites are relatively local, maybe within 40/50 kilometers. And one site's the primary, site A, and the
other, site B, is the secondary. In versions previous to Storage Foundation 5.0, Volume Manager and file system could not figure out
that a mirror was either in the same location or a remote campus clustered location. And so sometimes what would happen is, if you
have a failed disk in your mirror at your primary site, if another disk at another site was closer in SCSI real estate to the original disk,
it would pick that disk, even if it was not chosen as a spare disk, for the replacement of the failed disk at the other site. And so what
you'd end up with is disks that are very different sites, but you don't have a campus clustering factor built in there. And that would
cause probably a lot of write bottlenecks to the mirrored volume on the original site. So what we have in Version 5.0 and later is siteawareness. Now site-awareness is something that is not set by default. It is only available through our enterprise license key with the
Storage Foundation 5.1. And it is something that the administrator has to set and clear themselves, so it's not set by default in the
product. But if you do have this, and you have the key needed for it, you can do all the things that are in the bottom left portion of
your slide there that are pointed to by site-awareness feature. So you can do all those things that really help you make your volumes
and disks be redundant on the site and allow site failover from site A to site B if anything to site A happens to bring down the
application at site A.
http://symantecpartners.vportal.net/media/symantecpartners/media/_generated/transcripts/t... 8/23/2011
Page 2 of 5
Case scenario 1 - Failover cluster

Now there's two scenarios where this might happen. The first scenario assumes you have your clustered application as a failover
cluster, so possibly Cluster Server where you have failover service groups setup in Cluster Server. And maybe one of your service
groups in BCS has mounts, volumes and disk groups associated with it. And inside one of the volumes you have a mirrored volume
and you have basically campus clustered mirroring, so SAN mirroring across the different sites. So in this example we have actually
two clustered nodes at site A and we have one clustered node at site B, but in our example, they're all part of one single Cluster
Server cluster, so we can conceivably failover a service group from one site to the other if we need to. Now the disk group, because
we have failover service groups, the disk group is imported on only one host at the primary site currently, so only one host is going to
present those devices and applications to the users and clients that need them doing the I/O.
Case scenario 2 - Active/active cluster (CVM/CFS/RAC)
Another scenario that you might find this with site-awareness, is an active/active cluster where you're using somethinglike Storage
Foundation Oracle RAC or Storage Foundation Cluster File System where you have shared mounts and shared volumes across site A
and site B. This example is different, not only because we've added a node to the cluster there, but now you can mount import and
bring up volumes, disk groups and file systems on both sites at the same time for the same application. So that application is going to
be running at both sites sending parallel writes into the disks. And therefore, anything that happens, with regards to a failover, has to
happen pretty quick because the other site will already have and maintain write-awareness and site-awareness. So in this case we
have CVM, which is a portion of Storage Foundation Cluster File System, enabling nodes at both sites to update the volumes at the
same time. And we have basically a four-node parallel cluster. And the service groups are not failover in this scenario, they're parallel
service groups.
Configuring site awareness
So the first thing we need to do if we want to take advantage of site-awareness, is we need to define the site-awareness and we
need to set some tags on the objects
Defining site awareness
that are going to be used at the primary site. So these cannot only include hardware, these can also include logical resources such as
volumes and disk groups. So you can make your disk site-aware, you can make your disk array site-aware, the host site-aware. You
can even go as far as making the whole disk group site-aware and everything recursively inside the disk group will then be tagged as
site-aware as well. And we're going to explore each one of these in this lesson. When you set site-awareness to any object, there are
a set of consistency rules that are developed to enforce redundancy and site-awareness on those objects. For example, when you set
a site-awareness to a volume for example, the volume is going to be automatically mirrored across the sites by default. So even
though you don't ask for a mirrored volume, it will make a mirrored volume automatically because you're saying that that volume
needs to be site-aware. And for a volume to be site-aware, it has to have mirrors at both sites. In addition to that, the mirrored
volume is going to be created with a data change object log. Version 20 indicates a site-aware DCO log that has synchronization built
into it, and we talked about DCO logs in earlier lessons, and that will be also created by default. Automatically you don't have to add
it, all you have to basically do, with site-aware volumes, is to create a simple volume or concatenated volume and these things will be
added to the volume as part of the rules. So the idea is to automate this whole site-awareness process for you so that all you have to
do is set the site-awareness tag on the objects where you want this to manifest itself.
Configuring site awareness
So here is a step-by-step for site-awareness and configuration. And you can configure site-awareness on the disk group and
everything in it or you can pick out the individual volumes where this type of thing needs to be setup. You can also change the
volume's read policy to site-aware or site read. And we talked about site read as one of the read policies in an earlier lesson. We
actually have a slide on that coming up here.
Assigning disks to a site
So here is some of the commands that you can use to set site-awareness. So the operand you want to use is site= and then give it
the value of the name of your site. Name of your site can be anything you like and Volume Manager will imprint that site value into
that disk's private region and also into the Volume Manager objects that are associated with that change. So it's treated just like any
other disk tag. You can do a vxdisk list tag to list it. You can do vxdisk remove tag to remove it. You can a vxdisk list of a disk with a
-v option to give it more verbose output the private region and you can see the tag in there.
Assigning hosts to a site
And then to actually set a host to a site, remember we said we could tag hosts with site-awareness as well, you need to use a
different command, which is vxdctl. If you remember from earlier lessons, vxdctl is a command that affects the daemon vxconfigd in
Volume Manager on that host and it's a host-wide setting. So it'll affect everything on that host that's going to be site-aware.
Configuring site-based allocation on a disk group
Page 3 of 5
So here's an example of configuring site-based allocation or site-awareness on a disk group and everything recursively inside the disk
group. So that will be these commands here vxdg addsite. allsites=off if you want to turn the site-awareness off to the disk group.
And vxdg rmsite to remove that site from the disk group's configuration. Notice in the top balloon there underneath the two vxdg
commands, those are the two things that will be done automatically when you tag a site-aware tag to a vxdg disk group. So you get
a new volume mirrored and you'll have a volume read policy automatically set to site read.
Site consistency in a disk group
So let's talk about setting up site consistency for a disk group. Site consistency takes another step in making sure that your site
volumes are more redundant and then have logs attached to them and can perform with site-awareness in a failure situation. So the
key topics to talk about with site consistency, the differentiators are, new volumes are created as site consistent automatically when
you set the site consistency to the disk group. Existing volumes that don't have site consistency are not impacted in terms of their
data or their plexes. And for redundancy purposes, a configuration database copy of the disk group that is set as site consistent, is
going to be place at both sites, which is going to be important in the case of a connectivity failure on the SAN. So the first step is to
set site consistent flag to on, on the disk group itself. There's a command for that inside your slide. If you want to set consistency on
an individual existing volume, then you do vxsnap prepare. And then you do vxvol set siteconsistent=on to the volume itself. That is
going to automatically create the DCO log version 20 that we talked about in earlier slides. If you want to turn off site consistency to
a volume, you can do that manually by setting the site consistent flag to off. And notice how you can also set the disk group flag to
off.
Read policy in a site-consistent setup
We talked about site read in earlier lessons and it was one of the new read policies in the product. And with Storage Foundation 5.0
and later, you have a site read policy that is right in tune with site-awareness and site consistency. And what it will do, is it'll instruct
the host to read data only from the site that you want on the plexes that you want if you are SAN mirrored across sites mainly for
performance and update purposes. And you can change the read policy at any time, site read is just another read policy and a tool in
your toolbox to be able to use to make sure that you are the most consistent and up to date without sacrificing performance. The
command is not going to have any effect if site names have not been set on the host, so you need to make sure that your vxdctl set
site command has been done and the site name matches up to the site that you want to read from.
Site-based storage allocation
So with site consistent mirrored volume, with plexes at all the sites where you want to be consistent, is created with this command at
the top of the screen. And a non-site consistent mirrored volume is shown also for comparison purposes. So even though your disk
group may be site consistent, if you must set one or two of your volumes to be non-site consistent, you can still can do that. Just be
aware that different things are going to happen if a hardware component may fail in the site consistent or non-consistent volume. So
usually you want to have all volumes either site consistent or non-site consistent depending on your needs.
Making changes in a site-consistent environment
Now if you want to make some changes in your site consistent environment, here are some examples on how to do that. And when
you do this, sometimes you inadvertently change the rules or you may violate rules that you have already setup on other objects or
other volumes in the disk group. You want to make sure that all the rules are in tune with each other and logically work together.
Because if they're not, they could cause some unexpected results inside the volume or volume objects. So here's some examples of
adding a site at site C, which in this particular first case near the bottom of your slide, doesn't have a mirror created that points to
site C, and so the command would fail unless you had a mirror already setup across to site C. Now notice how you can force the
command if you need to, but what's going to happen is the all sites is going to be disabled because now that C isn't truly site
consistent. So using the force flag is only recommended if you need to use it because you want something to be created at that site
and you're going to set the site consistency on that object later on. This is not a recommended procedure if you want to be site
consistent across all of your sites. But nevertheless, like a lot of other Volume Manager flags, with -f Enforce associated with it, it is
there if you need to use it.
Making an existing disk group site consistent
The other thing you want to make sure of is because of this feature for site-awareness and consistency was introduced in Storage
Foundation 5, the disk version on Storage Foundation 5 was 140. So you want to make sure your disk groups are all upgraded to 140
or later. And of course to do that, you're going to run your vxdg upgrade command on that disk group, which takes care of
everything inside the disk group as well, and that is an uptime online command. If it's not, you're going to get errors when you try to
set the site flags and tags and then you will know that you have to upgrade your disk group. So once you've upgraded your disk
group, you're going to define the site name on each host that can access that disk group, presumably both sites. You're going to tag
the disks in the disk group with a site name. And if you have any RAID-5 volumes, which cannot be mirrored in Volume Manager,
you're going to have to either convert those volumes to other types of volumes such as striping and mirroring volumes or you'll have
to move them to another disk group, which is not going to be site-aware. You want to ensure that the volumes have plexes at each
site, and then you want to register each site with the disk group, and then make sure site consistency is on for the disk group itself
and for each volume in the disk group.
Recovering from failures with remote mirrors
Page 4 of 5
At that point, you're entire disk group and all of its objects recursively are site consistent.
Possible failure scenarios: Storage failure
Now, let's say you have that scenario and then on site B for example, you have a failed disk one day. So this assumes that you have
two sites, you have SAN mirror setup, you have complete disk group site consistency recursively and what happens is, in site B you
lose a disk. In earlier versions of Storage Foundation before site consistency was around, what used to happen by default is hot
relocation on site B would take over and it would move the failed disk out of the disk group and find a spare disk or another disk to
replace that disk inside the same disk group and it would automatically try to recover the volume. Now, that will still happen, but only
if your disk group is not site consistent. If you've done all the previous steps in the slides I just showed to make your disk group site
consistent, something different happens. What happens is, site A's vxconfigd daemon will detach site B from the configuration. It'll
detach the entire site, even though only one disk has failed and because all the volumes at site B were mirrored, there's really no
impact to that data at site B. But site A takes a very conservative approach to this and decides that this failure could possibly affect
the entire site and it kind of overrides the hot relocation daemon in site B and detaches the whole site from site A's configuration.
Now this does not mean that any of the applications will stop running at site B. If they're running, they can still run, but what
happens is, now site B is no longer site consistent in site A's eyes, so you want to be aware of this. This only happens when the disk
group itself is site consistent. Remember that you can also have a situation where your disk group is not site consistent, but some of
the volumes in that disk group are site consistent. This is a case where everything in the disk group, including disk group itself, is site
consistent.
Possible failure scenarios: Primary site failure
So here we have a standby host at the other site, and this scenario is different. This is a real honest to goodness site A failure where
something happens to the site, maybe from an environmental standpoint, and the site is no longer accessible, so no applications are
running now at site A. And now you most likely need the assistance of a clustered application, namely Vertitas Cluster Server for
example, to help you failover all your services and service groups containing the disk groups and volumes to site B. So what happens
is the disk groups are imported with missing disks and the failed site is detached from site B's configuration. And now site B has
control of all the applications and processes running on those objects.
Possible failure scenarios: Loss of SAN connectivity
Here's a third scenario, which is different than the previous two. This scenario assumes that you've lost storage connectivity between
site A and site B, so neither site has actually failed, nor have any volumes, disks or disk groups failed at either site. This is simply a
breakage of connectivity, very similar to a heartbeat outage in a clustered environment where neither server in that clustered
environment is actually down. So the same type of thing happens between the sites, each site thinks the other is down when it's not
really down. So if this happens, according to each site, each site detaches the other sites from its configuration and assumes that it's
the only site running in the whole environment right now. In this particular case, there's a possibility of getting something called
serial split brain. Now this serial split brain is a condition where the disk group configuration database copies at each site may be out
of date or they may contain a different number or different type of objects because there's been some changes since then.
Nonetheless, they are not necessarily identical at each site. Before you bring the sites back up to site-awareness consistency, when
you reattach the connectivity or fix the connectivity problems, you're going to have to make sure that the disk group configuration
database images are the same for each site at each site. That means that site A has consistent views in the database of site B, as site
B has of itself and vice versa.
Recovering from a detached site
So that's part of the reason that site-awareness and consistency can help you, because if you remember from earlier, if we made our
disk group site consistent, we presumably set it up and set all the flags correctly. What happened was, it created database copies of
each site at the other site. So the other site already has or should have updated copies of the other site where its own site is. And
assuming nothing else happens to that database, then you should be okay. So now you have to recover from the detached site
situation, and here are some steps in the slide that describe how to do that. You fix the hardware failure as quickly as you can, verify
the site is back up with the associated storage and everything is recovered if it needs to be recovered and up if it needs to be up. If a
failed disk happened, then vxattached daemon should have reattached the failed site and/or the failed disk and automatically
recovered the volumes. And all these other things have been done that are on the slide there. So this is only necessary if you've had
site consistency breakage and the breakage in SAN configuration connections and serial split brain was a risk. So this is meant to fix
that serial split brain or get around the serial split brain problem. So what happens is, the update for the configuration database will
automatically be done when the detached site gets recovered at the other site.
Verifying a site-aware environment
Now, nobody wants a disaster to happen at their site or at least I hope nobody wants a disaster to happen at their site.
Testing the site-aware configuration
For this purpose we offer an additional feature that comes with the site-awareness and consistency, which is called manual fire drill.
Manual fire drill is a way to test or simulate a site outage. And this is typically used in company's business continuity disaster recovery
plans, which usually should be executed at least annually if not more. And so this is something that you can add or implement in that
Page 5 of 5
plan where you actually simulate one of these site failures or even a volume or disk failure. Now if you look at your slide, you'll see in
the middle of the slide it says, not recommended for live production systems. Let me put this into context for you. What this means is
we don't recommend doing this at a time of high I/O or high user usage on the applications that are using these volumes involved in
the test. The reason for being that it could cause a real dip in performance because a lot of activity happens underneath the covers
here with the objects being shifted and flags being read and tags being set and so on and so forth, that it would cause a lot of
contention on the hardware that's being used by the applications and the users. So this is recommended only when needed and only
at an opportune time maybe as part of your BCDR plan. But I think it's important to know that it's available to use because if you
ever do have a real disaster at your site, you don't necessarily know what's going to happen after the disaster or you don't know
where you're going to be or what you're going to have to do to restore yourself after the disaster. This fire drill offers a way to test
that and give you a very specific set of results because all this stuff is logged in the usual log places and you can go back to those
logs and save those in the folder or file or database where you do all your BCDR tests. So what happens is, the command that you
see here is going to perform a point in time detach of all site consistent mirrors. It's going to simulate a mirror failure in those
volumes that are used in the site consistency. And then what you have to do after it, is you have to recover the detached site, just as
the previous slides described, as if you had a real site failure. So there will be some manual administration involved in this procedure,
but I think it's a good procedure to know about and to use in these types of plans.
Lesson summary
And that concludes Lesson 7. Thank you very much.

Lesson 7 - Manage and Administer - Using Site-Awareness With Mirroring

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lesson 7 - Manage and Administer - Using Site-Awareness With Mirroring

Hochgeladen von

Copyright:

Verfügbare Formate

Lesson 7: Manage and Administer - Using Site-Awareness with Mirroring - Text