Beruflich Dokumente
Kultur Dokumente
Page 1 of 5
http://symantecpartners.vportal.net/media/symantecpartners/media/_generated/transcripts/t... 8/23/2011
Page 2 of 5
http://symantecpartners.vportal.net/media/symantecpartners/media/_generated/transcripts/t... 8/23/2011
Page 3 of 5
So here's an example of configuring site-based allocation or site-awareness on a disk group and everything recursively inside the disk
group. So that will be these commands here vxdg addsite. allsites=off if you want to turn the site-awareness off to the disk group.
And vxdg rmsite to remove that site from the disk group's configuration. Notice in the top balloon there underneath the two vxdg
commands, those are the two things that will be done automatically when you tag a site-aware tag to a vxdg disk group. So you get
a new volume mirrored and you'll have a volume read policy automatically set to site read.
Site consistency in a disk group
So let's talk about setting up site consistency for a disk group. Site consistency takes another step in making sure that your site
volumes are more redundant and then have logs attached to them and can perform with site-awareness in a failure situation. So the
key topics to talk about with site consistency, the differentiators are, new volumes are created as site consistent automatically when
you set the site consistency to the disk group. Existing volumes that don't have site consistency are not impacted in terms of their
data or their plexes. And for redundancy purposes, a configuration database copy of the disk group that is set as site consistent, is
going to be place at both sites, which is going to be important in the case of a connectivity failure on the SAN. So the first step is to
set site consistent flag to on, on the disk group itself. There's a command for that inside your slide. If you want to set consistency on
an individual existing volume, then you do vxsnap prepare. And then you do vxvol set siteconsistent=on to the volume itself. That is
going to automatically create the DCO log version 20 that we talked about in earlier slides. If you want to turn off site consistency to
a volume, you can do that manually by setting the site consistent flag to off. And notice how you can also set the disk group flag to
off.
Read policy in a site-consistent setup
We talked about site read in earlier lessons and it was one of the new read policies in the product. And with Storage Foundation 5.0
and later, you have a site read policy that is right in tune with site-awareness and site consistency. And what it will do, is it'll instruct
the host to read data only from the site that you want on the plexes that you want if you are SAN mirrored across sites mainly for
performance and update purposes. And you can change the read policy at any time, site read is just another read policy and a tool in
your toolbox to be able to use to make sure that you are the most consistent and up to date without sacrificing performance. The
command is not going to have any effect if site names have not been set on the host, so you need to make sure that your vxdctl set
site command has been done and the site name matches up to the site that you want to read from.
Site-based storage allocation
So with site consistent mirrored volume, with plexes at all the sites where you want to be consistent, is created with this command at
the top of the screen. And a non-site consistent mirrored volume is shown also for comparison purposes. So even though your disk
group may be site consistent, if you must set one or two of your volumes to be non-site consistent, you can still can do that. Just be
aware that different things are going to happen if a hardware component may fail in the site consistent or non-consistent volume. So
usually you want to have all volumes either site consistent or non-site consistent depending on your needs.
Making changes in a site-consistent environment
Now if you want to make some changes in your site consistent environment, here are some examples on how to do that. And when
you do this, sometimes you inadvertently change the rules or you may violate rules that you have already setup on other objects or
other volumes in the disk group. You want to make sure that all the rules are in tune with each other and logically work together.
Because if they're not, they could cause some unexpected results inside the volume or volume objects. So here's some examples of
adding a site at site C, which in this particular first case near the bottom of your slide, doesn't have a mirror created that points to
site C, and so the command would fail unless you had a mirror already setup across to site C. Now notice how you can force the
command if you need to, but what's going to happen is the all sites is going to be disabled because now that C isn't truly site
consistent. So using the force flag is only recommended if you need to use it because you want something to be created at that site
and you're going to set the site consistency on that object later on. This is not a recommended procedure if you want to be site
consistent across all of your sites. But nevertheless, like a lot of other Volume Manager flags, with -f Enforce associated with it, it is
there if you need to use it.
Making an existing disk group site consistent
The other thing you want to make sure of is because of this feature for site-awareness and consistency was introduced in Storage
Foundation 5, the disk version on Storage Foundation 5 was 140. So you want to make sure your disk groups are all upgraded to 140
or later. And of course to do that, you're going to run your vxdg upgrade command on that disk group, which takes care of
everything inside the disk group as well, and that is an uptime online command. If it's not, you're going to get errors when you try to
set the site flags and tags and then you will know that you have to upgrade your disk group. So once you've upgraded your disk
group, you're going to define the site name on each host that can access that disk group, presumably both sites. You're going to tag
the disks in the disk group with a site name. And if you have any RAID-5 volumes, which cannot be mirrored in Volume Manager,
you're going to have to either convert those volumes to other types of volumes such as striping and mirroring volumes or you'll have
to move them to another disk group, which is not going to be site-aware. You want to ensure that the volumes have plexes at each
site, and then you want to register each site with the disk group, and then make sure site consistency is on for the disk group itself
and for each volume in the disk group.
Recovering from failures with remote mirrors
http://symantecpartners.vportal.net/media/symantecpartners/media/_generated/transcripts/t... 8/23/2011
Page 4 of 5
At that point, you're entire disk group and all of its objects recursively are site consistent.
Possible failure scenarios: Storage failure
Now, let's say you have that scenario and then on site B for example, you have a failed disk one day. So this assumes that you have
two sites, you have SAN mirror setup, you have complete disk group site consistency recursively and what happens is, in site B you
lose a disk. In earlier versions of Storage Foundation before site consistency was around, what used to happen by default is hot
relocation on site B would take over and it would move the failed disk out of the disk group and find a spare disk or another disk to
replace that disk inside the same disk group and it would automatically try to recover the volume. Now, that will still happen, but only
if your disk group is not site consistent. If you've done all the previous steps in the slides I just showed to make your disk group site
consistent, something different happens. What happens is, site A's vxconfigd daemon will detach site B from the configuration. It'll
detach the entire site, even though only one disk has failed and because all the volumes at site B were mirrored, there's really no
impact to that data at site B. But site A takes a very conservative approach to this and decides that this failure could possibly affect
the entire site and it kind of overrides the hot relocation daemon in site B and detaches the whole site from site A's configuration.
Now this does not mean that any of the applications will stop running at site B. If they're running, they can still run, but what
happens is, now site B is no longer site consistent in site A's eyes, so you want to be aware of this. This only happens when the disk
group itself is site consistent. Remember that you can also have a situation where your disk group is not site consistent, but some of
the volumes in that disk group are site consistent. This is a case where everything in the disk group, including disk group itself, is site
consistent.
Possible failure scenarios: Primary site failure
So here we have a standby host at the other site, and this scenario is different. This is a real honest to goodness site A failure where
something happens to the site, maybe from an environmental standpoint, and the site is no longer accessible, so no applications are
running now at site A. And now you most likely need the assistance of a clustered application, namely Vertitas Cluster Server for
example, to help you failover all your services and service groups containing the disk groups and volumes to site B. So what happens
is the disk groups are imported with missing disks and the failed site is detached from site B's configuration. And now site B has
control of all the applications and processes running on those objects.
Possible failure scenarios: Loss of SAN connectivity
Here's a third scenario, which is different than the previous two. This scenario assumes that you've lost storage connectivity between
site A and site B, so neither site has actually failed, nor have any volumes, disks or disk groups failed at either site. This is simply a
breakage of connectivity, very similar to a heartbeat outage in a clustered environment where neither server in that clustered
environment is actually down. So the same type of thing happens between the sites, each site thinks the other is down when it's not
really down. So if this happens, according to each site, each site detaches the other sites from its configuration and assumes that it's
the only site running in the whole environment right now. In this particular case, there's a possibility of getting something called
serial split brain. Now this serial split brain is a condition where the disk group configuration database copies at each site may be out
of date or they may contain a different number or different type of objects because there's been some changes since then.
Nonetheless, they are not necessarily identical at each site. Before you bring the sites back up to site-awareness consistency, when
you reattach the connectivity or fix the connectivity problems, you're going to have to make sure that the disk group configuration
database images are the same for each site at each site. That means that site A has consistent views in the database of site B, as site
B has of itself and vice versa.
Recovering from a detached site
So that's part of the reason that site-awareness and consistency can help you, because if you remember from earlier, if we made our
disk group site consistent, we presumably set it up and set all the flags correctly. What happened was, it created database copies of
each site at the other site. So the other site already has or should have updated copies of the other site where its own site is. And
assuming nothing else happens to that database, then you should be okay. So now you have to recover from the detached site
situation, and here are some steps in the slide that describe how to do that. You fix the hardware failure as quickly as you can, verify
the site is back up with the associated storage and everything is recovered if it needs to be recovered and up if it needs to be up. If a
failed disk happened, then vxattached daemon should have reattached the failed site and/or the failed disk and automatically
recovered the volumes. And all these other things have been done that are on the slide there. So this is only necessary if you've had
site consistency breakage and the breakage in SAN configuration connections and serial split brain was a risk. So this is meant to fix
that serial split brain or get around the serial split brain problem. So what happens is, the update for the configuration database will
automatically be done when the detached site gets recovered at the other site.
Verifying a site-aware environment
Now, nobody wants a disaster to happen at their site or at least I hope nobody wants a disaster to happen at their site.
Testing the site-aware configuration
For this purpose we offer an additional feature that comes with the site-awareness and consistency, which is called manual fire drill.
Manual fire drill is a way to test or simulate a site outage. And this is typically used in company's business continuity disaster recovery
plans, which usually should be executed at least annually if not more. And so this is something that you can add or implement in that
http://symantecpartners.vportal.net/media/symantecpartners/media/_generated/transcripts/t... 8/23/2011
Page 5 of 5
plan where you actually simulate one of these site failures or even a volume or disk failure. Now if you look at your slide, you'll see in
the middle of the slide it says, not recommended for live production systems. Let me put this into context for you. What this means is
we don't recommend doing this at a time of high I/O or high user usage on the applications that are using these volumes involved in
the test. The reason for being that it could cause a real dip in performance because a lot of activity happens underneath the covers
here with the objects being shifted and flags being read and tags being set and so on and so forth, that it would cause a lot of
contention on the hardware that's being used by the applications and the users. So this is recommended only when needed and only
at an opportune time maybe as part of your BCDR plan. But I think it's important to know that it's available to use because if you
ever do have a real disaster at your site, you don't necessarily know what's going to happen after the disaster or you don't know
where you're going to be or what you're going to have to do to restore yourself after the disaster. This fire drill offers a way to test
that and give you a very specific set of results because all this stuff is logged in the usual log places and you can go back to those
logs and save those in the folder or file or database where you do all your BCDR tests. So what happens is, the command that you
see here is going to perform a point in time detach of all site consistent mirrors. It's going to simulate a mirror failure in those
volumes that are used in the site consistency. And then what you have to do after it, is you have to recover the detached site, just as
the previous slides described, as if you had a real site failure. So there will be some manual administration involved in this procedure,
but I think it's a good procedure to know about and to use in these types of plans.
Lesson summary
And that concludes Lesson 7. Thank you very much.
http://symantecpartners.vportal.net/media/symantecpartners/media/_generated/transcripts/t... 8/23/2011