Beruflich Dokumente
Kultur Dokumente
Search Squ
Go
quidoo
mohit96 donates 100% of this page's earnings to charity. You can do it too...
Home Computers & Electronics
RSS
On Line File Storage Permanently free. Register now and get 5GB of Free Online
Storage! free-hidrive.com/On+Line+File+Storage
Data Storage Permanently free. Register now and get 5GB of Free Online Storage! freehidrive.com/Data+Storage
EMC VNX Unified Storage Complete Tasks in mins instead of Hours. Switch to
EMC VNX storage! www.EMCIndia.co.in/vnx
Ads by Google
Data Storage Permanently free. Register now and get 5GB of Free Online Storage! freehidrive.com/Data+Storage
EMC VNX Unified Storage Complete Tasks in mins instead of Hours. Switch to
EMC VNX storage! www.EMCIndia.co.in/vnx
CONTENTS AT A GLANCE
1. Please go to original blog by clicking o... 2. NetApp Filer troubleshooting blog 3. New Guestbook
Please go to original blog by clicking on title of post to see in original form with details NetApp Filer troubleshooting blog
My Struggle with NetApp
NetApp related technical how to, Storage world and other technical discussions How to restore data from aggregate snapshot
Today one of?our user found himself in wet pants when he noticed his robocopy job has overwritten a folder, rather than appending new data to it. Being panicked he run to me looking for any tape or snapshot backup of his original data, which unfortunately wasn?t there as previously he confirmed that they don?t need any kind of protection. Now at this time I had only place left where I can recover the data, aggregate level snapshots; so I looked at aggregate snapshots and saw it goes back to time when he had data in place. Knowing that the data deleted from volume is still locked in aggregate?s snapshot I was feeling good that I have done a good job by having some space reserved for aggregate level snapshot, which no one ever advocated. Now the next step is to recover the data, but problem was that if I revert aggregate using ?snap restore ?A? then all the volumes in that aggregate will be reverted which will be bigger problem. So had to go on a different way, use aggregate copy function to copy the aggregate?s snapshot to an
empty aggregate and then restore the data from there. Here?s the cookbook for this. Pre-checks: The volume you lost data from is a flexible volume?Identify an aggregate which is empty so it can be used for destination (could be on another controller also)?Make sure the destination aggregate is either equal or larger than source aggregate?/etc/hosts.equiv has entry for the filer you want to copy data to and /etc/hosts has its IP address added, in case of copying on same controller loopback address (127.0.0.1) should be added in /etc/hosts file and local filername should be in hosts.equiv file?Name of aggregate?s snapshot which you want to copy? Example: Let?s say the volume we lost data was ?vol1?, the aggregate which has this volume is ?aggr_source?, the aggregate?s snapshot which has lost data is ?hourly.1? and empty aggregate where we will be storing data to is ?aggr_destination? Execution: Restrict the destination aggregate using ??aggr restrict aggr_destination??Start the aggregate data copy using ?aggr copy start ?s hourly.1 aggr_source aggr_destination??Once the copy is completed online the aggregate using ?aggr online aggr_destination??If you have done copy on same controller, system will rename the volume ?vol1? of ?aggr_destination? to ?vol1(1)??Now export the volume or lun and you have your all lost data available.So here?s the answer to another popular question, why do I need to reserve space for aggregate level snapshot. Do you have the answer now?
Often we want to do nslookup for a host or NIS/LDAP lookup for a user or group for troubleshooting purpose. You have a unix system handy and you do it from there however what if you suspect results are not same as what your filer may be getting? If you are troubleshooting CIFS issue, you are in luck with command 'cifs lookup' however, if you are dealing with DNS or NFS issue then you are out of luck, unless you go into advanced mode. Yes, you go inside advance mode and you get access to lot of other commands including one very nifty command 'getXXbyYY', which is incredibly useful but hidden from the view of admins for some strange reason, really I am not sure why NetApp thinks this shouldn't be available to end user as every time I do troubleshooting I feel the need of this and by no way I see this to be making any sort of changes on filer. Anyway here's the command, though command says using "man na_getXXbyYY" for additional info however I couldn't locate it on systems therefore I use test1*> getXXbyYY help usage: getXXbyYY Where sub-command is one of gethostbyname_r gethostbyaddr_r netgrp getspwbyname_r getpwbyname_r getpwbyuid_r getgrbyname getgrbygid getgrlist For more information, try 'man na_getXXbyYY' Please remember this command is not available in admin mode and search order depends of your /etc/nsswitch.conf entry, so before you start thinking that this isn't working as expected please check these two things first. Though all the subcommands are self explanatory however I have added small description for each of them. gethostbyname_r - Resolves host name to IP address from configured DNS server, same as nslookup gethostbyaddr_r - Retrieves IP address for host name from configured DNS server, same as reverse lookup netgrp - Checks group membership for given host from LDAP/Files/NIS getspwbyname_r - Displays user information using shadow file getpwbyname_r - Displays user information including encrypted password from LDAP/Files/NIS getpwbyuid_r - Same as above however you provide uid in this command rather than user name getgrbyname - Displays group name and gid from LDAP/Files/NIS getgrbygid - Same as above however you provide gid in this command rather than group name getgrlist - Shows given user's gid from LDAP/Files/NIS Examples: test1*> getXXbyYY gethostbyname_r landinghost1 name: landinghost1 aliases: addresses: 10.21.242.7 test1*> getXXbyYY gethostbyaddr_r 10.21.242.7 name: landinghost1 aliases: addresses: 10.21.242.7 test1*> getXXbyYY netgrp support-group testhost1 client testhost1 is in netgroup support-group test1*> getXXbyYY getpwbyname_r root pw_name = root pw_passwd = _J9..gsxiYTAHEtV3Qnk pw_uid = 0, pw_gid = 1 pw_gecos = pw_dir = / pw_shell = test1*> getXXbyYY getpwbyuid_r 0 pw_name = root pw_passwd = _J9..gsxiYTAHEtV3Qnk pw_uid = 0, pw_gid = 1 pw_gecos = pw_dir = / pw_shell = test1*> getXXbyYY getgrbyname was name = was
gid = 10826 test1*> getXXbyYY getgrbygid 10826 name = was gid = 10826 test1*> getXXbyYY getgrlist wasadmin pw_name = wasadmin Groups: 10826
we have little known command 'availtime' right inside Ontap which just do the exact same function and specifically created after thinking about our bosses. HOST02*> availtime full Service statistics as of Sat Aug 28 18:07:33 BST 2010 ?System ?(UP). First recorded 68824252 secs ago on Mon Jun 23 04:16:41 BST 2008 ?? ? ? ? Planned ? downs 31, downtime 6781737 secs, longest 6771328, Tue Sep ?9 15:07:33 BST 2008 ?? ? ? ? Uptime counting unplanned downtime: 100.00%; counting total downtime: ?90.14% ?NFS ? ? (UP). First recorded 68824242 secs ago on Mon Jun 23 04:16:51 BST 2008 ?? ? ? ? Planned ? downs 43, downtime 6849318 secs, longest 6839978, Wed Sep 10 10:11:43 BST 2008 ?? ? ? ? Uptime counting unplanned downtime: 100.00%; counting total downtime: ?90.04% ?CIFS ? ?(UP). First recorded 61969859 secs ago on Wed Sep 10 12:16:34 BST 2008 ?? ? ? ? Planned ? downs 35, downtime 17166 secs, longest 7351, Thu Jul 30 13:52:25 BST 2009 ?? ? ? ? Uptime counting unplanned downtime: 100.00%; counting total downtime: ?99.97% ?HTTP ? ?(UP). First recorded 47876362 secs ago on Fri Feb 20 14:08:11 GMT 2009 ?? ? ? ? Planned ? downs 8, downtime 235 secs, longest 53, Wed Jan 20 14:10:18 GMT 2010 ?? ? ? ? Unplanned downs 16, downtime 4915 secs, longest 3800, Mon Jul 27 16:01:02 BST 2009 ?? ? ? ? Uptime counting unplanned downtime: ?99.98%; counting total downtime: ?99.98% ?FCP ? ? (DOWN). First recorded 68817797 secs ago on Mon Jun 23 06:04:16 BST 2008 ?? ? ? ? Planned ? downs 17, downtime 44988443 secs, longest 38209631, Sat Aug 28 18:07:33 BST 2010 ?? ? ? ? Unplanned downs 6, downtime 78 secs, longest 21, Fri Feb 20 15:24:44 GMT 2009 ?? ? ? ? Uptime counting unplanned downtime: ?99.99%; counting total downtime: ?34.62% ?iSCSI ? (DOWN). First recorded 61970687 secs ago on Wed Sep 10 12:02:46 BST 2008 ?? ? ? ? Planned ? downs 21, downtime 38211244 secs, longest 36389556, Sat Aug 28 18:07:33 BST 2010 ?? ? ? ? Uptime counting unplanned downtime: 100.00%; counting total downtime: ?38.33%? I am not sure why NetApp has kept this command in Advanced mode but once you know this command I bet next time you will not refrain yourself going inside advance mode to see how many unscheduled downtime you had since last reset. A shorter version of this command is just 'availtime' it also shows the same information as 'availtime full' however it truncates letters from output and denotes ?Planned with P and Unplanned with U which is very good if you want to pass it in script.? HOST04*> availtime Service statistics as of?Sat Aug 28 18:07:33 BST 2010 ?System ?(UP). First recorded (20667804) on Wed Sep 23 09:35:49 GMT 2009 ?? ? ? ? P ?5, 496, 139, Fri Dec 11 15:58:19 GMT 2009 ?? ? ? ? U ?1, 1605, 1605, Wed Mar 31 17:01:41 GMT 2010 ?CIFS ? ?(UP). First recorded (20666589) on Wed Sep 23 09:56:04 GMT 2009 ?? ? ? ? P ?7, 825, 646, Thu Jan 21 19:08:03 GMT 2010 ?? ? ? ? U ?1, 77, 77, Wed Mar
31 16:34:54 GMT 2010 ?HTTP ? ?(UP). First recorded (20664731) on Wed Sep 23 10:27:02 GMT 2009 ?? ? ? ? P ?3, 51, 22, Thu Jan 21 19:17:25 GMT 2010 ?? ? ? ? U ?4, 203, 96, Thu Jan 21 19:08:03 GMT 2010 ?FCP ? ? (UP). First recorded (20477735) on Fri Sep 25 14:23:38 GMT 2009 ?? ? ? ? P ?3, 126, 92, Thu Jan 21 19:07:57 GMT 2010 ?? ? ? ? U ?4, 108, 76, Wed Mar 31 16:34:53 GMT 2010 In order to reset the output use 'reset' switch and it will zero out all the counters, make sure you have recorded the statistics before you reset the counters as once you reset the counters you will not be able to get details of system uptime since system was built so you may like to do only after you acquire a new system, have done all the configuration and now it's the time for it to serve user requests.
creating a shortcut doesn?t works. So here?s the way to correct the problem. Download the plugin from now toolchest. Extract the zip and edit file ?package.xml?, change the string ?dfmeff.exe? to ?dfmeff.bat?, next you have to create a new batch file in called ?dfmeff.bat? with below contents. @echo offD:\DFM\scriptplugins\dfmeff\dfmeff.exexcopy D:\DFM\web\*.* "C:\Program Files\NetApp\DataFabric\DFM\web" /Q/I/Y/R Obviously you have to change the path as per your installation however once you have created the batch file and added its reference in xml file you are good to go, just zip it again using any zip software and use the new zip file as plugin source for installation in DFM. Update: Just noticed a video showing features of plugin on netapp community site http://communities.netapp.com/videos/1209
As soon as someone asks this question we all say ?use ndmpcoyp? but what if you don?t have any network adapters configured, will ndmpcoyp work? No; ndmpcopy is very useful if you want to copy a file or a whole volume however one thing very few people know that it doesn?t work if you don?t have loopback adapter configured because ndmpcopy passes all the data through lo adapter so it?s not only dependent on lo?s availability, its speed also. So how do you copy the data if lo is not available? The answer is simple, use dd, just an old fashioned unix command which does lot of thing, not only it can copy the file with full pathname you can even use block number and disk number and the best part, syntax is simple ?if? for from and ?of? for to. It can be used not only for copying file around the system, in fact you can use it for testing I/O and copying file from snapshot also and this command can be used regardless of permission. A little note, if you are afraid of going in advanced or diagnostic mode better keep use rdfile and wrfile because this command is not available in admin mode so you have to go in advanced mode to use this. Here?s the syntax of this command. dd [ [if= file ] | [ din= disknum bin= blocknum ] ] [ [of= file ] | [ dout= disknum bout= blocknum ] ] count= number_of_blocks Another note, if you are using count make sure you are using in multiply of 4 because a WAFL block size is 4k. Example: sim1> priv set advanced sim1*> dd if=/vol/vol0/.snapshot/hourly.2/etc/snapmirror.conf of=/vol/vol0/etc/snapmirror.conf1
are also there which are used by Ontap for its internal work but that?s all covered deep under their IP protection policy so I couldn?t get any info on it. Now how to you get information about this from your system?If you look closely you will see in vol options there is an option ?no_i2p? (default is off and only in 7.1 and later) to enable or disable i2p feature as well if you go to advance mode you can see few more commands related to this like ?inodepath? which shows the i2p information stored in a given inode whereas ?wafl scan status? command shows you running i2p scans which can be aborted with 'wafl scan abort scan_id' or you can also change the scan speed with ?wafl scan speed new_speed? command after you list the current scan speed with ?wafl scan speed? command. After having these information pushed me to think that ok so most of the volumes in my systems are NFS only and they don?t need any virus scan, neither we use dump, fpolicy or any other feature so why not to turn off and get extra juice out of system but speaking with chaps it turned out that turning off wouldn?t be a good idea though they were also not sure what will break if you turn it off and as it has very little performance impact so better to leave it untouched. And yes it?s true it does have very less performance impact as in general we don?t do so much metadata modification that it may hurt the system with i2p workload however when you upgrade your system from an earlier release to 7.1 or later family they get very busy in creating i2p information for each and every file/directory/link etc and may run in high utilization for quite some time and at that time you may wish to use these commands to quickly pull your system back in normal state and let scan run with slow speed or one volume at a time or completely stop it if you want to revert back the system to pre 7.1 release. I think if I get some time I would like to do an extensive testing and see what comes out however if anyone else knows please share your knowledge.
also) it takes additional 30 seconds for surviving node to declare its partner as dead and start takeover process, which goes whopping 120 seconds. Now let?s see other scenario, NDU software update. What I understand from NDU is Non Disruptive Update, that means if I am doing a software update I can failover and failback partner nodes and simulate a reboot to put new code in effect without any downtime. But as per NetApp KB-22909 failover and failback can take as much as 180 seconds. So how does 180 seconds of downtime on each controller can be called as non-disruptive? Now that was for worst-case scenario but what I have seen with my systems so far is they take less than a minute to do a failover and failback. On my V6080 systems 35 seconds and on V3170 filer, 22 seconds rather than 90econds (observed with ping), and both of them are loaded with multiple vifs, CIFS shares, NFS exports, qtree quotas and snapmirror, though your mileage may vary as it depends on system configuration but that?s not bad for NAS only environment. To prove this few week before we have done some tests on our V3170 systems in order to check VMs of a new project, as they wanted to see how it affects while a system goes offline due to a hardware failure or any other reason and all 300 odd VMs were running fine without any glitches. Even while doing the test we run a script on all 300 odd Linux VMs, which was using DD to write and delete 100MB file every 2 seconds on /root and /tmp, a few of them were modified to write 500MB file and 1MB file. While all of the VMs were running the script, we have done failover/failback as well hard reboot, which left I/O suspended for whopping 3 minutes but surprisingly none of the VMs had kernel panic, RO file system or stopped writing however during the filer reboot they were pathetically slow or frozen. Now as you must be wondering why none of the VMs crashed as 180 seconds of no response from disk will put any OS to it?s knees, so what was special?? Well here the magic, if you look into VM best practices and search NetApp site for VM disk timeout settings change you will find they recommend changing the disk timeout error to 190 seconds so it can survive any kind of controller reboot or failover and call it non disruptive. So next time if someone says with active/active cluster you don?t have any downtime, don?t forget to ask him how do you handle any system crash or upgrade activity and if you want to deploy VM in your environment over NetApp heads don?t forget to change that parameter otherwise even in 22 seconds of I/O pause will make big impact on your VM environment.
Defragement in NetApp
Usually we face this problem with our PC and then we defrag our volumes clear temp files and what not; most of the times that solves the problem, though not fully but yes it gets better. In NetApp though we don?t have to deal with fragmented registry or temp files but due to nature of WAFL file system it gets fragmented very soon, soon after you
start overwriting or start deleting and adding the data to volume. So what do you do then? Well the answer is very simple use ?reallocate? command. Yes, this is the defrag tool of NetApp built right in the Ontap OS. First you have to turn on the reallocation on system with ?reallocate on? command. This command turns on the reallocation on system and same way turns off with off switch. This can be used not only on volumes, infact you can run this on a file, lun or aggregate itself. However I should warn you that optimization of lun may not give you any performance benefit or may get worse, as Ontap doesn?t have any clue what?s in the lun and it?s file system layout. If you want to run the reallocation only one time you should use -f or -o switch however if you want Ontap to keep a track of your FS and optimize the data when if feels necessary you should control it with ?i switch or schedule it with ?reallocate schedule? command. To check current optimization level of volume, you can use ?reallocate measure -o ? or if you want to feel adventurous use ?wafl scan measure_layout ? through advanced mode, though I don?t suggest using wafl set of commands in general use but yes sometime you want to do something different. This command is pretty straightforward and no harm (except extra load on CPU and disk) so you can play with this but you should always consider using -p switch for volumes having snapshot and/or snapmirror on to keep the snapshot size small.
How to get the list of domain users added to filer without fiddling with SID
There were numerous time when I wanted to see an AD user?s permission on filer however just to locate that user on system itself took me a lot of time. Why? Because Ontap shows domain users added to system in SID format rather than their names which is very much annoying as when it dumps the SIDs on screen then we have to use ?cifs lookup? command to hunt for the user I am looking for from that bunch of SIDs. So here?s a little handy unix script to see the list of all AD users added on filers in their username format rather then SIDs I have already setup a password less login to filer therefore I haven?t added the username and password fields however if you haven?t done that add your login credentials after name of the filer in below command. rsh useradmin domainuser list -g ?Administrators? | sed 's/^S/rsh cifs lookup S/' Now this command will display the AD users added in Administrator group however if you want to see users from any other group replace the Administrators word with group name on your screen.
without any interference of backup team. Everyone is happy so you happy but all of sudden on a Friday evening get a call from VP marketing crying on phone that he lost all his data from his network drive and windows shows recovery time of 2 hrs but he wants his 1Gb pst to be accessible now as he is on VPN with a client and needs to pull some old mails from his pst. Well that?s nothing abnormal as he was having lots of data and to recover the data windows has to read all the data from snapshot and then write back on network drive which but obvious will take time. Now what would you say, will you tell him to navigate to his pst and recover it (which shouldn?t take much time on fast connection) then try to recover all the data or ok I have recovered all your data while talking on the phone and become hero. Well I must say I would like to use the opportunity to become hero with a minute or less of work, but before we do a few things to note. For volume snaprestore: The volume must be online and must not be a mirror. When reverting the root volume, filer will be rebooted. Non-root volumes do not require a reboot however when reverting a non-root volume, all ongoing access to the volume must be terminated, just as is done when a volume is brought offline. For single-file snaprestore: The volume used for restoring the file must be online and must not be a mirror. If restore_as_path is specified, the path must be a full path to a filename, and must be in the same volume as the volume used for the restore. Files other than normal files and LUNs are not restored. This includes directories (and their contents), and files with NT streams. If there is not enough space in the volume, the single file snap restore will not start. If the file already exists (in the active file system), it will be overwritten with the version in the snapshot. To restore data there are two ways, first system admins using ?snap restore? command invoked by SMO, SMVI, Filer view or system console and second by end users where they can restore by copying file from .snapshot or ~snapshot directory or by using revert function in XP or newer system. However restoring data through snap restore command is very quick (seconds) even for TBs of data. Syntax for snap restore is as below. ?snap restore -t vol -s -r ? If you don?t want to restore the data at different place then remove the ?-r ? argument and filer will replace current file with the version in snapshot and if you don?t provide a snapshot name in syntax then system will show you all available snapshots and will prompt to select snapshot from which you want to restore the data. Here?s the simplest form of this command as example to recover a file. testfiler> snap restore -t file /vol/testvol/RootQtree/test.pst WARNING! This will restore a file from a snapshot into the active filesystem. If the file already exists in the active filesystem, it will be overwritten with the contents from the snapshot. Are you sure you want to do this? yes The following snapshots are available for volume testvol: date --------- Nov 17 13:00 hourly.0 Nov 17 11:00 hourly.1 Nov 17 09:00 name -----------hourly.2 Nov
hourly.4 Nov 16 17:00 nightly.1 Nov weekly.2 Oct nightly.4 Nov 11 00:00
weekly.3 Which snapshot in volume testvol would you like to revert the file
from? nightly.5 You have selected file /vol/testvol/RootQtree/test.pst, snapshot nightly.5 Proceed with restore? yes testfiler>
schedule, and here you specify how much weekly, daily or hourly snapshot you want to retain as well at what time hourly snapshot will be taken. In given example volume testvol is having 4 weekly, 7 daily and 7 hourly available where hourly snapshots are taken at 9,11,13,15,17,19 and 21 hours of system local time. Please make sure that ?nosnap? is set to off in volume options. How to take snapshots manually? To take the snapshot manually you can run below command. ?snap create ? Here volume name is the name of volume you want to take snapshot of and snapshot name is the name you want to identify snapshot with. How to list snapshots? You can check the status of snapshots associated with any volume with command ?snap list ? After issuing the above command you will get similar output testfiler> snap list testvol Volume testvol working... %/used %/total date name ---------- ---------- ------------ -------- 36% (36%) 0% ( 0%) Dec 02 16:00 hourly.0 50% (30%) 0% ( 0%) Dec 02 12:00 hourly.1 61% (36%) 0% ( 0%) Dec 02 08:00 hourly.2 62% ( 5%) 0% ( 0%) Dec 02 00:01 nightly.0 69% (36%) 0% ( 0%) Dec 01 20:00 hourly.3 73% (36%) 0% ( 0%) Dec 01 16:00 hourly.4 77% (36%) 0% ( 0%) Dec 01 00:01 nightly.1 What if you are running low on snap reserve? Sometimes due to excessive rate of change in data, very soon snapshot reserve gets full and they over spill on data area of volume, to remediate this you have to either extend volume or delete old snapshots. To resize the volume use ?vol size? command and to delete the old snapshots you can use ?snap delete? command which I will cover in next section, however before deleting if you want to check how much free space you can gain from this snapshot use below command ?snap reclaimable | ?? Running above command will give you output as below and you can add multiple snapshot names after one other if you are not getting required free space by deleting one snapshot. Please note that you should select snapshots for deletion only from oldest to latest order otherwise blocks freed by deleting any middle snapshot will still be locked in its following snapshot testfiler> snap reclaimable testvol nightly.1 hourly.4 Processing (Press Ctrl-C to exit) ............ snap reclaimable: Approximately 9572 Kbytes would be freed. How to delete snapshot? To delete the snapshot use command snap delete with volume name and snap name in below fashion ?snap delete ? Running this command will print similar information on screen testvol> snap delete testvol hourly.5 Wed Dec 2 16:58:29 GMT [testfiler: wafl.snap.delete:info]: Snapshot copy hourly.5 on volume testvol NetApp was deleted by the Data ONTAP function snapcmd_delete. The unique ID for this Snapshot copy is (67, 3876). How to know what is the actual rate of change? Sometime on a particular volume very often you will be running out of snap reserve space as snapshots fill them up much before old snaps gets expire and deleted by auto delete function (if you have configured) and you must be interested to resize the snap reserve accurately to
avoid any issues. So in order to check how much is the actual rate of change KB per/hour calculated from all the snapshots or between two snap on given volume you can use snap delta command. ?snap delta [<1st snapshot name> <2nd snapshot name>]? testfiler> snap delta testvol Volume testvol working... From Snapshot To KB changed Time hourly.0 weekly.0 552 hourly.6 nightly.2 nightly.5 632 Rate (KB/hour) --------------- -------------------- -------Active File System 30044 552 468 hourly.5 hourly.2 0d 00:28 628 hourly.4 0d 01:59 5392 880 nightly.6 7d 00:00 Rate Active 1d 0d 02:00 276.000 hourly.2 0d 03:00 155.956 548 nightly.1 nightly.4 --- ------------ --------------- hourly.0 63176.635 hourly.1 hourly.1 hourly.4 500 nightly.0 nightly.3 nightly.6 42420 552 hourly.3
274.038 nightly.0
and deleting snapshots but what it?s good if you don?t know how to restore the data from snapshots for which you have done so much things. So, in next post I will address how to restore data from snapshot through snap restore command
Snapshots in NetApp
Volumes and data: Volume used for test was a flexible volume named ?buffer_aggr12? and ?My Documents? folder from my laptop for data and sync tools from Microsoft to sync ?My Documents? folder with cifs share created from volume buffer_aggr12. Snapshot configuration: Scheduled snapshot were configured at 9,11,13,15,17,19,21 hours and retention period was 4 weekly, 7 daily and 7 hourly with 20% space reserve for snapshot. The coolest part of the snapshot is flexibility, because as an administrator once you have configured it no more you have to look into this as it takes snapshot at defined schedule and if you have configured ?snap autodelete? then it will purge expired snapshots also as per your retention period. So effectively you don?t have to ever worry about managing hundreds of old snapshots lying in volume and eating up space (except when change rate of data overshoots and snapshots starts spilling on data area). As a
end user you experience backups at your click away because snapshots integrates well with shadow copy services of windows 2000, XP or Vista and you can recover them whenever you need. Here?s the configuration of snapshot for my test volume ?buffer_aggr12? AMSNAS02> snap sched buffer_aggr12 Volume buffer_aggr12: 4 7 7@9,11,13,15,17,19,21 AMSNAS02> snap reserve buffer_aggr12 Volume buffer_aggr12: current snapshot reserve is 20% or 157286400 k-bytes. As I was running this test for months so there were enough snaps for me to play with and you can see below that these snapshots are going way back to 20th July, which is 4 week old snapshot and anytime I can recover that from just a right click. How to recover files or folders from snapshot: There are two ways to recover the data from snapshots. As an end user you can recover your data from windows explorer by just right clicking in an empty space while you are in the share in which you lost your data. Here?s an example of this. a) This is the snapshot of my share folder, in this as you can see my pst file is corrupted and showing 0 kb. b) To recover this, right click on any empty area and go to properties>previous version it shows me all the snapshots taken for this folder, as shown in below screenshot. c) Now at this point either I can revert the whole folder to previous state or just copy it to another location to recover a deleted file but at this place my point is to revert a corrupted file rather than recovering a deleted file. So I will just do a right click on that file and navigate to previous versions tab in properties dialogue box. Here in this it shows me the changes captured by snapshot at different times, so I can just select the date I want to revert with and click on restore. d) Now it starts replacing the corrupted file with the one taken by snapshot. Its taking a long time because the file in question is >1Gb size and I am on WAN link so it?s slow but there is another way to do it and that?s recovering directly from filer console which recovers in seconds but unfortunately not available to end user. e) Now here? the screenshot of my before and after. As an Administrator you can recover a file, folder or whole volume within second as while doing it from filer console, system doesn?t have to copy the old file from snapshot to temp location, delete old file and then change the recovered file?s metadata , instead it just changes the block pointers internally so it?s blazing fast . Here?s an example of this. a) In this test again I will use same pst file which is corrupted but this time we will recover it from console. So first login to filer and do a snap list to see what all snapshots are available. AMSNAS02> snap restore buffer_aggr12 Volume buffer_aggr12 working? %/used (40%) %/total date name ---------- ---------- ------------ -------- 0% ( 0%) 0% ( 0%) Aug 14 15:00 hourly.1 40% 0% ( 0%) Aug 14 11:00 0% ( 0%) Aug 14 0% ( 0%) Aug 14 17:00 hourly.0 0% ( 0%) hourly.3 40% ( 0%)
0% ( 0%) 0% (
0% ( 0%) Aug 13 00:00 nightly.1 41% ( 0%) 0% ( 0%) Aug 11 00:00 nightly.3 57% (
you give below command and it recovers that in just a second. AMSNAS02> snap restore -t file -s nightly.5 /vol/buffer_aggr12/RootQtree/test.pst WARNING! This will restore a file from a snapshot into the active filesystem. If the file already exists in the active filesystem, it will be overwritten with the contents from the snapshot. Are you sure you want to do this? yes You have selected file /vol/buffer_aggr12/RootQtree/test.pst, snapshot nightly.5 Proceed with restore? yes AMSNAS02> c) Here?s the screenshot of my folder which confirm file back in previous state. Now as you see it was quite easy to use and very useful also, but to have a snapshot you need some extra space reserved in volume specially if your data is changing very frequently as more changes means more space you need to store changed block and the condition goes more complicated if you are trying to take snapshot of a VM, Exchange or Database volume, because before the snapshot is taken application has to put itself in hot-backup mode so a consistent copy can be made. Most of the applications have this functionality available but you have to use some script or snapmanager so when application is prepared it can inform filer to take snapshot and once snapshot is taken filer can inform back the application to resume its normal activity.
net for this and fortunate enough I got a way on NOW site to get this work. It was recorded under Bugs section with Bug ID # 80611 Which reads as. ?There is an unsupported undocumented feature of the /etc/snapmirror.allow file, such that if it is filled as follows: hostA:vol1 hostA:vol29 hostB:/vol/vol0/q42 hostC and "options snapmirror.access legacy" is issued, then the desired access policy will be implemented. Again note that this is unsupported and undocumented so use at your own risk.? Yes, though NetApp says that there is a way to do that but they also say well sometimes it may break other functionality or may not work as expected. Finding this I sent the details to my friend but unfortunately he don?t want to give it a try on his production systems and test systems are not available with him. So if anyone of you want to try it or have tried it before please put your experience in comments field.
directory and uses vcsagent user which is not having access to run version command. Everything was looking good so I started running tests by moving resource from one node to another but to my surprise they were failing to make changes on filer and looking at filer audit logs it shown that they are still using root for ssh to filer. Till the moment I didn?t run test I was thinking that agent is just relying to OS for ssh username as NetApp hasn?t set any username attribute in agent moreover as I haven?t configured in OS which account to use that?s why when agent executes command ?ssh testfiler1 ? OS directs the ssh connection to connect with root (cluster node?s local logged-in user). But after going through my failed test it made me to believe that username is hardcoded in agent script so I started looking in script and soon found below line in file NetApp_VCS.pm ?$cmd = "$main::ssh -n root\@$host '$remote_cmd'";? After having this finding it was not a big brainer work to figure out what was going wrong and what I have to do. Just removed the word ?root? from script and it started working because now it is using config file from .ssh directory and uses vcsagent as username, alternatively I could have replaced word root with vcsagent directly in script also to make it simple and stay away from maintaining config file but I felt this to be much better. Unfortunately till today there is no alternative apart from making changes in script as NetApp and Veritas both were not able to help us apart from a statement ?we will raise a product enhancement request?. Update: You need to give access "security-privadvanced" also to user, so role should look like below. testfiler01> useradmin role list exportfs Name: exportfs Info: To manage NFS exports from CLI Allowed Capabilities: cliexportfs*,cli-lock*,cli-priv*,cli-sm_mon*,security-priv-advanced
with lablincl1n1, lablincl1n2, lablincl1n3, lablincl1n4 (name of all the nodes of cluster) and updated ClearNFSLocks to 2, UseSSH to 1, I left rest all of the options untouched as they were good with their default values, like FilerPingTimeout=240, RebootOption=Empty, HostingFilerName=Empty, RebootOption=empty, RouteViaAddress=empty, along with MultiNIC and /etc/hosts file because NIC teaming was done at OS level and felt lazy to update lots of IP addresses in hosts file, as a matter of fact I knew that our BIND servers are robust enough. Note: Please don?t get confuse looking at HostingFilerName field as you need it only if you are using vfiler. If you are exporting NFS volume from vfiler then put vfiler name in FilerName field and physical filer name (on which vfiler is created) as HostingFilerName. Now next step was configuring SSH which was pretty easy, just use ?ssh-keygen -t dsa? command to generate public and private key of root from all your nodes and copy their public key ?authorized_keys? file in folder /etc/sshd/root/.ssh of your filer. Now configuration was completed and everything was working as expected just within 4 hrs of my effort. At this point everything was completed except one very important thing i.e. security, as following agent?s admin guide I have added dsa keys in root?s authorized_keys file, therefore anyone having root access on any of 4 nodes of cluster will have root access on my filer also which I wasn?t comfortable at. So I started looking around in agent?s attributes to configure different account name used by agent but to my surprise nothing was there even none of the documents were speaking on that so I started going on my own way to solve it and it worked well after some extra effort. Now as this post is going quite big so I will cover configuring different user name in VCS agent in next post.
VCS to bring resources on line, monitor their status, and take them off-line as needed. Key Features for version 5.0 of agent are given below.Supports VCS 4.1 and 5.0*Supports Exportfs persistencySupports IPMultiNIC and MultiNICASupports Data ONTAP 7.1.x or laterSupports fine granularity NFS lock clearing (requires Data ONTAP 7.1.1 or later)Supports communication with the storage system through SSH, in addition to RSHMultithreading (NumThreads >1) is supported (requires IPMultiNIC with MultiNICA)Supports automatic fencing of export for ro access to other nodes in cluster as resource moves from one node to otherSupports failover of a single resource group when multiple resource groups of the same type are active on the same cluster node Kernel RequirementLinux Kernel 2.6.9-34.EL, 2.6.9-34.ELsmp for RHEL, 2.6.5-7.287.3smp for SUSE * VCS 4.1 is not supported for SUSE Linux# With Solaris 10 local zones are also supported in addition to global zones. In next part I will post how to implement it, which will need some modification to script also. References:NetApp NFS Client for VCS on RHELNetApp NFS Client for VCS on SolarisNetApp NFS Client for VCS on SUSE Linux
my organization security policy more importantly I can get rid of pending space allocation request.
broken disk id, which you can get from ?aggr status -r", now go to advance mode with ?priv set advanced? and run ?disk unfail ? at this stage your filer will throw some 3-4 errors on console or syslog or snmp traps, depends on how you have configured but this was the final step and now disks should be good which you can confirm with ?disk show ? for detailed status or ?sysconfig -r? command. Give it a few seconds to recognize the changed status of disk if status change doesn?t shows at first. Scenario 2: Two disks have failed from same raid group and I don?t have any spare disk in my system. Now in this case you are really in big trouble because always you need to have at least one spare disk available in your system whereas NetApp recommends 1:28 ratio i.e. have one spare on each 28 disks. In the situation of dual disk failure you have very high chances of loosing your data if another disk goes while you are rebuilding the data on spare disk or while you are waiting for new disks to arrive. So always have minimum 2 disks available in your system one disk is also fine and system will not complain about spare disk but if you leave system with only one spare disk then maintenance centre will not work and system will not scan any disk for potential failure. Now going to your above situation that you have dual disk failure with no spares available, so best bet is just ring NetApp to replace failed disk ASAP or if you think you are loosing your patient select same type of disk from another healthy system, do a disk fail, remove disk and replace it with failed disk on other system. After adding the disk to another filer if it shows Partial/failed volume, make sure the volume reported as partial/failed belongs to newly inserted disk by using ?vol status -v? and ?vol status -r" commands, if so just destroy the volume with ?vol destroy? command and then zero out the disk with ?disk zero spares?. This exercise will not take more than 15 min(except disk zeroing which depends on your disk type and capacity) and you will have single disk failure in 2 systems which can survive with another disk failure, but what if that doesn?t happens and you keep running your system with dual disk failure. Your system will shut down by itself after 24 hours; yes it will shut down itself without any failover to take, your attention. There is a registry setting to control how long your system should run after disk failure but I think 24hrs is a good time and you shouldn?t increase or decrease it until and unless you think you don?t care of the data sitting there and anyone accessing it. Scenario 3: My drive failed but there is no disk with amber lights A number of times these things happen because disk electricals are failed and no more system can recognize it as part of it. So in this situation first you have to know the disk name. There are couple of methods to know which disk has failed. a) ?sysconfig -r ? look for broken disk list b) From autosupport message check for failed disk ID c) "fcadmin device_map" looks for a disk with xxx or ?BYP? message d) In /etc/messages look for failed or bypassed disk warning and there
it gives disk ID Now once you have identified failed disk ID run ?disk fail ? and check if you see amber light if not use ?blink_on ? in advanced mode to turn on the disk LED or if that that fails turn on the adjusting disk?s light so you can identify the disk correctly using same blink_on command. Alternatively you can use led_on command also instead of blink_on to turn on the disk LEDs adjacent to the defective disk rather than its red LED. If you use auto assign function then system will assign the disk to spare pool automatically otherwise use ?disk assign ? command to assign the disk to system. Scenario 4: Disk LED remains orange after replacing failed disk This error is because you were in very hurry and haven?t given enough time for system to recognize the changes. When the failed disk is removed from slot, the disk LED will remain lit until the Enclosure Services notices and corrects it generally it takes around 30 seconds after removing failed one. Now as you have already done it so better use led_off command from advanced mode or if that doesn?t works because system believes that the LED is off when it is actually on, so simply turn the LED on and then back off again using ?led_on ? then ?led_off ? commands. Scenario 5: Disk reconstruction failed There could be a number of issues to fail the RAID reconstruction fail on new disk including enclosure access error, file system disk not responding/missing, spare disk not responding/missing or something else, however most common reason for this failure is outdated firmware on newly inserted disk. Check if newly inserted disk is having same firmware as other disks if not first update the firmware on newly inserted disk and it then reconstruction should finish successfully. Scenario 6: Disk reconstruction stuck at 0% or failed to start This might be an error or due to limitation in ONTAP i.e. no more than 2 reconstructions should be running at same time. Error which you might find a time is because RAID was in degraded state and system went through unclean shutdown hence parity will be marked inconsistent and need to be recomputed after boot. However as parity recomputation requires all data disks to be present in the RAID group and we already have a failed disk in RG so aggregate will be marked as WAFL_inconsistent. You can confirm this condition with ?aggr status -r" command. If this is the case then you have to run wafliron, giving command ?aggr wafliron start ? while you are in advance mode. Make sure you contact NetApp before starting walfiron as it will un-mount all the volumes hosted in the aggregate until first phase of tests are not completed. As the time walfiron takes to complete first phase depends on lots of variables like size of volume/aggregate/RG, number of files/snapshot/Luns and lots of other things therefore you can?t predict how much time it will take to complete, it might be 1 hr or might be 4-5 hrs. So if you are running wafliron contact NetApp at fist hand.
In this post I have tried to cover mount options and other settings related to Solaris for higher throughput from NFS, which is more towards 64 bit although these settings apply to even 32 bit but a few extra settings gets counted when you think of 32 bit version, like super caching as I can remember because this list I have complied long back and still it's very handy to me when I get some complain about low performance. For any further details you can look in references section. Mount options rw,bg,hard,nointr,rsize=32768,wsize=32768,vers=3,proto=tcp Kernel Tuning Parameter Replaced by (Resource Control) Recommended Minimum Value noexec_user_stack NA 1 semsys:seminfo_semmni project.max-sem-ids 100 semsys:seminfo_semmns NA 1024 semsys:seminfo_semmsl project.max-sem-nsems 256 semsys:seminfo_semvmx NA 32767 shmsys:shminfo_shmmax project.max-shm-memory 4294967296 shmsys:shminfo_shmmni project.max-shm-ids 100 On Solaris 10, the following kernel parameters should be set to the shown value, or higher. Solaris file descriptors rlim_fd_cur ? "Soft" limit on the number of file descriptors (and sockets) that a single process can have open rlim_fd_max ? "Hard" limit on the number of file descriptors (and sockets) that a single process can have open Setting these values to 1024 is strongly recommended to avoid database crashes resulting from Solaris resource deprivation. Network Settings Parameter Value Details /dev/tcp tcp_recv_hiwat 65,535 increases TCP receive buffer /dev/tcp tcp_xmit_hiwat 65,535 increases TCP transmit buffer /dev/ge adv_pauseTX 1 Enables transmit flow control /dev/ge adv_pauseRX 1 Enables receive flow control /dev/ge adv_1000fdx_cap 1 forces full duplex for GBE ports /dev/tcp tcp_xmit_hiwat 65536 Increases TCP transmit high watermark /dev/tcp tcp_recv_hiwat 65536 Increases TCP receive high watermark sq_max_size ? Sets the maximum number of messages allowed for each IP queue (STREAMS synchronized queue). Increasing this value improves network performance. A safe value for this parameter is 25 for each 64MB of physical memory in a Solaris system up to a maximum value of 100. The parameter can be optimized by starting at 25 and incrementing by 10 until network performance reaches a peak. Nstrpush ? Determines the maximum number of modules that can be pushed onto a stream and should be set to 9 References NetApp Technical Teport tr-3633, tr-3496, tr-3322, NetApp Knowledge Base Article 7518
controller?s disk shelves. Further details of Active / Active cluster best practices can be found in TR-3450 Active / Passive (Stretch MetroCluster) Configuration This is the diagram of active/active metrocluster, however the same design applies to active/passive metrocluster also except one node on the cluster is having only mirror of primary system's data. In this configuration primary and secondary systems can extend upto 500m (upto 100km with Fabric MetroCluster) and all the primary system data is mirrored to secondary system with Sync Mirror, in the event of primary system failure all the connection automatically gets switch over to remote copy. This provides additional level of failure protection like whole disk shelf failure or multiple failures at same time, however this needs another copy of same data and exact same hardware configuration to be available for secondary node. Please note that cluster interconnect (CI) on NVRAM card is required for cluster configuration however 3170 offer a new architecture that incorporates a dual-controller design with the cluster interconnect on the backplane. For this reason, the FCVI card that is normally used for CI in a Fabric MetroCluster configuration must also be used for a 31xx Stretch configuration.Further details of MetroCluster design and implementation can be found in TR-3548 Minimizing downtime with cluster Although having a cluster configuration saves from any unwanted downtime however a small disruption can be sensed on the network while takeover /giveback is happening which is approximately less than 90 seconds in most of the environments and it keeps the NAS network alive with few ?not responding? errors on clients.A few points in related with this are given below: CIFS: leads to a loss of session to the clients, and possible loss of data. However clients will reconnect the session by themselves if system comes up before the timeout window. NFS hard mounts: clients will continue to attempt reconnection indefinitely, therefore controller reboot does not affect clients unless the application issuing the request times out waiting for NFS responses. Consequently, it may be appropriate to compensate by extending the application timeout window. NFS soft mounts: client processes continue reconnection attempts until the timeout limit is reached. While soft mounts may reduce the possibility of client instability during failover, they expose applications to the potential for silent data corruption, so are only advised in cases where client responsiveness is more important than data integrity. If TCP soft mounts are not possible, reduce the risk of UDP soft mounts by specifying long retransmission timeout values and a relatively large number of retries in the mount options (i.e., timeo=30, retrans=10). FTP, NDMP, HTTP, backups, restores: state is lost and the operation must be retried by the client. Applications (for example, Oracle, Exchange): application-specific. Generally, if timeout-based, application parameters can
be tuned to increase timeout intervals to exceed Data ONTAP reboot time as a means of avoiding application disruption.
net.ipv4.tcp_timestamps 0 Disables new RTTM feature introduced in RFC-1323 Helps to reduce CPU overhead Prevents adding 10-byte overhead to TCP header net.ipv4.tcp_sack 0 Disables selective ack Helps to reduce CPU overhead References: NetApp whitepaper tr-3700,tr-3183, tr-3369 NetApp Knowledge Base Article 7518
New Guestbook
Add your co
submit
Show All
o o o o o o o
Data Storage
Permanently free. Register now and get 5GB of Free Online Storage!
free-hidrive.com/Data+Storage
SAN Storage
By Dell w/ automatic load Balancing for growing business need
www.dell.com/EqualLogic Ads by Google
Data Storage
Permanently free. Register now and get 5GB of Free Online Storage!
free-hidrive.com/Data+Storage
SAN Storage
By Dell w/ automatic load Balancing for growing business need
www.dell.com/EqualLogic
Like0
RSS
by mohit96
1 featured lens Winner of 2 trophies! Top lens Netapp storage systems technical troubleshooting and tips
What Is Your Greatest Weakness?What Is Your Greatest Weakness? How to Apply for Google JobsHow to Apply for Google Jobs How to Ace your Job InterviewHow to Ace your Job Interview Best Phone Interview Tips - Get That Job!Best Phone Interview Tips Get That Job! Self Description In InterviewsSelf Description In Interviews How to Win At InterviewsHow to Win At Interviews
Netapp storage systems technical troubleshooting and tips computers-and-electronics filer interview questions nas ontap san vcs veritas cluster services mohit96
Related Tags
Copyright 2012, Squidoo, LLC and respective copyright owners This page and all the pages on Squidoo generate income for lensmasters and charities based on affiliate relationships with our partners, including Amazon, Google, eBay and others. Have fun.
Sign up!