Sie sind auf Seite 1von 31

Log In Explore Topics JOIN US

Search Squ

Go

quidoo
mohit96 donates 100% of this page's earnings to charity. You can do it too...
Home Computers & Electronics

Netapp storage systems technical troubleshooting and tips


Ranked #8,539 in Computers & Electronics, #177,283 overall Ads by Google Like0

RSS

On Line File Storage Permanently free. Register now and get 5GB of Free Online
Storage! free-hidrive.com/On+Line+File+Storage

SmartOptics It's really Simple! Tranceivers, SFP, SFP+, MUX,


DEMUX www.SmartOptics.com

SAN Storage Used to optimise storage! infrastructure & high


efficiency www.dell.com/EqualLogic
Ads by Google

Data Storage Permanently free. Register now and get 5GB of Free Online Storage! freehidrive.com/Data+Storage

SAN Storage Used to optimise storage! infrastructure & high


efficiency www.dell.com/EqualLogic

EMC VNX Unified Storage Complete Tasks in mins instead of Hours. Switch to
EMC VNX storage! www.EMCIndia.co.in/vnx
Ads by Google

Data Storage Permanently free. Register now and get 5GB of Free Online Storage! freehidrive.com/Data+Storage

SAN Storage Used to optimise storage! infrastructure & high


efficiency www.dell.com/EqualLogic

EMC VNX Unified Storage Complete Tasks in mins instead of Hours. Switch to
EMC VNX storage! www.EMCIndia.co.in/vnx

CONTENTS AT A GLANCE
1. Please go to original blog by clicking o... 2. NetApp Filer troubleshooting blog 3. New Guestbook

Please go to original blog by clicking on title of post to see in original form with details NetApp Filer troubleshooting blog
My Struggle with NetApp
NetApp related technical how to, Storage world and other technical discussions How to restore data from aggregate snapshot
Today one of?our user found himself in wet pants when he noticed his robocopy job has overwritten a folder, rather than appending new data to it. Being panicked he run to me looking for any tape or snapshot backup of his original data, which unfortunately wasn?t there as previously he confirmed that they don?t need any kind of protection. Now at this time I had only place left where I can recover the data, aggregate level snapshots; so I looked at aggregate snapshots and saw it goes back to time when he had data in place. Knowing that the data deleted from volume is still locked in aggregate?s snapshot I was feeling good that I have done a good job by having some space reserved for aggregate level snapshot, which no one ever advocated. Now the next step is to recover the data, but problem was that if I revert aggregate using ?snap restore ?A? then all the volumes in that aggregate will be reverted which will be bigger problem. So had to go on a different way, use aggregate copy function to copy the aggregate?s snapshot to an

empty aggregate and then restore the data from there. Here?s the cookbook for this. Pre-checks: The volume you lost data from is a flexible volume?Identify an aggregate which is empty so it can be used for destination (could be on another controller also)?Make sure the destination aggregate is either equal or larger than source aggregate?/etc/hosts.equiv has entry for the filer you want to copy data to and /etc/hosts has its IP address added, in case of copying on same controller loopback address (127.0.0.1) should be added in /etc/hosts file and local filername should be in hosts.equiv file?Name of aggregate?s snapshot which you want to copy? Example: Let?s say the volume we lost data was ?vol1?, the aggregate which has this volume is ?aggr_source?, the aggregate?s snapshot which has lost data is ?hourly.1? and empty aggregate where we will be storing data to is ?aggr_destination? Execution: Restrict the destination aggregate using ??aggr restrict aggr_destination??Start the aggregate data copy using ?aggr copy start ?s hourly.1 aggr_source aggr_destination??Once the copy is completed online the aggregate using ?aggr online aggr_destination??If you have done copy on same controller, system will rename the volume ?vol1? of ?aggr_destination? to ?vol1(1)??Now export the volume or lun and you have your all lost data available.So here?s the answer to another popular question, why do I need to reserve space for aggregate level snapshot. Do you have the answer now?

Most destructive command in Ontap


There are some commands which shake me when I run or even when I am close to them, but never thought I could be make my filer so close to death by just mistyping a command. Yes, indeed I did it by typing 'gbd' rather than 'dbg', these two are so close to each other that my buttery fingers didn't realize that I mistyped a command and by the time I could realize it was too late. Sigh! Little background on this 'gbd' command. This command is in diag mode and so debug, however whereas 'dbg' captures filer debug info on console or a file, 'gbd' sends kill signal to all the processors which stops all the work on filers and everything just hangs. The only way to recover your filer is by hard rebooting it, either by RLM or physically tipping the power button. I don't understand why brilliant NetApp engineers have made it so simple, why couldn't they use a command like 'use_this_to_kill_your_system' or something like that and I swear no one would ever type it. Anyway I did it and I admit, that I should have checked before hitting return which I didn't. But guess what I was lucky enough to not to do it on a prod system and this test/dev system was having only a bunch of NFS clients were connected to it which made it sort of?invisible to client systems?due to the nature of NFS protocol. What is the other command which you think shouldn't be so easy along with this?

How to do host/user/group or netgroup from filer

Often we want to do nslookup for a host or NIS/LDAP lookup for a user or group for troubleshooting purpose. You have a unix system handy and you do it from there however what if you suspect results are not same as what your filer may be getting? If you are troubleshooting CIFS issue, you are in luck with command 'cifs lookup' however, if you are dealing with DNS or NFS issue then you are out of luck, unless you go into advanced mode. Yes, you go inside advance mode and you get access to lot of other commands including one very nifty command 'getXXbyYY', which is incredibly useful but hidden from the view of admins for some strange reason, really I am not sure why NetApp thinks this shouldn't be available to end user as every time I do troubleshooting I feel the need of this and by no way I see this to be making any sort of changes on filer. Anyway here's the command, though command says using "man na_getXXbyYY" for additional info however I couldn't locate it on systems therefore I use test1*> getXXbyYY help usage: getXXbyYY Where sub-command is one of gethostbyname_r gethostbyaddr_r netgrp getspwbyname_r getpwbyname_r getpwbyuid_r getgrbyname getgrbygid getgrlist For more information, try 'man na_getXXbyYY' Please remember this command is not available in admin mode and search order depends of your /etc/nsswitch.conf entry, so before you start thinking that this isn't working as expected please check these two things first. Though all the subcommands are self explanatory however I have added small description for each of them. gethostbyname_r - Resolves host name to IP address from configured DNS server, same as nslookup gethostbyaddr_r - Retrieves IP address for host name from configured DNS server, same as reverse lookup netgrp - Checks group membership for given host from LDAP/Files/NIS getspwbyname_r - Displays user information using shadow file getpwbyname_r - Displays user information including encrypted password from LDAP/Files/NIS getpwbyuid_r - Same as above however you provide uid in this command rather than user name getgrbyname - Displays group name and gid from LDAP/Files/NIS getgrbygid - Same as above however you provide gid in this command rather than group name getgrlist - Shows given user's gid from LDAP/Files/NIS Examples: test1*> getXXbyYY gethostbyname_r landinghost1 name: landinghost1 aliases: addresses: 10.21.242.7 test1*> getXXbyYY gethostbyaddr_r 10.21.242.7 name: landinghost1 aliases: addresses: 10.21.242.7 test1*> getXXbyYY netgrp support-group testhost1 client testhost1 is in netgroup support-group test1*> getXXbyYY getpwbyname_r root pw_name = root pw_passwd = _J9..gsxiYTAHEtV3Qnk pw_uid = 0, pw_gid = 1 pw_gecos = pw_dir = / pw_shell = test1*> getXXbyYY getpwbyuid_r 0 pw_name = root pw_passwd = _J9..gsxiYTAHEtV3Qnk pw_uid = 0, pw_gid = 1 pw_gecos = pw_dir = / pw_shell = test1*> getXXbyYY getgrbyname was name = was

gid = 10826 test1*> getXXbyYY getgrbygid 10826 name = was gid = 10826 test1*> getXXbyYY getgrlist wasadmin pw_name = wasadmin Groups: 10826

Execute command from a file on Ontap


Occasionally we want to quickly or periodically run a set of pre defined commands on filers, like when we are making little change in network configuration and want to minimize the network down time or when we are creating a volume and know snap reserve, snap schedule, de-dupe, autosize or anything else which needs to be changed after every volume creation. If you are executing commands from unix terminal then a better way would be to keep all the commands in a text file and would do something like bash>for $i in vol1 >do >snap sched $i 0 >snap reserve $i 0 >blah $i >blah $i >done and everything would be done, but imagine how would you do if you are doing through console. Unfortunately Ontap doesn't supports any kind of script not even for loop so we have to run each and every line of command either by typing on console or doing a copy-paste from text file from our desktop. However there is a better way, use notepad to create set of command as you would execute on console with correct order, copy it off to filer and use 'source' command to execute each and every line from it. I know it's not such a brilliant idea as still you have to copy and paste everything to a file on filer however it's wee better than executing each and every command on console. Think about you have to re-run /etc/rc so either you can use 'rdfile /etc/rc' to print everything on console and copy all the the line and execute on console or just run 'source /etc/rc' and let it run all the commands for you. You can also use 'source -v /etc/rc' to print the commands on console but not to execute, just to get an idea if there are any junk characters or any unwanted command inside the file, just as a precaution you better be sure that all the commands are valid and correct as if a command fails source doesn't stops there and just goes to next command in list. Use it and I am sure you will like it next time when you are making some changes on filer which needs ten different commands to be executed.

How to check unplanned downtime detail for a NetApp filer


Every now and then someone ask us what is uptime of system and we just type 'uptime' on system console to get the detail instantly. This is really handy command to know when the system was last rebooted and how many operations per protocol it has served since then. Wouldn't our life be little easy if managers get satisfy with this detail? Alas! but that doesn't happen and they ask us to give all the details since we have acquired the system or 1st January and then we go back to our excel sheet or ppt we have created as part of monthly report to pull the data. How about if we can get same information from system with just a command, wouldn't that be cool. Fortunate enough

we have little known command 'availtime' right inside Ontap which just do the exact same function and specifically created after thinking about our bosses. HOST02*> availtime full Service statistics as of Sat Aug 28 18:07:33 BST 2010 ?System ?(UP). First recorded 68824252 secs ago on Mon Jun 23 04:16:41 BST 2008 ?? ? ? ? Planned ? downs 31, downtime 6781737 secs, longest 6771328, Tue Sep ?9 15:07:33 BST 2008 ?? ? ? ? Uptime counting unplanned downtime: 100.00%; counting total downtime: ?90.14% ?NFS ? ? (UP). First recorded 68824242 secs ago on Mon Jun 23 04:16:51 BST 2008 ?? ? ? ? Planned ? downs 43, downtime 6849318 secs, longest 6839978, Wed Sep 10 10:11:43 BST 2008 ?? ? ? ? Uptime counting unplanned downtime: 100.00%; counting total downtime: ?90.04% ?CIFS ? ?(UP). First recorded 61969859 secs ago on Wed Sep 10 12:16:34 BST 2008 ?? ? ? ? Planned ? downs 35, downtime 17166 secs, longest 7351, Thu Jul 30 13:52:25 BST 2009 ?? ? ? ? Uptime counting unplanned downtime: 100.00%; counting total downtime: ?99.97% ?HTTP ? ?(UP). First recorded 47876362 secs ago on Fri Feb 20 14:08:11 GMT 2009 ?? ? ? ? Planned ? downs 8, downtime 235 secs, longest 53, Wed Jan 20 14:10:18 GMT 2010 ?? ? ? ? Unplanned downs 16, downtime 4915 secs, longest 3800, Mon Jul 27 16:01:02 BST 2009 ?? ? ? ? Uptime counting unplanned downtime: ?99.98%; counting total downtime: ?99.98% ?FCP ? ? (DOWN). First recorded 68817797 secs ago on Mon Jun 23 06:04:16 BST 2008 ?? ? ? ? Planned ? downs 17, downtime 44988443 secs, longest 38209631, Sat Aug 28 18:07:33 BST 2010 ?? ? ? ? Unplanned downs 6, downtime 78 secs, longest 21, Fri Feb 20 15:24:44 GMT 2009 ?? ? ? ? Uptime counting unplanned downtime: ?99.99%; counting total downtime: ?34.62% ?iSCSI ? (DOWN). First recorded 61970687 secs ago on Wed Sep 10 12:02:46 BST 2008 ?? ? ? ? Planned ? downs 21, downtime 38211244 secs, longest 36389556, Sat Aug 28 18:07:33 BST 2010 ?? ? ? ? Uptime counting unplanned downtime: 100.00%; counting total downtime: ?38.33%? I am not sure why NetApp has kept this command in Advanced mode but once you know this command I bet next time you will not refrain yourself going inside advance mode to see how many unscheduled downtime you had since last reset. A shorter version of this command is just 'availtime' it also shows the same information as 'availtime full' however it truncates letters from output and denotes ?Planned with P and Unplanned with U which is very good if you want to pass it in script.? HOST04*> availtime Service statistics as of?Sat Aug 28 18:07:33 BST 2010 ?System ?(UP). First recorded (20667804) on Wed Sep 23 09:35:49 GMT 2009 ?? ? ? ? P ?5, 496, 139, Fri Dec 11 15:58:19 GMT 2009 ?? ? ? ? U ?1, 1605, 1605, Wed Mar 31 17:01:41 GMT 2010 ?CIFS ? ?(UP). First recorded (20666589) on Wed Sep 23 09:56:04 GMT 2009 ?? ? ? ? P ?7, 825, 646, Thu Jan 21 19:08:03 GMT 2010 ?? ? ? ? U ?1, 77, 77, Wed Mar

31 16:34:54 GMT 2010 ?HTTP ? ?(UP). First recorded (20664731) on Wed Sep 23 10:27:02 GMT 2009 ?? ? ? ? P ?3, 51, 22, Thu Jan 21 19:17:25 GMT 2010 ?? ? ? ? U ?4, 203, 96, Thu Jan 21 19:08:03 GMT 2010 ?FCP ? ? (UP). First recorded (20477735) on Fri Sep 25 14:23:38 GMT 2009 ?? ? ? ? P ?3, 126, 92, Thu Jan 21 19:07:57 GMT 2010 ?? ? ? ? U ?4, 108, 76, Wed Mar 31 16:34:53 GMT 2010 In order to reset the output use 'reset' switch and it will zero out all the counters, make sure you have recorded the statistics before you reset the counters as once you reset the counters you will not be able to get details of system uptime since system was built so you may like to do only after you acquire a new system, have done all the configuration and now it's the time for it to serve user requests.

Operations Manager Efficiency Plugin


After the release of DFM 3.8.1 NetApp has released a nice little plugin for DFM called as ?Operations Manager Storage Efficiency Dashboard Plugin?. Though quite a long name but it?s good, it cleverly uses DFM database to pull storage utilization and presents the information in nice flash based webpage. It?s useful when you have to show higher management current storage utilization and saving came from NetApp thin provisioning, dedupe, flexclone and other stuffs and goes very well with NetApp?s storage efficiency mantra. The best part is, after you install the plugin you don?t have to anything and you can access it from anywhere in network without installing any software, however there isn?t a simple way to reach the page even after you are right inside OM webpage as there is no link pointing to dashboard, so you have to remember the location to access it later or for people like me bookmark in your browser. The most common problem arising from this is due to lack of foresight while creating the plugin. Here?s what I mean to say. Usually we install DFM server on c:\ and move all perfdata, DB, script folder and other bits and pieces to a different drive for easy backup or in the case of cluster, for clustering setup and here script falls apart. Script expects that it is sitting in its default location and web folder is sitting right next to it, so it acts accordingly whereas in real situation web folder is on c:\ and script is in some other volume. Now there isn?t any way to rectify the behaviour of script or web server, as apache running on DFM can?t be configured to use any folder other then the one sitting inside the installation directory (AFAIK) and no switches are provided in script to tell him the location of original web folder where he needs to copy its content. So in nutshell even though script executes and copies all the files required for showing the dashboard it?s useless unless you figure out by yourself what?s going wrong and why not the page is showing in your browser. Overcoming this limitation is easy enough for folks those who are on Unix environment as creating an alias to original web folder makes everything working fine but for windows folks like me

creating a shortcut doesn?t works. So here?s the way to correct the problem. Download the plugin from now toolchest. Extract the zip and edit file ?package.xml?, change the string ?dfmeff.exe? to ?dfmeff.bat?, next you have to create a new batch file in called ?dfmeff.bat? with below contents. @echo offD:\DFM\scriptplugins\dfmeff\dfmeff.exexcopy D:\DFM\web\*.* "C:\Program Files\NetApp\DataFabric\DFM\web" /Q/I/Y/R Obviously you have to change the path as per your installation however once you have created the batch file and added its reference in xml file you are good to go, just zip it again using any zip software and use the new zip file as plugin source for installation in DFM. Update: Just noticed a video showing features of plugin on netapp community site http://communities.netapp.com/videos/1209

Which is faster, NDMPcopy or vol copy?


After posting my last post I got few mails asking, amongst ndpmpcopy and vol copy, which one would be faster? Only if I have to count speed then vol copy, because it copies blocks directly from disk without going through FS, however I think it?s well suitable if you want to migrate a volume. ProsCPU usage can be throttledSource volume snapshot can be copiedSimultaneously 4 copy operations can be startedOnce started it goes to background and you can use console for other purpose? ConsDestination can?t be root volumeDestination volume should be offlineAll data in destination volume will be over-writtenDestination volume size should be bigger or equal to sourceSingle file or directory cannot be specified for copy operationBoth the volumes should be of same type; traditional or flexibleIf data is copied between two filers both filer should have other filer?s entry in /etc/hosts.equiv file and loopback address for itself in /etc/hosts file? However for copying data between two filers for test or any other purpose ndmpcopy is more suitable because it gives you additional control and less restrictions, which is very useful. ProsLittle or no CPU overheadIncremental copy is supportedNo limitation on volume size and typeNo need to take destination volume offlineSingle file or directory can also be specifiedNo file fragmentation on destination volume as all data is copied sequentially from source volume so improved data layoutNo configuration is required between two filers and username and password is used for authentication ConsSnapshots can?t be copied from sourceConsole is not available till the time copy operation is running so no multiple ndmpcopy operationsIf lots of small files has to be copied then copy operation will be slower? So as you have seen both are well however one can?t be replaced for other and both have their usage for different purposes.

How to copy files in Ontap

As soon as someone asks this question we all say ?use ndmpcoyp? but what if you don?t have any network adapters configured, will ndmpcoyp work? No; ndmpcopy is very useful if you want to copy a file or a whole volume however one thing very few people know that it doesn?t work if you don?t have loopback adapter configured because ndmpcopy passes all the data through lo adapter so it?s not only dependent on lo?s availability, its speed also. So how do you copy the data if lo is not available? The answer is simple, use dd, just an old fashioned unix command which does lot of thing, not only it can copy the file with full pathname you can even use block number and disk number and the best part, syntax is simple ?if? for from and ?of? for to. It can be used not only for copying file around the system, in fact you can use it for testing I/O and copying file from snapshot also and this command can be used regardless of permission. A little note, if you are afraid of going in advanced or diagnostic mode better keep use rdfile and wrfile because this command is not available in admin mode so you have to go in advanced mode to use this. Here?s the syntax of this command. dd [ [if= file ] | [ din= disknum bin= blocknum ] ] [ [of= file ] | [ dout= disknum bout= blocknum ] ] count= number_of_blocks Another note, if you are using count make sure you are using in multiply of 4 because a WAFL block size is 4k. Example: sim1> priv set advanced sim1*> dd if=/vol/vol0/.snapshot/hourly.2/etc/snapmirror.conf of=/vol/vol0/etc/snapmirror.conf1

I2P (inode to pathname) in Ontap


Few days before I became curios to know what is I2P and what will happen if you turn it off. So started hunting NetApp site and Google for information however I couldn?t find much on this. Now as obvious once you are off from vendor site and Google then start looking at your social network and then, bingo! I was able to get some information on this. So the first question what the heck is this I2P?Actually it?s a new feature in Ontap 7.1 and later versions that maps each inode number and file name with relative path to speed up certain operations. As we all know every file, directory, soft link, hard link or any other metadata file has one inode associated with it so each inode has to go through this process and each file/directory gets 8 bits added to its metadata by Ontap whereas every hard link gets 12 bytes penalty and this happen every time you create, delete or rename file, directory or link. Ok, but why do I need this?As far as its usage goes there are some well known application for this like fpolicy, virus scan and file auditing which needs to know full path of the requested file whereas some are Ontap specific feature like in mixed volume it informs NFS clients for any changes made by CIFS clients; one important place where you will see difference is dump command, as having full path available for each file makes it much faster and efficient in operation, some grey areas

are also there which are used by Ontap for its internal work but that?s all covered deep under their IP protection policy so I couldn?t get any info on it. Now how to you get information about this from your system?If you look closely you will see in vol options there is an option ?no_i2p? (default is off and only in 7.1 and later) to enable or disable i2p feature as well if you go to advance mode you can see few more commands related to this like ?inodepath? which shows the i2p information stored in a given inode whereas ?wafl scan status? command shows you running i2p scans which can be aborted with 'wafl scan abort scan_id' or you can also change the scan speed with ?wafl scan speed new_speed? command after you list the current scan speed with ?wafl scan speed? command. After having these information pushed me to think that ok so most of the volumes in my systems are NFS only and they don?t need any virus scan, neither we use dump, fpolicy or any other feature so why not to turn off and get extra juice out of system but speaking with chaps it turned out that turning off wouldn?t be a good idea though they were also not sure what will break if you turn it off and as it has very little performance impact so better to leave it untouched. And yes it?s true it does have very less performance impact as in general we don?t do so much metadata modification that it may hurt the system with i2p workload however when you upgrade your system from an earlier release to 7.1 or later family they get very busy in creating i2p information for each and every file/directory/link etc and may run in high utilization for quite some time and at that time you may wish to use these commands to quickly pull your system back in normal state and let scan run with slow speed or one volume at a time or completely stop it if you want to revert back the system to pre 7.1 release. I think if I get some time I would like to do an extensive testing and see what comes out however if anyone else knows please share your knowledge.

How long it takes for Standard active/active cluster to failover


Usually if we see NetApp sales pitch they say no downtime with Active/Active cluster, but I beg to differ, because though it?s a good solution but not best of breed. Let?s see why it?s not. Active/Active configuration involves two controllers connected to same disk shelves and both of them keep talking with each other through their NVRAM module connections, so anytime if one system goes down other system takes identity of its partner. That sounds good doesn?t it, yeah except few glitches when a system goes down unexpected or experience any failure and partner start takeover process it can take more than 90 seconds which may be fine in the case of NAS environment but for FC and iSCSI it?s more then enough time for host to declare the lun as dead and fail your application. Now 90 seconds was the time surviving node takes, to have identity of partner but if you don?t have RLM card (which gives hardware level assistance to cluster

also) it takes additional 30 seconds for surviving node to declare its partner as dead and start takeover process, which goes whopping 120 seconds. Now let?s see other scenario, NDU software update. What I understand from NDU is Non Disruptive Update, that means if I am doing a software update I can failover and failback partner nodes and simulate a reboot to put new code in effect without any downtime. But as per NetApp KB-22909 failover and failback can take as much as 180 seconds. So how does 180 seconds of downtime on each controller can be called as non-disruptive? Now that was for worst-case scenario but what I have seen with my systems so far is they take less than a minute to do a failover and failback. On my V6080 systems 35 seconds and on V3170 filer, 22 seconds rather than 90econds (observed with ping), and both of them are loaded with multiple vifs, CIFS shares, NFS exports, qtree quotas and snapmirror, though your mileage may vary as it depends on system configuration but that?s not bad for NAS only environment. To prove this few week before we have done some tests on our V3170 systems in order to check VMs of a new project, as they wanted to see how it affects while a system goes offline due to a hardware failure or any other reason and all 300 odd VMs were running fine without any glitches. Even while doing the test we run a script on all 300 odd Linux VMs, which was using DD to write and delete 100MB file every 2 seconds on /root and /tmp, a few of them were modified to write 500MB file and 1MB file. While all of the VMs were running the script, we have done failover/failback as well hard reboot, which left I/O suspended for whopping 3 minutes but surprisingly none of the VMs had kernel panic, RO file system or stopped writing however during the filer reboot they were pathetically slow or frozen. Now as you must be wondering why none of the VMs crashed as 180 seconds of no response from disk will put any OS to it?s knees, so what was special?? Well here the magic, if you look into VM best practices and search NetApp site for VM disk timeout settings change you will find they recommend changing the disk timeout error to 190 seconds so it can survive any kind of controller reboot or failover and call it non disruptive. So next time if someone says with active/active cluster you don?t have any downtime, don?t forget to ask him how do you handle any system crash or upgrade activity and if you want to deploy VM in your environment over NetApp heads don?t forget to change that parameter otherwise even in 22 seconds of I/O pause will make big impact on your VM environment.

Defragement in NetApp
Usually we face this problem with our PC and then we defrag our volumes clear temp files and what not; most of the times that solves the problem, though not fully but yes it gets better. In NetApp though we don?t have to deal with fragmented registry or temp files but due to nature of WAFL file system it gets fragmented very soon, soon after you

start overwriting or start deleting and adding the data to volume. So what do you do then? Well the answer is very simple use ?reallocate? command. Yes, this is the defrag tool of NetApp built right in the Ontap OS. First you have to turn on the reallocation on system with ?reallocate on? command. This command turns on the reallocation on system and same way turns off with off switch. This can be used not only on volumes, infact you can run this on a file, lun or aggregate itself. However I should warn you that optimization of lun may not give you any performance benefit or may get worse, as Ontap doesn?t have any clue what?s in the lun and it?s file system layout. If you want to run the reallocation only one time you should use -f or -o switch however if you want Ontap to keep a track of your FS and optimize the data when if feels necessary you should control it with ?i switch or schedule it with ?reallocate schedule? command. To check current optimization level of volume, you can use ?reallocate measure -o ? or if you want to feel adventurous use ?wafl scan measure_layout ? through advanced mode, though I don?t suggest using wafl set of commands in general use but yes sometime you want to do something different. This command is pretty straightforward and no harm (except extra load on CPU and disk) so you can play with this but you should always consider using -p switch for volumes having snapshot and/or snapmirror on to keep the snapshot size small.

How to get the list of domain users added to filer without fiddling with SID
There were numerous time when I wanted to see an AD user?s permission on filer however just to locate that user on system itself took me a lot of time. Why? Because Ontap shows domain users added to system in SID format rather than their names which is very much annoying as when it dumps the SIDs on screen then we have to use ?cifs lookup? command to hunt for the user I am looking for from that bunch of SIDs. So here?s a little handy unix script to see the list of all AD users added on filers in their username format rather then SIDs I have already setup a password less login to filer therefore I haven?t added the username and password fields however if you haven?t done that add your login credentials after name of the filer in below command. rsh useradmin domainuser list -g ?Administrators? | sed 's/^S/rsh cifs lookup S/' Now this command will display the AD users added in Administrator group however if you want to see users from any other group replace the Administrators word with group name on your screen.

Restoring data from snapshot through snaprestore in NetApp


Ok so now you have allocated correct snap reserve space, configured snap schedules, snap autodelete, users have access to their snapshots and they recover their data

without any interference of backup team. Everyone is happy so you happy but all of sudden on a Friday evening get a call from VP marketing crying on phone that he lost all his data from his network drive and windows shows recovery time of 2 hrs but he wants his 1Gb pst to be accessible now as he is on VPN with a client and needs to pull some old mails from his pst. Well that?s nothing abnormal as he was having lots of data and to recover the data windows has to read all the data from snapshot and then write back on network drive which but obvious will take time. Now what would you say, will you tell him to navigate to his pst and recover it (which shouldn?t take much time on fast connection) then try to recover all the data or ok I have recovered all your data while talking on the phone and become hero. Well I must say I would like to use the opportunity to become hero with a minute or less of work, but before we do a few things to note. For volume snaprestore: The volume must be online and must not be a mirror. When reverting the root volume, filer will be rebooted. Non-root volumes do not require a reboot however when reverting a non-root volume, all ongoing access to the volume must be terminated, just as is done when a volume is brought offline. For single-file snaprestore: The volume used for restoring the file must be online and must not be a mirror. If restore_as_path is specified, the path must be a full path to a filename, and must be in the same volume as the volume used for the restore. Files other than normal files and LUNs are not restored. This includes directories (and their contents), and files with NT streams. If there is not enough space in the volume, the single file snap restore will not start. If the file already exists (in the active file system), it will be overwritten with the version in the snapshot. To restore data there are two ways, first system admins using ?snap restore? command invoked by SMO, SMVI, Filer view or system console and second by end users where they can restore by copying file from .snapshot or ~snapshot directory or by using revert function in XP or newer system. However restoring data through snap restore command is very quick (seconds) even for TBs of data. Syntax for snap restore is as below. ?snap restore -t vol -s -r ? If you don?t want to restore the data at different place then remove the ?-r ? argument and filer will replace current file with the version in snapshot and if you don?t provide a snapshot name in syntax then system will show you all available snapshots and will prompt to select snapshot from which you want to restore the data. Here?s the simplest form of this command as example to recover a file. testfiler> snap restore -t file /vol/testvol/RootQtree/test.pst WARNING! This will restore a file from a snapshot into the active filesystem. If the file already exists in the active filesystem, it will be overwritten with the contents from the snapshot. Are you sure you want to do this? yes The following snapshots are available for volume testvol: date --------- Nov 17 13:00 hourly.0 Nov 17 11:00 hourly.1 Nov 17 09:00 name -----------hourly.2 Nov

17 00:00 14 00:00 27 00:00

weekly.0 Nov 16 21:00 nightly.2 Nov 13 00:00

hourly.3 Nov 16 19:00 nightly.3 Nov 12 00:00

hourly.4 Nov 16 17:00 nightly.1 Nov weekly.2 Oct nightly.4 Nov 11 00:00

hourly.5 Nov 16 15:00 nightly.5 Nov 10 00:00

hourly.6 Nov 16 00:00 weekly.1 Nov 09 00:00

nightly.0 Nov 15 00:00 nightly.6 Nov 03 00:00

weekly.3 Which snapshot in volume testvol would you like to revert the file

from? nightly.5 You have selected file /vol/testvol/RootQtree/test.pst, snapshot nightly.5 Proceed with restore? yes testfiler>

Snapshot configuration in NetApp


Ok first of all let me admit that my last post sounded more as a sales pitch rather than something technical though I am not a NetApp employee or paid by anyone to do blogging. However I must agree that whatever I have tried to show there was petty much similar available from other vendors so it was more about general awareness of technology rather than a particular vendor but in this post I will talk about Snapshot configuration and other functions in NetApp, so let?s start. What is snapshot copy? A Snapshot copy is a frozen, read-only image of a volume or an aggregate that captures the state of the file system at a point in time and each volume can hold maximum 255 Snapshot copies at one time. Snapshots can be taken either by system at pre-defined schedule, Protection Manager Policies, SMO, SMVI, Filer view or manually running command at system console or through custom scripts. How to disable client access to snapshot copy? To disable client access of .snapshot volume you can give ?vol options nosnapdir on? command. Notes: Please DO NOT use any snap family of command without volume name as it may drive CPU processor to its peak for systems having lots of volume with a number of snapshots and it can hung the system which may result in system panic situation. Use ?-A? if you want to run these command against any aggregate and replace volume name with aggregate name. How to Configure Snapshots through system console? It?s always recommended that when you provision a volume you should look at snapshot reserve and schedule as by default when a volume is created 20% of space is reserved for snapshots which most of the time you need to change for efficient usage of space and snapshots. Always ask requester what is the rate of change, how much snapshots he wants to have access to and when he wants to snapshots to be taken because if you take snapshot of some oracle data and database is not in hot-backup mode then it?s just utter waster and same goes for VM. So once you have those details do a little calculation and then use these command to configure. ?snap reserve ? Example: ?snap reserve testvol 10? This command will allocate 10% of space for snapshots on volume ?testvol? ?snap sched ? Example: ?snap sched testvol 4 7 7@9,11,13,15,17,19,21? This command will define the automatic snapshot

schedule, and here you specify how much weekly, daily or hourly snapshot you want to retain as well at what time hourly snapshot will be taken. In given example volume testvol is having 4 weekly, 7 daily and 7 hourly available where hourly snapshots are taken at 9,11,13,15,17,19 and 21 hours of system local time. Please make sure that ?nosnap? is set to off in volume options. How to take snapshots manually? To take the snapshot manually you can run below command. ?snap create ? Here volume name is the name of volume you want to take snapshot of and snapshot name is the name you want to identify snapshot with. How to list snapshots? You can check the status of snapshots associated with any volume with command ?snap list ? After issuing the above command you will get similar output testfiler> snap list testvol Volume testvol working... %/used %/total date name ---------- ---------- ------------ -------- 36% (36%) 0% ( 0%) Dec 02 16:00 hourly.0 50% (30%) 0% ( 0%) Dec 02 12:00 hourly.1 61% (36%) 0% ( 0%) Dec 02 08:00 hourly.2 62% ( 5%) 0% ( 0%) Dec 02 00:01 nightly.0 69% (36%) 0% ( 0%) Dec 01 20:00 hourly.3 73% (36%) 0% ( 0%) Dec 01 16:00 hourly.4 77% (36%) 0% ( 0%) Dec 01 00:01 nightly.1 What if you are running low on snap reserve? Sometimes due to excessive rate of change in data, very soon snapshot reserve gets full and they over spill on data area of volume, to remediate this you have to either extend volume or delete old snapshots. To resize the volume use ?vol size? command and to delete the old snapshots you can use ?snap delete? command which I will cover in next section, however before deleting if you want to check how much free space you can gain from this snapshot use below command ?snap reclaimable | ?? Running above command will give you output as below and you can add multiple snapshot names after one other if you are not getting required free space by deleting one snapshot. Please note that you should select snapshots for deletion only from oldest to latest order otherwise blocks freed by deleting any middle snapshot will still be locked in its following snapshot testfiler> snap reclaimable testvol nightly.1 hourly.4 Processing (Press Ctrl-C to exit) ............ snap reclaimable: Approximately 9572 Kbytes would be freed. How to delete snapshot? To delete the snapshot use command snap delete with volume name and snap name in below fashion ?snap delete ? Running this command will print similar information on screen testvol> snap delete testvol hourly.5 Wed Dec 2 16:58:29 GMT [testfiler: wafl.snap.delete:info]: Snapshot copy hourly.5 on volume testvol NetApp was deleted by the Data ONTAP function snapcmd_delete. The unique ID for this Snapshot copy is (67, 3876). How to know what is the actual rate of change? Sometime on a particular volume very often you will be running out of snap reserve space as snapshots fill them up much before old snaps gets expire and deleted by auto delete function (if you have configured) and you must be interested to resize the snap reserve accurately to

avoid any issues. So in order to check how much is the actual rate of change KB per/hour calculated from all the snapshots or between two snap on given volume you can use snap delta command. ?snap delta [<1st snapshot name> <2nd snapshot name>]? testfiler> snap delta testvol Volume testvol working... From Snapshot To KB changed Time hourly.0 weekly.0 552 hourly.6 nightly.2 nightly.5 632 Rate (KB/hour) --------------- -------------------- -------Active File System 30044 552 468 hourly.5 hourly.2 0d 00:28 628 hourly.4 0d 01:59 5392 880 nightly.6 7d 00:00 Rate Active 1d 0d 02:00 276.000 hourly.2 0d 03:00 155.956 548 nightly.1 nightly.4 --- ------------ --------------- hourly.0 63176.635 hourly.1 hourly.1 hourly.4 500 nightly.0 nightly.3 nightly.6 42420 552 hourly.3

0d 01:59 276.115 weekly.0

0d 09:00 69.680 hourly.3

0d 01:59 276.115 hourly.5 560 820

0d 02:00 249.895 hourly.6 700 2920 weekly.1

274.038 nightly.0

0d 14:59 37.334 nightly.1 0d 23:59 34.172 nightly.4 1d 00:00 46307.381 8892

0d 23:59 29.171 nightly.2 0d 23:59 121.687 nightly.5 1111956 weekly.2

00:00 224.666 nightly.3 1d 00:00 36.666 weekly.1

1d 00:00 26.333 weekly.2 KB changed Time

6d 00:00 294.583 weekly.3

52.928 Summary... From Snapshot To File System 1209016

(KB/hour) --------------- -------------------- ----------- ------------ --------------- weekly.3

21d 13:29 2336.320 That was all about configuring creating

and deleting snapshots but what it?s good if you don?t know how to restore the data from snapshots for which you have done so much things. So, in next post I will address how to restore data from snapshot through snap restore command

Snapshots in NetApp
Volumes and data: Volume used for test was a flexible volume named ?buffer_aggr12? and ?My Documents? folder from my laptop for data and sync tools from Microsoft to sync ?My Documents? folder with cifs share created from volume buffer_aggr12. Snapshot configuration: Scheduled snapshot were configured at 9,11,13,15,17,19,21 hours and retention period was 4 weekly, 7 daily and 7 hourly with 20% space reserve for snapshot. The coolest part of the snapshot is flexibility, because as an administrator once you have configured it no more you have to look into this as it takes snapshot at defined schedule and if you have configured ?snap autodelete? then it will purge expired snapshots also as per your retention period. So effectively you don?t have to ever worry about managing hundreds of old snapshots lying in volume and eating up space (except when change rate of data overshoots and snapshots starts spilling on data area). As a

end user you experience backups at your click away because snapshots integrates well with shadow copy services of windows 2000, XP or Vista and you can recover them whenever you need. Here?s the configuration of snapshot for my test volume ?buffer_aggr12? AMSNAS02> snap sched buffer_aggr12 Volume buffer_aggr12: 4 7 7@9,11,13,15,17,19,21 AMSNAS02> snap reserve buffer_aggr12 Volume buffer_aggr12: current snapshot reserve is 20% or 157286400 k-bytes. As I was running this test for months so there were enough snaps for me to play with and you can see below that these snapshots are going way back to 20th July, which is 4 week old snapshot and anytime I can recover that from just a right click. How to recover files or folders from snapshot: There are two ways to recover the data from snapshots. As an end user you can recover your data from windows explorer by just right clicking in an empty space while you are in the share in which you lost your data. Here?s an example of this. a) This is the snapshot of my share folder, in this as you can see my pst file is corrupted and showing 0 kb. b) To recover this, right click on any empty area and go to properties>previous version it shows me all the snapshots taken for this folder, as shown in below screenshot. c) Now at this point either I can revert the whole folder to previous state or just copy it to another location to recover a deleted file but at this place my point is to revert a corrupted file rather than recovering a deleted file. So I will just do a right click on that file and navigate to previous versions tab in properties dialogue box. Here in this it shows me the changes captured by snapshot at different times, so I can just select the date I want to revert with and click on restore. d) Now it starts replacing the corrupted file with the one taken by snapshot. Its taking a long time because the file in question is >1Gb size and I am on WAN link so it?s slow but there is another way to do it and that?s recovering directly from filer console which recovers in seconds but unfortunately not available to end user. e) Now here? the screenshot of my before and after. As an Administrator you can recover a file, folder or whole volume within second as while doing it from filer console, system doesn?t have to copy the old file from snapshot to temp location, delete old file and then change the recovered file?s metadata , instead it just changes the block pointers internally so it?s blazing fast . Here?s an example of this. a) In this test again I will use same pst file which is corrupted but this time we will recover it from console. So first login to filer and do a snap list to see what all snapshots are available. AMSNAS02> snap restore buffer_aggr12 Volume buffer_aggr12 working? %/used (40%) %/total date name ---------- ---------- ------------ -------- 0% ( 0%) 0% ( 0%) Aug 14 15:00 hourly.1 40% 0% ( 0%) Aug 14 11:00 0% ( 0%) Aug 14 0% ( 0%) Aug 14 17:00 hourly.0 0% ( 0%) hourly.3 40% ( 0%)

0% ( 0%) Aug 14 13:00 hourly.2 40% ( 0%)

0% ( 0%) Aug 14 09:00 hourly.4 40% ( 0%)

00:00 nightly.0 40% ( 0%)

0% ( 0%) Aug 13 21:00 hourly.5 40% ( 0%)

0% ( 0%) 0% (

Aug 13 19:00 hourly.6 40% ( 0%) 0%)

0% ( 0%) Aug 13 00:00 nightly.1 41% ( 0%) 0% ( 0%) Aug 11 00:00 nightly.3 57% (

0%) Aug 12 00:00 nightly.2 57% (39%) 57% ( 0%)

0% ( 0%) Aug 10 00:00 weekly.0 57% ( 0%)

0% ( 0%) Aug 09 00:00 nightly.4 0% ( 0%) Aug 07 00:00 0% ( 0%) Jul 27

0% ( 0%) Aug 08 00:00 nightly.5 57% ( 0%)

nightly.6 57% ( 0%)

0% ( 0%) Aug 03 00:00 weekly.1 65% (35%)

00:00 weekly.2 65% ( 0%)

0% ( 0%) Jul 20 00:00 weekly.3 b) Now to recover the file

you give below command and it recovers that in just a second. AMSNAS02> snap restore -t file -s nightly.5 /vol/buffer_aggr12/RootQtree/test.pst WARNING! This will restore a file from a snapshot into the active filesystem. If the file already exists in the active filesystem, it will be overwritten with the contents from the snapshot. Are you sure you want to do this? yes You have selected file /vol/buffer_aggr12/RootQtree/test.pst, snapshot nightly.5 Proceed with restore? yes AMSNAS02> c) Here?s the screenshot of my folder which confirm file back in previous state. Now as you see it was quite easy to use and very useful also, but to have a snapshot you need some extra space reserved in volume specially if your data is changing very frequently as more changes means more space you need to store changed block and the condition goes more complicated if you are trying to take snapshot of a VM, Exchange or Database volume, because before the snapshot is taken application has to put itself in hot-backup mode so a consistent copy can be made. Most of the applications have this functionality available but you have to use some script or snapmanager so when application is prepared it can inform filer to take snapshot and once snapshot is taken filer can inform back the application to resume its normal activity.

Restrict snapmirror access by host and volume on NetApp


Recently one of my fellow NetApp admin friend asked me a very general question, ?How do you restrict your data to be copied through snapmirror?? As like any other normal NetApp guy my answer was also same old vanilla type. ?Go to snapmirror.allow file and put the host name if your have set snapmirror.access to legacy or you can directly put hostname in host=host1,host2 format in snapmirror.access option.? But he wanted more granular level of permission, so my another answer was, ?You can also use snapmirror.checkip.enable so any system reporting same hostname will not be able to access data.? But even on that he wasn?t happy and was asking if there is any other way so he can restrict snapmirror access on volume basis. At this point I said ?No, NetApp doesn?t provide this level of granular access.? So the topic stopped there, but this question was there in my mind and always hunted me why there isn?t any such way. Fast forward Past week when I had some extra time in my hand I started searching on

net for this and fortunate enough I got a way on NOW site to get this work. It was recorded under Bugs section with Bug ID # 80611 Which reads as. ?There is an unsupported undocumented feature of the /etc/snapmirror.allow file, such that if it is filled as follows: hostA:vol1 hostA:vol29 hostB:/vol/vol0/q42 hostC and "options snapmirror.access legacy" is issued, then the desired access policy will be implemented. Again note that this is unsupported and undocumented so use at your own risk.? Yes, though NetApp says that there is a way to do that but they also say well sometimes it may break other functionality or may not work as expected. Finding this I sent the details to my friend but unfortunately he don?t want to give it a try on his production systems and test systems are not available with him. So if anyone of you want to try it or have tried it before please put your experience in comments field.

NetApp NFS agent for VCS - Part 3


In first post I wrote why I need this agent installed and what all are the features of this, and in last post I mentioned how to configure it on cluster node but that was incomplete because the post was going very big and I had to stop it, so here?s the remaining and very important part of that configuration. How to configure different account name in NetApp NFS agent for VCS? Hunting around in agent?s configuration guide from Veritas and NetApp didn?t reveal any result and even their KB search was not helpful. So I was left to choose my way and explore the stuff which I started with creating a new customized account on filer only for this purpose. Here are the actual commands I used to create them starting from customized role to account. ?useradmin role add exportfs -c "To manage NFS exports from CLI" -a cli-exportfs*,cli-lock*,cli-priv*,cli-sm_mon*? ?useradmin group add cli-exportfs-group -r exportfs -c "Group to manage NFS exportfs from CLI"? ?useradmin user add vcsagent -g cli-exportfs-group -c "To manage NFS exports from NetApp VCS Agent"? And here?s the account after creation testfiler1> useradmin user list vcsagent Name: vcsagent Info: To manage NFS exports from NetApp VCS Agent Rid: 131090 Groups: cli-exportfs-group Full Name: Allowed Capabilities: cli-exportfs*,cli-lock*,cli-priv*,cli-sm_mon* Password min/max age in days: 0/4294967295 Status: enabled Now next thing was to give limited access to cluster node using vcsagent user and revoke its root access which was nothing more then removing dsa keys from /etc/sshd/root/.ssh/authorized_keys file and adding in /etc/sshd/vcsagent/.ssh/authorized_keys file. After completing that I headed back to host and created a new file named config in .ssh directory of root with below content Host testfiler1 User vcsagent port 22 hostName testfiler1.lab.com As a test I issued command ?ssh testfiler1 version? on node terminal and I got access denied error which was perfectly fine because now when I do ?ssh testfiler1? system looks into config file in .ssh

directory and uses vcsagent user which is not having access to run version command. Everything was looking good so I started running tests by moving resource from one node to another but to my surprise they were failing to make changes on filer and looking at filer audit logs it shown that they are still using root for ssh to filer. Till the moment I didn?t run test I was thinking that agent is just relying to OS for ssh username as NetApp hasn?t set any username attribute in agent moreover as I haven?t configured in OS which account to use that?s why when agent executes command ?ssh testfiler1 ? OS directs the ssh connection to connect with root (cluster node?s local logged-in user). But after going through my failed test it made me to believe that username is hardcoded in agent script so I started looking in script and soon found below line in file NetApp_VCS.pm ?$cmd = "$main::ssh -n root\@$host '$remote_cmd'";? After having this finding it was not a big brainer work to figure out what was going wrong and what I have to do. Just removed the word ?root? from script and it started working because now it is using config file from .ssh directory and uses vcsagent as username, alternatively I could have replaced word root with vcsagent directly in script also to make it simple and stay away from maintaining config file but I felt this to be much better. Unfortunately till today there is no alternative apart from making changes in script as NetApp and Veritas both were not able to help us apart from a statement ?we will raise a product enhancement request?. Update: You need to give access "security-privadvanced" also to user, so role should look like below. testfiler01> useradmin role list exportfs Name: exportfs Info: To manage NFS exports from CLI Allowed Capabilities: cliexportfs*,cli-lock*,cli-priv*,cli-sm_mon*,security-priv-advanced

NetApp NFS agent for VCS - Part 2


In last post I have told why I need this agent installed and what all are the features of this, in this post I will write how I have implemented this agent on 4 node RHEL 5.2 VCS cluster in our test environment as this post is centred on NetApp NFS agent for VCS configuration therefore I will not talk about how to install and configure VCS on RHEL. First I have created a NFS volume on our filer testfiler1, and exported it giving rw access to all 4 nodes of cluster (lablincl1n1, lablincl1n2, lablincl1n3, lablincl1n4) to keep it simple I used sec=sys rather than Kerberos or anything else. Next step was to download the agent from NOW site and install on all the cluster nodes, it was pretty straight forward and well documented in admin guide so no hurdles and went well. Once agent installation and volume creation was done I started to configure NFS share in agent through GUI. Updated FilerName as testfiler1 (which is the name of NetApp filer exporting NFS share), MountPoint attribute with local mount point name, MountOptions with Oracle specific mount options, FilerPathName with nfs volume name, NodeNICs

with lablincl1n1, lablincl1n2, lablincl1n3, lablincl1n4 (name of all the nodes of cluster) and updated ClearNFSLocks to 2, UseSSH to 1, I left rest all of the options untouched as they were good with their default values, like FilerPingTimeout=240, RebootOption=Empty, HostingFilerName=Empty, RebootOption=empty, RouteViaAddress=empty, along with MultiNIC and /etc/hosts file because NIC teaming was done at OS level and felt lazy to update lots of IP addresses in hosts file, as a matter of fact I knew that our BIND servers are robust enough. Note: Please don?t get confuse looking at HostingFilerName field as you need it only if you are using vfiler. If you are exporting NFS volume from vfiler then put vfiler name in FilerName field and physical filer name (on which vfiler is created) as HostingFilerName. Now next step was configuring SSH which was pretty easy, just use ?ssh-keygen -t dsa? command to generate public and private key of root from all your nodes and copy their public key ?authorized_keys? file in folder /etc/sshd/root/.ssh of your filer. Now configuration was completed and everything was working as expected just within 4 hrs of my effort. At this point everything was completed except one very important thing i.e. security, as following agent?s admin guide I have added dsa keys in root?s authorized_keys file, therefore anyone having root access on any of 4 nodes of cluster will have root access on my filer also which I wasn?t comfortable at. So I started looking around in agent?s attributes to configure different account name used by agent but to my surprise nothing was there even none of the documents were speaking on that so I started going on my own way to solve it and it worked well after some extra effort. Now as this post is going quite big so I will cover configuring different user name in VCS agent in next post.

NetApp NFS agent for VCS


Recently again we found ourselves going through space crunch but this time it was on our DMX systems spinning 15k FC disks so we started looking around in space allocation and soon found a lots of low IOPS Oracle databases using these space and after adding their space allocation them came as 460TB in total. WOW wasn?t that enough space to give us few more months to place new orders; Oh yes. So we decided to move them on NetApp boxes which are using 7.2k 1TB SATA disk storage but not on FC or iSCSI instead over NFS as I knew NetApp provides a VCS agent to work with their NFS export and gives some cool features. Though I have never used them but was confident enough that it will work, so I started implementing it in our test environment. Here?s the detail of its features The NetApp NFS client agent for VCS on Red Hat Linux/Solaris/SUSE Linux monitors the mount points on NetApp storage systems. In this environment, the clustered nodes or single node uses the NFS protocol to access the shared volume on NetApp storage systems and agent carries out the commands from

VCS to bring resources on line, monitor their status, and take them off-line as needed. Key Features for version 5.0 of agent are given below.Supports VCS 4.1 and 5.0*Supports Exportfs persistencySupports IPMultiNIC and MultiNICASupports Data ONTAP 7.1.x or laterSupports fine granularity NFS lock clearing (requires Data ONTAP 7.1.1 or later)Supports communication with the storage system through SSH, in addition to RSHMultithreading (NumThreads >1) is supported (requires IPMultiNIC with MultiNICA)Supports automatic fencing of export for ro access to other nodes in cluster as resource moves from one node to otherSupports failover of a single resource group when multiple resource groups of the same type are active on the same cluster node Kernel RequirementLinux Kernel 2.6.9-34.EL, 2.6.9-34.ELsmp for RHEL, 2.6.5-7.287.3smp for SUSE * VCS 4.1 is not supported for SUSE Linux# With Solaris 10 local zones are also supported in addition to global zones. In next part I will post how to implement it, which will need some modification to script also. References:NetApp NFS Client for VCS on RHELNetApp NFS Client for VCS on SolarisNetApp NFS Client for VCS on SUSE Linux

SSH broken if you disable Telnet in ontap 7.3.1


And here?s another bug which we hit a last month. Last month when I was doing setup of our new filers I disabled telnet on the systems with along-with lots of other tweaking but later on when I tried to connect the system with SSH it refused. Thinking about that I might have turned off some other deep registry feature I went through entire registry but couldn?t find anything suspicious. So I turned on SSH verbose login, tried to re-run SSH setup with different passkey sizes and what not, but no joy. Finally I tried with enabling telnet and voila it worked. By the time it worked it was around 7 pm so I called a day and left office scratching my head. Next morning again I started looking around if there was something obvious I am missing but no, I couldn?t find anything even on NOW site, so I opened a case with NetApp and even NetApp guy was not able to understand why system is behaving like this, but finally in late evening that NetApp chap came to me with a BURT # 344484 which was fixed in 7.3.1.1P2. Now there was a big problem as I wasn?t quite ready to upgrade my systems with a patched version so decided to let have telnet enable and wait for 7.3.2 to arrive. But since that time I was getting bugged with IT-security team because I was trying to get these systems connected in network so I can start allocating some space and get rid of space low warning but these guys were not allowing me because telnet was enabled on them. Finally past week when I noticed 7.3.2RC1 and 8.0RC1 availability on now site I got some sigh of relief as I believe now 7.3.2 GA should be available within a month and finally I can have my systems meeting

my organization security policy more importantly I can get rid of pending space allocation request.

NetApp command line shortcuts


Just a few commands which I use frequently while on console. CTRL+W = It deletes the word before cursorCTRL+R = Rewrites the entire line you have enteredCTRL+U = Deletes the whole lineCTRL+A = Go to start of the lineCTRL+E = Go to end of the lineCTRL+K = Delete all the following texts A few more commands are there but I feel arrow keys work better then you press these sequences like CTRL+F = Right arrowCTRL+B = Left arrowCTRL+P = Up arrowCTRL+N = Down arrowCTRL+I = Tab key Am I missing anything else?

Failed disk replacement in NetApp


Disk failures are very common in storage environment and as a storage administrator we come across this situation very often, how often that depends how much disks your storage systems is having; more disks you manage more often you come across this situation. This post I have written considering RAID-DP with FC-AL disks because it?s always better than RAID4 and SCSI loops we don?t use. Due to its design RAID-DP gives protection from double disk failure in a single raid group. To say that it means you will not loose data even if 2 disks are failed in a single RG at same time or one after another. As like any other storage system Ontap also uses a disk from spare disks pool to rebuild the data from surviving disk as soon as it encounters a failed disk situation and sends an autosupport message to NetApp for parts replacement. Once autosupport is received by NetApp they initiate RMA process and part gets delivered to the address listed for that failed system in NetApp records. Once the disk arrives you change the disk by yourself or ask a NetApp engineer to come at onsite and change it, whatever way as soon as you replace the disk your system finds the newly working disk and adds it in spare pool. Now wasn?t that pretty simple and straightforward? Oh yes; because we are using software based disk ownership and disk auto assignment is turned on. Much like your baby had some cold so he called-up GP himself and got it cured rather than asking you to take care of him, but what about if there are some more complication. Now, will cover what all other things can come in way and any other complications. Scenario 1: I have replaced my drive and light shows Green or Amber but ?sysconfig -r' still shows the drive as broken? Sometimes we face this problem because system was not able to either label the disks properly or replaced disk itself is not good. The first thing we try is to label the disk correctly if that doesn?t work try replacing with another disk or known good disk but what if that too doesn?t work, just contact NetApp and follow their guidelines. To label the disk from "BROKEN" to "SPARE" first you have to note down the

broken disk id, which you can get from ?aggr status -r", now go to advance mode with ?priv set advanced? and run ?disk unfail ? at this stage your filer will throw some 3-4 errors on console or syslog or snmp traps, depends on how you have configured but this was the final step and now disks should be good which you can confirm with ?disk show ? for detailed status or ?sysconfig -r? command. Give it a few seconds to recognize the changed status of disk if status change doesn?t shows at first. Scenario 2: Two disks have failed from same raid group and I don?t have any spare disk in my system. Now in this case you are really in big trouble because always you need to have at least one spare disk available in your system whereas NetApp recommends 1:28 ratio i.e. have one spare on each 28 disks. In the situation of dual disk failure you have very high chances of loosing your data if another disk goes while you are rebuilding the data on spare disk or while you are waiting for new disks to arrive. So always have minimum 2 disks available in your system one disk is also fine and system will not complain about spare disk but if you leave system with only one spare disk then maintenance centre will not work and system will not scan any disk for potential failure. Now going to your above situation that you have dual disk failure with no spares available, so best bet is just ring NetApp to replace failed disk ASAP or if you think you are loosing your patient select same type of disk from another healthy system, do a disk fail, remove disk and replace it with failed disk on other system. After adding the disk to another filer if it shows Partial/failed volume, make sure the volume reported as partial/failed belongs to newly inserted disk by using ?vol status -v? and ?vol status -r" commands, if so just destroy the volume with ?vol destroy? command and then zero out the disk with ?disk zero spares?. This exercise will not take more than 15 min(except disk zeroing which depends on your disk type and capacity) and you will have single disk failure in 2 systems which can survive with another disk failure, but what if that doesn?t happens and you keep running your system with dual disk failure. Your system will shut down by itself after 24 hours; yes it will shut down itself without any failover to take, your attention. There is a registry setting to control how long your system should run after disk failure but I think 24hrs is a good time and you shouldn?t increase or decrease it until and unless you think you don?t care of the data sitting there and anyone accessing it. Scenario 3: My drive failed but there is no disk with amber lights A number of times these things happen because disk electricals are failed and no more system can recognize it as part of it. So in this situation first you have to know the disk name. There are couple of methods to know which disk has failed. a) ?sysconfig -r ? look for broken disk list b) From autosupport message check for failed disk ID c) "fcadmin device_map" looks for a disk with xxx or ?BYP? message d) In /etc/messages look for failed or bypassed disk warning and there

it gives disk ID Now once you have identified failed disk ID run ?disk fail ? and check if you see amber light if not use ?blink_on ? in advanced mode to turn on the disk LED or if that that fails turn on the adjusting disk?s light so you can identify the disk correctly using same blink_on command. Alternatively you can use led_on command also instead of blink_on to turn on the disk LEDs adjacent to the defective disk rather than its red LED. If you use auto assign function then system will assign the disk to spare pool automatically otherwise use ?disk assign ? command to assign the disk to system. Scenario 4: Disk LED remains orange after replacing failed disk This error is because you were in very hurry and haven?t given enough time for system to recognize the changes. When the failed disk is removed from slot, the disk LED will remain lit until the Enclosure Services notices and corrects it generally it takes around 30 seconds after removing failed one. Now as you have already done it so better use led_off command from advanced mode or if that doesn?t works because system believes that the LED is off when it is actually on, so simply turn the LED on and then back off again using ?led_on ? then ?led_off ? commands. Scenario 5: Disk reconstruction failed There could be a number of issues to fail the RAID reconstruction fail on new disk including enclosure access error, file system disk not responding/missing, spare disk not responding/missing or something else, however most common reason for this failure is outdated firmware on newly inserted disk. Check if newly inserted disk is having same firmware as other disks if not first update the firmware on newly inserted disk and it then reconstruction should finish successfully. Scenario 6: Disk reconstruction stuck at 0% or failed to start This might be an error or due to limitation in ONTAP i.e. no more than 2 reconstructions should be running at same time. Error which you might find a time is because RAID was in degraded state and system went through unclean shutdown hence parity will be marked inconsistent and need to be recomputed after boot. However as parity recomputation requires all data disks to be present in the RAID group and we already have a failed disk in RG so aggregate will be marked as WAFL_inconsistent. You can confirm this condition with ?aggr status -r" command. If this is the case then you have to run wafliron, giving command ?aggr wafliron start ? while you are in advance mode. Make sure you contact NetApp before starting walfiron as it will un-mount all the volumes hosted in the aggregate until first phase of tests are not completed. As the time walfiron takes to complete first phase depends on lots of variables like size of volume/aggregate/RG, number of files/snapshot/Luns and lots of other things therefore you can?t predict how much time it will take to complete, it might be 1 hr or might be 4-5 hrs. So if you are running wafliron contact NetApp at fist hand.

NetApp NFS mount for Sun Solaris 10 (64 bit)

In this post I have tried to cover mount options and other settings related to Solaris for higher throughput from NFS, which is more towards 64 bit although these settings apply to even 32 bit but a few extra settings gets counted when you think of 32 bit version, like super caching as I can remember because this list I have complied long back and still it's very handy to me when I get some complain about low performance. For any further details you can look in references section. Mount options rw,bg,hard,nointr,rsize=32768,wsize=32768,vers=3,proto=tcp Kernel Tuning Parameter Replaced by (Resource Control) Recommended Minimum Value noexec_user_stack NA 1 semsys:seminfo_semmni project.max-sem-ids 100 semsys:seminfo_semmns NA 1024 semsys:seminfo_semmsl project.max-sem-nsems 256 semsys:seminfo_semvmx NA 32767 shmsys:shminfo_shmmax project.max-shm-memory 4294967296 shmsys:shminfo_shmmni project.max-shm-ids 100 On Solaris 10, the following kernel parameters should be set to the shown value, or higher. Solaris file descriptors rlim_fd_cur ? "Soft" limit on the number of file descriptors (and sockets) that a single process can have open rlim_fd_max ? "Hard" limit on the number of file descriptors (and sockets) that a single process can have open Setting these values to 1024 is strongly recommended to avoid database crashes resulting from Solaris resource deprivation. Network Settings Parameter Value Details /dev/tcp tcp_recv_hiwat 65,535 increases TCP receive buffer /dev/tcp tcp_xmit_hiwat 65,535 increases TCP transmit buffer /dev/ge adv_pauseTX 1 Enables transmit flow control /dev/ge adv_pauseRX 1 Enables receive flow control /dev/ge adv_1000fdx_cap 1 forces full duplex for GBE ports /dev/tcp tcp_xmit_hiwat 65536 Increases TCP transmit high watermark /dev/tcp tcp_recv_hiwat 65536 Increases TCP receive high watermark sq_max_size ? Sets the maximum number of messages allowed for each IP queue (STREAMS synchronized queue). Increasing this value improves network performance. A safe value for this parameter is 25 for each 64MB of physical memory in a Solaris system up to a maximum value of 100. The parameter can be optimized by starting at 25 and incrementing by 10 until network performance reaches a peak. Nstrpush ? Determines the maximum number of modules that can be pushed onto a stream and should be set to 9 References NetApp Technical Teport tr-3633, tr-3496, tr-3322, NetApp Knowledge Base Article 7518

NetApp Active/Active vs. Active/Passive (Stretch MetroCluster) solution


Active / Active Controller Configuration In this configuration both the systems are connected to each other?s disk and having heartbeat connection through NVRAM card. In the situation of one controller failure other controller takes over the loads of failed controller and keeps the operation going as it?s having connection with failed

controller?s disk shelves. Further details of Active / Active cluster best practices can be found in TR-3450 Active / Passive (Stretch MetroCluster) Configuration This is the diagram of active/active metrocluster, however the same design applies to active/passive metrocluster also except one node on the cluster is having only mirror of primary system's data. In this configuration primary and secondary systems can extend upto 500m (upto 100km with Fabric MetroCluster) and all the primary system data is mirrored to secondary system with Sync Mirror, in the event of primary system failure all the connection automatically gets switch over to remote copy. This provides additional level of failure protection like whole disk shelf failure or multiple failures at same time, however this needs another copy of same data and exact same hardware configuration to be available for secondary node. Please note that cluster interconnect (CI) on NVRAM card is required for cluster configuration however 3170 offer a new architecture that incorporates a dual-controller design with the cluster interconnect on the backplane. For this reason, the FCVI card that is normally used for CI in a Fabric MetroCluster configuration must also be used for a 31xx Stretch configuration.Further details of MetroCluster design and implementation can be found in TR-3548 Minimizing downtime with cluster Although having a cluster configuration saves from any unwanted downtime however a small disruption can be sensed on the network while takeover /giveback is happening which is approximately less than 90 seconds in most of the environments and it keeps the NAS network alive with few ?not responding? errors on clients.A few points in related with this are given below: CIFS: leads to a loss of session to the clients, and possible loss of data. However clients will reconnect the session by themselves if system comes up before the timeout window. NFS hard mounts: clients will continue to attempt reconnection indefinitely, therefore controller reboot does not affect clients unless the application issuing the request times out waiting for NFS responses. Consequently, it may be appropriate to compensate by extending the application timeout window. NFS soft mounts: client processes continue reconnection attempts until the timeout limit is reached. While soft mounts may reduce the possibility of client instability during failover, they expose applications to the potential for silent data corruption, so are only advised in cases where client responsiveness is more important than data integrity. If TCP soft mounts are not possible, reduce the risk of UDP soft mounts by specifying long retransmission timeout values and a relatively large number of retries in the mount options (i.e., timeo=30, retrans=10). FTP, NDMP, HTTP, backups, restores: state is lost and the operation must be retried by the client. Applications (for example, Oracle, Exchange): application-specific. Generally, if timeout-based, application parameters can

be tuned to increase timeout intervals to exceed Data ONTAP reboot time as a means of avoiding application disruption.

NetApp NFS mount for Red Hat Linux 5.2


Just another post from my mails, where I have collected some Best Practices for mounting NFS share in RHRL. AutomounterAn automounter can cause a lot of network chatter, so it is best to disable the automounter on your client and set up static mounts before taking a network trace. Automounters depend on the availability of several network infrastructure services. If any of these services is not reliable or performs poorly, it can adversely affect the performance and availability of your NFS clients. When diagnosing an NFS client problem, triple-check your automounter configuration first. It is often wise to disable the automounter before drilling into client problem diagnosis. LINUX KERNEL TUNING FOR KNFSsunrpc.tcp_slot_table_entries = 128 Increasing this parameter from the default of 16 to the maximum of 128 increases the number of in-flight Remote Procedure Calls (I/Os). Be sure to edit /etc/init.d/netfs to call /sbin/sysctl ?p in the first line of the script so that sunrpc.tcp_slot_table_entries is set before NFS mounts any file systems. If NFS mounts the file systems before this parameter is set, the default value of 16 will be in force. Mount optionsrw,bg,hard,intr,rsize=32768,wsize=32768,vers=3,proto=tcp,timeo=600,retrans=2 Kernel TuningMost modern Linux distributions contain a file called /etc/sysctl.conf where you can add changes such as this so they will be executed after every system reboot. Add these lines to your /etc/sysctl.conf file on your client systems: net.core.rmem_default 262144 Default TCP receive window size (Default buffer size) Improve network performance for IPbased protocols net.core.rmem_max 16777216 Max. TCP receive window size.(Max. buffer size) Improve network performance for IPbased protocols net.core.wmem_default 262144 Default TCP send window size (Default buffer size) Improve network performance for IPbased protocols net.core.wmem_max 16777216 Max. TCP send window size (Max. buffer size) Improve network performance for IPbased protocols net.ipv4.tcp_rmem 4096 262144 16777216 Autotuning for TCP receive window size (Default and Max. values are overridden by rmem_default rmem_max) Improve network performance for IPbased protocols net.ipv4.tcp_wmem 4096 262144 16777216 Autotuning for TCP send window size (Default and Max. values are overridden by wmem_default wmem_max) Improve network performance for IPbased protocols net.ipv4.tcp_window_scaling 1 TCP scaling, allows a TCP window size greater than 65536 to be used This is enabled by default (value 1), make sure that it doesn't get disabled (Value 0). net.ipv4.tcp_syncookies 0 Disables generation SYN (crypto) COOKIES Helps to reduce CPU overhead

net.ipv4.tcp_timestamps 0 Disables new RTTM feature introduced in RFC-1323 Helps to reduce CPU overhead Prevents adding 10-byte overhead to TCP header net.ipv4.tcp_sack 0 Disables selective ack Helps to reduce CPU overhead References: NetApp whitepaper tr-3700,tr-3183, tr-3369 NetApp Knowledge Base Article 7518

New Guestbook
Add your co

submit

Show All
o o o o o o o

Share this Guestbook


Stumbleupon Facebook MySpace Twitter Digg Delicious RSS

o Email Ads by Google

PCIe Storage Solution


Industry leading DI storage provide Performance workflow,HD,4K,RAID
www.dulcesystems.com

EMC Anechoic chambers


Shielded chambers/faraday cages Anechoic chambers/absorbers
www.Globalemc.co.uk Ads by Google

Data Storage
Permanently free. Register now and get 5GB of Free Online Storage!
free-hidrive.com/Data+Storage

SAN Storage
By Dell w/ automatic load Balancing for growing business need
www.dell.com/EqualLogic Ads by Google

Data Storage
Permanently free. Register now and get 5GB of Free Online Storage!
free-hidrive.com/Data+Storage

SAN Storage
By Dell w/ automatic load Balancing for growing business need
www.dell.com/EqualLogic

Like0

RSS

by mohit96

Hello world. This is my bio. I can edit it later!


1 featured lens Winner of 2 trophies! Top lens Netapp storage systems technical troubleshooting and tips

Feeling creative? Create a Lens!

Explore related pages


What Is Your Greatest Weakness?What Is Your Greatest Weakness? How to Apply for Google JobsHow to Apply for Google Jobs How to Ace your Job InterviewHow to Ace your Job Interview Best Phone Interview Tips - Get That Job!Best Phone Interview Tips Get That Job! Self Description In InterviewsSelf Description In Interviews How to Win At InterviewsHow to Win At Interviews
Netapp storage systems technical troubleshooting and tips computers-and-electronics filer interview questions nas ontap san vcs veritas cluster services mohit96

Related Tags

TOS Originality Pact About Us

SquidooHQ Charity Report Abuse Feedback & Bugs

Copyright 2012, Squidoo, LLC and respective copyright owners This page and all the pages on Squidoo generate income for lensmasters and charities based on affiliate relationships with our partners, including Amazon, Google, eBay and others. Have fun.

Want our Newsletter?


enter e-mail

Sign up!

Das könnte Ihnen auch gefallen