Sie sind auf Seite 1von 64

MODULE 1: KERNEL

Exercise 1: Recovering from a boot loop


Time Estimate: 20 minutes
Step
1.

Action
Log in to the clustershell and execute the following command
cluster1::> cluster show
Node

Health

Eligibility

--------------------- ------- -----------cluster1-01

true

true

cluster1-02

false

true

cluster1-03

true

true

cluster1-04

true

true

4 entries were displayed.

2.

Note that the health of node clusterX-02 is false.


Try and log in to the nodeshell of clusterX-02 to find out the problem.
If unable to access nodeshell of clusterX-02, try and access it through its console.
What do you see?

3.

How do you fix this?

MODULE 2: M-HOST
Exercise 1: Fun with mgwd and mroot
Time Estimate: 20 minutes
Step
1.

Action
On a node which does not own epsilon log in as admin to your cluster via
console and go into systemshell.
::> set diag
::*> systemshell local

2.

Execute the following:


% ps -A|grep mgwd
913

??

Ss

0:11.76 mgwd -z

2794

p1

DL+

0:00.00 grep mgwd

The above listing shows that the process id of the running instance of mgwd on this
node is 913
Kill mgwd as follows
%sudo kill <pid of mgwd as obtained from above>

3.

You see the following? Why?


server closed connection unexpectedly: No such file or
directory
login:
Login as admin again as shon below:
server closed connection unexpectedly: No such file or
directory
login:admin
Password:
What happens ?

4.

You are now in clustershell. Drop to systemshell as follows:


::> set diag

::*> systemshell local


In systemshell execute the following:
% cd /etc
% sudo ./netapp_mroot_unmount
% exit
logout
When would we expect the node to use/need this script?

5.

Now you are back in clustershell. Execute the following:


cluster1::> set diag

Warning: These diagnostic commands are for use by NetApp


personnel only.
Do you want to continue? {y|n}: y

cluster1::*> cluster show


Node

Health

Eligibility

-------------------- ------- ------------

Epsilon
------------

cluster1-01

true

true

true

cluster1-02

true

true

false

cluster1-03

true

true

false

cluster1-04

true

true

false

4 entries were displayed.


cluster1::*> vol modify -vserver studentX -volume studentX_nfs
-size 45M
(volume modify)

Error: command failed: Failed to queue job 'Modify


studentX_nfs'. IO error in
local job store

cluster1::*> cluster show


Node

Health

Eligibility

Epsilon

-------------------- ------- ------------

------------

cluster1-01

true

false

true

cluster1-02

false

true

false

cluster1-03

false

true

false

cluster1-04

false

true

false

4 entries were displayed.

Do we see a difference in cluster show? If so, why? Whats broken?

6.

To fix this without rebooting and without manually re-mounting /mroot restart mgwd.

7.

Which phase in the boot process could we see this behavior occurring?

Exercise 2: Configuration backup and recovery


Time Estimate: 40 minutes

Action
1.

Run the following commands:


::> set advanced
::*> man system configuration backup create
::*> man system configuration recovery node
::*> man system configuration recovery cluster
::*> system configuration backup show node nodename
What do each of the commands show?

2.

Where in systemshell can you find the files listed above?

3.

Create a new system configuration backup of the node and the cluster as follows:
cluster1::*> system configuration backup create -node
cluster1-01 -backup-type
node -backup-name cluster1-01.node
[Job 164] Job is queued: Local backup job.
::*> job private show
::*> job private show id [Job id given as output of the
backup create command above]
::*> job private show -id [id as above] -fields uuid
::*> job store show -id [uuid obtained from the command above]

cluster1::*> system configuration backup create -node


cluster1-01 -backup-type
cluster -backup-name cluster1-01.cluster
[Job 495] Job is queued: Cluster Backup OnDemand Job.
::>job show

4.

The following KB shows how to scp the backup files you created, as well as one of
the system-created backups off to the Linux client:
https://kb.netapp.com/support/index?page=content&id=1012580
Use the following to install p7zip on your Linux client and use it to unzip the backup
files.
# yum install p7zip

This is the recommended practice on live nodes however for vsims scp does not
work.
So in the current lab setup ,drop to the systemshell and cd to
/mroot/etc/backups/config
Unzip the system created backup file by doing the following:

% 7za e [system created backup file name]

What is in this file?

cd into one of the folders created by the unzip. There will be another 7z file. Extract
it:
% 7za e [file name]
Whats in this file?
Extract the file:
% 7za e

[file name]

Whats inside of it?

Compare it to what is in /mroot/etc of one of the cluster nodes. What are some of the
differences?

5.

cd into cluster_config in the backup. What is different from


/mroot/etc/cluster_config on the node?

6.

cd into cluster_replicated_records at the root of the folder you originally extracted


the backup to and issue an ls command.
What do you see?

7.

Unzip the node and cluster backups you created. What do you notice about the
contents of these files?

Exercise 3: Moving mroot to a new aggregate


Time Estimate: 30 minutes

Step
1.

Action
Move a nodes root volume to a new aggregate.
Work with your lab partners and do this on only one node.
For live nodes the following KB contains the steps to do this:
https://kb.netapp.com/support/index?page=content&id=1013350&actp=LIST
However for vsims the root volume that is created by default is only 20MB and
too small to hold the cluster configuration information.
Hence follow the steps given below:

2.

Run the following command to create a new 3-disk aggregate on the desired node :
cluster1::> aggr create -aggregate new_root -diskcount 3 nodes local
[Job 276] Job succeeded: DONE
cluster1::> aggr show -nodes local
Aggregate
Size Available Used% State
RAID Status

#Vols Nodes

--------- -------- --------- ----- ------- ------ --------------- -----------aggr0_cluster1_02_0


900MB

15.45MB

98% online

1 cluster1-02

raid_dp,
normal
student2
raid_dp,

900MB

467.4MB

48% online

8 cluster1-02

normal
2 entries were displayed.

3.

Ensure that the node does not own an epsilon. If it does, run the following command
to move it to another node in the cluster:
cluster1::> set diag

Warning: These diagnostic commands are for use by NetApp


personnel only.
Do you want to continue? {y|n}: y

cluster1::*> cluster show


Node

Health

Eligibility

Epsilon

-------------------- ------- ------------

------------

cluster1-01

true

true

false

cluster1-02

true

true

true

cluster1-03

true

true

false

cluster1-04

true

true

false

4 entries were displayed.

Run the following command to move the epsilon and modify it to 'false' on the
owning node:
::*> cluster modify -node cluster1-02 -epsilon false

Then, run the following command to modify it to 'true' on the desired node:
::*> cluster modify -node cluster1-01 -epsilon true

::*> cluster show


Node

Health

Eligibility

Epsilon

-------------------- ------- ------------

------------

cluster1-01

true

true

true

cluster1-02

true

true

false

cluster1-03

true

true

false

cluster1-04

true

true

false

4 entries were displayed.

4.

Run the following command to set the cluster eligibility on the node to 'false':
::*> cluster modify -node cluster1-02 -eligibility false

Note: This action must be performed on a node that is not to be marked as ineligible.

5.

Run the following command to reboot the node into maintenance mode
cluster1::*> reboot local
(system node reboot)

Warning: Are you sure you want to reboot the node? {y|n}: y

login:
Waiting for PIDS: 718.
Waiting for PIDS: 695.
Terminated
.
Uptime: 2h12m14s
System rebooting...
\
Hit [Enter] to boot immediately, or any other key for command
prompt.
Booting...
x86_64/freebsd/image1/kernel data=0x7ded08+0x1376c0
syms=[0x8+0x3b7f0+0x8+0x274a
8]
x86_64/freebsd/image1/platform.ko size 0x213b78 at 0xa7a000
NetApp Data ONTAP 8.1.1X34 Cluster-Mode
Copyright (C) 1992-2012 NetApp.
All rights reserved.
md1.uzip: 26368 x 16384 blocks
md2.uzip: 3584 x 16384 blocks
*******************************

* Press Ctrl-C for Boot Menu. *


*

*******************************
^CBoot Menu will be available.
Generating host.conf.

Please choose one of the following:

(1) Normal Boot.


(2) Boot without /etc/rc.
(3) Change password.
(4) Clean configuration and initialize all disks.
(5) Maintenance mode boot.
(6) Update flash from backup config.
(7) Install new software first.
(8) Reboot node.
Selection (1-8)? 5
.
WARNING: Giving up waiting for mroot

Tue Sep 11 11:23:27 UTC 2012


*> Sep 11 11:23:28 [cluster1-02:kern.syslog.msg:info]: root
logged in from SP NONE

*>

6.

Run the following command to set the options for the new aggregate to become the
new root:
Note: It might be required to set the aggr options to CFO instead of SFO:
*> aggr options new_root root
aggr options: This operation is not allowed on aggregates with sfo HA
Policy

*> aggr options new_root ha_policy cfo


Setting ha_policy to cfo will substantially increase the client
outage during giveback for cluster volumes on aggregate new_root.
Are you sure you want to proceed? y
*> aggr options new_root root
Aggregate 'new_root' will become root at the next boot.
*>

7.

Run the following command to reboot the node:


*> halt
Sep 11 11:27:49 [cluster1-02:kern.cli.cmd:debug]: Command line
input: the command is 'halt'. The full command line is 'halt'.

.
Uptime: 6m26s

The operating system has halted.


Please press any key to reboot.

System halting...
\
Hit [Enter] to boot immediately, or any other key for command
prompt.
Booting in 1 second...

8.

Once the node is booted, a new root volume named AUTOROOT will be created. In
addition, the node will not be in quorum yet. This is because the new root volume
will not be aware of the cluster.
login: admin
Password:
***********************
**

SYSTEM MESSAGES

**

***********************

A new root volume was detected.


operational. Contact

This node is not fully

support personnel for the root volume recovery procedures.

cluster1-02::>

9.

Increase the size of AUTOROOT on the node by doing the following


Log in to the systemshell of a node which is in quorum and execute the following dblade zapis to
a) Get the uuid of volume AUTOROOT of the node where root volume was
changed
b) Increase the size of the same AUTOROOT volume by 500m
c) Check if the size is successfully changed
% zsmcli -H <cluster ip address of the node where new root
volume was created> d-volume-list-info-iter-start desiredattrs
=name,uuid
<results status="passed">
<next-tag>cookie=0,desired_attrs=name,uuid</next-tag>
</results>
% zsmcli -H <cluster ip address of the node where new root
volume was created> d-volume-list-info-iter-next maximumrecord
s=10
tag='cookie=0,desired_attrs=name,uuid'
<results status="passed">
<volume-attrs>
<d-volume-info>
<name>vol0</name>
<uuid>014df353-bbc1-11e1-bb4c123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>student2_root</name>
<uuid>044f53fa-e784-11e1-ab6e123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>student2_LS_root</name>
<uuid>0ea7ae4c-e790-11e1-ab6e-

123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>AUTOROOT</name>
<uuid>30d8f742-fc04-11e1-bbf5123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>student2_cifs</name>
<uuid>b8868843-e788-11e1-ab6e123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>student2_cifs_child</name>
<uuid>c07f13ce-e788-11e1-ab6e123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>student2_nfs</name>
<uuid>c861f83b-e788-11e1-ab6e123478563412</uuid>
</d-volume-info>
% zsmcli -H 192.168.71.33 d-volume-set-info desired-attrs=size
id=30d8f742-fc04-11e1-bbf5-123478563412 volume-attrs='[dvolume-info=[size=+500m]]'
<results status="passed"/>
% zsmcli -H 192.168.71.33 d-volume-list-info id=30d8f742-fc0411e1-bbf5-123478563412 desired-attrs=size
<results status="passed">
<volume-attrs>
<d-volume-info>
<size>525m</size>
</d-volume-info>
</volume-attrs>
</results>

10.

Clear the root recovery flags if required by doing the following:


Log in to the systemshell of the node where the new root volume was created and

check if the bootarg.init.boot_recovery bit is set

% sudo kenv bootarg.init.boot_recovery


If a value is returned, and it is not kenv: unable to get
bootarg.init.boot_recovery, clear the bit.

% sudo sysctl kern.bootargs=--bootarg.init.boot_recovery


kern.bootargs:

->

Check that the bit is cleared


% sudo kenv bootarg.init.boot_recovery
kenv: unable to get bootarg.init.boot_recovery
%

11.

From a healthy node, with all nodes booted, run the following command:
::*> system configuration recovery cluster rejoin -node <the node
where new root volume was created>

Warning: This command will rejoin node "cluster1-02" into the


local cluster, potentially overwriting critical cluster
configuration files. This command should only be used
to recover from a disaster. Do not perform any other
recovery
operations while this operation is in progress. This
command will cause node "cluster1-02" to reboot.
Do you want to continue? {y|n}: y
Node "cluster1-02" is rebooting. After it reboots, verify that
it joined the new cluster.

12.

After a boot, check the cluster to ensure that the node is back and eligible:
cluster1::> cluster show
Node

Health

Eligibility

--------------------- ------- -----------cluster1-01

true

true

cluster1-02

true

true

cluster1-03

true

true

cluster1-04

true

true

4 entries were displayed.

13.

If the cluster is still not in quorum, run the following command:


::*> system configuration recovery cluster sync <node where new root

volume was created>


Warning: This command will synchronize node "cluster1-02" with the
cluster configuration, potentially overwriting critical cluster
configuration files on the node. This feature should only be used to
recover from a disaster. Do not perform any other recovery
operations while this operation is in progress. This command will
cause all the cluster applications on node "node4" to restart,
interrupting administrative CLI and Web interface on that node.
Do you want to continue? {y|n}: y
All cluster applications on node "cluster1-02" will be restarted.
Verify that the cluster applications go online.

14.

After the node is in quorum, run the following command to add the new root vol to
VLDB. This is necessary because it is a 7-Mode volume and will not be
displayed until it is added:
cluster1::> set diag
cluster1::*> vol show -vserver cluster1-02
(volume show)
Vserver
Volume
Available Used%

Aggregate

State

Type

Size

--------- ------------ ------------ ---------- ---- ---------- --------- ----cluster1-02


vol0

aggr0_cluster1_02_0
online

283.3MB

RW

851.5MB

66%

cluster1::*> vol add-other-volumes -node cluster1-02


(volume add-other-volumes)

cluster1::*> vol show -vserver cluster1-02


(volume show)
Vserver
Volume
Available Used%

Aggregate

State

Type

Size

--------- ------------ ------------ ---------- ---- ------------------- ----cluster1-02


AUTOROOT
379.2MB
27%

new_root

online

RW

525MB

RW

851.5MB

cluster1-02
vol0

aggr0_cluster1_02_0
online

283.3MB

66%

2 entries were displayed.

15.

Run the following command to remove the old root volume from VLDB
cluster1::*> vol remove-other-volume -vserver cluster1-02 volume vol0
(volume remove-other-volume)

cluster1::*> vol show -vserver cluster1-02


(volume show)
Vserver
Volume
Available Used%

Aggregate

State

Type

Size

--------- ------------ ------------ ---------- ---- ------------------- ----cluster1-02


379.2MB

16.

AUTOROOT
27%

new_root

online

RW

525MB

Destroy the old root vol by running the following command from the node shell of the
node where the new root volume has been created
cluster1::*> node run local
Type 'exit' or 'Ctrl-D' to return to the CLI
cluster1-02> vol status vol0
Volume State
vol0 online

Status

Options

raid_dp, flex

nvfail=on

64-bit
Volume UUID: 014df353-bbc1-11e1-bb4c123478563412
Containing aggregate: 'aggr0_cluster1_02_0'
cluster1-02> vol offline vol0
Volume 'vol0' is now offline.
cluster1-02> vol destroy vol0
Are you sure you want to destroy volume 'vol0'? y
Volume 'vol0' destroyed.
And the old root aggr can be destroyed if desired:
From cluster shell:
cluster1::*> aggr show -node <node where new root vol was

created>
Aggregate
Size Available Used% State
RAID Status

#Vols Nodes

--------- -------- --------- ----- ------- ------ --------------- -----------aggr0_cluster1_02_0


900MB

899.7MB

0% online

0 cluster1-02

900MB

371.9MB

59% online

1 cluster1-02

900MB

467.2MB

48% online

8 cluster1-02

raid_dp,
normal
new_root
raid_dp,
normal
student2
raid_dp,
normal
3 entries were displayed.
cluster1::*> aggr delete -aggregate <old root aggregate name>

Warning: Are you sure you want to destroy aggregate


"aggr0_cluster1_02_0"?
{y|n}: y
[Job 277] Job succeeded: DONE

17.

Use the following KB rename the root volume(AUTOROOT) to vol0


https://kb.netapp.com/support/index?page=content&id=2015985

18.

What sort of things regarding the root vol did you observe during this?

Exercise 4: Locate and Repair Aggregate Issues


Time Estimate: 15 minutes

Action
1.

Login to clustershell of clusterX and execute the following:


::> aggr show -aggregate VLDBX (team member 1 use X=1 and team
member 2 use X = 2)
There are no entries matching your query.
One aggregate is showing as missing from the cluster shell:

Execute the following:


::> aggr show -aggregate WAFLX -instance
Aggregate: WAFLX
Size: Used Size: Used Percentage: Available Size: State: unknown
Nodes: cluster1-02
Another aggregate is showing as unknown:

Fix the issue.

2.

Issue the following command. Do you see anything wrong?


::*> debug vreport show aggregate

3.

What nodes do the aggregates belong to? How do you know?

4.

Use the debug vreport fix command to resolve the problem.

5.

List some of the reasons why customers could have this problem.

6.

Was any data lost? If so, which aggregate?

Exercise 5: Replication failures


Time Estimate: 20 minutes
Action
1.

Note:Participants working with cluster2 should replace student1 with student3


and student2 with student4 in all the steps of this exercise

Log in to systemshell clusterX-02 (make sure it does not own epsilon)


Unmount mroot and clus and prevent mgwd from being monitored by spmctl, as
follows:
% sudo umount -f /mroot
% sudo umount -f /clus
% spmctl -d -h mgwd

2.

Login to ngsh on clusterX-02 and execute the following:


cluster1::*> volume create -vserver student1 -volume test aggregate
Info: Node cluster1-01 that hosts aggregate aggr0 is offline
Node cluster1-03 that hosts aggregate
aggr0_cluster1_03_0 is offline
Node cluster1-04 that hosts aggregate
aggr0_cluster1_04_0 is offline
Node cluster1-01 that hosts aggregate student1 is
offline
aggr0
aggr0_cluster1_04_0
new_root

aggr0_cluster1_03_0
student1

student2

cluster1::*> volume create -vserver student1 -volume test aggregate student2


Error: command failed: Replication service is offline
cluster1::*> net int create -server student1 -lif test -role
data -home-node cluster1-02 -home-port e0c -address

10.10.10.10 -netmask 255.255.255.0 -status-admin up


(network interface create)
Info: An error occurred while creating the interface, but a
new routing group
d10.10.10.0/24 was created and left in place
Error: command failed: Local unit offline
cluster1::*> vserver create -vserver test -rootvolume test aggregate student1 -ns-switch file -rootvolume-securitystyle unix
Info: Node cluster1-01 that hosts aggregate student1 is
offline
Error: create_imp: create txn

failed

command failed: Local unit offline

3.

Login to ngsh on clusterX-01 and execute the following:


cluster1::> volume create test -vserver student2 -aggregate
Info: Node cluster1-02 that hosts aggregate new_root is
offline
Node cluster1-02 that hosts aggregate student2 is
offline
aggr0
aggr0_cluster1_04_0
new_root

aggr0_cluster1_03_0
student1

student2

cluster1::> volume create test -vserver student2 -aggregate


student2 -size 20MB
Info: Node cluster1-02 that hosts aggregate student2 is
offline
Error: command failed: Failed to create the volume because
cannot determine the
state of aggregate student2.
cluster1::> volume create test -vserver student2 -aggregate
student1 -size 20MB
[Job 368] Job succeeded: Successful
Note: when a volume is created on an aggregate not hosted on clusterX-02 ,
the volume create succeeds
cluster1::> net int create -vserver student1 -lif data2 -role
data -data-protocol nfs,cifs,fcache -home-node cluster1-02
-home-port e0c -address 10.10.10.10 -netmask 255.255.255.0

(network interface create)


Info: create_imp: Failed to create virtual interface
Error: command failed: Routing group d10.10.10.0/24 not found
cluster1::> net int create -vserver student1 -lif data2 -role
data -data-protocol nfs,cifs,fcache -home-node cluster1-01
-home-port e0c -address 10.10.10.10 -netmask 255.255.255.0
(network interface create)
Note: when an interface is created on port not hosted on clusterX-02 the
interface create succeeds

cluster1::*> vserver create -vserver test -rootvolume test aggregate student2 -ns-switch file -rootvolume-securitystyle unix
Info: Node cluster1-02 that hosts aggregate student2 is
offline
Error: create_imp: create txn

failed

command failed: Local unit offline


cluster1::*> vserver create -vserver test -rootvolume test aggregate student1 -ns-switch file -rootvolume-securitystyle unix
[Job 435] Job succeeded: Successful
Note: when a vserver is created and its root volume is created an aggregate
that is not hosted on clusterX-02 the vserver create succeeds

4.

Log in to systemshell of clusterX-02.


Execute the following:
cluster1-02% mount
/dev/md0 on / (ufs, local, read-only)
devfs on /dev (devfs, local)
/dev/ad0s2 on /cfcard (msdosfs, local)
/dev/md1.uzip on / (ufs, local, read-only, union)
/dev/md2.uzip on /platform (ufs, local, read-only)
/dev/ad3 on /sim (ufs, local, noclusterr, noclusterw)
/dev/ad1s1 on /var (ufs, local, synchronous)
procfs on /proc (procfs, local)

/dev/md3 on /tmp (ufs, local, soft-updates)


/mroot/etc/cluster_config/vserver on /mroot/vserver_fs
vserverfs, union)
Note that /mroot and /clus are not mounted

From systemshell of clusterX-02 run following commands:


% rdb_dump

What do you see?


%tail -100 /mroot/etc/mlog/mgwd.log |more
What do you see?
Log in to systemshell of cluster-01 and run the following command
%tail -100 /mroot/etc/mlog/mgwd.log |more
What do you see?

6.

From systemshell of clusterX-02 run:


%spmctl
What do you see?

6.

7.

What happened?

Fixing these issues:


a) Re-add mgwd to spmctl with:
% ps aux | grep mgwd
root 779 0.0 17.6 303448 133136 ?? Ss 1:53PM 0:44.12 mgwd -z
diag 3619 0.0 0.2 12016 1204 p2 S+ 4:39PM 0:00.00 grep mgwd
% spmctl -a -h mgwd -p 779
b) Then restart mgwd which will mount /mroot and /clus
% sudo kill <PID>

Exercise 6: Troubleshooting Autosupport


Time Estimate: 20 minutes

Action
1.

From clustershell of each node send a test autosupport as follows: (y takes the
values 1,2,3,4)
::*> system autosupport invoke -node clusterX-0y -type test
You will see an error such as:
Error: command failed: RPC: Remote system error - Connection refused

2.

Lets find out Why?


Connection refused means that we couldn't talk to the application for some reason.
In this case, notifyd is the application.
When we look at systemshell for the process, it's not there:
cluster1-01% ps aux | grep notifyd
diag 5442 0.0 0.2 12016 1160 p0 S+ 9:20PM 0:00.00 grep notifyd

3.

spmctl manages notifyd


We can check to see why spmctl didn't start notifyd back up:
cluster-1-01% cat spmd.log | grep -i notify
0000002e.00001228 0002ba73 Tue Aug 09 2011 21:26:31 +00:00
[kern_spmd:info:739]
0x800702d30: INFO: spmd::ProcessController:
sendShutdownSignal:process_controller.cc:186 sending SIGTERM to
5498:
0000002e.00001229 0002ba73 Tue Aug 09 2011 21:26:31 +00:00
[kern_spmd:info:739]
0x8007023d0: INFO: spmd::ProcessWatcher: _run:process_watcher.cc:152
kevent
returned: 1
0000002e.0000122a 0002ba73 Tue Aug 09 2011 21:26:31 +00:00
[kern_spmd:info:739]
0x8007023d0: INFO: spmd::ProcessControlManager:
dumpExitConditions:process_control_manager.cc:732 process
(notifyd:5498) exited on
signal 15
0000002e.0000122b 0002ba7d Tue Aug 09 2011 21:26:32 +00:00
[kern_spmd:info:739]
0x8007023d0: INFO: spmd::ProcessWatcher: _run:process_watcher.cc:148

wait for
incoming events.
And then we check spmctl to see if it's still monitoring notifyd:
cluster-1-01% spmctl | grep notify
In this case, it looks like notifyd got removed from spmctl and we need to re-add it:
cluster-1-01% spmctl -e -h notifyd
cluster-1-01% spmctl | grep notify
Exec=/sbin/notifyd -n;Handle=56548532-c334-4633-8cd877ef97682d3d;Pid=15678;State=Running
cluster-1-01% ps aux | grep notify
root 15678 0.0 6.7 112244 50568 ?? Ss 4:06PM 0:02.42 /sbin/notifyd
diag 15792 0.0 0.2 12016 1144 p2 S+ 4:06PM 0:00.00 grep notify

4.

Try to send a test autosupport.


::*> system autosupport invoke -node clusterX-0y -type test

What happens?

MODULE 3: SCON
Exercise 1: Vifmgr and MGWD interaction
Time Estimate: 30 minutes
Step
1.

Action
Try to create an interface:
clusterX::*> net int create -vserver studentY -lif test -role
data -data-protocol nfs,cifs,fcache -home-node clusterX-02 home-port
You see the following error:
Warning: Unable to list entries for vifmgr on node clusterX02. RPC: Remote
system error - Connection refused
{<netport>|<ifgrp>}

2.

Home Port

Ping interfaces of clusterX-02 the node whose ports seem inaccessible


clusterX::*> cluster ping-cluster

-node clusterX-02

What do you see?

3.

Perform data access:


Attempt cifs access to \\student2\student2(cluster1) or \\student4\student4(cluster2)
from the windows machine
What happens?

4.

Execute the following:


clusterX::*> net int show
What do you see?

5.

6.

Run net port show


clusterX::*> net port show
What do you see?
Check the system logs:
clusterX::*> debug log files modify -incl-files vifmgr,mgwd
clusterX::*> debug log show node clusterX-02 timestamp Mon

Oct 10*
What do you see?

7.

Log in to systemshell on clusterX-02 and run ps to see if vifmgr is running:


clusterX-02% ps -A |grep vifmgr

8.

9.

Run rdb_dump from systemshell of clusterX-02


clusterX-02% rdb_dump
What do you see?
Run the following from systemshell of clusterX-02:
clusterX-02% spmctl | grep vifmgr
What do you see

10.

In cluster shell execute cluster ring show


clusterX::*> cluster ring show

11.

What is the Issue?


How do you fix it?

Exercise 2: Duplicate lif IDs


Time Estimate: 30 minutes

Step

Action
From the clustershell create a new network interface as follows: Y E {1,2,3,4}

1.

clusterX::*> net int create -vserver studentY -lif data1 role data -data-protocol nfs,cifs,fcache -home-node
clusterX-0Y -home-port e0c -address 192.168.81.21Y -netmask
255.255.255.0 -status-admin up
(network interface create)

Info: create_imp: Failed to create virtual interface


Error: command failed: Duplicate lif id

2.

Execute the following:


clusterX::*> net int show
What do you see?

3.

View the mgwd log file on the node where you are giving the net int create
command and determine the lifid which is eing reported as duplicate

4.
Execute the following:
clusterX::*>debug smdb table vifmgr_virtual_interface show
-node clusterX-0* -lif-id [lifid/vifid determined from step
3]
What do you see?

5.

Execute the following:


clusterX::*> debug smdb table vifmgr_virtual_interface delete
-node clusterX-0Y lif-id <the duplicate id >
clusterX::*> debug smdb table vifmgr_virtual_interface show node clusterX-0Y -lif-id <the duplicate id>
There are no entries matching your query.

5.

Create new lif:


clusterX::*> net int create -vserver studentY -lif testY role data -data-protocol
nfs,cifs,fcache -home-node clusterX-0Y -home-port e0c address 192.168.81.21Y -netmask 255.255.255.0 -status-admin
up
(network interface create)

MODULE 4: NFS
Exercise 1: Mount issues
Time Estimate: 20 minutes
Step
1.

Action
From the Linux Host execute the following:
#mkdir /cmodeY
#mount studentY:/studentY_nfs /cmodeY
You See the following:
mount: mount to NFS server 'studentY' failed: RPC Error:
Program not registered.

2.

Find out the node being mounted:


From the Linux Host execute the following to find the IP address being accessed:
#ping studentY
PING studentY (192.168.81.115) 56(84) bytes of data.
64 bytes from studentY (192.168.81.115): icmp_seq=1 ttl=255
time=1.09 ms
From the clustershell use the following to find out the current node and port on which
the above IP address is hosted
clusterX::*> net int show -vserver studentY -address
192.168.81.115 -fields curr-node,curr-port
(network interface show)
vserver

lif

curr-node

curr-port

-------- -------------- ----------- --------studentY studentY_data1 clusterX-01 e0d

3.

Execute the following to start a packet trace from the nodeshell of the node that was
being mounted and attempt the mount once more
clusterX::*> run -node clusterX-01
Type 'exit' or 'Ctrl-D' to return to the CLI
clusterX-01> pktt start e0d
e0d: started packet trace
From the Linux Host attempt the mount once more as shown below:

# mount student1:/student1_nfs /cmode1


Back in the nodeshell of the node that was mounted dump and stop the packet trace
clusterX-01> pktt dump e0d
clusterX-01> pktt stop e0d
e0d: Tracing stopped and packet trace buffers released.
From the systemshell of the node where the packet trace was captured view the
packet trace using tcpdump
clusterX-01> exit
logout

clusterX::*> systemshell -node clusterX-01


clusterX-01% cd /mroot
clusterX-01% ls
e0d_20120925_131928.trc home
etc

vserver_fs

trend

clusterX-01% tcpdump r e0d_20120925_131928.trc

What do you see? Why?

4.

How do you fix the issue?

5.

After fixing the issue check that the mount is successful.


Note:If the mount succeeds please unmount.This step is very important or the
rest of the exercises will be impacted

Exercise 2: Mount and access issues


Time Estimate: 30 minutes

Step
1.

Action
From the Linux Host attempt to mount volume studentX_nfs.

# mount studentX:/studentX_nfs /cmode


mount: studentX:/studentX_nfs failed, reason given by server:
Permission denied

2.

From clustershell execute the following to find the export policy associated with the
volume studentX_nfs:
cluster1::*> vol show -vserver studentX -volume studentX_nfs
instance
Next use the export-policy rule show to find the properties of the export policy
associated with the volume studentX_nfs
Why did you get an access denied error?
How will you fix the issue

3.

Now once again attempt to mount studentX_nfs from the Linux Host
# mount studentX:/studentX_nfs /cmode
mount: studentX:/studentX_nfs failed, reason given by server:
No such file or directory
What issue is occurring here?

4.

Now once again attempt to mount studentX_nfs from the Linux Host
# mount studentX:/studentX_nfs /cmode
Is the mount successful?
If yes, cd into the mount point
#cd /cmode
-bash: cd: /cmode: Permission denied
How do you resolve this?
Note: Depending on how you resolved the issue with the export-policy in step
1 you may not see any error here.In that case move on to step 4
If you unmount and remount, does it still work?

5.
Try to write a file into the mount
[root@nfshost cmode]# touch f1

What does ls la show?

[root@nfshost cmode]# ls -la


total 16
drwx------

2 admin admin 4096 Sep 25 08:06 .

drwxr-xr-x 26 root
-rw-r--r--

root

1 admin admin

drwxrwxrwx 12 root

root

4096 Sep 25 06:03 ..


0 Sep 25 08:06 f1
4096 Sep 25 08:05 .snapshot

What do you see the file permissions as?


Why are the permissions and owner set the way they are?

6.

From clustershell Execute:


clusterX::> export-policy rule modify -vserver studentY policyname studentY -ruleindex 1 -rorule any -rwrule any
(vserver export-policy rule modify)

Exercise 3: Stale file handle


Time Estimate: 30 minutes

Step
1.

Action
From the Linux Host execute:
# cd /nfsX
-bash: cd: /nfsX: Stale NFS file handle

2.

Unmount the volume from the client and try to re-mount. What happens?

3.

From the Linux Host:


# ping studentX
PING studentX (192.168.81.115) 56(84) bytes of data.
The underlined IP above is the IP of vserver being mounted.
Find the node in the cluster that is currently hosting this IP
From your clustershell
::*> net int show -address 192.168.81.115 -fields curr-node
(network interface show)
vserver

lif

curr-node

-------- -------------- ----------studentX studentX_data1 clusterY-0X


The node underlined above is the node that is currently hosting the IP.
Log in to the systemshell of this node and view the vldb logs
cluster1::*> systemshell -node clusterY-0X
cluster1-01% tail /mroot/etc/mlog/vldb.log
What do you see?

4.

Look for volumes with the MSID in the error shown in the vldb log as follows:
From clustershell execute the following to find the aggregate where the volume
being mounted(nfs_studentX) lives and on which node that aggregate lives:
cluster1::*> vol show -vserver studentX -volume nfs_studentX fields aggregate
(volume show)
vserver

volume

aggregate

-------- ------------ --------studentX nfs_studentX studentX


cluster1::*> aggr show -aggregate studentX -fields nodes
aggregate nodes
--------- ----------studentx

clusterY-0X

Go to nodeshell of the node (underlined above) that hosts the volume and its
aggregate and use the showfh command and convert the msid from hex.
::>run node clusterY-0X
>priv set diag
*>showfh /vol/nfs_studentX
flags=0x00 snapid=0 fileid=0x000040 gen=0x5849a79f
fsid=0x16cd2501 dsid=0x0000000000041e msid=0x00000080000420

0x00000080000420 converted to decimal is 2147484704


Exit from nodeshell back to clustershell abd execute debug vreport show in diag
mode:
cluster1-01*> exit
logout

cluster1::*> debug vreport show


What do you see?

5.

What is the issue here?

6.

How would you fix this?

MODULE 5: CIFS
Instructions to Students:
As mentioned in the lab handout the valid windows users in the domain
Learn.NetApp.local are:
a) Administrator
b) Student1
c) Student2

Exercise 1: Using diag secd


Time Estimate: 20 minutes

Step
1.

Action
Find the node where the IP(s) for vserver studentX is hosted
From the RDP machine do the following to start a command window
Start->Run->cmd
In the command window type
ping studentX
From the clustershell find the node on which the IP is hosted (Refer to NFS Exercise
3)
Login to the console of that node and execute the steps of this exercise

2.

Type the following:


::> diag secd
What do you see and why?

3.

Note: for all the steps of this exercise clusterY-0X should be the name of the
local node
Type the following to verify the name mapping of windows user student1 ,.

::diag secd*> name-mapping show -node local -vserver


studentX -direction win-unix -name student1

4.

From the RDP machine do the following to access a cifs share


Start -> Run -> \\studentX
Type the following to query for the Windows SID of your windows user name
cluster1::diag secd*> authentication show-creds -node local
-vserver studentX -win-name <username that you have used
to RDP to the windows machine>
DC Return Code: 0
Windows User: Administrator Domain: LEARN Privs: a7
Primary Grp: S-1-5-21-3281022357-2736815186-1577070138-513
Domain: S-1-5-21-3281022357-2736815186-1577070138
Rids: 500, 572, 519, 518, 512, 520, 513
Domain: S-1-5-32 Rids: 545, 544
Domain: S-1-1 Rids: 0
Domain: S-1-5 Rids: 11, 2
Unix ID: 65534, GID: 65534
Flags: 1
Domain ID: 0
Other GIDs:
cluster1::diag secd*> authentication translate -node local vserver student1 -win-name <username that you have used to RDP
to the windows machine>
S-1-5-21-3281022357-2736815186-1577070138-500

5.

Type the following to test a Windows login for your user windows name in diag secd
cluster1::diag secd*> authentication login-cifs -node
local -vserver studentX -user <username that you have
used to RDP to the windows machine>

Enter the password: <your windows password i.e Netapp123>


Windows User: Administrator Domain: LEARN Privs: a7
Primary Grp: S-1-5-21-3281022357-2736815186-1577070138-513
Domain: S-1-5-21-3281022357-2736815186-1577070138
Rids: 500, 513, 520, 512, 518, 519, 572
Domain: S-1-1 Rids: 0
Domain: S-1-5 Rids: 11, 2
Domain: S-1-5-32 Rids: 544

Unix ID: 65534, GID: 65534


Flags: 1
Domain ID: 0
Other GIDs:
Authentication Succeeded.

6.

Type the following to view active CIFS connections in secd


cluster1::diag secd*> connections show -node clusterY-0X vserver studentX
[ Cache: NetLogon/learn.netapp.local ]
Queue> Waiting: 0, Max Waiting: 1, Wait Timeouts: 0, Avg
Wait: 0.00ms
Performance> Hits: 0, Misses: 1, Failures: 0, Avg
Retrieval: 24505.00ms

(No connections active or currently cached)

[ Cache: LSA/learn.netapp.local ]
Queue> Waiting: 0, Max Waiting: 1, Wait Timeouts: 0, Avg
Wait: 0.00ms
Performance> Hits: 1, Misses: 4, Failures: 0, Avg
Retrieval: 6795.40ms

(No connections active or currently cached)

[ Cache: LDAP (Active Directory)/learn.netapp.local ]


Queue> Waiting: 0, Max Waiting: 1, Wait Timeouts: 0, Avg
Wait: 0.00ms
Performance> Hits: 1, Misses: 3, Failures: 1, Avg
Retrieval: 2832.75ms

(No connections active or currently cached)


Type the following to clear active CIFS connections in secd
cluster1::diag secd*> connection clear -node clusterY-0X
vserver studentX

Test connections on vserver student1 marked for removal.


NetLogon connections on vserver student1 marked for
removal.
LSA connections on vserver student1 marked for removal.
LDAP (Active Directory) connections on vserver student1
marked for removal.
LDAP (NIS & Name Mapping) connections on vserver student1
marked for removal.
NIS connections on vserver student1 marked for removal.

7.

Type the following to view the server discovery information


cluster1::diag secd*> server-discovery show-host -node
clusterY-0X

Host Name: win2k8-01


Cifs Domain:
AD Domain:
IP Address: 192.168.81.10

Host Name: win2k8-01


Cifs Domain:
AD Domain:
IP Address: 192.168.81.253
Type the following to achieve the same result as ONTAP 7Gs cifs resetdc
cluster1::diag secd*> server-discovery reset -node
clusterY-0X -vserver studentX
Discovery Reset succeeded for Vserver: student1
To verify type the following:
cluster1::diag secd*> server-discovery show-host -node
clusterY-0X
Discovery Reset succeeded for Vserver: studentX
Type the following to achieve the same result as ONTAP 7Gs
cifs testdc?

cluster1::diag secd*> server-discovery test -node clusterY0X -vserver studentX


Discovery Global succeeded for Vserver: studentX

8.

Type the following to view current logging level in secd


cluster1::diag secd*> log show -node clusterY-0X
Log Options
---------------------------------Log level:

Debug

Function enter/exit logging:

OFF

Type the following to set and view the current logging level in secd
cluster1::diag secd*> log set -node clusterY-0X -level err
Setting log level to "Error"

cluster1::diag secd*> log show -node clusterY-0X


Log Options
---------------------------------Log level:
Function enter/exit logging:

9.

Error
OFF

Type the following to enable tracing in secd to capture the logging level specified
cluster1::diag secd*> trace show -node local
Trace Spec
--------------------------------------Trace spec has not been set.
cluster1::diag secd*> trace set -node cluster1-01 -traceall yes
Trace spec set successfully for trace-all.

cluster1::diag secd*> trace show -node cluster1-01


Trace Spec
---------------------------------------

TraceAll:

10.

Tracing all RPCs

Type the following to check secd configuration for comparison with the ngsh
settings?
cluster1::diag secd*> config query -node local -source-name
cifs-server
account

kerberos-realm

machine-

nis-domain
to-name

vserver

vserverid-

unix-group-membership local-unix-user
group

local-unix-

kerberos-keyblock
client-config

ldap-config

ldap-

ldap-client-schema
kerberos

name-mapping

nfs-

cifs-server-security

dns

virtual-interface

routing-

cifs-server-options
cifs-preferred-dc
group-routes
secd-cache-config

cluster1::diag secd*> configuration query -node local source-name machine-account


vserver: 5
cur_pwd:
0100962681ce82e2d6da20df35ce86964fea2c495d9609d395a51994
31d3d4531144f845fcfd675e15143fe76932ced271ddcf57c9d8fe59
a63b0bc68f717077fc88ca28aa0fdbba4b8d8509bb25ebe2
new_pwd:
installdate: 1345202770
sid: S-1-5-21-3281022357-2736815186-1577070138-1609

vserver: 6
cur_pwd:
01433517c8acbbf66c2e287b4bee56f5d8b707cfb69710737bfb2061
6ebe61fc31163acde2b5a827f3c2d395b89fef15f28a8f514c147906
580cbaa30b4a1361444f76036d2c590222ce1a0feaa56779
new_pwd:
installdate: 1345202787
sid: S-1-5-21-3281022357-2736815186-1577070138-1610

11.

Type the following to clear the cache(s) one at a time


cluster1::diag secd*> cache clear -node clusterY-0X vserver studentX -cache-name
ad-to-netbios-domain
delivery

netbios-to-ad-domain

ems-

ldap-groupid-to-name
userid-to-creds

ldap-groupname-to-id

ldap-

ldap-username-to-creds
to-sid

log-duplicate

name-

sid-to-name
groupname-to-id

nis-groupid-to-name

nis-

nis-userid-to-creds
group-membership

nis-username-to-creds

nis-

netgroup
bad-route-to-target

schannel-key

lif-

cluster1::diag secd*> cache clear -node clusterY-0X vserver studentX -cache-name ad-to-netbios-domain
Type the following to clear all caches together
cluster1::diag secd*> restart -node clusterY-0X

You are attempting to restart a process in charge of


security services. Do not
restart this process unless the system has generated a
"secd.config.updateFail"
event or you have been instructed to restart this process
by support personnel.

This command can take up to 2 minutes to complete.

Are you sure you want to proceed? {y|n}: y

Restart successful! Security services are operating correctly.

12.

From the RDP machine close the cifs share \\studentX opened in windows explorer

Exercise 2: Authentication issues


Time Estimate: 30 minutes
Step
1.

Action
From the RDP machine access the cifs share \\studentX
Start->Run->\\studentX
What error message do you see?

2.

Refer to step 1 of exercise 1 and


Find the node where the IP(s) for vserver studentX is hosted
Login to the console of that node and execute the steps of this exercise
From clustershell of the node , run the following commands:
::> set diag
::*> diag secd authentication translate -node local -vserver
studentX -win-name <your windows username>
::*> diag secd authentication sid-to-uid -node local -vserver
studentX -sid <sid from previous command>
::*> diag secd authentication show-creds -node local -vserver
studentX -win-name <username>

Does the user seem to be functioning properly? If not, what error do you get?

3.

Run the following command:


::> event log show

What message do you see?

4.

Run the following command:


::> diag secd name-mapping show -node local -vserver

student1 -direction win-unix


-name <your windows username>
::> vserver name-mapping show -vserver studentX direction
win-unix position *
::> cifs options show vserver studentX

5.

Which log in systemshell can we look at to see errors for this problem?

6.

What issues did you find?

7.

cluster1::*> unix-user create -vserver studentX -user


pcuser -id 65534 -primary-gid 65534
(vserver services unix-user create)

cluster1::*> cifs option modify -vserver studentX -defaultunix-user pcuser

8.

The Windows Explorer window which opens when you navigate to Start->Run>\\studentX shows 2 shares .
a) studentX
b) studentX_child
Try to access the shares
What happens?
Do the following:

Enable debug logging for secd on the node that owns your data lifs

cluster1::*> diag secd log set -node local -level debug


Setting log level to "Debug"
cluster1::*> trace set -node local -trace-all yes
(diag secd trace set)
Trace spec set successfully for trace-all.

Close the CIFS session on the Windows host and run net use /d * from
cmd to clear cached sessions and retry the connection

Enter systemshell and cd to /mroot/etc/mlog

Type tail f secd.log


What do you see?

9.

Given the results of the previous tests, what could the issue be here?

10.

From ngsh(custershell) run:


cluster1::> vserver show -vserver studentX -fields
rootvolume
vserver

rootvolume

-------- ------------studentX studentX_root


The value highlighted in bold is the root volume of the vserver you are acessing
cluster1::>vserver cifs share show -vserver studentX share-name studentX

Vserver: studentX
Share: studentX
CIFS Server NetBIOS Name: STUDENTX
Path: /studentX_cifs
Share Properties: oplocks
browsable
changenotify
Symlink Properties: File Mode Creation Mask: Directory Mode Creation Mask: Share Comment: Share ACL: Everyone / Full Control
File Attribute Cache Lifetime: cluster1::*> vserver cifs share show -vserver studentX share-name studentX_child

Vserver: studentX
Share: studentX_child
CIFS Server NetBIOS Name: STUDENTX
Path: /studentX_cifs_child
Share Properties: oplocks
browsable
changenotify
Symlink Properties: File Mode Creation Mask: Directory Mode Creation Mask: Share Comment: Share ACL: Everyone / Full Control
File Attribute Cache Lifetime: From the above commands obtain the name of the volumes being accessed via the
shares

11.

Now that you know the volumes you are trying to access use fsecurity show to view
permissions on these.
cluster1::*> vol show -vserver studentX -volume studentX_cifs
instance

Find on which node the aggregate where studentX_cifs lives is


hosted on
From node shell of that node run:
cluster1-01> fsecurity show /vol/studentX_cifs
What do you see?
cluster1::*> vol show -vserver studentX -volume
studentX_cifs_child instance

Find on which node the aggregate where studentX_cifs_child


lives is hosted on
From node shell of that node run:
cluster1-01> fsecurity show /vol/studentX_cifs_child

What do you see?


Find on which node the aggregate where studentX_root lives is

hosted on
From node shell of that node run
.
cluster1-01> fsecurity show /vol/studentX_root
What do you see?

12.

From ngsh run::


cluster1::*> volume modify -vserver studentX -volume
studentX_root -unix-permissions 755
Queued private job: 167
Are you able to access both the shares now?

13.

From ngsh run::


cluster1::*> volume modify -vserver studentX -volume
studentX_cifs -security-style ntfs
Queued private job: 168
Does this resolve the issue?

Exercise 3: Authorization issues


Time Estimate: 20 minutes

Step
1.

Action
From a client go Start -> Run -> \\studentX\studentX
What do you see?

2.

Try to view the permissions on the share. What do you see?

3.

From the nodeshell of the node where the volume and its aggregate is hosted run:
cluster1-01> fsecurity show /vol/student1_cifs
[/vol/student1_cifs - Directory (inum 64)]
Security style: NTFS
Effective style: NTFS

DOS attributes: 0x0010 (----D---)

Unix security:
uid: 0
gid: 0
mode: 0777 (rwxrwxrwx)

NTFS security descriptor:


Owner: S-1-5-32-544
Group: S-1-5-32-544
DACL:
Allow - S-1-5-21-3281022357-2736815186-1577070138-500 0x001f01ff (Full Control)

4.

From the above command, obtain the sid of the owner of the volume.
From ngsh run:

cluster1::*> diag secd authentication translate -node local vserver studentX -sid S-1-5-32-544
What do you see?

5.

How do you resolve this issue?

Exercise 4: Export Policies


Time Estimate: 20 minutes
Step
1.

Action
Try to access \\studentX\studentX
What do you see?

2.

3.
4.

What error do you see?

What does the event log show? What about the secd log? (Exercise 2 steps 3 and 8)

From nodeshell of the node that hosts the volume and its aggregate run:
fsecurity show /vol/studentX_cifs
Do the permissions show that access should be allowed?

5.

From clustershell obtain the name of the export-policy associated with the volume as
follows:
cluster1::> volume show -vserver studentx -volume
student1_cifs -fields policy
Now view details of the export-policy obtained in the previous command
cluster1::> export-policy rule show -vserver studentX policyname <policy name obtained from the above command>
cluster1::> export-policy rule show -vserver studentX policyname <policy name obtained from the above command> ruleindex <rule index applicable>
What do you see?
How do you fix the issue?

MODULE 6: SCALABLE SAN


Exercise 1: Enable SAN features and create a LUN and connect via ISCSI
Time Estimate: 20 minutes

Step
1.

Action
Review your SAN configuration on the cluster.
-

Licenses

SAN protocol services

Interfaces

2.

Create a lun in your studentX_san volume.

3.

Create an igroup and add the ISCSI IQN of your host to the group.

4.

Configure the ISCSI initiator

5.

Map the lun and access from lab host. Format the lun and write data to it.

6.

From clustershell
cluster1::*> iscsi show
What do you see?
cluster1::*> debug seqid show
What do you see?

7.

1. Locate the UUIDs of your iSCSI LIFs


::> debug smdb table vifmgr_virtual_interface show -lifname <iscsi_lif>

2. Display the statistics for these LIFs


cluster1::statistics*> show -node cluster1-01 -object
iscsi_lif -counter iscsi_read_ops -instance <UUID obtained
from the above command

EXERCISE 2
TASK 1: TROUBLESHOOT QUORUM ISSUES

In this task, you experience quorum failure on a node of the cluster.

STEP ACTION

1.

Team member 1 login to console of clusterY-01 as admin


From here on this will be referred to as Node1

2.

Team member 2 login to console of clusterY-02 as admin


From here on this will be referred to as Node2

3.

Team member 1 on the Node 1 console ngsh


::> set diag

4.

Team member 2 on the Node 2 console ngsh


::> set diag

5.

Team member 2 on the Node 2 ngsh , verify cluster status


::*> cluster show

6.

Team member 2 on the Node 2 ngsh, view the current LIFs:


::*> net int show

7.

Team member 2 on the Node 2 ngsh, view the current cluster kernel status:
::*> cluster kernel-service show -instance

8.

Team member 2 on the Node 2 ngsh, bring down the cluster network LIFs on the interface:
::*> net int modify -vserver clusterY-02 -lif clus1,clus2 status-admin down

STEP ACTION

9.

Team member 2 on the Node 2 ngsh, view the current cluster kernel status:
::*> cluster kernel-service show -instance

10.

Team member 1 on the Node 1 ngsh, view the current cluster kernel status:
::*> cluster kernel-service show -instance

11.

On the Node 2 PuTTY interface, enable the cluster network LIFs on the interface:
::*> net int modify -vserver cluster1-02 -lif clus1,clus2 status-admin up

12.

Team member 2 on the Node 2 ngsh, view the current cluster kernel status:
::*> cluster kernel-service show -instance
What do you see?

13.

Team member 1 on the Node 1 ngsh, view the current cluster kernel status:
::*> cluster kernel-service show -instance
What do you see?

14.

cluster1::*> debug smdb table bcomd_info show


What do you see?

STEP ACTION

15.

Team member 1on the Node 1 ngsh, view the current bcomd information:
cluster1::*> debug smdb table bcomd_info show
What do you see?

16.

Team member 2 reboot Node2 to have it start participating in SAN quorum again:
::*> reboot node clusterY-02

17.

Team member 2 console log in on Node2 as admin

18.

Team member 2 on Node2, verify cluster health:

::> cluster show

19.

Team member 2 on Node2


::> set diag

20.

Verify the cluster kernel to verify both nodes have a status of in quorum (INQ):
::*> cluster kernel-service show instance
::*>debug smdb table bcomd_info show

TASK 2: TROUBLESHOOT LOGICAL INTERFACE ISSUES

In this task, you bring down the LIFs that are associated with a LUN.
STEP ACTION

Console login as admin on clusterY-0X, view the current LIFs:

1.

::*> net int show

On your own, disable LIFs that are associated with studentX_iscsi and determine how this action
impacts connectivity to your LUN on the Windows host.

2.

END OF EXERCISE

Exercise 3: Diag level SAN debugging


Time Estimate: 25 minutes

Step

Action

1.

What are two ways we can see where the nvfail option is set on a volume?

2.

How would we clear an nvfail state if we saw it?

3.

How would we show virtual disk object information for a lun?

4.

How do you manually dump a rastrace?

MODULE 7: SNAPMIRROR
Exercise 1: Setting up Intercluster SnapMirror
Time Estimate: 20 minutes

Step
1.

Action
From clustershell of cluster1 run:
cluster1::> snapmirror create -source-path
cluster1://student1/student1_snapmirror -destination-path
cluster2://student3/student3_dest -type DP -tries 8 throttle unlimited

Error: command failed: Volume


"cluster2://student3/student3_dest" not found.
(Failed to contact peer cluster with address
192.168.81.193. No
intercluster LIFs are configured on this node.)

2.

From clustershell of cluster1 run:


::>set diag
cluster1::*> cluster peer address stable show
What do you see?
cluster1::*> net ::>int show -role intercluster
What do you see?
cluster1::*> cluster peer show -instance
What do you see?
cluster1::*> cluster peer show health instance
What do you see?
.

3.

Run the following command:

::*> cluster peer ping -type data


What do you see?

4.

Run the following command:


::*> cluster peer ping -type icmp
What do you see now? What addresses, if any, seem to be having issues?

5.

Run the following command:


::> job history show -event-type failed
What jobs are failing?
To examine why they are failing:
cluster1::*> event log show -node cluster1-01 -messagename
cpeer*
Why are the jobs failing?

6.

Try to modify the cluster peer. What happens?


cluster1::*> cluster peer modify -cluster cluster2 -peeraddrs 192.168.81.193,192.168.81.194 -timeout 60

7.

How did you resolve the issue?

Exercise 2: Intercluster DP mirrors


Time Estimate: 30 minutes

Step
1.

Action
From clustershell of cluster1 run:
cluster1::*> snapmirror create -source-path
cluster1://student1/student1_snapmirror -destination-path
cluster2://student3/student3_dest -type DP -tries 8 throttle unlimited
What error do you see?What might he be doing wrong?

2.

From clustershell of cluster2 run:

cluster2::> snapmirror create -source-path


cluster1://student1/student1_snapmirror -destination-path
cluster2://student3/student3_dest -type DP -tries 8 -throttle
unlimited

What do you see?Why?

3.

After correcting the issue, run the following command in clustershell of cluster2:
cluster2::> snapmirror create -source-path
cluster1://student1/student1_snapmirror -destination-path
cluster2://student3/student3_dest -type DP -tries 8 throttle unlimited

Does the command complete?


How do you verify the snapmirror exists?

::>snapmirror show
What do you see? Is the snapmirror functioning?
How do you get the mirror working if its not?

4.

After the snapmirror is confirmed as functional, check to see how long it has been
since the last update (snapmirror lag).

Exercise 3: LS Mirrors
Time Estimate: 20 minutes

Step
1.

Action
Create two LS mirrors that point to your studentX_snapmirror volume.
clusterY::*> volume create -vserver studentX -volume
studentX_LS_snapmirror -aggregate studentX -size 100MB state online -type DP
[Job 265] Job succeeded: Successful

clusterY::*> volume create -vserver studentX -volume


studentX_LS_snapmirror2 -aggregate studentX -size 100MB state online -type DP
[Job 266] Job succeeded: Successful
clusterY::*> snapmirror create -source-path
clusterY://studentX/studentX_snapmir
ror -destination-path
clusterY://studentX/studentX_LS_snapmirror2 -type LS
[Job 273] Job is queued: snapmirror create the relationship
with destination clu
[Job 273] Job succeeded: SnapMirror: done

clusterY::*> snapmirror create -source-path


clusterY://studentX/studentX_snapmir
ror -destination-path
clusterY://studentX/studentX_LS_snapmirror -type LS
[Job 275] Job is queued: snapmirror create the relationship
with destination clu
[Job 275] Job succeeded: SnapMirror: done

What steps did you have to consider? Check the MSIDs and DSIDs for the source
and destination volumes. What do you notice?
clusterY::*> volume show -vserver studentX -fields msid,dsid

2.

Attempt to initialize one of the mirrors using the snapmirror initialize command.
cluster1::*> snapmirror initialize -destination-path
cluster1://student1/student1_LS_snapmirror
[Job 276] Job is queued: snapmirror initialize of destination
cluster1://student1/student1_LS_snapmirror.

cluster1::*> snapmirror initialize -destination-path


cluster1://student1/student1_LS_snapmirror2
[Job 277] Job is queued: snapmirror initialize of destination
cluster1://student1/student1_LS_snapmirror2.

cluster1::*> job show


What happens? How would you view the status of the job? If it didnt work, how
would you fix it? Why didnt it work?

cluster1::*> job history show -id 276


What do you see?
How do you fix it?

3.

After initializing the LS mirrors, try to update the mirrors using snapmirror update.
clusterY::*> snapmirror update -destination-path
clusterY://studentX/studentX_LS_snapmirror
[Job 279] Job is queued: snapmirror update of destination
clusterY://studentX/studentX_LS_snapmirror.
clusterY::*> job show

What happens? How do you view the status of the job?


What is the issue?

4.

Run the following command:


::> vol show -vserver studentX -fields junction-path

What do you see?


.
Mount the volume from the cluster shell.

::> vol nmount -vserver studentX -volume studentX_snapmirror


junction-path /student1_snapmirror
What do you see?

Run the following:


::> vol show -vserver studentX -fields junction-path
What do you see now?
Then remount the volume to a new junction path studentX_snapmirror.
::> vol mount -vserver studentX -volume studentX_snapmirror junction-path /studentX_snapmirror
Now what do you see?
Unmount the volume from the cluster shell.
::> vol unmount -vserver studentX -volume studentX_snapmirror
Run the following:
::> vol show -vserver studentX -fields junction-path
What do you see now?
Then remount the volume to a new junction path studentX_snapmirror.
::> vol mount -vserver studentX -volume studentX_snapmirror junction-path /studentX_snapmirror
Now what do you see?

5.

clusterY::*> snapmirror update-ls-set -source-path


clusterY://studentX/studentX_snapmirror
clusterY::*> snapmirror update-ls-set -source-path
clusterY://studentX/studentX_root

clusterY::*> volume modify -vserver studentX -volume


studentX_snapmirror -unix-permissions 000

clusterY::*> volume show -vserver studentX -fields unixpermissions


What do you see?

Mount the volume from your Linux host using o nfsvers=3:


[root@nfshost DATAPROTECTION]# mount -o nfsvers=3
student1:/student1_snapmirror /cmode
[root@nfshost DATAPROTECTION]# cd /cmode
[root@nfshost cmode]# ls
[root@nfshost cmode]# cd
[root@nfshost ~]# ls -latr /cmode
Now execute:
[root@nfshost ~]# umount /cmode
From clustershell run:
clusterY::*> snapmirror update-ls-set -source-path
clusterY://studentX/studentX_snapmirror
From Linux Host run:
[root@nfshost ~]# mount -o nfsvers=3
student1:/student1_snapmirror /cmode
[root@nfshost ~]# ls -latd /cmode
What do you see?
Modify the volume back to 777 on the cluster (using vol modify)
clusterY::*> volume modify -vserver studentX -volume
studentX_snapmirror -unix-permissions 777
Queued private job: 162
Check permissions on the unix host again.
[root@nfshost ~]# ls -latd /cmode
ls: /cmode: Permission denied
[root@nfshost ~]# cd /cmode
What do you see?
Are you able to cd into the mount now?

Update the LS mirror set.


clusterY::*> snapmirror update-ls-set -source-path

clusterY://studentX/studentX_snapmirror
What do you see in ls on the host? Why?
Modify the source volume to 000
clusterY::*> volume modify -vserver studentX -volume
studentX_snapmirror -unix-permissions 000
Queued private job: 163

What do you see in ls on the host? Why?

Das könnte Ihnen auch gefallen