Beruflich Dokumente
Kultur Dokumente
following tools are the best ones available to figure out what exactly
NFS is doing.
2.1: share and exportfs
share (on Solaris) and exportfs (on SunOS) are good tools to use to
see exactly how a NFS server is exporting its filesystems. Simply log
on to the NFS server and run the command that is appropriate for the
OS.
SunOS:
# exportfs
/usr -root=koeller
/mnt
/tmp
The above shows that /mnt and /tmp are exported normally. Since we see
neither rw or ro as options, this means that the default is being
used, which is rw to the world. In addition /usr gives root
permissions to the machine koeller.
Solaris:
# share
-
/var
rw=engineering
""
/usr/sbin
rw=lab-manta.corp.sun.com
/usr/local
rw
""
""
/tmp (everyone)
Note that showmount only displays: the partition and who can mount
it. We will not see any other options displayed. In the example above,
there are no restrictions on who can mount crimson's partitions and
so showmount lists (everyone).
2.3: nfsstat
The nfsstat command
being sent via NFS.
the stats of an NFS
NFS server. When we
# nfsstat -c
Client rpc:
calls
badcalls
45176
1
badverfs
timers
0
80
retrans
45
toobig
0
badxids
3
nomem
0
Client nfs:
calls
badcalls
44866
1
clgets
44866
cltoomany
0
root
0 0%%
timeouts
45
cantsend
0
waits
0
bufulocks
0
newcreds
0
lookup
readlink
read
15225 33%% 55 0%%
remove
rename
914 2%%
6 0%%
statfs
68 0%%
link
13880
symlink
306 0%%
0 0%%
The rpc stats at the top are probably the most useful. High 'retrans'
and 'timeout' values can indicate performance or network issues. The
client nfs section can show you what types of NFS calls are taking up
the most time. This can be useful if you're trying to figure out what
is hogging your NFS. For the most part, the nfsstat command is most
useful when you are doing network and performance tuning. Section 7.4
and 7.5 list books that give some information on this they are useful
to
make more sense of the nfsstat statistics.
2.4: rpcinfo
You can test that you have a good, solid NFS connection to your NFS
server via the rpcinfo command. As explained in the man page, this
program provides information on various rpc daemons, such as nfsd,
rpc.mountd, rpc.statd and rpc.lockd. Its biggest use is to determine
that one of these daemons is responding on the NFS server. The
following examples all show the indicated daemons correctly
responding. If instead you get complaints about a service 'not
responding,' there might be a problem.
NFS R GETATTR OK
NFS C READDIR FH=4141 Cookie=2600
NFS R READDIR OK 1 entries (No more)
These were the results when an 'ls' was run on 'psi' in a directory
that was mounted from 'rainbow-16'. The lines labelled 'C' are NFS
requests, while the lines labelled 'R' are NFS replies. Through snoop
you can easily see: NFS not being responding to (you would get lots of
'C' lines without 'R' replies to them) and also certain errors
(timeouts and retransmits particularly). The man page on snoop gives
some indication of how to make more in-depth use of the tool. In
general, it should only be used for very complex issues, where NFS is
behaving very oddly and even then you must be very good with NFS
to perceive unexpected behavior. See the next section for
more tips on snoop.
2.6
Please see section 4.9: Common rpc.lockd & rpc.statd Error Messages
for information regarding specific lockd and statd problems.
Generally you can pick out problem clients by snooping and/or
putting lockd into debug mode. Sections 2.5 and 2.6 cover snoop.
How to put the Solaris 2.3 and 2.4 lockd into debug mode:
Edit the line in the /etc/init.d/nfs.client script that starts
up lockd to start it with -d3 and redirect stdout to a
-ro
/localpart
nfs
[options]
The options field is optional and can be left out if none are needed.
To make /usr/local mount automatically, you would add the following to
your /etc/fstab:
bigserver:/usr/local
/usr/local
nfs
0 0
/usr/local
nfs
ro
0 0
0 0
-secure
/secret/top
nfs
rw,secure
0 0
Root can add credentials for hosts with the following command:
# newkey -h machinename
The passwd supplied to newkey in this case should be the same as the
machine's root passwd.
It is important to note that rpc.yppasswd must be running on your NIS
server for these commands to work. In addition, push out
publickey maps afterwards to make sure that the most up-to-date
credential information is available.
Once this is all done, secure NFS should work on your NIS network,
with two caveats: First, keyserv must be running on your client
machines. If this is not the case, adjust your rc files, so that it
automatically starts up. Second, if a user does not supply a passwd
when logging in (due to a .rhosts or /etc/hosts.equiv for example) or
if his secure key is different than his passwd, then he will need to
execute the command 'keylogin' before he can access the secure NFS
partition.
clienta is at 192.1.1.1 and the Server uses DNS for hostname lookups.
The NFS request to do the mount arrives from IP address 192.1.1.1
The NFS server looks up the IP address of 192.1.1.1 to get the hostname
associated with that IP address.
The gethostbyaddr MUST return "clienta". If it does not, the NFS
request will fail with "access denied". telnet from the NFS client
to the NFS server and run "who am i". The hostname in parentheses
is the name that should be in the netgroup:
hackley
pts/13
Jan 24 09:21
(mercedes)
A:
#!/bin/sh
#
# fhfind: takes the expanded filehandle string from an
# NFS write error or stale filehandle message and maps
# it to a pathname on the server.
#
# The device id in the filehandle is used to locate the
# filesystem mountpoint. This is then used as the starting
# point for a find for the file with the inode number
# extracted from the filehandle.
#
# If the filesystem is big - the find can take a long time.
# Since there's no way to terminate the find upon finding
# the file, you might need to kill fhfind after it prints
# the path.
#
if [ $# -ne 8 ] then
echo
echo "Usage: fhfind <filehandle> e.g."
echo
echo " fhfind 1540002 2 a0000 4df07 48df4455 a0000 2 25d1121d"
exit 1
fi
# Filesystem ID
FSID1=$1
FSID2=$2
# FID for the file
FFID1=$3
FFID2=`echo $4
FFID3=$5
echo
echo "Now searching $MNTPNT for inode number $INUM"
echo
find $MNTPNT -mount -inum $INUM -print 2>/dev/null
4.2: Problems Mounting Filesystems on a Client
Q: Why do I get "permission denied" or "access denied" when I
try to
mount a remote filesystem?
A1: Your remote NFS server is not exporting or sharing its file systems.
You can verify this by running the showmount command as follows:
# showmount -e servername
That will provide you with a list of all the file systems that are
being sent out. If a file system is not being exported, you should
consult section 3.1 or 3.2, as applicable.
A2: Your remote NFS server is exporting file systems, but only to a
limited number of client machines, which does not include you. To
verify this, again use the command showmount:
# showmount -e psi
/var
engineering
/usr/sbin lab-manta.corp.sun.com
/usr/local (everyone)
In this example, /usr/local is being exported to everyone, /var is
being exported to the engineering group, and /usr/sbin is only being
exported to the machine lab-manta.corp.sun.com. So, I might get the
denial message if I tried to mount /var from a machine not in the
engineering netgroup or if I tried to mount /usr/sbin from anything
but lab-manta.corp.sun.com.
A3: Your machine is given explicit permission to mount the partition,
but the server does not list your correct machine name. In the example
above, psi is exporting to "lab-manta.corp.sun.com", but the machine
might actually identify itself as "lab-manta" without the suffix. Or,
alternatively, a machine might be exporting to "machine-le0" while the
mount request actually comes from "machine-le1". You can test this by
first running "showmount -e" and then physically logging in to the
server, from the client that cannot mount, and then typing "who". This
will show you if the two names do not match. For example, I am on
lab-manta, trying to mount /usr/sbin from psi:
lab-manta# mount psi:/usr/sbin /test
mount: access denied for psi:/usr/sbin
I use showmount -e to verify that I am being exported to:
lab-manta# showmount -e psi
export list for psi:
/usr/sbin lab-manta.corp.sun.com
I then login to psi, from lab-manta, and execute who:
lab-manta%% rsh psi
...
psi# who
root
pts/6
Sep
8 14:02
(lab-manta)
You get this message because some process is using the underlying mount
point. For example, if you had a shell whose pwd was /mnt and you
tried to mount something into /mnt, e.g. mount server:/export/test /mnt
you would see this error.
To work around this, find the process using the directory and either kill
it or move its pwd someplace else. The "fuser" command is
extremely handy
to do this:
mercedes[hackley]:cd /mnt
mercedes[hackley]:fuser -u /mnt
/mnt:
4368c(hackley)
368c(hackley)
In this case you see process # 368 and 4368 are using the /mnt mount point.
PID 368 is the shell and PID 4368 was the fuser command.
You can forcibly kill any process (must be root) from a mount point
using fuser -k /mnt.
Please note that fuser is not infallible and cannot identify kernel
threads using a mount point (as sometimes happens with the automounter).
4.3: Common NFS Client Errors Including NFS Server Not Responding
If a file system has been successfully mounted, you can encounter the
following errors when accessing it.
Q: Why do I get the following error message:
Stale NFS file handle
A1: This means that a file or directory that your client has open has
been removed or replaced on the server. It happens most often when a
dramatic change is made to the file system on the server, for example
if it was moved to a new disk or totally erased and restored. The
client should be rebooted to clear Stale NFS file handles.
A2: If you prefer not to reboot the machine, you can create a new
mount point on the client for the mount point with the Stale NFS file
handle.
Q: Why do I get the following error message:
NFS Server <server> not responding
NFS Server ok
Note, this error will occur when using HARD mounts.
This troubleshooting section applies to HARD or SOFT mounts.
A1: If this problem is happening intermittently, while some NFS
traffic is occurring, though slowly, you have run into the performance
limitations of either your current network setup or your current NFS
server. This issue is beyond the scope of what SunService can support.
Consult sections 7.4 & 7.5 for some excellent references that can help you
tune NFS performance. Section 9.0 can point you to where you can get
additional support on this issue from Sun.
A2: If the problem lasts for an extended period of time, during which
no NFS traffic at all is going through, it is possible that your NFS
server is no longer available.
You can verify that the server is still responding by running the commands:
# ping server
and
# ping -s server 8000 10
(this will send 10 8k ICMP Echo request packets to the server)
If your machine is not available by ping, you will want to check the
server machine's health, your network connections and your routing.
If the ping works, check to see that the NFS server's nfsd and
mountd are responding with the "rpcinfo" command:
# rpcinfo -u server nfs
program 100003 version 2 ready and waiting
# rpcinfo -u server mountd
program 100005 version 1 ready and waiting
program 100005 version 2 ready and waiting
If there is no response, go to the NFS server and find out why
the nfsd and/or /mountd are not working over the network. From
the server, run the same commands. If they work OK from the
server, the network is the culprit. If they do NOT work,
check to see if they are running. If not, restart them and
repeat this process. If either nfsd or mountd IS running but
does not respond, then kill it and restart it and retest.
A3: Some older bugs might have caused this symptom. Make sure that you
have the most up-to-date Core NFS patches on the NFS server.
These are listed in Section 5.0 below. In addition, if you are running
quad ethernet cards on Solaris, install the special quad
ethernet patches listed in Section 5.4.
A4: Try cutting down the NFS read and write size with the NFS mount
options: rsize=1024,wsize=1024. This will eliminate problems with
packet fragmentation across WANS, routers, hubs, and switches in a
multivendor environment, until the root cause can be pin-pointed.
THIS IS THE MOST COMMON RESOLUTION TO THIS PROBLEM.
A5: If the NFS server is Solaris 2.3 and 2.4, 'nfsreadmap' occasionally
caused the "NFS server not responding" message on Sun and non-Sun
NFS clients. You can resolve this by adding the following entry to
your /etc/system file on the NFS server:
set nfs:nfsreadmap=0
And rebooting the machine. The nfsreadmap function was removed in 2.5
because it really didn't work.
A6: If you are using FDDI on Solaris, you must enable fragmentation
with the command:
ndd -set /dev/ip ip_path_mtu_discovery 0
Add this to /etc/init.d/inetinit, after the other ndd command on line 18.
A7: Another possible cause is IF the NFS SERVER is Ultrix, old AIX,
Stratus, and older SGI and you ONLY get this error on Solaris 2.4 and 2.5
clients, but the 2.3 and 4.X clients are OK.
The NFS Version 2 and 3 protocol allow for the NFS READDIR request to be
1048 bytes in length. Some older implementations incorrect thought the
request had a max length of 1024. To work around this, either mount
those problem servers with rsize=1024,wsize=1024 or add the following
to the NFS client's /etc/system file and reboot:
set nfs:nfs_shrinkreaddir=1
A8: Oftentimes NFS SERVER NOT RESPONDING is an indication of another problem
on the NFS server, particularly on the disk subsystem. If you have a
SPARCStorage Array, you must verify that you have the most recent
firmware and patches due to the volatility of that product.
Another general method that can be tried to is look at the output
from iostat -xtc 5 and check the svt_t field. If this value goes
over 50.0 (50 msec) for a disk that is being used to serve NFS requests,
you might have found your bottleneck. Consult the references in
Section 7 of this PSD for other possible NFS Server tuning hints.
NOTE: NFS Server performance tuning services are only available
on a Time and Materials basis.
Q: Why can't I write to a NFS mounted file system as root?
A: Due to security concerns, the root user is given "nobody"
permissions when it tries to read from or write to a NFS file system.
This means that root has less access than any user, will only be
able to read from things with world read permissions, and will only be
able to write to things with world write permissions.
If you would like your machine to have normal root permissions to a
filesystem, the filesystem must be exported with the option
"root=clientmachine".
An alternative is to export the filesystem with the "anon=0" option.
This will allow everyone to mount the partition with full root
permissions.
Sections 3.1 and 3.2 show how to include options when exporting
filesystems.
Q1: Why do 'ls'es of NFS mounted directories sometimes get mangled on
my SunOS machine?
Q2: Why do I get errors when looking at a NFS file on my SunOS
machine?
A: By default, SunOS does not have UDP checksums enabled. This can
cause problems if NFS is being done over an extended distance,
especially if it is going across multiple routers. If you are seeing
very strange errors on NFS or are getting corruption of directories
when you view them, try turning UDP checksums on.
You can do so my editing the kernel file /usr/sys/netinet/in_proto.c,
changing the following:
int udp_cksum = 0
checksums */
to:
int udp_cksum = 1
checksums */
1985c
The above example shows that pids 1985 and 1997 are accessing the
/test partition. Either kill the processes or run fuser -k /test to
have fuser do this for you.
NOTES: This functionality is not available under SunOS.
not always identify an automounted process on Solaris.
It does
2049/udp
nfs
dgram
rpc.mountd
You can resolve this problem by commenting out the mountd line in the
/etc/inetd.conf file and then killing and restarting your inetd.
4.9: Common rpc.lockd & rpc.statd Error Messages
NFS services.
but has changed its
to be
user mode and
SunOS:
rm /etc/sm/* /etc/sm.bak/*
Solaris:
rm /var/statmon/sm/* /var/statmon/sm.bak/*
Afterwards, execute reboot to bring your machine back up.
Alternatively, if you cannot put the system into single user mode,
- Kill the statd and lockd process
- clear out the "sm" and "sm.bak" directories"
- Restart statd and lockd in that order
Q: How can I fix these errors?
We also see
nlm1_call: RPC: Program not registered
create_client: no name for inet address 0x90EE4A14.
A: There are THREE items to check in order.
1.
2.
3.
Patch levels:
Solaris 2.4:
101945-34 or better kernel jumbo patch
101977-04 or better lockd jumbo patch
102216-05 or better klm kernel locking patch (See note below)
Note:
Patch 102216-05 contains a fix for a bug that can cause this error message:
1164679 KLM doesn't initialize rsys & rpid correctly
Solaris 2.3:
101318-75 or better kernel jumbo patch
Q:
It is needed whenever