Beruflich Dokumente
Kultur Dokumente
PDF
Print
Written by Geoff Wild
Thursday, 07 June 2007 22:57
NOTE: The following article was written by: James Tonguet
Mb
USED
1024
184
96
280
PCT
FREE
0
-184
276
1116
START/
Mb
USED
LIMIT
1024
RESERVE
PRI
NAME
0
1 /dev/vg00/lvol1
26
20
, some overhead exists in managing the dynamic buffer cache, such as the dynamic
allocation of the buffers and managing the buffer cache address map or buffer c
ache virtual bitmap Also, a dynamic buffer cache expands very
rapidly, but contracts very slowly and only when memory pressure exists.
It is possible to bypass either static or dynamic buffer caches , in some
instances this allows for faster disk I/O .
This can be accomplished with the Online JFS mount options mincache=direct , co
nvosync=direct Other options would be raw I/O , aynchronus writes to raw logica
l volumes , discovered_direct_io , ioctl. These topics are covered later in the
text.
Tuning recommendations :
For databases , favor global area ( SGA) over the buffer cache
For most systems 200 -400 MB
Current patches relating to the buffer cache :
10.20
PHKL_28866 (Critical, Reboot) s800 10.20 VM read-ahead panic, buffer cache, pagi
ng
PHKL_26767 (Critical, Reboot) s800 10.20 Buffer cache deadlock;write gets VX_ERE
TRY
11.0
PHKL_18543(Critical, Reboot) s700_800 11.00
PM/VM/UFS/async/scsi/io/DMAPI/JFS/perf patch
11.11
Patch PHKL_27808 s700_800 11.11 Filesystem buffercache performance fix
If there is 4Gb of total memory , the cumulative size of of data, stack and text
is 1984Mb . This represents quadrants 1 & 2 minus the Uarea in quadrant 2.
If there is less than 4Gb total memory, the quadrant size is 1/4th of total memo
ry. For 64 bit systems, while the address space in each quadrant is 4 Tb, the si
ze of the memory map is equal to the total memory of the system, and a quadrant
is 1/4 of this value. When sizing memory parameters for 64 bit it is important t
o keep this in mind. The quadrant boundary rules still apply.
It is important to remember that Uarea receives its memory allocation
in quadrant 2 first, then stack , the remainder of available space is
available for data. For HP-UX 11.X data can also occupy the free
space in quadrant 1 that is not used by text. A single process
cannot cross a quadrant boundary.
The last configurable area of memory to check is shared memory .
Any application running within the 32 bit architecture will have a limit of 1.75
Gb total for shared memory for EXEC_MAGIC and 2.75Gb using SHMEM_MAGIC.
Note :
This is only true when the total memory on the system equals at least 4Gb
Individual processes cannot cross quadrant boundaries , so the logical 32 bit l
imit for maxtsiz , maxdsiz and shmmax is 1 GB. For 64 bit , the quadrant size de
termines the logical limit.
Note: if a system is utilizing SHMEM_MAGIC the additional 1 Gb of shared object
space comes from quadrant 2, this means that the text,data, stack and Uarea must
all come from quadrant 1 .
This means maxtsiz,maxdsiz, maxssiz + Uarea can total no more than 1 Gb .
If these parameters are undersized the system will error.
maxdsiz will return "out of memory "
maxssiz will return "stack growth failure".
maxtsiz will return " /usr/lib/dld.sl: Call to mmap() failed - TEXT "
As of HP-UX 11, the kernel stack (maxssiz) will receive its memory allocation be
fore data ( maxdsiz) or text (maxtsiz).
For 64 bit systems, the quadrant size is determined by dividing the total memory
by 4 .
It is important to determine if the application is running 32 bit or 64 bit
when troubleshooting 64 bit systems.
This can be done with the file command :
example :
file /stand/vmunix
/stand/vmunix: ELF-64 executable object file - PA-RISC 2.0 (LP64)
PA-RISC versions under 2.0 are 32 bit
For an overview on shared memory for 32 bit systems refer to the Application Not
e RCMEMKBAN00000027 Understanding Shared Memory on PA RISC Systems.
The kernel parameter shmmax determines the size of the shared memory region. Unl
ess patched SAM will not allow this to be configured greater than 1 quadrant ,
or 1Gb even on 64 bit systems. If a larger shmmax value is needed for 64 bit sys
tems it has to be done using a manual kernel build.
The current patches to address this problem are :
11.00: [PHKL_24487/PACHRDME/English]
11.11 [PHKL_24032/PACHRDME/English]
Please refer to the patch database found at http://itrc.hp.com for the latest re
visions of these.
In a 64 bit system 32 bit applications will only address the 32 bit shared memor
y region, 64 bit applications will only address the 64 bit regions.
To determine shared memory allocation, use ipcs this utility is used to
report status of interprocess communication facilities. Run the following
command:
ipcs -mob
You will see an output similar to this :
IPC status from /dev/kmem as of Tue Apr 17 09:29:33 2001
T
ID
KEY
MODE
OWNER
GROUP NATTCH SEGSZ
Shared Memory:
m
0 0x411c0359 --rw-rw-rwroot
root
0
348
m
1 0x4e0c0002 --rw-rw-rwroot
root
1 61760
m
2 0x412006c9 --rw-rw-rwroot
root
1 8192
m
3 0x301c3445 --rw-rw-rwroot
root
3 1048576
m
4004 0x0c6629c9 --rw-r----root
root
2 7235252
m
5 0x06347849 --rw-rw-rwroot
root
1 77384
m
206 0x4918190d --rw-r--rwroot
root
0 22908
m
6607 0x431c52bc --rw-rw-rwdaemon
daemon
1 5767168
The two fields of the most interest are NATTCH and SEGSZ.
NATTCH -The number of processes attached to the associated shared
memory segment. Look for those that are 0, they indicate processes who have not
released their shared memory segment.
If there are multiple segments showing with an NATTACH of zero , especially if t
hey are owned by a database, this can be an indication that the segments are not
being efficiently released . This is due to the program not calling detachreg
. These segments can be removed using ipcrm -m shmid.
Note : Even though there is no process attached to the segment , the data struct
ure is still intact. The shared memory segment and data structure associated wit
h it are destroyed by executing this command.
SEGSZ The size of the associated shared memory segment in bytes. The total of SE
GSZ for a 32 bit system using EXEC_MAGIC cannot exeed 1879048192 bytes or 1.75Gb
, or 2952790016 bytes or 2.75Gb for SHMEM_MAGIC.
If more than 1.75GB total shared object space ( shared memory) is required for 3
2bit enviroments memory windows can be implemented. This configuration will allo
w discrete 1Gb windows to be opened up to a limit of the total amount of memory
CPU load
Once we have determined that the memory resources are adequate, we need to addre
ss the processors. We need to determine how many processors there are, what spee
d they run at and what load they are under during a variety of system loads .
To find out processor speed,run :
example:
echo itick_per_usec/D | adb -k /stand/vmunix /dev/mem
itick_per_usec:
itick_per_usec: 360
This will be the speed in MHz.
To find out how many processors are in use, run :
example:
echo runningprocs/D | adb -k /stand/vmunix /dev/mem
runningprocs:
runningprocs: 2
This can also be done by using sar -Mu
To find out cpu load on a multi-processor system, run :
example:
sar -Mu 5 100 this will produce 100 data points 5 seconds apart.
The output will look similar to :
11:20:05
11:20:10
1
17
system
cpu
0
83
9
%usr
1
%sys
1
0
42
%wio
0
%idle
99
0
0
49
Typically the %usr value will be higher than %sys . If the system is making many
read/write transactions this may not be true as they are systme calls.
Out of memory errors can occur When excessive CPU time given to system versus us
er processes. These can also be caused when maxdsiz is undersized. As a rule ,
we should expect to see %usr at 80% or less, and %sys at 50% or less.
Values higher than these can indicate a CPU bottleneck.
The %wio should ideally be 0%, values less than 15% are acceptable. The %idle b
eing low over short periods of time is not a major concern . This is the percen
tage of time that the CPU is not running processes. However low %idle over a
sustained period could be an indication of a CPU bottleneck.
If the %wio is greater than 15% and %idle is low , consider the size of the runq
(runq-sz). Ideally we would like to see values less than 4 . If the runq-sz is
high and the %wio is 0 then there is no bottleneck . This is usually a case of m
any small processes running that do not overload the processors.
If the system is a single processor system under heavy load the CPU bottleneck m
ay be unavoidable.
If the cpu load appears high , but the system is not heavily loaded check the va
lue of the kernel parameter timeslice. By default it is 10, if a Tuned Parameter
Set was applied to the kernel, it will change timeslice to 1. This will cause t
he cpu to context switch every 10mS instead of 100mS. In most instances this wil
l have a negative effect on cpu efficiency.
To find out what the run queue load is, run :
sar -q 5 100
example:
runq-sz %runocc swpq-sz %swpocc
10:06:36
0.0
0
0.0
10:06:41
1.5
40
0.0
10:06:46
3.0
20
0.0
10:06:51
1.0
20
0.0
Average
1.8
16
0.0
0
0
0
0
0
The output of sar -v will show the usage/kernel value for each area.
example:
08:05:08 text-sz ov proc-sz ov inod-sz
ov file-sz ov
08:05:10 N/A
0 272/6420 0 3427/7668 0 5458/12139 0
11.0 32 bit
444
11.0 64 bit
11i 32 bit
680
475
11i 64 bit
688
On 10.20 the inode table and dnlc (directory name lookup cache) are combined.
The tunable parameter for dnlc ncsize was introduced in patch PHKL_18335.
On 11.00 the dnlc is now configurable using the ncsize and vx_ncsize kernel para
meters.
By default ncsize =(ninode+vx_ncsize) +(8*dnlc_hash_locks) . The parameter vx_n
csize defines the memory space reserved for VxFS directory path-name cache (in b
ytes) The default value for vx_ncsize is 1024, dnlc_hash_ locks defaults to 512.
As of JFS 3.5 vx_ncsize became obsolete.
The JFS Inode Cache
A VxFS file system obtains the value of vx_ninode from the system configuration
file used for making the kernel (/stand/system for example). This value is used
to determine the number of entries in the VxFS inode table. By default, vx_ninod
e initializes at zero; the file system then computes a value based on the system
memory size (see Inode Table Size).
To change the computed value of vx_ninode, you can hard code the value in SAM .
For example:
Set vx_ninode=16,000.
The number of inodes in the inode table is calculated according to the
following table. The first column is the amount of system memory, the second is
the number of inodes. If the available memory is a value between two entries, t
he value of vx_ninode is interpolated.
The memory requirements for JFS are dependent on the revision of JFS and system
memory.
Maximum VxFS inodes in the cache based on system memory
System Memory in MB
JFS 3.1
256 18666
16000
512 37333 32000
1024 74666 64000
2048 149333 128000
8192
32768
131072
JFS.3.3-3.5
149333
149333
149333
256000
512000
1024000
To determine the amount of vxfs inodes allocated ( these are not reported by sar
) run :
example:
echo vxfs_ninode/D | adb -k /stand/vmunix /dev/mem
vxfs_ninode:
vxfs_ninode:
64000
128002
The JFS daemon ( vxfsd ) scans the free list , if inodes are on the free list f
or given length of time the inode is freed back to the kernel memory allocator .
The amount of time this takes , and the amount freed varies by revison .
Maximum time in seconds before being freed
Memory cost per in bytes for JFS inode by revision for inode/vnode/locks :
JFS
JFS
JFS
JFS
3.1
3.3
3.3
3.5
11.0 32 bit
11.0 32 bit
11.11 32 bit
11.11
%busy
0.80
0.60
avque
0.50
0.50
r+w/s
1
1
avque
r+w/s
Number of data transfers per second (read and writes)
from and to the device
blks/s
Number of bytes transferred (in 512-byte units)
from and to the device
avwait
Average time (in milliseconds) that transfer requests
waited idly on queue for the device
avserv
Average time (in milliseconds) to service each
transfer request (includes seek, rotational latency,
and data transfer times) for the device.
When average wait (avwait) is greater than average service time (avserv) it indi
cates the disk can't keep up with the load during that sample. When the average
queue length exceeds the norm of .50 it is an indication of jobs stacking up.
These conditions are considered to be a bottleneck. It is prudent to keep in min
d how long these conditions last. If the queue flushes, or the avwait clears in
a reasonable time, (ie 5 seconds), it is not a cause for concern.
Keep in mind that the more jobs in a queue, the greater the effect on wait on I/
O even if they are small. Large jobs, those greater than 1000 blks/s will also e
ffect throughput.
Also consider the type of disks being used. Modern disk arrays are capable of ha
ndling very large amounts of data in very short processing times. Processing loa
ds of 5000 blks/s or greater in under 10mS. Older standard disks may show far le
ss capability.
The avwait is similar to %wio returned for sar -u on cpu .
If a bottleneck is identified, run:
strings /etc/lvmtab
to identify the volume group associated with the disks.
lvdisplay -v /dev/vgXX/lvolX where x represents the lvol name.
This will tell you what disks are associated with the logical volume.
bdf
to see if this volume groups files sytems are full ( > 85%)
cat /etc/fstab
to determine the file system type assiciated with the lvol/mountpoint
How to improve disk I/O ?
1. Reduce the volume of data on the disk to less than 90%
2. Stripe the data across disks to improve I/O speed
3. If you are using Online JFS , run fsadm -e to defragment the extents.
4. If you are using HFS filesystems , implement asynchronous writes by setting
the kernel parameter fs_async to 1 or consider converting to VxFS.
5. Reduce the size of the buffer cache ( if %wcache is less than 90)
6. Consider changing the vxfs mount options to mincache=direct and nolog , these
are available on Online JFS.
7. If you are using raw logical volumes , consider implementing asynchronous IO.
The difference between the async i/o and the synchronous i/o is that async does
not wait for confirmation of the write before moving on to the next task. This d
oes increase the speed of the disk performance at the expense of robustness.
Synchronous I/O waits for acknowledgement of the write (or fail) before continui
ng on. The write can have physically taken place or could be in the buffer cache
but in either case, acknowledgement has been sent. In the case of async, no wai
ting.
Note: Contact your database vendor or product vendor to determine the correct mi
nor number for your application.
Change the ownership to the approriate group and owner :
chown oracle:dba /dev/async
change the permissions :
The default number of available ports for asynchronus disks is 50 , this is tune
d with the kernel parameter max_async_ports, if greater than 50
disks are being used, this parameter needs to be increased .
PATCHES
There are a number of OS performance issues that are resolved by current patches
.