Sie sind auf Seite 1von 19

Establishing a System Resource

Usage Baseline Profile

August 10, 2001

Report By:
Larry Higa
Lawrence Higa Consulting, Inc.

Overview..........................................................................................................................................................3
Elements of a System Usage Profile................................................................................................................3
Primary Elements of System Usage.............................................................................................................4
1. CPU (ResusageSpma Data) Node Level....................................................................................4
2. Disk I/O (ResusageSpma Data) Node Level..............................................................................6
3. Available Free Memory & Paging/Swapping (ResusageSpma Data) Node Level.................7
4. Number of Concurrent Active Sessions........................................................................................7
Secondary Elements of System Usage.........................................................................................................7
1. CPU (ResusageSvpr Data) Vproc Level....................................................................................8
2. Disk I/O (ResusageSvpr Data) Vproc Level..............................................................................8
3. Bynet I/O (ResusageSpma Data) Node Level..............................................................................8
4. Host I/O (ResusageShst Data) - Vproc Level..............................................................................9
Sample Charts of Typical and Problem Situations.........................................................................................10
1. Parallel Efficiency Average / Max Node CPU Chart.........................................................................10
2. OS as PCT of CPU vs. AVG CPU Busy Chart (OS % CPU Problem)..............................................10
3. OS vs. DBS CPU Busy Chart (Different view of OS % CPU Problem)............................................10
4. Poor Parallel Efficiency Average / Max Node CPU Chart.................................................................10
5. Avg CPU Busy vs. CPU Idle Waiting for I/O Chart...........................................................................11
6. I/O Wait vs. Disk I/Os Chart...............................................................................................................11
7. CPU Idle Waiting for I/O vs. Buddy Backup Bynet I/O Chart...........................................................11
8. Average & Min Free Memory Available vs. Total Page/Swap IOs....................................................11
9. Concurrent Sessions Chart..................................................................................................................11

Overview
The purpose of establishing a system resource usage profile is to obtain a picture in graphic and numerical
format of the usage of a system to help isolate/identify performance problems that may be due to
application changes, new software releases, hardware upgrades, etc. Having a long-term pattern of usage
also enables one to see trends and helps one in doing capacity planning. The pattern or profile of usage can
be seen as a cycle: daily, weekly, monthly, etc., corresponding to the customers business or workload
cycle.
From a performance monitoring / debugging perspective, one is looking for changes in the pattern.
Usually, one is looking for a marked increase in a particular resource. Often times, the system may be at
100% CPU capacity and the users applications are running fine with no complaints. Then something
happens and the users are complaining about response time. The system is at 100% CPU busy, but this is
no different from before. The change could be an increase in the number of concurrent queries in the
system, or it could be an increase the volume of disk I/O or in Bynet broadcast messages. In some cases, a
longer term of several months may be necessary to see a significant change in the pattern. Once a change
in pattern is correlated with a performance problem or degradation, one can eliminate possible causes of the
problem and narrow the search for the basic causes.
A baseline profile should be established when the system is at a semi-steady state. This means data is
loaded, updated on a regular basis (daily, weekly), and accessed by users or by a production application.
The baseline period could be all hours of a day or just the on-line daytime hours when users are running
their queries. For some users, the critical period may be the nighttime load of the data; for others, it could
be a monthly load and report generation over a short 2-5 day period. There are also different levels of data
summarization for the purpose of establishing a baseline profile. At one level, one can maintain the
detailed logging period data. Other levels could be totals by hour or totals by day.
For some users, a profile can be established for a known set of benchmark queries. With benchmark
queries, the common basis for determining if the performance of a new software release is acceptable or not
is if the response time of the queries are nearly the same or better, or if they are considerably slower. In
situations where the system is running slower, the baseline profile will provide a contrast in the system
resource usage between the different instances of running the benchmark.

Elements of a System Usage Profile


The primary sources of data for a system usage profile are the DBC system tables and views:

DBC.ResusageSpma
DBC.ResusageSvpr
DBC.ResusageSvpr2
DBC.ResusageShst
DBC.Diskspace
DBC.AccessLog
DBC.Ampusage

In general, the bulk of the system usage profile comes from the Resusage tables in which data is recorded
on a periodic basis, usually every 10 minutes. Established Resusage macros are used to extract data and
automatically charted with an Excel program. For data from other tables, separate procedures need to be
established in order to get data periodically.

The data in a profile can be grouped as primary and secondary elements. The primary elements are the
more important ones that will generally give a first level indication of a performance problem. For
example, the system may be a 100% CPU bottleneck utilization where normal may be at 80% or lower, or
the system is at maximum disk I/O capacity compared to a normal of 50%. Another common situation is
the number of concurrent queries (tasks) in the system. Normal my have been 10 to 15 concurrent queries
and a performance problem is occurring when there are 50 to 60 concurrent queries.
Secondary elements are useful for a more detailed analysis. In such situations, several factors may be
necessary to get a proper interpretation of what is affecting performance.
When establishing a baseline profile, one must first gather and chart the data. Then, one needs to describe
what one sees in the charts. The following sections contain a description of the individual elements that
make up a profile and what one generally can see / interpret from the different elements.

Primary Elements of System Usage


Three primary elements give the best picture of a system baseline profile:

CPU busy
Disk I/O activity
Concurrent active sessions

Data columns to look at and key interpretations are described below.

1. CPU (ResusageSpma Data) Node Level


UNIX categorizes CPU busy information into 4 categories:
busy executing user code
busy executing operating system code
idle waiting for I/O
idle
The total of CPU busy executing user code and operating system code indicates how much of the CPU
capacity is being used. UNIX detects the CPU busy state and passes this data on to the Teradata
RDBMS where the data is recorded in the Resusage tables. This same data is passed to the UNIX sar
(System Activity Reporter) which records the data in a flat file. While the source of the data is the
same, the numbers can have a slight variation due to different logging time periods.
The important performance information that can be extracted about the system is:
how busy the system is and if there is more available capacity
if there is an imbalance of work across the nodes (skewing)
if there is an application problem causing the system to do inefficient repetitive processing
Average CPU Busy
Represents the average CPU utilization of all CPUs in all nodes. The current norm is 4 CPUs per
node.1 Within a node, if all the multiple CPUs are running at 100% busy, the node CPU busy number
is normalized by dividing the sum of utilization all the CPUs by the number of CPUs per node.
This is the most important column in telling how much of the system is being used (from a CPU point
of view). When this number is 100%, the system is running at maximum capacity.
1

For the 5100, the norm is 8 CPUs per node.

Maximum CPU Busy


Represents the normalized CPU utilization of the busiest node in the system. For a parallel processing
system, the overall relative efficiency of the system is calculated by dividing the average CPU busy by
the maximum CPU busy. When this number varies greatly from the average CPU busy number
for the same log period, there is a skew in the processing of the system. Skewing is a problem
when it persists for multiple logging periods. Usually, this is an application issue rather than a system
software or hardware issue. The reason for the imbalance in the workload of the system could be due
to a number of different conditions. One, UNIX applications could be running in a single node. Two,
partitioning of table data based on the primary index could be skewed where some nodes have
significantly more data than other nodes. Three, processing of SQL join condition statements could
cause a skewed redistribution of the data to a single AMP. Data going to a single AMP in turn means
running in a single node.

OS as Pct of CPU
Percent of CPU busy time that the system was executing operating system work as opposed to database
work. The formula for calculating this column is CPU time for the operating system divided by the
sum of the CPU time for both the user and the operating system. This column does not represent the
percent of absolute time that the system was spending in the operating system.
The lower the value of this column means the CPUs are spending more time executing DBMS code.
Conditions where this column goes below 20% are large product joins, duplicate row checks, collect
statistics, SQL statements doing aggregation on lots of data columns, SQL statements with a lot of
numeric expressions or use of SQL functions such as INDEX, SUBSTR, etc. Often times, when this
number goes below 20% for lengthy periods of time and the maximum CPU busy is around
100%, it is an indication of a duplicate row check problem or a very large product join.
Duplicate row check problems can be resolved by changing the physical data model of the table. The
change could be modifying the primary index, adding a unique secondary index or changing the table
from a SET table to a MULTISET table2. For large product joins, this usually can be corrected by
collecting statistics on the join columns of the tables involved in the joins.
I/O Wait %
This is the most misunderstood column of all the Resusage columns. I/O Wait % is the percent of time
the system is waiting for completion of disk or Bynet I/O, and there are no other tasks available to run
in the system. It does not necessarily mean the system is at I/O throughput capacity.
The more common cause of the I/O Wait is the database software (user queries) has requested a disk
read and there is a lack of concurrent tasks running in the system. For example, if there is a single job
running in the system, when the tasks finishes processing a data block and requests another data block,
the system may have to do a physical I/O to get the data block. At this point, the system will initiate a
physical I/O and schedule another task for execution. As long there is another task to run, the CPU
will not be idle even though there is a pending I/O completion. When there are no other tasks to run,
the system will record that a CPU is idle and waiting for an I/O completion.
Similarly, some disk writes can occur asynchronously, in which case the writing task continues without
waiting for I/O completion. When disk writes are for table data modifications that were not sent to the
another node for buddy backup logic, the task must wait for completion of the physical disk I/O.
I/O Waits can occur when there is a lot of Bynet Buddy backup activity. The task in the sending node
must wait for acknowledgement from the receiving node that it received the buddy backup data before
2

Duplicate row checks can happen inadvertently when the user neglects to define a primary index. In this
case, by default, the first column of the table becomes the primary index. In an extreme case, the data for
this column has only a single distinct value, which will result in a factor of (n2/2)duplicaterowchecks
(wherenisthenumberofrowsinthetable).
5

the task in the sending node can continue its processing. Because the receiving node can be busy for a
number of different reasons, there is no Bynet threshold number to indicate that the buddy backup
traffic over the Bynet is the bottleneck. In general, one can get an indication if this is the bottleneck by
charting the I/O Wait column with the buddy backup column (in the ResNet macros).
The most common situation is that there simply are not enough concurrent tasks running in the system.
A true disk I/O bottleneck occurs when the disk subsystem is transferring data at its maximum
throughput capacity. Depending on the I/O hardware and configuration, I/O bottlenecks generally do
not occur until the nodes are doing over 1400 I/Os per second, or transferring at least 70-80 MB/sec.
There are no known cases of Bynet bottleneck. The speed of the Bynet is significantly faster than the
rate that the nodes can provide data for transferring over the Bynet.
On TNT systems, I/O Waits are not detected and zero is recorded for this column.
CPU profile summary. The key interpretations of the CPU profiles are:
CPU is at maximum CPU busy capacity (100%) for most (much/high proportion) of the critical
period (day shift, end of month processing, all the time, etc.).
Another interpretation could be: the system averages 60% (an arbitrary percent less than 100%)
and is almost never at 100% busy., i.e., there is meaningful CPU capacity available to do more
work.

There is significant node CPU skewing for a number of periods during the critical time periods.
Or, there is occasional skewing, but only for brief periods of time. None are large enough to
cause a significant performance problem.

The OS as percent of CPU is below 20% for over 2 continuous hours. This generally implies an
application problem that should be investigated.
The OS as percent of CPU varies from 10% to 80% without any special pattern. This value is
below 20% only for 10 20 minutes at a time and happens only occasionally during the critical
time periods. This could be due to the applications doing a lot of aggregation, collect statistics or
numerous arithmetic expressions in the queries.

2. Disk I/O (ResusageSpma Data) Node Level


Disk I/O is recorded for both number of reads and writes, and for the number of bytes transferred. For
disk reads, both logical and physical reads are recorded in the ResusageSpma table. For disk writes,
only the physical writes are recorded in the ResusageSpma table. In the ResusageSvpr table, logical
and physical writes are recorded. Also recorded in the ResusageSvpr table is a breakdown of the disk
I/O by type of I/O, table data, cylinder index (CI) for table data, spool, spool CI, transient journal (TJ)
and permanent journal (PJ).
Position Reads (Logical and Physical)
Number of disk positioning reads. A position read occurs for the first data block of a cylinder, for
cylinder indexes, and for random data block accesses.
Pre-Reads (Logical and Physical)
Number of disk pre-reads. Pre-reads, also commonly referred to as pre-fetches, occurs only for full
table scans. This could be for table data or for spools. When a query does a full table scan, the first
block of the cylinder is accessed by a positioning read and all other accesses to the cylinder are done
with pre-reads. When there are no pre-reads, there are no full table scan queries.

Data Base Reads (Logical and Physical)


Sum of the disk position reads and pre-reads. This column is output as an easier means for seeing the
total number of disk reads.
Disk Read Kbytes (Logical and Physical)
Number of Kbytes read for both the position reads and pre-reads.
Database Writes (Physical only)
Number of database disk writes. Writes occur for table data, spool data, cylinder indices, transient
journal and permanent journal.
Disk Write Kbytes (Physical only)
Number of Kbytes written.

3. Available Free Memory & Paging/Swapping (ResusageSpma Data) Node Level


At system start up, memory is logically divided into FSG Cache for the Teradata file system to manage
and available free memory for UNIX to manage. FSG cache is used for table data, spools, TJs, PJs,
buddy backup, etc. Basically, FSG cache is used to manage the database data for queries and data
modification. Free memory is managed by UNIX for AMP and PE code and data and Bynet buffers,
including row redistribution and duplication. For non-TPA (trusted parallel application) work, i.e., a
UNIX job and not a Teradata RDBMS task, memory is also allocated from free memory. When the
amount of available free memory goes below a certain threshold, UNIX initiates Paging out of code or
data from the UNIX managed portion of memory. The key value for available free memory is 40 MB
per node. Customers have experienced UNIX panics when the amount of free memory goes below the
40 MB threshold. In essence, the Teradata RDBMS puts such a heavy and quick demand for memory
that it will exhaust the amount of free memory before UNIX can make free up enough memory by
paging out segments.
Guarding against the UNIX panics can be handled in two different ways. One is to set the tunable
parameter, FSG Cache Percent, to a lower number. This essentially reduces the amount of memory
dedicated to FSG cache and makes it available to UNIX to manage. The negative drawback with this
approach is that it tends to leave too much free memory that is never used. The second way to guard
against UNIX panics is to tune UNIX memory tunable parameters, LOTSFREE, DESFREE and
MINFREE to a higher value to let UNIX start its paging earlier. Having the paging safeguard allows
one to tune the FSG Cache percent parameter to a higher value so that less memory is taken away from
FSG Cache to give to UNIX.

4. Number of Concurrent Active Sessions


Query response time is dependent on the number of active concurrent queries running in the system.
The common situation is many users are logged on to Teradata, but not all are running queries at the
same time. Teradata Manager provides a method for logging sessions to a log file that can be
processed later. The Performance Monitor Application Program Interface (PM/API) also provides a
means for capturing session data and charting the active sessions in real-time. Establishing this data as
part of the profile allows one to correlate an increase in response time to an increase in the number of
concurrent active AMP sessions.

Secondary Elements of System Usage

The secondary elements of system usage provide a profile of the system usage, but are not critical for
immediate problem detection. They provide a background for comparison when problems occur to help
identify the kind of changes in system usage that occur at the time of the problem. The secondary elements
include:
CPU busy at the vproc level
Disk I/O by type table data, table data CI, spool, transient journal, permanent journal
Bynet I/O by type buddy backup (complete and partial segments), point to point, broadcast
Host I/O

1. CPU (ResusageSvpr Data) Vproc Level


CPU time is recorded for each AMP and PE vproc in the system. In addition, every node also has a node
vproc which generally handles the physical I/O to disk and to the Bynet. Meaningful Resusage profiles to
look at are:
AMP Vproc skewing
Hot AMP problem
Comparison of Vproc Level CPU Use vs. Node Level CPU Use
Indication of non-TPA work on system which could be cause of node skewing
PE and Node Vproc Skewing
Imbalance of PE or node level processing
Combined AMP, PE and Node Use
Shows proportion of AMP, PE and node vproc use. Unusual case is when PE has
relatively higher percentage than normal

2. Disk I/O (ResusageSvpr Data) Vproc Level


Disk I/O at the vproc level is broken down by type of I/O. Some applications will do a lot more spool I/O
than table data I/O due to complex, multiple table joins. Other applications may build only small spools
because data is aggregated and only a small answer set is built. Different times of the day one may see
update processing going on due to the existence of TJ I/Os. Some users workload is such that table data is
supposed to be updated only at night, whereupon the day shift should only read table data and build
intermediate spools. If a large number of TJ I/Os take place during the day shift (more than the system
overhead), then this could be an indication of a job running at the wrong time and a reason why normal
query response time is slower than normal.
A comparison between logical and physical I/Os give an indication of how well memory is used to cache
data. Typically, spools are highly cached.
Disk I/O to look at include:

Logical and Physical Number of disk reads and writes, and Mbytes transferred for
Table data blocks and Table data CIs
Spools blocks and Spool CIs
TJ (Transient Journal) and PJ (Permanent Journal)

For disk management data associated with running out of free disk cylinders, the data to look at are:

Mini-cylinder Packs
Cylinder Defragmentations

3. Bynet I/O (ResusageSpma Data) Node Level


Bynet I/O covers inter-nodal communication. From a query viewpoint, after a statement is parsed and steps
are created, the Dispatcher sends steps to the AMPs for execution. For all-AMP operations, a broadcast
message is sent to all AMPs for step execution. Step completion is indicated by point to point messages.
Generally, query step messages are few in number compared to the actual processing of data. For join
processing, data is often redistributed (Bynet point to point messages) or duplicated (Bynet broadcast). For
updates via BTEQ or Tpump, updated data blocks, CIs for updated table blocks, TJs and PJs are sent to a
buddy node as part of the buddy backup process. When data blocks, which were sent to a buddy node, are
written to disk, a flush message is sent to the buddy node to tell it to get rid of the backed up data block.
The buddy backup is especially useful when there are many updates to the same data block. This can occur
when using volume updates via Tpump, or when primary index updates are made through BTEQ or a users
pre-processor program. The many updates to the same block can be detected in the data by looking at the
buddy backup complete and partial segments. When the number of buddy backup partial segments is
nearly zero or is relatively small compared to the complete segments, the user has the option of turning off
the use of the buddy backup mechanism for table data by setting the DBS tunable parameter,
WrtDBsToDisk, to TRUE. If the important factor to optimize is throughput, then setting the parameter to
TRUE is helpful. If response time for individual transactions is more important, then setting the value to
FALSE is the better option.
Bynet messages are:

Point to Point Messages


Number of I/Os & Mbytes transferred; Also KB per I/O
Broadcast Messages
Number of I/Os & Mbytes transferred; Also KB per I/O
Redistribution
This is a derived value based on the Buddy backup and the Point to Point messages. For
each buddy backup message (which is a point to point message), there is a corresponding
Acknowledgement (ACK) message (also a point to point message). Thus, the estimated
number of redistribution messages is the number of point to point messages minus 2
times the number of buddy backup messages. Because messages can be piggy-backed,
i.e., more than one ACK can be sent in a point to point message, the number of
redistribution messages can only be regarded as a calculated estimate.
Buddy Backup Complete and Partial Segments
Buddy backup messages are included in the count of point to point messages
Number of I/Os & Mbytes transferred; Also KB per I/O
Table data and CIs for table data
TJs and PJs
Buddy Backup Flushes3
Buddy backup flushes are included in the count of point to point messages
Number of I/Os & Mbytes transferred; Also KB per I/O
Table data and CIs for table data
TJs and PJs
Bynet Merges
Number of Rows returned for a SQL SELECT statement. This does not include data sent
back for FastExport, nor for archiving of data.

In a steady state, table data buddy backup flushes should be approximately the same as for complete
segment backups. This column is generally not important unless there is a specific performance problem
that cannot be understood or explained. Then any discrepancy between the flushes and the complete
segments can point to an internal problem with the system.

4. Host I/O (ResusageShst Data) - Vproc Level


The Host I/O data presents a picture of when data is loaded and how much data is loaded via either a
mainframe connection or LAN connection. It also shows when a large volume of data is sent back to a
host, usually for archiving of data.

Data Read From and Written To a Host (Mainframe channel connect or LAN connect)
Number of I/Os & Mbytes transferred; Also KB per I/O

Sample Charts of Typical and Problem Situations


The charts can be found on the succeeding pages. While much of the samples are oriented toward problem
identification, they still could be considered part of a baseline profile.

1. Parallel Efficiency Average / Max Node CPU Chart

Chart shows system is at maximum 100% utilization on 10/13, from 23:00 to 10:00 the next
morning. The parallel efficiency at this time is also at 100%, an excellent situation. However,
this does not say anything about efficient use of the system. One needs to check the next chart that
shows OS as Pct of CPU.
On 10/13, from 0:00 to 6:30, there is significant skewing even though the maximum utilization is
less than 40%. This may or may not be a problem.

2. OS as PCT of CPU vs. AVG CPU Busy Chart (OS % CPU Problem)

During the peak CPU utilization, the OS as % of CPU is extremely low for a lengthy period of
time. Below 20% for 11 hours indicates there is an application problem, usually a large number of
duplicate row checks or a very large product join. (This turned out to be a duplicate row check
problem where an Insert into a table was executing and the choice of primary index was a poor
one.)

3. OS vs. DBS CPU Busy Chart (Different view of OS % CPU Problem)

During the peak period, the sum of OS % busy and DBS % busy add up to 100%. The chart
shows the DBS was executing about 90% busy and the OS was executing only about 10% busy.
The lengthy duration of the DBS executing at such a high % indicates an application problem,
usually a large number of duplicate row checks or a very large product join. (This turned out to be
a duplicate row check problem where an Insert into a table was executing and the choice of
primary index was a poor one.)

4. Poor Parallel Efficiency Average / Max Node CPU Chart

On 8/25, from midnight through 7:00, there was extreme skewing where one node was running at
100% and the other nodes averaged as low as 20%. Definitely, this needs to be investigated.

On 8/26, for 2 hours (21:00 and 22:00) and on 8/27 from 1:00 through 7:00, the valley shaped area
shows the max CPU busy a little over 25% with an average around 6%. This is an indication of a
single node (out of 4) running with a single CPU running at 100% busy and all other CPUs in the
node and in other nodes running virtually at 0%. (With one CPU in a node at 100% and the 3
other CPUs at 0%, the overall node CPU average would be 25%. With the one node at 25% and

10

the other nodes at 0, the overall average for all nodes would be 6.25%.) Again, there is a problem
that needs to be investigated.

5. Avg CPU Busy vs. CPU Idle Waiting for I/O Chart

Upper area of chart indicates the CPU was idle waiting of disk or Bynet I/O completion. The I/O
wait could be due to a real disk I/O bottleneck or simply not having enough jobs in the system.
When there are not enough jobs in the system to keep the CPUs busy, the I/O wait could be due to
disk I/O or Bynet buddy backup I/O.

6. I/O Wait vs. Disk I/Os Chart

The chart shows high spikes of I/O wait without the corresponding spikes in disk I/Os. (A better
chart would have been to show Mbytes transferred per second.) By and large, the I/O wait is not
due to an I/O throughput bottleneck.

7. CPU Idle Waiting for I/O vs. Buddy Backup Bynet I/O Chart

The chart shows a high correlation between the occurrence of the I/O wait and the Buddy Backup
traffic. This indicates the users workload was doing table updates throughout most of the time
period. However, this does not mean the Buddy Backup was at a throughput bottleneck. The I/O
wait looks more like too few jobs in the system. If the data is available, the number of concurrent
active sessions should also be looked at for this time period.

8. Average & Min Free Memory Available vs. Total Page/Swap IOs

For most of the time, available free memory is over 150 Mbytes. On occasion they drop down so
low that they cause a high number of page/swap I/Os. This looks like a case where the FSG Cache
percent should be raised so as to not leave so much free memory unused. Also, UNIX memory
parameters, LOTSFREE, DESFREE and MINFREE should be raised to reduce the risk of UNIX
panics.

9. Concurrent Sessions Chart

The Concurrent Sessions Chart shows the average and max node CPU busy at 10 minute logging
periods and the number of concurrent active sessions at less than a minute frequency. The
beginning of the chart shows the number of concurrent active sessions at 10 (scale on right hand
side of chart), fluctuating to about 20 concurrency for an hour or so, then picking up to 40,
dropping down and going back to 40. At approximately 17:22, the concurrent load on the system
dropped from about 40 down to 12 while the CPU still remained at 100% busy. This helps to
explain why response time is longer at different times of the day even though the CPU is 100%
busy throughout most of the time period.

11

Parallel Efficiency - Avg & Max Node CPU Busy %


120

100

80

60

40

20

0
0:00
3:10
6:20
9:30
12:40
15:50
19:00
22:10
1:20
4:30
7:40
10:50
14:00
99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/14 99/10/14 99/10/14 99/10/14 99/10/14

Max CPU Busy


Avg CPU Busy

OS as PCT of CPU vs. AVG CPU Busy


120

100

80

Avg CPU Busy

60

OS % CPU

40

20

0
0:00
3:10
6:20
9:30
12:40
15:50
19:00
22:10
1:20
4:30
7:40
10:50
14:00
99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/14 99/10/14 99/10/14 99/10/14 99/10/14

13

OS vs. DBS CPU Busy


120.0

100.0

80.0

60.0

40.0

20.0

0.0
0:00
3:20
6:40
10:00
13:20
16:40
20:00
23:20
2:40
6:00
9:20
12:40
16:00
99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/13 99/10/14 99/10/14 99/10/14 99/10/14 99/10/14
14

os % busy
dbs % busy

Poor Parallel Efficiency - Avg & Max Node CPU Busy %


120

100

80

Max CPU Busy

60

Avg CPU Bsy

40

20

13
21
5
13
21
5
13
21
5
13
21
8/24/1998 8/24/1998 8/25/1998 8/25/1998 8/25/1998 8/26/1998 8/26/1998 8/26/1998 8/27/1998 8/27/1998 8/27/1998
15

I/O Wait vs. Disk I/Os


40

35

14,000,000

Avg CPU Busy vs. CPU Idle Waiting for I/O


12,000,000

120
30
10,000,000

8,000,000
20

80
6,000,000

Disk I/Os

I/O Wait %

100 25

Total Disk I/O / sec


I/O Wait %

15
60

4,000,000
10

40 5

0
20

2,000,000

0
11
22
9
20
7
18
5
16
3
14
98/08/03 98/08/03 98/08/03 98/08/04 98/08/04 98/08/05 98/08/05 98/08/06 98/08/06 98/08/07 98/08/07

0
0
9
18
3
12
21
6
15
0
9
18
3
12
21
98/08/03 98/08/03 98/08/03 98/08/04 98/08/04 98/08/04 98/08/05 98/08/05
1698/08/06 98/08/06 98/08/06 98/08/07 98/08/07 98/08/07

I/O Wait %
Avg CPU Bsy

40

16,000,000

35

14,000,000

30

12,000,000

25

10,000,000

20

8,000,000

15

6,000,000

10

4,000,000

2,000,000

0
0 4

8 12 16 20 0 4

8 12 16 20 0 4

8 12 16 20 0

4 8 12 16 20 0

98/08/03 - 98/08/07

17

4 8 12 16 20

Total Buddy Backup Reads

I/O Wait %

CPU Idle Waiting for I/O vs. Buddy Backup Bynet I/O

Total BkUp Rds


I/O Wait %

Average & Min Free Memory Available vs. Total Page/Swap IOs
700

250,000

600
200,000

150,000
400

300
100,000

200
50,000
100

0
0
18
0:00
16:20
8:40
1:00
17:20
9:40
2:00
18:20
10:40
4:30
20:50
7/16/2001 7/16/2001 7/17/2001 7/18/2001 7/18/2001 7/19/2001 7/20/2001 7/20/2001 7/21/2001 7/22/2001 7/22/2001

Total Number of Page/Swaps

MBytes Available Free Memory

500

Avg MB Mem Free


Min MB Mem Free
Total PgSw IOs

Avg
Max

19

Active Sessions

18:18:07

18:04:27

17:50:33

17:36:31

17:22:49

17:08:27

16:54:39

16:40:33

16:26:51

16:13:32

15:59:41

15:45:23

15:30:47

15:16:50

15:03:33

14:50:18

14:36:53

14:23:36

120.0

CPU Usage
50

100.0
40

80.0

60.0
30

40.0
20

20.0
10

0.0
0

Das könnte Ihnen auch gefallen