Beruflich Dokumente
Kultur Dokumente
ABSTRACT
A holistic method for diagnosing performance techniques in a complex information system is presented. The COE Performance Method (CPM) relies on proven techniques and offers a simple, end-to-end approach to isolating performance bottlenecks based on evidence of their actual causes. There are many excellent Oracle solutions which treat single, individual technology components in greater depth than this paper, but the purpose in this document is to provide a complete method of end-to-end performance analysis for an entire application of perhaps many synergistic components. While this approach is shown in the context of a networked enterprise database application, a general use of the described CPM can be easily applied to any computing environment. An explicit goal of the COE Performance Method described here is to enhance the performance achievements of Service Level Agreements and to quickly diagnose variances from those SLAs.
description suggests a level of complexity that might discourage the non-mathematician, it is not necessary to have a mathematics background to develop a reasoned understanding of the principles involved. The fundamental equation we need to understand is this:
Service time deserves some consideration. In the case of a database application, a sessions process might be found to spend too much service time, in the form of CPU time, processing extra data blocks because of the lack of a proper index on a particular table. That is, the server performs a full table scan instead of an index range scan, retrieving many more data blocks than otherwise would be necessary. While this additional work might be initially regarded as service time indeed, each block retrieval operation will consist of some CPU processing time the operation will involve even more I/O wait time as the users process must wait for each additional blocks disk read requests. So, while the full table scan certainly incurs additional CPU service time, the symptom of poor performance will most obviously be exhibited by excessive wait time (disk I/O) rather than service (CPU) time.
Consider another example from daily life: the junk food lunch. We drop by our favorite hamburger restaurant for a quick bite and are faced with three lines of people waiting to order food from three employees acting as servers. Which line do we choose? Almost automatically, we choose the shortest line available. After several minutes, we notice someone who arrived after us is being served before us. It dawns on us the person serving our line might still be in training. It takes that person about twice as long to fill an order as the more experienced workers. So, we intuitively understand service time the time it takes to actually take and fill an order is a vital component of response time. Response time in this case is the time it takes to get our food in hand, starting from the moment we step into line in the restaurant.
Another example of the importance of wait time as a primary measure of poor performance would be CPU time consumed by excess SQL parsing operations. A well-designed application will not only make use of sharable SQL and avoid hard parses, but will also avoid soft parses by keeping frequently used cursors open for immediate execution without reparsing at all neither hard nor soft. A poorly designed application will certainly exhibit a high percentage of parse time CPU, but will probably also incur a disproportionate amount of time waiting for latches, most notable the library cache latch. As such, even a highly CPUTHE C OE PERFORMANCE M ETHOD 3
consumptive process is likely to cause measurable disproportionate waits. So, while service time must be monitored, performance problems are more likely to be quickly spotted by focusing on wait time. CPM as presented here takes a holistic approach to performance analysis and encourages the analyst to concentrate on service time or wait time as appropriate for the situation at hand. If the real problem is service-time related rather than wait time, it will be indicated by CPM and its cause corrected. Although the earlier automobile traffic example is easy to understand, the importance of wait time is all too easy to forget when dealing with the abstractions of computer software. However, that example can highlight how a database server might have a buffer cache hit ratio of ninety-nine percent and at the same time exhibit abysmal response time. Or, how a large parallel query might take too long to complete while CPU consumption mysteriously drops to near-idle levels. When the CPU is not working, it is waiting.
there exist a number of statistical values available for report, called wait events, indicating the presence or absence of internal bottlenecks. Measuring changes in the performance of an Oracle database involves viewing these wait events by value of time waited and comparing these wait times to the same measure from a different time period. Other stacks involved in the end-to-end application view typically have tools to provide similar information. We will discuss some of those tools in more detail later. Lets now forge on to the practical details of diagnosing performance issues.
represent actual system performance taken during one or more periods of busy activity. A baseline of data gathered while the system is idle is of little use. The baseline will need to be maintained as the system evolves, with respect to workload, functionality and configuration. If you add new application features, upgrade the database version or add or replace CPUs or other hardware, the environment has changed and therefore performance may have changed. If the baseline is not reestablished, any understanding of a future performance complaint by the user community will be compromised and blurred one will not be able to know if a performance change is due to a configuration issue or is a bug introduced with a new application feature. The baseline is established for this system in this environment and enables a comparative analysis to be made to diagnose a specific problem. The issue of the performance complaint itself is worthy of some note. One of the problems inherent with managing complex systems is the uncertainty of the performance metric. Performance is largely a matter of perception. A user may decide one day that a two second response for the execution of a particular form is acceptable, but unacceptable the next day, depending on issues like how hurried or relaxed the user feels on a particular day. This suggests the information used for the reference baseline needs to be coordinated with the metrics used for the SLA. Even though performance complaints may still be lodged, at least the system or database administrator has either a defense to offer or a starting point to diagnose the issue.
Figure 1
PROBLEM STATEMENT
A clear and unambiguous definition of both good and bad behaviour is essential. The problem statement is more than half of the battle for a solution and defines success for us. Moreover, the discipline of stating the problem clearly and concisely often uncovers possible solutions. There is an undeniable and innate siren offering a temptation to gloss over this step, but this temptation must be resisted so that misunderstandings and inefficiencies are avoided. If you think you are solving one problem and the customer or user has a different expectation, valuable time will be wasted addressing misguided issues. An example of a weak problem definition would be, Online queries are slow and need to be much faster, while a good problem statement might be, The Customer Service Name Lookup screen normally returns query results in 3-4 seconds, but has been taking more than 20 seconds since the open of business this morning. Define the problem specifically and concisely, establish the measure of success with the customer and make certain you have agreement. The accordant goal must, of course, be reasonable and realistic. The definition needs to be quantifiable in terms conforming to the SLA metrics. The weak problem statement example above is harmfully vague. How would we know when we have succeeded in finding a solution? In our good example, if the SLA requires specific response times for the application function in question, we at least have a target for success and therefore a greater probability of success. Sometimes a clear problem statement is elusive. When things go wrong, often during critical business hours, tempers flare and communication lines break down. Sometimes the issue is obvious while at other times we wonder if we are simply imagining a problem that does not exist. When in doubt, ask yourself the simple question, What makes you think there is a problem here? then demand of yourself a very specific answer based on symptomatic behavior. As Winston Churchill said, Never overlook the obvious. It may well be the cause of the problem is already understood or suspected. A clear description of what the problem is and isnt will go a long way toward quickly resolving both obvious and obscure problems. Take the time to clearly define the nature of the performance symptom, the time and circumstances of appearance or disappearance, and to establish a valid test. Say what is known about the problem, and describe what is not known. A previously developed test case is ideal, and if one does not exist in advance, now is the time to create one. A test case can be as simple as the execution of a procedure through SQL*Plus and then also through the web server, with a measurement of response times. The result of the test needs to be compared to the baseline, so the importance of a valid and current baseline is therefore apparent. If a baseline was not established in advance, get one now so that you at least have the current bad performance captured and have something against which to measure the impact of changes. Not all changes are good.
language such as perl to analyze the text output. The tool can then phone home when exceptions are encountered or predefined thresholds exceeded. Having an integrated monitoring environment will facilitate rapid and accurate stack identification during a performance crisis. While elaborate third party tools are available for such an infrastructure, off-the-shelf and freeware tools are often entirely adequate, although any tools you choose will have to be integrated into your environment. For example, each UNIX platform in the enterprise might have a scheduled process to gather sar and netstat statistics on regular intervals. If Statspack snapshots are also collected at similar times, it is a simple matter to analyze reports for those tools for a period of concern and compare the available data to reports from, say, exactly one week or one month earlier. If the application workload is similar for both periods, but the performance problem did not exist in the earlier period, we have a fast way to compare bad performance data to baseline data. If the problem is with the underlying UNIX platform or the network, it should be apparent immediately. Even without the baseline, a trained technician will recognize symptoms of constraint a high percentage of CPU wait time or process swapping activity, for example. See Figure 2 for an example of vmstat output. If no obvious starting point presents itself, we recommend you start with the database server itself. One obvious reason is the database administrator understands that stack best. Another advantage is the Oracle server gathers and provides information offering clues to problems across other stacks. For example, network problems often show up as a specific Oracle wait event, sql*net more data to client. Knowing the response time through the database stack will allow you to determine whether most of the overall response time is spent in the database or not. This in turn will direct your attention to the database itself or to another stack.
$ vmstat 5 5 procs r b w sy id 0 1 0 6 77 memory swap 6968 free re mf pi page po fr disk de sr s0 s1 s2 s3 8 1 1 4 faults in sy 6649 cpu cs us 7044 17 8798 29
20376 49 1775
0 37 36 4520
1 1 0 31807664 443376 10 1037 6 64 1 4 0 31798856 443008 8 60 0 1 0 31807744 441872 7 62 0 0 0 31808072 441376 5 72 1 934
0 49 49 5251 61709
0 10
This is a vmstat sample taken from a 32-processor Sun system for five intervals of five seconds each. Statistical sampling is such that we ignore the first line of vmstat. A quick glance under the procs section tells us there is some process run queue wait ti me (r is either 0 or 1 in this example) and some resource waiting (b > 0 for most interval samples). This is generally considered good, nonbottlenecked performance although the b value indicates a process blocked by an IO wait, so disk may need balancing if that b value grows. Run queues are averaged for all CPUs for Solaris. Memory paging and swapping are not the same. Paging, even with these seemingly large numbers, is quite normal. The sr column tells you how often the page scanner daemon is looking for memory pages to reclaim, shown in pages scanned per second. Consistent high numbers here (> 200) are a good indication of real (not virtual) memory shortage. The fields displayed are: procs Report the number of processes in each of the three following states: r in run queue b blocked for resources (I/O, paging, and so forth) w runnable but swapped memory report on usage of virtual and real memory. swap amount of swap space currently available (Kbytes) free size of the free list (Kbytes) page Report information about page faults and paging activity, in units per second. re page reclaims mf minor faults pi kilobytes paged in po kilobytes paged out fr kilobytes freed de anticipated short-term memory shortfall (Kbytes) sr pages scanned by clock algorithm disk Report the number of disk operations per second, per disk unit shown. faults Report the trap/interrupt rates (per second). in (non clock) device interrupts sy system calls cs CPU context switches cpu Give a breakdown of percentage usage of CPU time. On MP systems, this is an average across all processors. us user time sy system time id idle time
Figure 2
TIMING IS EVERYTHING
An important consideration when evaluating third party tools or rolling your own is to gather and analyze data in a meaningful manner. For the most part, we are dealing with statistical samples when we monitor hardware and software resources, so sampling techniques must be sensible with respect to sample size and THE C OE PERFORMANCE M ETHOD 9
interval. The vmstat report shown in Figure 2 was taken at five-second intervals. While short intervals show performance spikes quite well, they also tend to exaggerate variances in values and therefore contain statistical noise. A better method is to take concurrent short and long samples to be able to analyze both averages and variances to get a meaningful picture of performance.
$ iostat xtc extended device statistics device sd0 sd1 sd2 sd3 sd4 sd5 sd15 sd16 . . . This as an abbreviated iostat report from the same 32 processor system as shown in Figure 1. The svc_t column is actually the response time for the disk device, however misleading the name. When looking for input/output bottlenecks on disks, a rule of thumb is to look for response time greater than 30 milliseconds for any single device. A well-buffered and managed disk system can show response times under 10 milliseconds. Here are the field names and their meanings: device r/s w/s kr/s kw/s wait actv svc_t %w %b name of the disk reads per second writes per second kilobytes read per second kilobytes written per second average number of transactions waiting for service (queue length) average number of transactions actively being serviced average service time, in milliseconds percent of time there are transactions waiting for service (queued) percent of time the disk is busy (transactions in progress) r/s 1.7 0.1 w/s 6.1 0.0 kr/s 34.5 1.1 24.0 9.5 15.8 28.4 97.5 140.1 kw/s wait actv 46.9 1.7 416.1 416.1 13.5 17.1 11.8 46.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.2 0.2 0.0 0.1 0.2 0.3 svc_t 26.1 7.6 4.9 4.9 17.7 10.2 14.0 14.6 %w 0 0 0 0 0 0 0 0 %b 4 0 18 18 1 6 5 12 tty cpu
tin tout us sy wt id 0 48 18 6 5 70
2.0 35.9 1.2 35.9 0.3 0.4 8.8 14.1 1.5 7.8 2.3 6.7
Figure 3
A sudden burst of activity might cause a single disk drive to be so busy as to cause process queuing, yet may not be of any real concern unless it become chronic. On the other hand, long iostat samples will average disk service time and tend to hide frequent spikes and could possibly mask a real problem. See figure 4 for an example of a CPU resource measurement illustrating how large variances in reported data can be misleading. If you look at the data for too short an interval, you might conclude CPU idle time is nearly seventy percent or nearly as low as twenty percent. If you are trying to analyze a performance anomaly during a period of high or low CPU usage, such a narrow slice of data can be quite helpful. On the other hand, taken as an indication of the norm, such a microscopic view could be completely misleading. The first priority at this early juncture is to eliminate obvious problems that can skew performance data and blur the analysis. We are concerned with quickly ascertaining the overall health of the components of each technology stack to make sure we know where the possible problem both is and isnt . We do this by looking for exceptions to what we know to be normal behavior.
10
CPU Idle times extracted from a sar report. The jagged line represents samples taken at fifteen-minute intervals. The trend line is shown to illustrate the degree to which variances among individual samples can be distracting and misleading. You need both average and variance information to get a true picture of what is happening at the hardware and operating system levels. The interval marked Low is entirely different from the interval marked High. A narrow peek at a performance variation can be useful for analyzing bottlenecks, but can be misleading if taken as an indication of the norm.
Figure 4
For example, perhaps we received a report that an Oracle server had severe latch free wait events during a period of bad performance. If we respond directly to that symptom without adequate high-level analysis of the overall platform/database technology stack, we might overlook heavy process queuing at the operating system level. That is, the Oracle database might appear to be the problem, when the real issue is a lack of capacity. Reports from vmstat or iostat would indicate chronic process run queues, so we would know that the Oracle database itself is probably not the culprit, at least not the primary culprit. Once the resource limit is addressed, by tuning the application, rescheduling processes or adding more or faster processors, we can proceed once again with the stack analysis and identify server constraints in their proper context.
6:5 1:3 0 7:0 6:3 7:2 0 1:3 7:3 0 6:3 7:5 0 1:3 1 8:0 6:3 1 8:2 1:3 8:3 1 6:3 8:5 1 1:3 9:0 1 6:3 1 9:2 1:3 9:3 2 6:3 2
THE C OE PERFORMANCE M ETHOD
11
tracert mail12 Tracing route to mail12.us.snerdley.com [148.22.88.200]over a maximum of 30 hops: 1 <10 ms <10 ms [148.22.216.1] 2 3 <10 ms 220 ms <10 ms 210 ms 10 ms <10 ms 231 ms whq17jones-rtr-755f1-0-a.us.snerdley.com whq4op3-rtr-714-f0-0.us.snerdley.com [148.22.252.23] mail12.us.snerdley.com [148.22.88.200]
Trace complete.
Sample tracert used to identify potential network problems. Coupled with ping, a number of common issues can be quickly identified. Ping each device shown in tracert, with the dont fragment bit set and a large packet size to isolate individual segment performance. Although tracert shows timing information, it is for very small packets and may not isolate bottlenecks, so ping is used in conjunction with tracert .
Figure 5
13
A MEASURE OF DIPLOMACY
Besides tools to cover the technology spectrum under your domain, you will also need occasional cooperation from other experts. One of the more common problems of the contemporary enterprise is a direct outgrowth of the integration of disparate technologies communication barriers. Often, the administrators of the database, hardware platform and the network belong to entirely different management structures. While a performance methodology such as this cannot address political turfs, cooperation is necessary to quickly diagnose potentially complex problems.
A CKNOWLEDGEMENTS
The Center of Expertise Performance Methodology has been a collaborative work of many individuals. Current and former members of COE, including Jim Viscusi, Ray Dutcher, Kevin Reardon and others, provided much of the early research. Cary Millsap offered the theoretical foundation for this effort.
BIBLIOGRAPHY
Practical Queueing Analysis, Mike Tanner, McGraw-Hill Book Company (out of print in the United States, but a classic worth finding, available at Amazons United Kingdom site) The Art of Computer Systems Performance Analysis Techniques for Experimental Design, Measurement, Simulation, and Modeling, Raj Jain, John Wiley & Sons Capacity Planning for Web Performance, Daniel A. Menasce, Virgilio A. F. Almeida, Prentice Hall Oracle8i Designing and Tuning for Performance Release 2 (8.1.6), Oracle Corporation, part A76992-01 Oracle9i Database Performance Methods, Oracle Corporation, part A87504-02 Oracle9i Database Performance Guide and Reference, Oracle Corporation, part A87503-02 Sun Performance and Tuning, Java and the Internet, Adrian Cockcroft, Richard Pettit, Sun Microsystems Press, a Prentice Hall Title Oracle Performance Tuning 101, Gaja Krishna Vaidyanatha, Kirtikumar Deshpande, John A. Kostelac, Jr., Oracle Press, Osborne/McGraw-Hill Oracle Applications Performance Tuning Handbook, Andy Tremayne, Oracle Press, Osborne/McGraw-Hill Yet Another Performance Profiling Method (YAPP), Anjo Kolk, http://metalink.oracle.com
14