WTF Is Your Linux Server Doing Tonight

Oracle Cloud. Software. Hardware. Training. Consulting. Mythics Complete.
WTF is my server doing?
September 21st, 2015
Introduction to Erik Benner
Erik Benner
 Published Author
Enterprise Architect  RAC Attack Ninja
ebenner@mythics.com  Linux since 1992
 Solaris since 1996
 DB 12c BETA user
 Prelaunch ODA “comet”
 First Version of Oracle…7 in 1994
@erik_benner  ZFS since “Thumper”
TalesFromTheDatacenter.com  OEM 12c since Product Launch
Mythics.com/blog  OAUG EM for Apps SIG co-chair
 OEM12c CAB Member
2  IOUG Solaris SIG Leader
What would you say if you
were asked:
How busy is that system?
A: I have no idea…
A: 10%
A: Why do you want to know?
A: I’m sorry, you don’t understand your question….
What is system performance?
• In a broader sense, system performance
refers to how well the computer resources
accomplish the work they are designed to
do. The performance of any computer
system may be defined by two criteria:
Response time
Throughput
Performance Tuning
• Performance tuning is a process of
observing the operations of a system and
making adjustments to different
components based on those observations.
Key factors
• Hardware
• Operating system
• Application software
• Users
• Changes over time
Managing system performance
• Monitoring usage of system resources
• Selecting tools to measure system performance
• Diagnosing problems from the results of
measurement
• Tuning the operating system and application
parameters
• Upgrading the hardware resources of the
system
• Planning for the optimal performance
Bottlenecks
• A resource is a bottleneck, if the size of a
request exceeds the available resource. In
other words, a bottleneck is a limitation of
system performance due to the
inadequacy of a hardware or software
component, or of the system’s
organization.
Two ways to solve a bottleneck
• increasing the size of available resource
• decreasing the size of the request
Guidelines in tuning a real system
• Never tune at random

• Tune one area at a time
• Change only one parameter at a time
• Always use at least two tools
• Experience is the best tool
• Know when to say stop
Quick Tips #1 - Disk
• The system will usually have a disk bottleneck
• Track how busy is the busiest disk of all
• Look for unbalanced, busy or slow disks with iostat
• Options: timestamp, look for busy controllers, ignore idle disks:
Watch out for sd_max_throttle limiting throughput when set too low
Watch out for RAID cache being flooded on writes, causes sudden very large
increase in write service time
[root@hol sa]# iostat -xnz T dm-0

Linux 3.8.13-98.1.2.el6uek.x86_64 (hol) 09/22/2015 _x86_64_ (1 CPU)
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm
%util
dm-0 0.00 0.00 15.24 3.06 230.64 24.46 13.94 0.09 4.68 0.21 0.38
Quick Tips #2 - Network
• If you ever see a slow machine that also appears to be idle, you should
suspect a network lookup problem. i.e. the system is waiting for some other
system to respond.
• Poor Network Filesystem response times may be hard to see
– Use iostat -xn 30 on a Solaris client
– wsvc_t is the time spent in the client waiting to send a request
– asvc_t is the time spent in the server responding
– %b will show 100% whenever any requests are being processed, it does NOT
mean that the network server is maxed out, as an NFS server is a complex
system that can serve many requests at once.
• Name server delays are also hard to detect
– Overloaded LDAP or NIS servers can cause problems
– DNS configuration errors or server problems often cause 30s delays as the
request times out
Quick Tips #3 - Memory
• Avoid the common vmstat misconceptions
– The first line is average since boot, so ignore it
• Linux, Other Unix and earlier Solaris Releases
– Ignore “free” memory
– Use high page scanner “sr” activity as your RAM shortage indicator
• Solaris 8 and Later Releases
– Use “free” memory to see how much is left for code to use
– Use non-zero page scanner “sr” activity as your RAM shortage indicator
• Don’t panic when you see page-ins and page-outs in vmstat
• Normal filesystem activity uses paging
[root@hol /]# vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 5644684 62776 64328 0 0 127 23 79 574 0 1 98 0 0
0 0 0 5644668 62776 64328 0 0 0 0 8 9 0 0 100 0 0
0 0 0 5644676 62776 64328 0 0 0 0 7 6 0 0 100 0 0
0 0 0 5644676 62776 64328 0 0 0 0 7 7 0 0 100 0 0
^C
Quick Tips #4 - CPU
• Look for a long run queue (vmstat procs r) - and add CPUs
– To speedup with a zero run queue you need faster CPUs, not more of them
• Check for CPU system time dominating user time
– Most systems should have lots more Usr than Sys, as they are running
application code
– But... dedicated NFS servers should be 100% Sys
– And... dedicated web servers have high Sys as well
– So... assume that lots of network service drives Sys time
• Watch out for processes that hog the CPU
– Big problem on user desktop systems - look for looping web browsers
– Web search engines may get queries that loop
– Use resource management or limit cputime (ulimit -t) in startup scripts to
terminate web queries
Quick Tips #5 - I/O Wait
• Look for processes blocked waiting for disk I/O (vmstat procs b)
– This is what causes CPU time to be counted as wait not idle
– Nothing else ever causes CPU wait time!
• CPU wait time is a subset of idle time, consumes no resources
– CPU wait time is not calculated properly on multiprocessor machines on
older Solaris releases, it is greatly inflated!
– CPU wait time is no longer calculated, zero in Solaris 10
– Bottom line - don’t worry about CPU wait time, it’s a broken metric
• Look at individual process wait time using microstates
– prstat -m or SE toolkit process monitoring
• Look at I/O wait time using iostat asvc_t
Quick Tips #6 - iostat
• For Solaris remember “expenses” iostat -xPncez 30
• Add -M for Megabytes, and -T d for timestamped logging
• Use 30 second interval to avoid spikes in load. Watch asvc_t which
is the response time for Solaris
• Look for regular disks over 5% busy that have response times of
more than 10ms as a problem.
• If you have cached hardware RAID, look for response times of more
than 5ms as a problem.
• Ignore large response times on idle disks that have filesystems - its
not a problem and the cause is the fsflush process
Quick Tips #7 – logs!
• /var/log for LINUX
• /var/adm for SOLARIS
• messages file, contains OS errors, and

many application errors
• secure file, su, sudo output
• last command, who logged in
Quick Tips #8 – System Activity Reporter
• The Linux Kernel maintains counters internally, which keeps track of
all requests, their completion time and I/O block counts etc. From all
these information, sar calculates rates and ratio of these request to
find out about bottleneck areas.
• The main thing to understand about the sar is that, it reports all
activities over a period if time. So, make sure that sar is enabled all
the time, not just on yout Lunch break and vacations.
[root@hol sa]# sar
Linux 3.8.13-98.1.2.el6uek.x86_64 (hol) 09/22/2015 _x86_64_ (1 CPU)
10:37:53 AM LINUX RESTART
10:40:01 AM CPU %user %nice %system %iowait %steal %idle

10:50:01 AM all 0.03 0.00 0.05 0.03 0.00 99.89
11:00:01 AM all 0.07 0.00 0.49 0.22 0.00 99.22
11:10:01 AM all 0.02 0.00 0.02 0.01 0.00 99.96
11:20:01 AM all 0.01 0.04 0.05 0.02 0.00 99.88
Average: all 0.03 0.01 0.15 0.07 0.00 99.74
Quick Tips #8 – sar
Common sar options
-f Past file.. Aka Mr. Peabody

-u all CPU
-P ALL per processor
-r memory usage
-S swap usage
-d disk
-b I/O
-q Run Queue
-n Network
Quick Tips #8 – sar
DEMO!
Recipe to fix a slow system
• Essential Background Information
– What is the business function of the system?
– Who and where are the users?
– Who says there is a problem, and what is slow?
– What changed recently and what is on the way?
• What is the system configuration?
– CPU/RAM/Disk/Net/OS/Patches, what application software is in use?
• What are the busy processes on the system doing?
– use top, prstat, pea.se or /usr/ucb/ps uax | head
• Report CPU and disk utilization levels, iostat -xPncezM -T d 30
– What is making the disks busy?
• What is the network name service configuration?
– How much network activity is there? Use netstat -i 30 or nx.se 30
• Is there enough memory?
– Check free memory and the scan rate with vmstat 30
Variable Clock Rate CPUs
• Laptop and other low power devices do this all the time
– Watch CPU usage of a video application and toggle mains/battery power….
• Server CPU Power Optimization - AMD PowerNow!™
– AMD Opteron server CPU detects overall utilization and reduces clock rate
– Actual speeds vary, but for example could reduce from 2.6GHz to 1.2GHz
– Changes are not understood or reported by operating system metrics
– Speed changes can occur every few milliseconds (thermal shock issues)
– Dual core speed varies per socket, Quad core varies per core
– Quad core can dynamically stop entire cores to save power
• Possible scenario:
– You estimate 20% utilization at 2.6GHz
– You see 45% reported in practice (at 1.2GHz)
– Load doubles, reported utilization drops to 40% (at 2.6GHz)
– Actual mapping of utilization to clock rate is unknown at this point
• Note: Older and "low power" Opterons used in blades fix clock rate
23

WTF Is Your Linux Server Doing Tonight

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

WTF Is Your Linux Server Doing Tonight

Hochgeladen von

Copyright:

Verfügbare Formate

Oracle Cloud. Software. Hardware. Training. Consulting. Mythics Complete.

• Never tune at random

[root@hol sa]# iostat -xnz T dm-0

• messages file, contains OS errors, and

10:37:53 AM LINUX RESTART

10:40:01 AM CPU %user %nice %system %iowait %steal %idle

Common sar options

-f Past file.. Aka Mr. Peabody

Das könnte Ihnen auch gefallen