Sie sind auf Seite 1von 2

Notes for Basic Troubleshooting class

Overview : Troubleshooting is figuring out what the problem is with the tools and information you
have available to you. I break it down into the following steps:
•Gather initial problem description
•Confirm that definition
•Define area of possibilities
•Confirm that area
•REPEAT (keep dividing the possible solutions into a smaller set)
• Gather additional information
• Confirm additional information
• Refine area of possibilities
• Confirm that area
•UNTIL either problem is identified and corrected, or progress stops
•EXPLAIN (to the customer if need be) how the problem you have found can cause all of the
symptoms you have been presented with

Metaphore #1

Highways lead in all directions, each a possible course of action


How did you get here? (Installation? Upgrade? Production?)
Gather initial problem description - eliminate some possibilities, choose a possible direction. Keep
a sharp eye out for "Off Ramps"
Off ramps include facts that short-circuit the process and strongly suggest a final answer. A few
are:
MCE listed in a log
Smell of Smoke
Full hard disk (did the disk cause the problem, or did the problem fill the disk?)
Exact IKB/KB match (confirm, and check the source)

Occam's razor - This is often paraphrased as "All other things being equal, the simplest solution
is the best." In other words, when multiple competing theories are equal in other respects, the
principle recommends selecting the theory that introduces the fewest assumptions and postulates
the fewest entities. (thanks Wikipedia)

Example – If the NIC fails, we have no reason to suspect anything else on the network (DNS,
iSCSI server, etc) Keep your guesses simple, find the “lowest common denominator”.

Gather initial problem description – have the customer show you “Where it hurts”. Use WebEx,
see the actual problem. Don’t assume the customer is using Vmware-specific terms correctly.
Repeat what you heard back to the customer using different words. You can’t hit the target (fix
the problem) if you don’t know what it is. Then, document the problem in Seibel. Understanding
the problem is a major part of solving the issue.

Confirm the problem – Recreate the problem on either the customers system, or in the lab. If this
is not practical, then get the logs and cut-n-paste the parts you need into a separate text file.
Assume that someone will argue with you, and get your evidence together. Find things that back-
up your suspected fault, or if that is premature, that at least focus the problem into one of the
major categories.

Define area of possibilities – Most of the time the issue will fall into one of these categories.
Remember, there is a LOT of overlap here:
• Networking - communications between hosts, virtualcenter, iSCSI, the rest of the
network, HA isolation, SSH failures, Firewall rules
• Storage – disks, iSCSI, SAN, local storage, HBA’s, LUNS

• Faults – Error messages that don’t fit cleanly in the above categories, PSODS, things
locking up, freezing

• Guest O/S – if it happens inside of the VM, it’s probably an OS issue

Confirm this theory – Take your best shot at matching this with Google, Bugzilla or Knova. Use
error messages as part of your search terms. If you can nail it exactly, GREAT! If not, you’ll learn
about similar issues and will get a better idea of the next group of possible areas.

Repeat the above steps using the additional information you have gathered. If you don’t have a
clue of where the problem might exist, try using a “subtractive” process and eliminate one or more
areas. Don’t forget to factor in what the customer was doing when the problem arose. Most
issues have happened before and should have been documented, but perhaps using different
terms than what you are using. If all of the common possibilities have been eliminated, then start
looking at the uncommon possibilities. Or better, have someone else look at the information you
have gathered and make suggestions. But don’t walk them down the same path you took, you’ll
learn little new that way. Let them ask questions, you should have gathered the answers already.
If not, then you may have found a new path to follow.

Eventually you will have a “strong suspicion” of where the problem is. Reverse your thinking for a
bit and ask yourself “What would I expect to see if XXX happened?” What other symptoms,
errors, log entries would you expect to find? Go see if they are present.

Don’t kill yourself – Know when to stop working on an issue. If you really have followed the above
process, and have reached a dead-end. Share the problem with a peer, or ask for a Research
Assist.

Das könnte Ihnen auch gefallen