Troubleshooting is a systematic approach to solving a problem.
The goal is to determine why something does not work as expected and how to
resolve the problem.
The first step in the troubleshooting process is to describe the problem
completely. Without a problem description, neither you nor IBM® can know where
to start to find the cause of the problem. This step includes asking yourself
basic questions, such as:
- What are the symptoms of the problem?
- Where does the problem occur?
- When does the problem occur?
- Under which conditions does the problem occur?
- Can the problem be reproduced?
The answers to these questions typically lead to a good description of
the problem, and that is the best way to start down the path of problem resolution.
What are the symptoms of the problem?
When
starting to describe a problem, the most obvious question is "What is
the problem?" This might seem like a straightforward question; however,
you can break it down into several more-focused questions that create a more
descriptive picture of the problem. These questions can include:
- Who, or what, is reporting the problem?
- What are the error codes and messages?
- How does the system fail? For example, is it a loop, hang, crash, performance
degradation, or incorrect result?
- What is the business impact of the problem?
Where does the problem occur?
Determining
where the problem originates is not always easy, but it is one of the most
important steps in resolving a problem. Many layers of technology can exist
between the reporting and failing components. Networks, disks, and drivers
are only a few components to be considered when you are investigating problems.
The
following questions can help you to focus on where the problem occurs in order
to isolate the problem layer.
- Is the problem specific to one platform or operating system, or is it
common across multiple platforms or operating systems?
- Is the current environment and configuration supported?
- Is the application running locally on the database server or on a remote
server?
- Is a gateway involved?
- Does the database reside on individual disks, or on a RAID disk array?
Remember that, even though one layer might report the problem,
this does not mean that the problem originates in that layer. Part of identifying
where a problem originates is understanding the environment in which it exists.
Take some time to completely describe the problem environment, including the
operating system, its version, all corresponding software and versions, and
hardware information. Confirm that you are running within an environment that
is a supported configuration; many problems can be traced back to incompatible
levels of software that are not intended to run together or have not been
fully tested together.
When does the problem occur?
Develop a
detailed timeline of events leading up to a failure, especially for those
cases that are one-time occurrences. You can most easily do this by working
backward: Start at the time an error was reported (as precisely as possible,
even down to the millisecond), and work backward through the available logs
and information. Typically, you need to look only as far as the first suspicious
event that you find in a diagnostic log; however, this is not always easy
to do and takes practice. Knowing when to stop looking is especially difficult
when multiple layers of technology are involved, and when each has its own
diagnostic information.
To develop a detailed timeline of events, try
to answer these questions:
- Does the problem happen only at a certain time of day or night?
- How often does the problem happen?
- What sequence of events leads up to the time that the problem is reported?
- Does the problem happen after an environment change, such as upgrading
or installing software or hardware?
- Product specific questions.
Responding to questions like this can help to provide you with
a frame of reference in which to investigate the problem.
Under which conditions does the problem occur?
Knowing
what other systems and applications are running at the time that a problem
occurs is an important part of troubleshooting. These and other questions
about your environment can help you to identify the root cause of the problem:
- Does the problem always occur when the same task is being performed?
- Does a certain sequence of events need to occur for the problem to surface?
- Do any other applications fail at the same time?
- Product specific questions.
Answering these types of questions can help you explain the environment
in which the problem occurs, and correlate any dependencies. Remember, just
because multiple problems might have occurred around the same time, the problems
are not necessarily related.
Can the problem be reproduced?
From
a troubleshooting standpoint, the "ideal" problem is one that can
be reproduced. Typically with problems that can be reproduced, you have a
larger set of tools or procedures at your disposal to help you investigate.
Consequently, problems that you can reproduce are often easier to debug and
solve. However, problems that you can reproduce can have a disadvantage: If
the problem is of significant business impact, you do not want it to recur!
If possible, re-create the problem in a test or development environment, which
typically offers you more flexibility and control during your investigation.
- Can the problem be re-created on a test machine?
- Are multiple users or applications encountering the same type of problem?
- Can the problem be re-created by running a single command, a set of commands,
or a particular application, or a stand-alone application?
- Product-specific questions.