One of the difficulties in the technology world is troubleshooting. With the various networks, custom and packaged software, PCs, servers, networks and standards it is difficult to figure out where the problem really exists. Sometimes it is difficult even to know where to start.
This artical was originally written by Mark to address client/server issues, but with his permission I have updated it and made it general to technology. The issue remains the same, just the technology and tools change.
Solving any technical problem is similar to following the scientific method: It is a process of developing a thesis, developing tests for that thesis, evaluating the results of the tests, and improving the thesis. This methodology will work whether the problem is an application ‘bug’, a network or system problem, or a combination of technical problems.
How many times have you seen a problem, thought that you understood it, ‘solved it’, only to find that the problem continued? In most cases, solving the actual problem once it is well understood is not difficult. What is difficult is identifying the cause of the problem. Really understanding the problem is the most important part of finding a solution. Once the problem has been identified and the cause is understood, simple trial and error can usually resolve it.
The first prerequisite to solving a problem is a separate test environment, which duplicates as closely as possible the production environment. If the problem cannot be reproduced in your test environment, then the test and production environments are not similar enough. Tests should not be run against the production environment. This is a difficult rule to live by, particularly for network or system management problems. However, it is better to start with this rule and have someone in a management capacity explain why you should risk your production environment further by running a series of tests to figure out what went wrong. Since many problems are related to data, actual live data should always be copied to the test environment.
The second prerequisite is adequate tools for testing. This may include application debugging tools, network analyzers, and access to database logs and tables. I once worked on a problem that we were convinced was a network problem until we used a LAN analyzer. It turned out that the application was trapping a database error message without displaying or logging it. Without the analyzer, we would have been hunting in the wrong area forever!
The primary steps in resolving a technical problem are:
- Identifying and prioritizing the problem;
- Developing a reproducible version of the problem;
- Creating tests that are able to identify the cause of the problem;
- Evaluating test results and defining the true cause of the problem;
- Developing a technical solution to the problem.
Each of these steps is discussed below.
Identifying and Prioritizing the Problem
The first step is to describe the problem clearly. The person assigned to get a solution must know all the variables that caused the problem. Possible variables such as time of day, functions in use, specific data in use, system configuration, et cetera, must be identified. The priority and severity of the problem must be defined relative to other problems on the go. There is no point in working on an interesting problem that does not affect your end users, rather than a boring, but mission critical problem.
A key question in this stage is “What has changed since the last time this ran successfully?” This often provides an important clue in understanding the problem.
Once the problem is identified, it must be assigned to an individual for resolution. Unless a problem has a specific owner, the chances of resolution are low. The owner must be able to call on other skills required to assist in solving the problem, but the owner should remain with the problem throughout.
Often at this point, you may find that the problem is properly the responsibility of someone else. It can be extremely satisfying to call up a vendor or consultant and blame them for your problem, particularly if they can say ‘“Try this and it will work.”
Duplicate the Problem
To resolve a complex problem, you must be able to duplicate it. Based on the initial description, the tester must define the steps to recreate the problem at will. The specific steps to recreate the problem must be defined.
If the problem is to be referred to a vendor, they usually require that you are able to duplicate it consistently before they will accept the problem. Otherwise, it is marked down as user error, or PEBKAC (Problem Exists Between Keyboard And Chair).
In a vendor situation, they will often return the problem and say “We can not reproduce the problem.” If the problem is continuing to occur, you need to collect more information about the conditions that cause the problem before re-submitting it, or escalating it, with the vendor. One of the difficulties in solving technical problems is that there are so many vendors in the typical configuration, pinning down the problem to one software or hardware vendor is extremely difficult.
Creating and Running Tests
Once the problem has been reproduced, this limits the possible causes of the problem. Either the reproduction of the problem has identified the specific problem, or a series of tests to identify the specific cause are required.
Initial tests should focus on identifying the subsystem in which the problem occurs. For example, if reproduction of the problem requires five steps, at which step does the specific problem occur? Each test or series of tests should have a specific goal. At the beginning, the goal is to gather general information. However, it should quickly focus on questions such as “Is this component responsible for the problem?” “Does this combination of inputs produce the correct result?”
Tests should focus on:
- What, if anything, has changed since the last successful run?
- What is different between this case and a similar case that runs correctly?
- Changing one component at a time to isolate the problem (stepwise refinement).
- Keeping track of tests executed and results gathered.
I cannot overstate the importance of keeping track of the test results. It is incredibly frustrating to have to re-run a test, not to get additional information, but because you cannot recall the results. Keeping track of results does not just mean having them “somewhere on your desk.” It means actually grouping the results into information. That might mean stapling printouts together, writing notes down on each test, keeping a spreadsheet with results, or writing down results on a whiteboard. The important thing is that you can get at the results and review them.
If necessary, the database, data files, or other variables must be reset to the initial conditions between each test. This must be done in any case where the problem is data dependent or causes data corruption. In order to be able to reset the test environment quickly, a copy of the data must be kept available to rebuild the environment quickly.
Testing techniques during this phase may vary but include running the application in ‘debug mode’, adding trace statements into the source, auditing log-files and data changes at each step. It may also be necessary to remove components of the application or the environment by replacing them with ‘stubs’ or other simulators, where you can control the responses to the application under test exactly.
Evaluate Test Results
Following a cycle of tests, the results must be analyzed. Either the specific cause is isolated or additional tests are identified. The additional tests might be further refinement, different test data, or additional isolation of components. If the previous test did not cause the problem to happen, why not? If it did happen, why? What was different between this case and one that worked differently?
This step and the previous step are repeated in a cycle until the exact cause is identified.
Eventually, the cause of the problem is isolated. Then, a solution is developed.
Develop a Technical Solution
Often, at this point, you understand the problem well enough that the solution is obvious. However, if you don’t have a solution, you can now do traditional analysis, and resolve the problem by defining a program to ‘bridge the gap’ between the function that exists and the function that you need. This may mean several attempts to resolve the problem, testing the solution, and trying other options to resolve the problem.
Regardless of how simple or complex the solution is, it is important to retest before going back into production. When dealing with an application bug, it is particularly important to do some level of regression testing. (Regression testing is testing to ensure that the change made hasn’t affected the parts of the application that were working correctly before.) If the solution is a trivial fix, and if you understand the environment fully, I recommend two tests be run. The first is to test the specific change, and the second to test the ‘normal’ execution path before going into production. For all other situations, I recommend as much additional regression testing as you can afford. Automated testing tools can be a big help here.
After the fix has been put into production, rerun the above tests in the production environment before releasing the system to users. If it doesn’t work as expected, fall back to the old setup. Nothing is worse for your users than replacing one problem with two others. Much better the devil you know.
If the problem has been solved by a vendor or other party, it remains your responsibility to re-test the application before putting the solution into production.
Once the problem is resolved, the changes to the application or environment must be documented. Otherwise, you’ve just made the next problem a little bit bigger.
To summarize, to solve a problem, you need a test environment, a person responsible for solving the problem, and a way to reproduce the problem at will. Run a series of tests to understand the problem, and keep track of your results. At the end of a given test, decide whether you can see a solution or whether additional tests are needed. Once you understand the cause of the real problem, finding a solution is usually not difficult. Before returning to production, test, test, test.
These simple steps can solve most problems. However, never underestimate the value of a little luck.
Originally written by Mark Dymond, IBM
I’m not sure I complete agree with the thrust of this post. I know you indicated that it was originally written from a client/server perspective and that’s probably why it doesn’t sit well with me from a “technolog in general” perspective.
The premise is about doing things in test and while I don’t disagree with that, especially for software development, it isn’t useful for troubleshooting things that are infrastructure related. Again, I’m all for, and insist on testing things before putting them into production, but that is a different thing from coming in on some Monday morning and finding that for some reason, for example, none of your users can seem to see any of the file servers on your LAN. Nothing in a test environment would help you to troubleshoot that.
The thrust of this post is about troubleshooting things in application development and not necessarily troubleshooting IT in general.
Key things for troubleshooting in an IT environment, across all aspects of that environment such as corporate applications, infrastructure, end-user desktops, etc.,etc. would be things like:
a) have a good change management process and change management log. Your IT environment is a tightly inter-connected set of resources. Changes in one area could impact systems in another area. When a problem arises, being able to go back and see what changes have been made is always a good place to start.
b) good monitoring tools will often point to things that are unseen problems that left unchecked can cause bigger problems and then you are left struggling to find out the root cause of the issue.
c) focused single-point of contact support. When a problem arises in your IT environment, someone needs to quarterback the troubleshooting processes, otherwise you can’t control the things being tried to remedy the problem. A top notch support system for logging trouble tickets and being able to comb a knowledgebase of past remedies is essential.