The Feynman approach: Debugging an overheating oven

A Feynman approach to debugging
Lisa Simone 11/10/2004 3:43 PM EST

Click here for response to this article As a child, Richard Feynman could fix radios by thinking. You can take the same approach to debugging embedded systems. As a child, Richard Feynman, scientist, author and Nobel laureate, was asked to fix a radio that emitted a terrible cacophony when it was first turned on. As he paced back and forth without touching the radio, trying to figure out what could cause such a loud noise, it dawned on him that the amplifier tubes were warming up at different rates before a radio signal was available. To avoid amplifying noise, he simply rearranged the tubes and the radio started up perfectly with no noise. The radio's owner was amazed that "he fixes radios by thinking!" Effectively debugging embedded systems requires an analytical thinking approach that's not generally taught after the student learns a new programming language. These skills are often more valuable than straight programming skills, especially when projects contain a significant amount of legacy code or the original author is no longer available for consultation. Acquiring these analytical skills comes from experience: learning to think logically about the symptoms and recognizing a bug that has bitten you previously. In this article, you'll be challenged to solve a real problem using some simple guidelines. You're an embedded systems debugging contractor, and you receive an e-mail describing your next assignment. You're to investigate a problem at the Industrial Enclosures Company (IEC), and the e-mail contains a brief overview with few concrete details: "On Tuesday afternoon, a small oven on IEC's manufacturing line malfunctioned. Rather than heating components to a predefined temperature, it got too hot and damaged the components. The manufacturing line was stopped for the remainder of the day." The e-mail further relays that the oven started to work correctly this morning, but the management cannot accept any further down-time of the manufacturing line without compromising a delivery deadline to a customer. The production manager will be calling you with more details, and tomorrow morning you're expected to report to IEC. Whether you're familiar with the code or not, the steps to isolate and identify the bug are largely the same. The process is iterative: exploring different sources of information, generating (more) questions, summarizing what you know, and brainstorming possible root causes. Several common sources of information are: The problem report and background information on the system in question Whether the problem report is correcthow was this information collected? The person who reported the problem and other knowledgeable parties Observing the system (running correctly or reproducing the problem if possible) Understanding what is "normal" and what are critical system operational parameters Understanding the algorithm or method used The software listing Debugging the system (running correctly or reproducing the problem if possible) We'll use these sources as logical steps to guide the debugging process. Later on Wednesday afternoon you receive a call from Sophie, who introduces herself as the production manager of IEC. She gives you a brief overview of the company. "Industrial
The scenario
The action plan
Background and report
Enclosures manufactures custom plastic enclosures for different types of industrial equipment. We produce more than 10,000 enclosures of several different types each year. The manufacturing line that failed yesterday is used to assemble enclosures with air-tight compartments." Sophie explains that a special material is used to construct these compartments, "The manufacturer of this component material has specified its temperature characteristicsI emailed the specifications to you. The operating range is between 117F and 165F, and the recommended nominal operating temperature is 126F. The manufacturer also warns that the structural properties of the material are compromised when it's heated over 189F." Sophie's email contains the manufacturer's specifications, as shown in Table 1. Table 1: Temperature characteristics for component material Temperature 117F 126F 165F 189F Component Limits Minimum temperature Nominal temperature Maximum temperature Material loses structural integrity
Sophie continues, "We use a small oven specially designed to heat the components to 126F so that it makes a proper seal with the underlying structures. However, the oven didn't shut off when it reached 126F and it heated the components to the point that they were unusable. The oven has a microcontroller to control the temperature but the engineer who designed the oven and the software is no longer with us. I'm counting on you to identify the cause of this failure." Sophie instructs you to meet with BJ when you arrive the next day. Your job is to find out what went wrong and to correct the problem before it happens again. Before you go further, it's a good idea to understand the normal operation of the failing system and think about ways it could fail. This brainstorming technique can provide useful questions for your interviews. What is the system supposed to do? How might it have been designed? What hardware or software could be involved? A basic system requires a heating element, a temperature sensor, and an algorithm to turn the heater on and off. If the measured temperature is below the operating point (which appears to be 126F), the heater is turned on. When the temperature reaches 126F the heater is turned off. If the temperature is above 126F, the heater does not turn on. The actual algorithm may be more complicated than this, but these are basic required elements. With these system basics in mind, some obvious points of failure are listed in Table 2. Table 2: Some common failure points for temperature-control systems Element Heating element Temperature sensor Temperature control algorithm Heater ON/OFF control Possible failure Element failed Element displaced Sensor failed Sensor displaced Set to wrong temperature control point/calibration Temperature-sensing hardware failed Heating element driver (solid state relay, for example) failed Heating logic signal incorrect
Sophie told you all she knew about the problem, but without more information it's difficult to determine what really happened. BJ is your next source of information as the person operating the machine. On Thursday morning you introduce yourself to BJ on the manufacturing floor and
Interview witnesses
then get to work by asking for all he knows about the machine and the failure. BJ explains, "The oven normally takes a couple of minutes to heat the material to the proper temperature, and when the component is hot enough, the oven automatically turns off, and the component moves out of the heating area. I noticed that heating started to take a lot longer, almost five minutes. When I looked inside, the oven was very hot so I hit the emergency shutoff. When I picked up the component with tongs, the tongs made imprints in the material. It's not supposed to do that. It looks like the oven didn't turn off when it was supposed to. I have been running this machine for about nine months so I know what it's supposed to do." BJ hands you one of the overheated components from Tuesday. It looks slightly deformed when compared to a properly heated component. After BJ has provided all the information he can think of, don't walk away! Keep asking questions to tease out more information. Is the problem reproducible? Does it happen all the time? "Well, for about an hour on Tuesday afternoon the oven continued to fail, and we had to shut the line down for the day. Wednesday morning it started working again and worked without failure the entire day." Listen to BJ's words and refine your questions, "BJ, you said that the oven 'continued to fail.' Does this mean it overheated the component every single time or just every now and then?" BJ replies that after the first failure, the next two were also overheated. Since the components are not cheap, the line was shut down. "BJ, did you do anything differently the next morning that might have fixed the problem?" BJ replies that he just started the machine for a trial run and that it worked fine the first time, so they restarted production. It hasn't failed since. Continue to ask questions about the machine, including the failure points you identified earlier. Even though the machine appears to be running correctly now, don't discount these common sources of failures. "Is it possible the heating element failed or moved inside the oven?" "No, the heater is solidly mounted and pretty rugged." "What about the temperature sensor? They're generally more fragile. Could it have moved too far away and measured a lower temperature? That could cause the heater to overheat." "I replaced the permanently mounted temperature sensor and the oven still overheated." "Has the oven ever overheated before when it was first turned on?" "Well, once when we had a problem calibrating the temperature sensor the oven did overheat that time. I checked the oven temperature this morning with a different temperature probe, and it's within one degree of normal." "Is there a temperature set-point knob or any other adjustments on the machine that could have been moved? Any recent maintenance on the machine?" "No maintenance lately. Here's the temperature set reading; it's set to 126F." "BJ, I'm not sure what hardware controls the heater but something like a solid-state relay. Could that be bad?" "Well, it's possible, but it appears to be working now. If the problem happens again we can check it." "Was any new software installed on the machine?" "No." "Did someone else operate the machine?" "No."
"No." Since the manufacturing line is now working, BJ offers to give you an example of normal operation. As he has described, the next component is inserted into the oven. After a short time the oven shuts off and the component is ejected from the oven. BJ picks up the heated component and inserts it into a larger assembly. As he works, you begin to review all that you have learned about the problem. At this point, classify the symptoms. Experience can help you predict the root cause of a problem by correctly characterizing its symptoms. While space limits a full exploration into cause and effect, consider these categories and try to classify the symptoms you've documented. One-time repeatable events These symptoms occur once, but have a pattern to their occurrence. They might occur only at power-up or the first time through a function or feature. Or, a function may work correctly the first time but fails all subsequent times. Periodic events These symptoms occur several times in a somewhat repeatable manner. Sporadic events These may happen once in a hundred tries or so randomly that it's hard to relate the occurrence to the software. How would you classify the oven symptoms? The oven failures are best described as sporadic events because the oven stopped working and then inexplicably resumed normal operation the next morning and hasn't failed since. Sporadic events are harder to find and fix because we design systems to behave in a repeatable manner. When they don't, some of our assumptions are generally incorrect. Some root causes of sporadic events are: Violation of boundary conditions in the software Unexpected input or output conditions (software, hardware, or material) Unhandled error conditions or faulty logic Logic based on time-of-day Memory corruption Performance issues Intermittent electrical or mechanical connections; impending component failure
Observe the system
Analyze the symptoms
With these types of root causes in mind, can you think of anything special about Tuesday's failure that might have caused the problem? Failure occurred in the afternoondoes the system have a real-time clock? Was a different material used for the components? If the software hit some kind of error condition that affected the heating algorithm, did turning the machine off and on again solve the problem? As a result of this brainstorming, you have a few more questions for BJ. He tells you he power-cycled the machine and it didn't fix the problem. The components are all made of the same material. We're not quite ready to jump into the code listings yet. Try to resist that urge just a bit longer! This up-front analysis can actually reduce the time you spend randomly digging through software because it enables us to perform a more directed and methodical search. Remember our initial brainstorming about how the system may have been designed? Think about the system elements again, focusing on less obvious causes of the failure. The A/D converter subsystem failed, causing inaccurate temperature readings The logic to determine if the oven has reached 126F is faulty The output control signal that turns the heater off failed Temperature or control variables not initialized or possibly corrupted or overwritten
Target your software search
Searching the code for words like "temperature" and "oven" in the software, you find several references that appear relevant. These code fragments are shown online at ftp://ftp.embedded.com/pub/2004/12simone. periodic_timer() read_actual_temperature_A2D() and read_reference_temperature_A2D() calculate_new_oven_ON_time() oven_ON_time_control_routine() You also learn that the processor is an 8-bit microcontroller, the program is written in C with some assembly code, it has a simple floating-point math function for small numbers, and it has one interrupt-service routine. We'll assume that system performance and resources aren't overloaded in any way. Look at each of the routines and try to decipher what they do, assuming that the comments, function names, and variables are not misleading. The periodic_timer() function is a good place to start since it'll tell us what happens and how often it happens. A routine that controls the oven on-time is called every 10ms, and new temperature values are read every second and used to adjust the oven control. Next, look at oven_ON_time_control_routine() to understand the actual control of the oven heater. A variable with the name of oven_pulse_width_counter is incremented each time this function is called, and it's reset to 0 after it reaches 100. The oven is turned on when the counter is less than ovenPW and is turned off all other times. Do you recognize that this control is pulse width modulation? The heating cycle is always 1s (100 function calls x 10ms/function call), and the longer the oven heater is on during this one second, the hotter the oven can get and the more rapidly it can heat up. The variable ovenPW is the duty cycle of this signal and also represents the percentage of time that the oven is on during each one-second interval. A sample picture of this signal is shown in Figure 1 where ovenPW is equal to 33 and the oven is on 33% of the time.
Figure 1: Pulse width modulation of the oven on signal Something should bother you about this function. Can oven_pulse_width_counter ever reach a value of 101? If this occurred, would the oven turn off properly? Checking counters this way is dangerous because any other function in the program could set this variable beyond 100. If this occurred, the heater wouldn't turn on again until variable oven_pulse_width_counter had been incremented all the way to 255 and then rolled over to 0. (How would you change the function to ensure that this never happened? A simple solution is to change the logic to check for values above as well as equal to the limit.) How is the oven on-time is calculated? The function calculate_new_oven_ON_time() contains a math calculation and some boundary checking. Combining the equations and substituting #defines, the ovenPW becomes:
ovenPW = 17 + 2.7 x (delta_temperature) First, a delta-temperature value is computed using a reference temperature that might represent a calibration point or a nominal temperature, but we don't know yet. The delta value is then used to calculate ovenPW using a linear equation with a slope of 2.7 and an intercept of 17. After the ovenPW value is computed, it's truncated to within the range of 0 to 100. This boundary checking confirms our suspicion that the ovenPW value is a percentage; the oven on times will range between 0% and 100% of each second. The software should raise additional questions that could be useful to your investigation: The ovenPW increases if the actual temperature is greater than the reference temperature; is this logic backwards? Since we don't know how the temperature circuit was designed, we can't answer that question now, but keep it in mind for later. Does the intercept value represent the required ovenPW for the nominal temperature of 126F? You might plot the equation to help visualize how the system works, as shown in Figure 2. The intercept occurs where delta_temperature is equal to 0.
Figure 2: Linear equation relating delta temperature and ovenPW After reviewing your notes over lunch, you conclude that the oven is controlled by a digital (on/off) signal, and that the temperature is controlled using pulse-width modulation. The on-time is computed from the actual temperature and a reference temperature, and that the nominal on pulse width is 17%, which most likely corresponds to the nominal temperature of 126F. You've also identified more questions. With your new understanding of the oven-heating algorithm, it's time to monitor the system in operation and test some of the hypotheses. Several techniques can give you visibility into the code as the system is running. Let's assume you have a debugger or monitor and can access the value of software variables in real time. Choose to monitor variables that will allow you to answer your outstanding questions. Logical choices would be ovenPW, actual_temp_A2D_counts, and reference_temp_A2D_counts. These will allow you to verify proper calculation of the A/D subsystem and the oven pulse width. You could also include Heater_output_pin to verify that the digital heater output signal is correct. It's now well after lunchtime so you head back out to the factory floor to run some tests. Sophie has provided a laptop computer with a real-time monitor for the oven microcontroller. After you set up the debugging environment, BJ starts the line and the first component enters the oven. BJ has placed a temperature sensor on the component in the oven so you can record the actual temperature at the surface of the component. As the oven turns on, you begin recording data as shown in Table 3 and plotted in Figure 3.
Summarize targeted search
Debug and observe
Figure 3: Measured A/D values and resultant ovenPW Table 3: Data collection trial 1 Measured values Calculated values
actual_temp_ reference_temp_ Heater_ Independently delta_ calculated A2D_counts A2D_counts output_pin measured temperature ovenPW temperature [A/D counts] 85 83 81 79 77 75 73 71 69 69 [A/D counts] 70 70 70 70 70 70 70 70 70 70 1=ON, 0=OFF 1 1 1 1 1 1 1 1 0 0 [F] 81 87 93 99 105 111 117 123 129 130 [A/D counts] 15 13 11 9 7 5 3 1 -1 -1 [%] 57 52 46 41 35 30 25 19 17 17
Immediately you learn several things from these data: The variable reference_temperature_A2D_counts is always equal to 70. This corresponds to an ovenPW of 17 and a temperature of 126F, confirming that this value represents the nominal operating temperature of the material in raw A/D counts. This value is the temperature set point. You can also confirm that the raw A/D values do decrease for increasing temperature values, so that mystery is solved. The ovenPW starts out at a duty cycle of 57% (oven is on for 570ms of each second). As the component heats up, the ovenPW falls to the nominal value of 17% and shuts off. The first value of ovenPW is 57% and this corresponds to 81F. BJ agrees that the ambient temperature in the building is about 81F, so this boundary condition is consistent.
ambient temperature in the building is about 81F, so this boundary condition is consistent. You can tentatively eliminate several possible root causes, such as A/D converter subsystem failure, wrong heater output signal, and bad up-to-temperature logic. This experiment has confirmed your understanding of the system but you feel no closer to an answer than you did this morning. Review again the possible root causes for sporadic events. We haven't yet explored boundary conditions very well, or unexpected input conditions. Is it really possible for the oven to be turned on 100% of the time? Working the ovenPW math backwards, you determine that an initial temperature of 33F (101 A/D counts) corresponds to an ovenPW of 100%. So it is possible for the heater to turn on fully if the material starts around freezing, and maybe the oven somehow got stuck at 100%. You're skeptical but ask BJ anyway. "Hey BJ, was the material really coldbelow freezingthe afternoon the oven failed? If so, the oven will turn on 100%." BJ chuckles, "Not likely on a summer afternoon! Sometimes we store the components in a storage area in the back and that area has no air conditioning, so they weren't frozen." But being a persistent troubleshooter, you decide to slip a few components in the break room refrigerator to test out your theory later. Thursday morning you grab the now-frozen components from the freezer and head back to the factory floor. "BJ, I know you said the components weren't frozen, but I just want to see what the oven what happens if we start with frozen components. It should turn the oven on full blast and then slowly turn it down as the temperature of the component starts to increase." BJ is game and fires up the line with your first frozen component. As you expect, the oven is on high 100%. Solving the ovenPW equation for actual temperature, you find that the component was measured at 102 A/D counts or 30F. After running three frozen components, you verify that the ovenPW calculations are correct and conclude that cold components are not the culprits. At this point in the scenario, you have enough information to solve this mystery. What do you think?
Debug by thinking
Reproduce the problem
To better understand the type of ovenPW values you might see, you extend your data table to include ovenPW values from 0 to 100% and the corresponding temperature and A/D values as shown in Table 4. Table 4: Valid temperature ranges and corresponding ovenPW values Measured values Calculated values
actual_temp_ reference_temp_ Independently delta_ calculated A2D_counts A2D_counts measured temperature ovenPW temperature [A/D counts] 111 109 107 105 103 101 99 97 [A/D counts] 70 70 70 70 70 70 70 70 [F] 3 9 15 21 27 33 39 45 [A/D counts] 41 39 37 35 33 31 29 27 [%] 100 100 100 100 100 100 95 89
97 95 93 91 89 87 85 83 81 79 77 75 73 71 70 68 66 64 62 60 58 57 56 54 52 50 49 47 46
70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70
45 51 57 63 69 75 81 87 93 99 105 111 117 123 126 132 138 144 150 156 162 165 168 174 180 186 189 195 198
27 25 23 21 19 17 15 13 11 9 7 5 3 1 0 -2 -4 -6 -8 -10 -12 -13 -14 -16 -18 -20 -21 -23 -24
89 84 79 73 68 62 57 52 46 41 35 30 25 19 17 11 6 0 0 0 0 0 0 0 0 0 0 0 0
As you finish your lunch and print out a copy of the data, the phone rings. It's BJ. "Come down right awaythe oven has failed again. It overheated a component and I've just turned it off. Hopefully you can figure out what is wrong with it now." You grab your printout and go. When you reach the factory floor, you see a small crowd gathered around the machine inspecting the overheated component. It's too hot to touch and noticeably distorted; BJ moves it out of the way to cool. Time to get to work. Now you have the opportunity to collect data on an actual failure. BJ restarts the line. When the next component arrives in the oven, the oven turns on. The debugger shows that the ovenPW immediately jumps to 100%! BJ watches the oven for about two minutes before he leans in to press the emergency stop button. "This is exactly what happened 10 minutes ago, and this is exactly what happened on Tuesday. Out of the blue the oven just goes crazy. What do you think could be wrong?" You sigh and consult your temperature table and scan the entries for an ovenPW of 100%. BJ leans over you shoulder to read the spreadsheet, "And those components sure aren't below freezing; we just got that new pallet of parts from the storage area, and like I told you, it doesn't have air conditioning."
"Finelet's look at the data from this run." The data shows that the starting temperature of the components is 63 A/D counts147F! "BJ, it says that this component came into the oven at 147F. It's already in the valid temperature range, but the heater is going full blast. My table here shows that the oven should be completely off in this situation." BJ checks the temperature of the next component in the queue and whistles, "Yup, this one is almost 150 degrees too. The readings are right." "Why are these components so hot? Were the components that failed on Tuesday afternoon this hot?" Look for similarities between today and Tuesday. You remember that BJ told you that the storage area had no air conditioning, but he didn't say how hot the room was. BJ thinks, "Well, we had just received a new palette of components out of the storage area after lunch, and they were the first to fail. The storage area is next to the boiler so it gets pretty hot in that area, but it's not hot enough to damage the parts. We usually pull the palettes out before lunch so we can start the line right away after the break is over, but sometimes we get behind like today and we don't get them pulled out until after lunch. We might have done the same on Tuesday, but do you think that matters? The temperature of the components is still within spec." BJ is right; even though the components are hot, they're still within spec. However, the oven should not have turned on at all. You have identified an important clue linking today's failure to Tuesday's failure. The components came from a new palette and the components were already hot enough to be used on the manufacturing line without the heating step. You check the entire log of recorded data as shown in Table 5. All the values look valid except for the ovenPW value. Why was the ovenPW set at 100% when it should have been set to 0%? Table 5: Data collection trial 2 Measured values Calculated values
actual_temp_ reference_temp_ Heater_ Independently delta_ calculated A2D_counts A2D_counts output_pin measured temperature ovenPW temperature [A/D counts] 63 61 60 59 58 [A/D counts] 70 70 70 70 70 1=ON, 0=OFF 1 1 1 1 1 [F] 147 153 156 161 162 [A/D counts] -7 -9 -10 -11 -12 [%] 100 100 100 100 100
Many developers have been bitten by this bug before. You may have anticipated the ending or you may be shaking your head and ready to jump to the solution. Resist the urge to peek, and review what you've learned about this problem. The problem has happened twice when the component is already hotaround 150F When the problem happens, the oven is immediately turned on 100% At 147F, actual_temp_A2D_counts = 63, reference_temp_A2D_counts = 70 The ovenPW is calculated as: ovenPW = 17 + 2.7 x delta_temperature Since the ovenPW equation is suspect, check your numbers to see if they're calculated the same way as the embedded code calculates them. At 147F, you calculated ovenPW as: ovenPW = 17 + 2.7 x (63 - 70) = -2 and -2 limited to [0, 100] is 0%
Solution train of thought
Nail the bug, then verify it
Unfortunately, the embedded code arrives at a different answer: ovenPW = 17 + 2.7 x (63 - 70) = 253 and 253 limited to [0, 100] is 100% Do you know why this is so? Each of the variables in the equation has been declared appropriately, but the intermediate calculation for ovenPW was not considered. When the ovenPW calculation is stored as an unsigned char, the value is interpreted as 253. This is truncated to 100 and the oven turns on 100%. Just to make sure, you should verify your strong suspicion. You explain to BJ that if the component temperature is above 144F, the bug will happen, so you'd like to test several components around that starting temperature. BJ is happy to oblige with an propane torch and he "warms" several scrap components to your specifications. You test each in the oven and record the initial ovenPW value in Table 6. As you suspected for starting temperatures greater than about 144F, the value of ovenPW rolls under 0 and is truncated to 100. Bug nailed. Table 6: Verifying a suspicion Actual temperature [F] 141 144 147 153 Displayed ovenPW 3 0 100 100 Test result OK OK Rollunder! Rollunder!
Now that you've indentified the problem, proposing a solution is straightforward. The ovenPW variable could be declared as a signed char instead of an unsigned char to prevent the casting problem, or the [0 to 100] truncation step can be performed before storing the resultant value into the ovenPW variable. This entire problem turned out to be a simple and common programming mistake that might seem unworthy of the detailed scenario that you just read. We were all taught rules for variable declarations but many times the results of these mistakes are difficult to trace back to the root cause. How well did this analytical iterative process help identify the root cause? Each time more information was learned, we raised questions and adjusted the possible root causes. Recall that the original problem report read, "the oven stopped shutting off when it reached 126F." This was an inaccurate description of the problem. Despite the misleading description, we identified the root cause through iterative brainstorming several times. Three of our ideas were actually true: time-of-day bug: the root cause wasn't related to a real-time clock but to the start of the afternoon shift unexpected input condition: failure occurred with a new palette of material violation of boundary condition: the original programmer assumed intermediate calculations for ovenPW would never be negative Looking back at our process we can see that we relied too heavily on the original problem report and didn't explore the brainstormed root causes more thoroughly. We didn't identify the root cause until the problem occurred again and overlooked bugs common for the processor type and coding language used. Often 8-bit micros have small memories without floating point libraries, and math operations must be carefully coded using integer math. If we'd stopped to think about the characteristics of the embedded system, we might have solved the problem without actually observing it. This case was designed to help you test your analytical and problem-solving skills more than to remind you to declare your variables properly. Although simple mistakes like this are sometimes
Propose a fix
Review your process
remind you to declare your variables properly. Although simple mistakes like this are sometimes costly to find and fix, the bug in this scenario was a real bug that was identified and repaired before the product was shipped. Lisa Simone is a senior research scientist in biomedical engineering at Kessler Medical Rehabilitation Research & Education Corp, and an assistant professor of physical medicine and rehabilitation at the University of Medicine and Dentistry of New Jersey-New Jersey Medical School. She received a PhD in biomedical engineering from Rutgers University and has designed embedded systems since 1990. You can reach her at lisa.simone.wt03@wharton.upenn.edu. Reader Response I really liked the article and the way it was presented. Well done Lisa. I passed the code through my normal Lint system and it caught the problem and pointed out a couple of others too. This is what it reported: 1. The Comparison for OvenPW less than 0 at the end is unneccessary, since OvenPW is an unsigned number, it can't be less than 0. This is a clue to the error. 2. The loss of precision in the subtraction as the two unsigned numbers (reference_temp_A2D_counts and actual_temp_A2D_counts) are subtracted and the result placed in the signed number, delta_temperature. (another clue) 3. The loss of precision as the ovenPW is calculated from the Delta Temperature. This is where the problem manifested itself. An unintended lesson of this, pay attention to what Lint is telling you. Bob Bailey Sr. Software Engineer Thank you Lisa. Your article was a fascinating reading experience. I can only conclude: "Two sensories are always better than one." Art Devine Dr. Simone's paper is very thought-provoking, and is refreshingly easy to read. Her approach to solving a problem extends far beyond ovens, of course, and should give us all pause to reflect how we could better do these things. Hats off to her!!! Thomas Mann This article is definitely worthwhile to promote as a vivid C programming case or as a troubleshooting case. It is enlightening for reader to think thing in reverse engineering point of view. Larry Luo Out of sheer curiosity, I tried the attached source code with some C compilation tools I haved. Some compilers (e.g. TI Code Composer C/C++ compiler for TMS320C54x) issued a warning like "pointless comparison of unsigned integer with zero" on the instruction "if (ovenPW < 0) ...", giving the programmer a chance to notice the bug described in this article. Some other ones (e.g. MS Visual C++ 6.0) didn't report any problems even with the strongest warning level set. Kazimierz Koziarz

The Feynman approach: Debugging an overheating oven

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

The Feynman approach: Debugging an overheating oven

Hochgeladen von

Copyright:

Verfügbare Formate

A Feynman approach to debugging

Lisa Simone 11/10/2004 3:43 PM EST

The action plan

Background and report

Observe the system

Analyze the symptoms

Target your software search

Summarize targeted search

Debug and observe

Reproduce the problem

Solution train of thought

Nail the bug, then verify it

Review your process

Das könnte Ihnen auch gefallen