Sie sind auf Seite 1von 17

White Paper

Ashley Montgomery
Debugging
Platform Application
Engineer
Intel Corporation
Machine Check
Tian Tian
Platform Application
Exceptions on
Embedded IA
Engineer
Intel Corporation

Platforms

July 2010

324077-001
Debugging Machine Check Exceptions on Embedded IA Platforms

Executive Summary
Embedded systems need to be able to detect, recover from, and report
errors. This is a critical feature not only during debugging but also for
quality control after product manufacturing has begun. The importance of
advanced error handling capabilities is often magnified for embedded
systems because many are deployed in a large number of units,
dispersed widely, and are running mission-critical type applications.
Further, the embedded systems present a unique challenge due to their
diverse form factors, vastly different feature sets, and special usage
models.

IA processors include Machine Check Architecture, which has built-in


capability to detect, report, and attempt recovery from the system errors
the CPU observes. As IA gets increasingly popular in the embedded
space, the value and significance of machine check exceptions grows.
Embedded products are often running critical applications non-stop for
extended periods of time where unexpected system resets may present
significant impact. Many times, machine check exceptions are the only
available clue that the customer has during system failures and they
provide a starting point for debugging.

Diagnosing the cause of machine check exceptions can be challenging


and time consuming. They are often difficult to reproduce in a timely
manner. There are also many potential suspects involved, such as:
platform design issues, the CPU operating out of specification,
overloading of power supplies, software applications, and BIOS. This
makes the debugging process extremely challenging.

Machine check architecture has been available in IA processors since the


Pentium days to report and record errors in the system as observed by

2
Debugging Machine Check Exceptions on Embedded IA Platforms

the CPU. When the CPU detects critical machine check exceptions and the
errors are not correctable, the CPU will reset the system to prevent error
situations from getting worse. The MCE registers capture some of the
error information as seen by the CPU at the point of failure, which can be
important information in order to get to the root cause of the error.

This application note is intended to provide recommendations on how to


debug machine check exceptions on embedded IA platforms. It also goes
over the Machine Check Architecture and uses the Intel® Core™ Duo
processor and Intel® Core™2 Duo processor as examples. However, the
information and methodology is generic to all newer IA processors.

This document presents a step-by-step approach to debugging Machine


check exceptions, understanding their causes, and reaching timely
resolution of the errors on embedded IA platforms.

The Intel® Embedded Design Center provides qualified developers with


web-based access to technical resources. Access Intel Confidential design
materials, step-by step guidance, application reference solutions,
training, Intel’s tool loaner program, and connect with an e-help desk
and the embedded community. Design Fast. Design Smart. Get started
today. www.intel.com/embedded/edc.

3
Debugging Machine Check Exceptions on Embedded IA Platforms

Contents
Machine Check Architecture ....................................................................................... 5

Machine Check Exceptions in Embedded Applications .................................................... 5


Potential Causes for Machine Check Exceptions .......................................... 6
Impact of MCEs to embedded customers ................................................... 6
Examples of MCEs in Embedded Applications ................................... 7
Machine Check Architecture elements ......................................................................... 7
MCA Capability Register .......................................................................... 7
IA32_MCG_CTL MSR..................................................................... 8
MCA Global Status Register ..................................................................... 8
MCE Bank Registers................................................................................ 8
MCE Bank Status Registers ............................................................ 8
MCE Bank Control Registers .......................................................... 9
MCE Bank Address Registers.......................................................... 9
MCE Bank MISC Registers ........................................................... 10
MCE Coding Tables ............................................................................... 10
Multi-core implications .......................................................................... 12
Debug Process Flow ................................................................................................ 12
Confirm platform is operating within specification .................................... 12
Gather MCE error code displayed by OS .................................................. 12
Identify error frequency ........................................................................ 13
Confirm whether the MCE is reproducible on the same platform ................. 13
Collect as much information as possible about configuration ...................... 13
Decode and understand MCE message .................................................... 13
Make sure latest MCU and BIOS updates are in place ................................ 14
Research CPU Spec Update for relevant CPU errata .................................. 14
Try to reproduce on Customer Reference Board........................................ 14
Debug Checklist ..................................................................................................... 15
Related documents ............................................................................... 15
Summary .............................................................................................................. 15

4
Debugging Machine Check Exceptions on Embedded IA Platforms

Machine Check Architecture


IA Machine Check Architecture is an evolving technology adding new features
and enhancements with each new processor generation. The common types
of errors that are detectable by the CPU include: ECC errors, cache errors,
system bus errors, parity errors, etc. As processors become more integrated
with new additions of memory and fast I/O, the types of errors the MCA can
cover become more diverse and the feature sets get even broader.

By using this architecture, the CPU can be configured to generate machine


check exceptions (MCEs). Some MCEs are correctable, which means that the
hardware can recover from the error and correct them without being reset.
Correctable MCEs do not need to generate any interrupts in earlier
generations of IA processors. Beginning with the 45nm Intel 64 processor
with CPUID signature 06H_1AH, the processor is able to report pertinent
information related to corrected machine check errors and send a
programmable interrupt to allow the software to respond to the machine
check errors. This is known as corrected machine check error interrupt
(CMCI). So, this option is available in case users want to take actions on
correctable MCEs.

Some MCEs are uncorrectable and the system will need to reset to recover
itself. In this situation the CPU has concluded that the system is no longer in
a safe or reliable operating mode, or the cost of trying to recover from the
error (either by hardware or software) is prohibitive.

The machine check architecture consists of a set of model-specific registers


(MSRs) that set up the machine checking as well as additional banks of MSRs
used for the recording of detected hardware errors.

Machine check architecture communicates critical hardware errors to the


software as well as possibly recovering from catastrophic system failures.
This architecture provides error handling features, which contribute to high
processor reliability, reliable error containment and recording, serviceability,
and error correction without program interruption.

Machine Check Exceptions in


Embedded Applications
The embedded IA products bring a new set of challenges to the process of
analyzing and debugging machine check exceptions. These products are often
deployed in large volumes in the field and it may be hard to retrieve a faulty

5
Debugging Machine Check Exceptions on Embedded IA Platforms

unit in a timely fashion in order to debug. The parts may be operating in all
kinds of environments including extreme temperatures or high altitudes,
which add to the complexity of trouble-shooting and the task of eliminating
suspects.

The embedded designs tend to have diverse characteristics compared to


desktop/laptop designs. They often run non-standard applications, or have
unique form factors, and have longer life cycles. Each product often comes
with its own unique set of components (battery-less design, or lack of some
components such as graphics or video), features, customized BIOS, OS
(embedded real-time OS such as VxWorks, or some in-house solution). As a
result, there may not be a standard reference platform to compare with. Each
machine check exception situation may need to be treated quite differently
throughout the debug process.

Potential Causes for Machine Check Exceptions


MCEs are difficult to debug mostly because of the large number of potential
causes. Some of the potential factors are:

• Violations to board design guidelines. For example, routing traces over


power and ground planes may cause unwanted noise and inadequate
signal spacing may cause signal integrity issues.
• Operating the processor out of specification. Examples include over-
clocking of the CPU and front-side bus speeds. The behavior of the system
cannot be predicted when the processor operates out of specification.
• Environmental factors, such as: alpha particles or cosmic ray hits,
extremely hot, and cold temperatures.
• Improperly fitted heat sinks or fans and incorrect hardware installation.
• Missing proper microcode updates that could contain fixes for known
processor errata.
• BIOS setup issue or OS issue may cause MCE handling scheme to behave
differently.
• Faulty components, such as: add-in cards, DIMMs, etc., can also cause
system errors that may eventually lead to a MCE.

Impact of MCEs to embedded customers


The impact of can be quite diverse but of significance especially when critical
applications are running that require extreme accuracy, reliability, and are
time sensitive. Some examples of embedded market segments where MCEs
would be of high impact are industrial controls, financial markets, medical
products, aviation, and defense to name a few. Many of these segments
utilize real-time operating systems for their applications, which may not have
the flexibility of recovering from critical hardware errors as compared to the

6
Debugging Machine Check Exceptions on Embedded IA Platforms

mobile and desktop markets. Several of the embedded applications are also
required to operate non-stop for 7-10 years with extremely low error rates.

Examples of MCEs in Embedded Applications


1. Product X is a customized, small form-factor motherboard. However, the
design routes FSB data lines too close to and through ground plane voids
near the board’s processor heat sink mounting holes. This FSB layout
causes FSB data parity errors and results in an MCE event.
2. Product Y receives an MCE event when running unique application
software. The customer uses a custom PCIe NIC due to design constraints.
During signal integrity analysis by this customer, the PCIe eye diagram is
observed to be outside specification. The MCE event is root caused to be
related to the custom PCIe NIC card.
3. Product Z is experiencing sporadic MCE events on different systems in the
field during the period of a year that are not reproducible. The customer is
using a customized RTOS and their own BIOS that is unable to be
updated as the systems have been deployed and are currently in use by
end customers. This MCE event would be very difficult to debug without a
reproducible failing system.

Machine Check Architecture elements


This section summarizes the key elements provided by IA Machine Check
Architecture. For further details on the MCA and MCE registers, refer to Ref
[1].

MCA Capability Register


IA32_MCG_CAP MSR is a read-only register that provides information on the
MCA of the processor.

Table 1. IA32_MCG_CAP register

7
Debugging Machine Check Exceptions on Embedded IA Platforms

Some of the bits may not be available in some older generation IA


processors. For detailed definitions, refer to Ref [1]. These register fields can
tell the OS and MCE handler what capabilities this particular processor has in
terms of the MCA.

IA32_MCG_CTL MSR
It is important to determine if the machine check features are enabled in
order for MCEs to be captured. The IA32_MCG_CTL controls the reporting of
machine check exceptions. The IA32_MCG_CTL MSR is present if the
capability flag, MCG_CTL_P is set in the IA32_MCG_CAP MSR register. If
present, writing 1s to this register enables MCE features and writing all 0s
disables MCE features. Refer to Ref [1] for more information.

MCA Global Status Register


This register provides information on the current status of the MCE. It will
also determine if the instruction pointer is related to the MCE or if CPU
operation can restart from the instruction that was pushed on the stack when
the MCE was generated.

Table 2 IA32_MCG_STATUS

MCE Bank Registers


A finer degree of the MCE is controlled and reported by the MCE Bank
Registers. Each error-reporting register bank can contain IA32_MCi_CTRL,
IA32_MCi_STATUS, IA32_MCi_ADDR, and IA32_MCi_MISC MSRs. Each bank
usually has a special focus area in terms of the types of errors it covers. The
number of banks depends on the specific processor family.

MCE Bank Status Registers


Each IA32_MCi_STATUS MSR contains information related to a machine check
exception if its VAL (valid) flag is set (see Table 5). Software is responsible
for clearing IA32_MCi_STATUS MSRs by explicitly writing 0s to them; writing
1s to them causes a general-protection exception.

8
Debugging Machine Check Exceptions on Embedded IA Platforms

Table 3 IA32_MCi_STATUS register

A more detailed description of the MCE Status Registers can be found in the
Machine-Check MSRs section in Ref [1].

MCE Bank Control Registers


IA32_MCi_CTL MSRs control error reporting for errors produced by a
particular hardware unit. Each of the 64 flags represents a potential error. If
the bit is implemented on the processor, setting the bit enables the reporting.

Table 4 IA32_MCi_CTL registers

MCE Bank Address Registers


The IA32_MCi_ADDR MSR contains the address of the code or data memory
location that produced the machine check exception if the ADDRV flag in the
IA32_MCi_STATUS register is set.

9
Debugging Machine Check Exceptions on Embedded IA Platforms

MCE Bank MISC Registers


The IA32_MCi_MISC MSR contains additional information describing the
machine check exception if the MISCV flag in the IA32_MCi_STATUS register
is set. For detailed register definition information when MISCV is valid, refer
to Ref [1] and [2].

MCE Coding Tables


To determine the type of error being reported the machine check exception
handler must read from the MCA error code field [15:0] of the
IA32_MCi_STATUS register. There are two types of MCA error codes: simple
error codes and compound error codes.

Table 5 shows the simple error codes. These codes indicate global error
information.

Table 5 IA32_MCi_Status [15:0] Simple Error Code Encoding

Notes:
1. BINIT# assertion will cause a machine check exception if the processor (or any processor on
the same external bus) has BINIT# observation enabled during power-on configuration
(hardware strapping) and if machine check exceptions are enabled (by setting CR4.MCE = 1).

2. At least one X must equal one. Internal unclassified errors have not been classified.

Table 6 shows the general form of the compound error codes related to the
TLBs, memory, caches, bus and interconnect logic, and internal timer. These
compound errors also consist of sub-fields that describe the type of access,
level in the cache, and type of request.

10
Debugging Machine Check Exceptions on Embedded IA Platforms

Table 6 IA32_MCi_Status [15:0] Compound Error Code Encoding

The “Interpretation” column indicates the name of a compound error, which is


constructed by substituting mnemonics for the sub-field names in the curly
braces.

Table 7 shows the 2-bit transaction type (TT) sub-field.

Table 7 Encoding for TT (Transaction Type) Sub-Field

Table 8 shows the 2-bit level (LL) sub-field, which indicates the level in
memory hierarchy where the error occurred.

Table 8 Level Encoding for LL (Memory Hierarchy Level) Sub-Field

Table 9 shows the 4-bit request (RRRR) sub-field, which indicates the type of
action associated with the error.

11
Debugging Machine Check Exceptions on Embedded IA Platforms

Table 9 Encoding of Request (RRRR) Sub-Field

Refer to Section 15.9 of Ref [1] for the other sub-field decoding tables and for
more information.

Multi-core implications
Most MCE registers are core-specific, that is, each core has its own set of
control, status, and address registers. However, in newer processor families
such as Nehalem, new banks of registers have been added to the architecture
to address package-level error information. For example, in Nehalem
processor families, bank 0, 1, 6, 7 are per-package and introduced to address
QPI, integrated memory and graphics. Banks 2, 3, 4, 5 are more traditional
MCE banks addressing per-core level information such as Data Cache, TLB,
MLC, LLC etc. See Ref [2] for more information.

Debug Process Flow

Confirm platform is operating within specification


Before in-depth debugging, it is important to make sure the MCE is not
caused by something obvious. An important checkpoint will be to make sure
the platform is not operating out of specification. For example, if the CPU is
operating in a temperature that is outside of the specified operating range,
the behavior will be unpredictable. Another example is if the voltage or
frequency of the CPU is operating out of the specified range.

Gather MCE error code displayed by OS


OS MCE handlers typically produce some MCE code and print out the screen
messages before system reset. It is important to capture as much information

12
Debugging Machine Check Exceptions on Embedded IA Platforms

as possible by carefully recording the screen message. Turning on debug


message levels to get extra system messages and getting screen log
messages recorded is also a good idea to help identify what is happening
before the MCE event. The details to turn on debug messages are OS specific.
For example, Linux has different levels of debug message displays.

Identify error frequency


Error frequency is an important piece of data but may be hard for a small
sample size. The frequency of the MCE may shed some light on what may be
causing it. For example, if the frequency of the MCE is relatively high and
easily reproducible, it may indicate issues with board designs or system
applications. Otherwise, if the frequency is extremely low, it may be related
to environmental disturbance.

Confirm whether the MCE is reproducible on the


same platform
It is helpful to confirm whether the issue is reproducible on the same
platform. However, in some cases it is possible the error will not occur on the
same platform.

Collect as much information as possible about


configuration
Capture all the platform/OS/BIOS information for the failing system,
including:

• CPU stepping/SKU info


• MCU version
• Chipset stepping/SKU info
• BIOS (vendor name, whether it has the latest MCU and necessary known
BIOS fixes)
• OS information
• Software applications that are running, the main transactions (I/O,
memory)
• The kind of environment (high altitude, Extreme cold/hot)

Decode and understand MCE message


Correctly decoding the MCE message is a necessary and important step. An
example of how to decode an MCE message follows.

13
Debugging Machine Check Exceptions on Embedded IA Platforms

For example, if the compound error code reported is:

MC1_STATUS: 0xf200000000020151

This can be decoded by looking at the MCA error code field bits [15:0], which
is 0151 of the above register. Convert this value to binary (0000 0001 0101
0001) and refer to Table 6 to determine the compound error code form. In
this case, the form is (000F 0001 RRRR TTLL) and is a cache hierarchy error.
Next, the sub-fields can be determined: TT=00, LL=01, RRRR=0101. By
using the sub-field Tables 7, 8, 9 and the corresponding “interpretation” form
from Table 6, ({TT}CACHE{LL}_{RRRR}_ERR), the MCE is decoded as an L1
instruction fetch error. This error is an uncorrected error as can be seen by bit
61 being set in the MC1_STATUS register.

The messages provided by the MCE error code can be used to understand
what may potentially be causing the errors. Refer to this section for some
potential common causes.

Refer to Appendix E, “Interpreting Machine-Check Error Codes” in Ref [2] for


more information on interpreting the MCA error code, model-specific error
code, and other information error code fields.

Make sure latest MCU and BIOS updates are in


place
Ensure the correct MCU and BIOS are in place. Check with BIOS vendors for
relevant BIOS updates. Make sure the latest MCU code is being used as each
MCU may contain bug fixes or enhancements.

Research CPU Spec Update for relevant CPU


errata
Research the CPU Specification Update for known CPU errata that match the
failure symptoms. If the errata calls for suggestions on certain software
practices, then these suggestions should be reviewed, considered, and tested.

Try to reproduce on Customer Reference Board


When issues are highly reproducible on customer platforms, gather a
reference data point by testing it on Intel CRBs. If users can re-create the
same issue on Intel CRBs, this will potentially eliminate a lot of possibilities
and streamline the debug process.

When an MCE can be reproduced on an Intel CRB, there are usually two
possibilities related to the cause. One is a potential sighting of a possible

14
Debugging Machine Check Exceptions on Embedded IA Platforms

silicon issue, or the SW/OS/BIOS that is running. If this is the case, it is


recommended to alert the silicon vendor. If the MCE is reproducible on the
CRB and in a common software environment, then it may be easier to engage
in a productive trouble-shooting process.

Debug Checklist

Steps Checklist Item Outcome


1 Confirm platform is operating within specification

2 Gather MCE error code

3 Identify error frequency

4 Confirm if MCE is reproducible on the same platform

5 Collect as much information as possible about configuration (OS,


BIOS version, software applications that are running)

6 Understand the MCE code

7 Make sure latest MCU & BIOS updates are in place

8 Research CPU spec update for any relevant CPU errata

9 Try to reproduce on customer reference board

Related documents

Ref. # Document Title Document


Number/Location

[1] Intel® 64 and IA-32 Architectures Software Developer’s 253668


Manual, Volume 3A: System Programming Guide

[2] Intel® 64 and IA-32 Architectures Software Developer’s 253669


Manual Volume 3B: System Programming Guide

Summary
This application note gives an overview of machine check architecture and its
purpose in detecting and reporting system errors. This architecture provides
an opportunity to capture a group of error situations visible to the CPU at the
point of failure. Newer additions of the MCA also make it possible to wire

15
Debugging Machine Check Exceptions on Embedded IA Platforms

interrupts to correctable MCEs, in case users are interested in checking these


events as well.

This document described the importance of machine check architecture to


embedded products given the valuable information captured in MCE registers,
which are often the only clue to what has happened. Debugging machine
check exceptions on any system is a challenging task due to the number of
suspects that may be involved. Embedded systems add a new dimension to
the difficulty due to their vastly diverse system configuration, environments,
and usage models.

This document provides a quick review of machine check architecture and its
key elements for debugging. It also provides recommendations on how to
debug MCEs in embedded systems and provides a sample approach to help
system developers debug such issues. As each failure event is rather unique,
every error situation will need to be approached differently. Nevertheless, this
step by step guide provides a list of items that may be helpful to this debug
process.

The Intel® Embedded Design Center provides qualified developers with web-
based access to technical resources. Access Intel Confidential design
materials, step-by step guidance, application reference solutions, training,
Intel’s tool loaner program, and connect with an e-help desk and the
embedded community. Design Fast. Design Smart. Get started today.
http://intel.com/embedded/edc.

Authors
Ashley Montgomery is a Platform Application Engineer with
Intel’s Embedded and Communications Group.
Tian Tian is a Platform Application Engineer with Intel’s
Embedded and Communications Group.

16
Debugging Machine Check Exceptions on Embedded IA Platforms

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO


LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY
WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO
SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO
FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE
AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED
FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A
SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice.
This paper is for informational purposes only. THIS DOCUMENT IS PROVIDED "AS IS" WITH NO
WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY,
NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE
ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Intel disclaims all liability, including
liability for infringement of any proprietary rights, relating to use of information in this specification.
No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted
herein.

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Core Inside, i960, Intel, the Intel
logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, the Intel Inside logo, Intel
NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of
Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel Viiv, Intel vPro, Intel
XScale, InTru, the InTru logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin,
Pentium, Pentium Inside, skoool, the skoool logo, Sound Mark, The Journey Inside, vPro Inside,
VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2010 Intel Corporation. All rights reserved.

17

Das könnte Ihnen auch gefallen