Beruflich Dokumente
Kultur Dokumente
Architecture
BY FRANK DENNEMAN PUBLISHED JULY 8, 2016 UPDATED JULY 13, 2016
Reviewing the physical layers helps to understand the behavior of
the CPU scheduler of the VMkernel. This helps to select a physical
configuration that is optimized for performance. This part covers
the Intel Xeon microarchitecture and zooms in on the Uncore.
Primarily focusing on Uncore frequency management and QPI
design decisions.
Terminology
There a are a lot of different names used for something that is
apparently the same thing. Let's review the terminology of the
Physical CPU and the NUMA architecture. The CPU package is the
device you hold in your hand, it contains the CPU die and is
installed in the CPU socket on the motherboard. The CPU die
contains the CPU cores and the system agent. A core is an
independent execution unit and can present two virtual cores to
run simultaneous multithreading (SMT). Intel proprietary SMT
implementation is called Hyper-Threading (HT). Both SMT threads
share the components such as cache layers and access to the
scalable ring on-die Interconnect for I/O operations.
Interesting entomology; The word "die" is the singular of dice.
Elements such as processing units are produced on a large round
silicon wafer. The wafer is cut "diced" into many pieces. Each of
these pieces is called a die.
NUMA Architecture
In the following scenario, the system contains two CPUs, Intel
2630 v4, each containing 10 cores (20 HT threads). The Intel 2630
v4 is based on the Broadwell microarchitecture and contains 4
memory channels, with a maximum of 3 DIMMS per channel. Each
channel is filled with a single 16 GB DDR4 RAM DIMM. 64 GB
memory is available per CPU with a total of 128 GB in the system.
The system reports two NUMA Nodes, each NUMA nodes,
sometimes called NUMA domain, contains 10 cores and 64 GB.
Consuming NUMA
The CPU can access both its local memory and the memory
controlled by the other CPUs in the system. Memory capacity
managed by other CPUs are considered remote memory and is
accessed through the QPI (Part 1). The allocation of memory to a
virtual machine is handled by the CPU and NUMA schedulers of
the ESXi kernel. The goal of the NUMA scheduler is to maximize
local memory access and attempts to distribute the workload as
efficient as possible. This depends on the virtual machine CPU and
memory configuration and the physical core count and memory
configuration. A more detailed look into the behavior of the ESXi
CPU and NUMA scheduler is done in part 5, how to size and
configure your virtual machines is discussed in part 6. This part
focusses on the low-level configuration of a modern dual-CPU
socket system. ESXtop reports 130961 MB (PMEM /MB) and
displays the NUMA nodes with its local memory count.
Uncore
As mentioned in part 1, the Nehalem microarchitecture introduced
a flexible architecture that could be optimized for different
segments. In order to facilitate scalability, Intel separated the core
processing functionality (ALU, FPU, L1 and L2 cache) from the
'uncore' functionality. A nice way to put it is that the Uncore is a
collection of components of a CPU that do not carry out core
computational functions but are essential for core performance.
This architectural system change brought the Northbridge
functionality closer to the processing unit, reducing latency while
being able to increase the speed due to the removal of serial bus
controllers. The Uncore featured the following elements:
Intel provides a schematic overview of a CPU to understand the
relationship between the Uncore and the cores, I've recreated this
overview to help emphasise certain components. Please note that
the following diagram depicts a High Core Count architecture of
the Intel Xeon v4 (Broadwell). This is a single CPU package. The
cores are spread out in a "chop-able" design, allowing Intel to offer
three different core counts, Low, Medium and High. The red line is
depicting the scalable on-die ring connecting the cores with the
rest of the Uncore components. More in-depth information can be
found in part 4 of this series.
If a CPU core wants to access data it has to communicate with the
Uncore. Data can be in the last-level cache (LLC), thus interfacing
with the Cbox, it might require memory from local memory,
interfacing with the home agent and integrated memory controller
(IMC). Or it needs to fetch memory from a remote NUMA node, as
a consequence, the QPI comes into play. Due to the many
components located in the Uncore, it plays a significant part in the
overall power consumption of the system. With today's focus on
power reduction, the Uncore is equipped with frequency scaling
functionality (UFS).
Haswell (v4) introduces Per Core Power States (PCPS) that allows
each core to run at its own frequency. UFS allows the Uncore
components to scale their frequency up and down independently
of the cores. This allows Turbo Boost 2.0 to turbo up and owns the
two elements independently, allowing cores to scale up the
frequency of their LLC and ring on-ramp modules, without having
to enforce all Uncore elements to turbo boost up and waste power.
The feature that regulates boosting of the two elements is
called Energy Efficient Turbo, some vendors provide the ability
to manage power consumption with the settings Uncore
Frequency Override or Uncore Frequency. These settings are
geared towards applying performance savings in a more holistic
way.
The Uncore provides access to all interfaces, plus it regulates the
power states of the cores, therefore it has to be functional even
when there is a minimal load on the CPU. To reduce overall CPU
power consumption, the power control mechanism attempts to
reduce the CPU frequency to a minimum by using C1E states on
separate cores. If a C1E state occurs, the frequency of the Uncore
is likely to be lowered as well. This could have a negative effect on
the I/O throughput of the overall throughput of the CPU. To avoid
this from happening some server vendors provide the BIOS
option; Uncore Frequency Override. By default this option is set
to Disabled, allowing the system to reduce the Uncore frequency
to obtain power consumption savings. By selecting Enabled it
prevents frequency scaling of the Uncore, ensuring high
performance. To secure high levels of throughput of the QPI links,
select the option enabled, keep in mind that this can have a
negative (increased) effect on the power consumption of the
system.
Some vendors provide the Uncore Frequency option of Dynamic
and Maximum. When set to Dynamic, the Uncore frequency
matches the frequency of the fastest core. With most server
vendors, when selecting the dynamic option, the optimization of
the Uncore frequency is to save power or to optimize the
performance. The bias towards power saving and optimize
performance is influenced by the setting of power-management
policies. When the Uncore frequency option is set to maximum the
frequency remains fixed.
Haswell (v3) and Broadwell (v4) offer three QPI clock rates, 3.2
GHz, 4.0 GHz, and 4.8 GHz. Intel does not provide clock rate
details, it just provide GT/s. Therefore to simplify this calculations,
just multiple GT/s by two (16 bits / 8 bits to bytes = 2). Listed as
9.6 GT/s a QPI link can transmit up to 19.2 GB/s from one CPU to
another CPU. As it is bidirectional, it can receive the same amount
from the other side. In total, the two 9.6 GT/s links provide a
theoretical peak data bandwidth of 38.4 GB/sec in one direction.
Note!
The reason why I'm exploring nuances of power settings is that
high-performance power consumption settings are not always the
most optimal setting for today's CPU microarchitecture. Turbo
mode allows cores to burst to a higher clock rate if the power
budget allows it. The finer details of Power management and
Turbo mode are beyond the scope of this NUMA deep dive, but will
be covered in the upcoming CPU Power Management Deep Dive.