Beruflich Dokumente
Kultur Dokumente
Controversy Corner
A R T I C LE I N FO A B S T R A C T
Keywords: Software power estimation of CPUs is a central concern for energy efficiency and resource management in data
Power models centers. Over the last few years, a dozen of ad hoc power models have been proposed to cope with the wide
Energy monitoring diversity and the growing complexity of modern CPU architectures. However, most of these CPU power models
Software toolkit rely on a thorough expertise of the targeted architectures, thus leading to the design of hardware-specific so-
Software-defined power meters
lutions that can hardly be ported beyond the initial settings. In this article, we rather propose a novel toolkit that
Open testbed
uses a configurable/interchangeable learning technique to automatically learn the power model of a CPU, in-
dependently of the features and the complexity it exhibits. In particular, our learning approach automatically
explores the space of hardware performance counters made available by a given CPU to isolate the ones that are
best correlated to the power consumption of the host, and then infers a power model from the selected counters.
Based on a middleware toolkit devoted to the implementation of software-defined power meters, we implement
the proposed approach to generate CPU power models for a wide diversity of CPU architectures (including Intel,
ARM, and AMD processors), and using a large variety of both CPU and memory-intensive workloads. We show
that the CPU power models generated by our middleware toolkit estimate the power consumption of the whole
CPU or individual processes with an accuracy of 98.5% on average, thus competing with the state-of-the-art
power models.
⁎
Corresponding author at: Cit Scientifique, 59650, Villeneuve d’Ascq, France.
E-mail addresses: maxime.colmant@inria.fr (M. Colmant), romain.rouvoy@univ-lille.fr (R. Rouvoy), pascal.felber@unine.ch (P. Felber),
lionel.seinturier@univ-lille.fr (L. Seinturier).
https://doi.org/10.1016/j.jss.2018.07.001
Received 2 June 2017; Received in revised form 12 June 2018; Accepted 1 July 2018
Available online 02 July 2018
0164-1212/ © 2018 Elsevier Inc. All rights reserved.
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
machine learning techniques to automatically infer the power model of Section 6 studies the applicability of our approach to model specific
a target CPU by exploring its consumption space according to a variety processor architectures. Finally, Section 7 concludes this article and
of input workloads. In particular, our modular toolkit automatically sketches some perspectives for further research.
builds the power model of a CPU by characterizing the available power-
aware features, and then exploits this model in production to estimate 2. State-of-the-Art
the power consumption. Thanks to POWERAPI, these power estimation can
be delivered in real-time both at the granularity of the physical node In this Section, we describe the previous research works which are
and of the hosted software processes. Our solution is assessed on 4 close to our contributions in the domain of power monitoring for server-
different CPUs (Intel Xeon, Intel i3, ARM Cortex, AMD Opteron) and side infrastructure.
exhibits, on average, an estimation error of only 1.5%. Regarding the Energy consumption is a crosscutting concern that encompasses
limitations identified in the previous paragraph, we show that POWERAPI various domains, such as mobile systems, which are also addressing the
can i) benefit from widely available open-source benchmarks and challenge of power profiling and the specification of power
workloads to learn and compare power models, ii) build an accurate models (Zhang et al., 2010; Yoon et al., 2012). However, this article
power model according to the list of available hardware performance does not aim at covering such power models, as they usually assume
counters and power-aware features. We claim that such a toolkit will be direct current supplies and may be influenced by additional hardware
able to cope with the continuous evolution of CPU architectures and to components (incuding screen, GPS, etc.).
cover the “next 700 CPU power models”,1 thus easing their adoption Thus, Section 2.1 introduces all recent research approaches for
and wide deployment. learning power models, while Section 2.2 presents different tools that
While we assess the capability of POWERAPI to build CPU power can estimate or measure the power consumption at different granula-
models for different CPU architectures, this article also aims at an- rities.
swering the following research questions:
2.1. CPU power models
1. Can we build CPU power models that better fit specific domains of
applications? The closest approach to hardware-based monitoring is Running
2. Can we use the derived CPU power models to estimate the power Average Power Limit (RAPL), introduced with the Intel “Sandy Bridge”
consumption of any workload in real-time? architecture to report on the power consumption of the entire CPU
3. Can we use the derived CPU power models to estimate the power package. Beyond modern Intel architectures, AMD introduced a similar
consumption of concurrent processes? feature with Application Power Management (APM) with their Socket
4. Can we adjust the CPU power model depending on an execution AM3+. However, as this feature is not available on all architectures
profile? and is not always as accurate as one could expect (Colmant et al.,
5. Does the CPU power model depend on the underyling operating 2015), alternative CPU power models are generally designed based on a
system or performance profiles? wider diversity of raw metrics. Along the last decade, the design of CPU
power models has therefore been regularly considered by the research
Beyond the CPU power models, experiments and research questions community (Bellosa, 2000; Kansal et al., 2010; McCullough et al., 2011;
we address in this article, we believe that our contribution offers an Versick et al., 2013; Colmant et al., 2015). The state-of-the-art in this
open testbed to foster the research on green computing. Our open- area has investigating both power modeling techniques and system
source solution can be used to automatically infer the power models performance metrics to identify power models that are as accurate and/
and to support the design of energy-aware scheduling heuristics in or as generic as possible.
homogeneous systems (Bambagini et al., 2013; Bellosa, 2000; In particular, standard operating system metrics (CPU, memory,
Moghaddami Khalilzad et al., 2013; Rasmussen, 2015; Havet et al., disk, or network), directly computed by the kernel, tend to exhibit a
2017), as well as in heterogeneous data centers (Kurpicz et al., 2014), to large measurement error (Kansal et al., 2010; Versick et al., 2013).
serve the energy-proportional computing (Barroso and Holzle, 2007; Contrary to these high-level usage statistics, Hardware Performance
Krioukov et al., 2010; Meisner et al., 2011; Prekas et al., 2015) and to Counters (HWPCs) are low-level metrics that can be monitored from the
evaluate the effectiveness of optimizations applied on processor (e.g., number of retired instructions, cache misses, non-halted
binaries (Schulte et al., 2014). It also targets system administrators and cycles). However, modern processors offer a variable number of HWPC,
software developers alike in monitoring and better understanding the depending on architectures and generations. As shown
power consumption of their software assets (Noureddine et al., 2014; by Bellosa (2000) and Bircher and John (2007), some HWPCs are highly
2015; Sterling, 2013). The POWERAPI middleware toolkit is published as correlated with the processor power consumption whereas the authors
open-source software2 and we build on freely available benchmark in Rivoire et al. (2008) conclude that several performance counters are
suites to learn the power model of any CPU.3 not useful as they are not directly correlated with dynamic power.
The remainder of this article is organized as follows. Section 2 Nevertheless, this correlation depends on the processor architecture
discusses the state-of-the-art of CPU power models learning techniques and the CPU power model computed using some HWPCs may not be
and theirs limitations, and the state-of-the-art of monitoring tools used ported to different settings and architectures. Furthermore, the number
to produce power estimation at different granularities. Section 3 de- of HWPC that can be monitored simultaneously is limited and depends on
scribes in depth the architecture of our middleware toolkit, POWERAPI, for the underlying architecture (Intel, 2015a), which also limits the ability
learning the CPU power models and building software-defined power to port a CPU power model on a different architecture. Therefore,
meters. Section 4 proposes a new approach to automatically learn the finding an approach to select the relevant HWPC represents a tedious
power model for a given CPU architecture. Section 5 demonstrates the task, regardless of the CPU architecture.
accuracy of various CPU power models that our approach can infer. Power modeling often builds on these raw metrics to apply learning
techniques—for example based on sampling (Bertran et al., 2010)—to
correlate the metrics with hardware power measurements using various
1
Following the same idea of generality as in the seminal paper of Landin: The regression models, which are so far mostly linear (McCullough et al.,
Next 700 Programming Languages (Landin, 1966). 2011).
2
Freely available as open source software from: http://powerapi.org. Arjona Aroca et al. (2014) propose to model several hardware
3
Freely available as Docker containers from: https://github.com/Spirals- components (CPU, disk, network). They use the lookbusy tool to
Team/benchmark-containers. generate a CPU load for each available frequency and a fixed number of
383
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
active cores. They capture Active Cycles Per Second (ACPS) and raw disk, and network utilization. In order to retrieve raw power mea-
power measurements while loading the CPU. A polynomial regression is surements, the authors uses board-level modifications and 4 “power
used for proposing a power model per combination of frequency and planes” (extracted from the paper), which is a rather heavy and con-
number of active cores. They validate their power models on a single straining mechanism for end-users and represents a major hardware
processor (Intel Xeon W3520) by using a map-reduce Hadoop appli- investment. On average, their power models exhibit an estimation error
cation. During the validation, the authors have not been able to cor- of less than 5% (varying between 0% and 15% in all cases) when using
rectly capture the input parameter of their power model—i.e., the SPEC benchmarks, matrix and stream.
overall CPU load—and they use an estimation instead. The resulting Isci and Martonosi (2003) use an alternative approach to estimate
“tuned” power model with all components together exhibits an esti- the power consumption of an Intel Pentium 4 processor. They isolate 22
mation error of 7% compared to total amount of energy consumed. processor subunits with the help of designed micro-benchmarks and
Bertran et al. (2010) model the power consumption of an Intel Core live power measurements. For each subunit, they use simple linear
2 Duo by selecting 14 HWPCs based on an a priori knowledge of the heuristics, which can include one or more HWPCs. For the others (trace
underlying architecture. To compute their model, the authors inject cache, allocation, rename...), they use a specific piecewise linear ap-
both selected HWPCs and power measurements inside a multivariate proach. They selected 15 different HWPCs to model all subunits, some of
linear regression. A modified version of perfmon24 is used to collect them are reconfigured or rotated when needed. At the end, they express
the raw HWPC values. In particular, the authors developed 97 specific the CPU power consumption as the sum of all subunits. They train their
micro-benchmarks to stress in isolation each component identified. model on designed micro-benchmarks, SPEC CPU 2000 and some
These benchmarks are written in C and assembly and cannot be gen- desktop tools (AbiWord, Mozilla, Gnumeric) and they report an average
eralized to other architectures. The authors assess their solution with error of 3 W.
the SPEC CPU 2006 benchmark suite, reporting an estimation error of Li and John (2003) rely on per OS-routines power estimation to
5% on a multi-core architecture. characterize at best the power drawn by a system. They simulate an 8-
Bircher et al. (2005) propose a power model for an Intel Pentium 4 way issue,5 out-of-order superscalar processor with function unit la-
processor. They provide a first model that uses the number of fetched μ- tency. The authors use 21 applications, including SPEC JVM 98 and
operations per cycle, reporting an average estimation error of 2.6%. As SPEC INT 95. During their experiments, they identify Instruction Per
this model was performing better for benchmarks inducing integer Cycle (IPC) to be very relevant to model the power drawn by the OS
operations, the authors refine their model by using the definition of a routines invoked by the benchmarks. The resulting CPU power model
floating point operation. As a consequence, their second power model exhibits an average error of up to 6% in runtime testing conditions.
builds on 2 HWPCs: the μ-operations delivered by the trace cache and the Rivoire et al. (2008) propose an approach to generate a family of
μ-operations delivered by the μ-code ROM. This model is assessed using high-level power models by using a common infrastructure. In order to
the SPEC CPU 2000 benchmark suite, which is split in 10 groups. One choose the best input metrics for their power models, they compare 5
benchmark is selected per group to train the model and the remaining types of power models that vary on their input metrics and complexity.
ones are used to assess the estimation. Overall, the resulting CPU power The first 4 power models are defined in the literature and use basic OS
model reports an average error of 2.5%. metrics (CPU utilization, disk utilization) (Fan et al., 2007; Heath et al.,
In Contreras and Martonosi (2005), the authors propose a multi- 2005; Rivoire et al., 2008). They propose the fifth power model that
variate linear CPU power model for the Intel XScale PXA255 processor. uses HWPCs in addition to CPU and disk utilization. The last proposed
They additionally consider different CPU frequencies on this processor power model exhibits a mean absolute error lesser than 4% over 4 fa-
to build a more accurate power model. They carefully select the HWPCs milies of processors (Core 2 Duo, Xeon, Itanium, Turion) when using
with the best correlation while avoiding redundancy, resulting in the SPECfp, SPECint, SPECjbb, stream, and Nsort. The authors do not detail
selection of only 5 HWPCs. In their paper, they also consider the power the underlying architectures of the testbed CPU, thus making a fair
drawn by the main memory using 2 HWPCs already used in the CPU comparison difficult.
power model. However, given that the processor can only monitor 2 iMeter (Yang et al., 2014) covers not only CPU, but also memory
events concurrently, they cannot implement an efficient and usable and I/O. To get a practical model, the authors need to select the proper
runtime power estimation. They test their solution on SPEC CPU 2000, number of counters. After benchmarking VMs under different loads,
Java CDC, and Java CLDC, and they report an average estimation error they empirically extract 91 out of 400 HWPCs. In a second step, a prin-
of 4% compared to the measured average power consumption. cipal component analysis (PCA) is applied to identify a statistical cor-
The authors in Dolz et al. (2015) propose an approach to build relation between the power consumption and the performance coun-
linear power models for hardware components (CPU, memory, net- ters. With this method, highly correlated values are clustered into a
work, disk) by applying a per component analysis. Their technique uses smaller set of principal components that are not correlated anymore.
4 benchmarks during the training phase and collect various metrics The selection of the principal components depends on the cumulative
gathered from hardware performance counters, OS statistics, and sen- contribution to the variance of the original counters, which should
sors. They build a correlation matrix from all gathered metrics (in- reach at least 85%. The final model is derived by the usage of support
cluding power measurements) and then apply a clustering algorithm on vector regression and 3 manually selected events per principal
top of it. The power models reported from this state-of-the-art work are component (Vapnik et al., 1997) and reports an average error of 4.7%.
manually extracted from these groups. Without considering the power In Zamani and Afsahi (2012), the authors study the relationship
models which include directly power measurements, the best one ex- between HWPC and power consumption. They use 4 applications from
hibits an absolute error of 3 W on average with a maximum absolute NAS parallel benchmarks (BT.C, CG.C, LU.C, SP.C) running on 8 threads
error of 70 W. in a dual quad-core AMD Opteron system. Given the limitation on the
Economou et al. (2006) model the power consumption of 2 servers events that they can open simultaneously, the authors first show that
(Turion, Itanium) as a multiple linear regression that uses various uti- the measurement variability over different executions is not very sig-
lization metrics as input parameters. The authors use the CPU utiliza- nificant, enabling different runs for sampling all events. This article
tion, the off-chip memory access count, the hard-disk I/O rate, and the proposes a deep analysis for HWPC selection (single or multiple). The
network I/O rate. The input parameters are learned by using Gamut authors demonstrate that a collinearity can exist between events and
that emulates applications with varying levels of CPU, memory, hard then propose a novel method to find the best combination of HWPC with
4 5
http://perfmon2.sourceforge.net. As mentioned in the aforementioned publication.
384
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
good performance. They use the ARMAX technique to build their power state-of-the-art. Most solutions are monolithic, created for specific needs
models. They evaluate their solution by producing a model per appli- and cannot be easily tuned or configured for assembling new kind of
cation and exhibit a mean absolute error in signal between 0.1% − 0.5% power meters (Do et al., 2009; Flinn and Satyanarayanan, 1999; Intel,
for offline analysis. 2015b; Wang et al., 2011). Power models used by such power meters
HaPPy (Zhai et al., 2014) introduces an hyperthread-aware power cannot be used or adapted to modern architectures because they have
model that uses only the non-halted cycles event. The authors distinguish been designed for a specific task (Wang et al., 2011) or depend on specific
between different cases where either single or both hardware threads of metrics (Do et al., 2009) that cannot trustfully represent recent power-
a core are in use. This power model is linear and contains a ratio aware features. They also lack modularity (Wang et al., 2011) and may
computed according to their observations. They demonstrate that when require additional investments (Flinn and Satyanarayanan, 1999).
both threads of a core are activated, they share a small part of non- In order to detect energy-consuming applications and to apply cri-
halted cycles. The authors extend the perf6 tool to access to RAPL. tical energy decisions at runtime, one rather needs a middleware toolkit
Their model is tested on a Intel “Sandy Bridge” server with private that is fully modular, configurable, and adaptive to report on power
benchmarks provided by Google, which cannot be reused, and 10 estimation or power measurements at high frequency on heterogeneous
benchmarks taken from SPEC CPU 2006. To assess their power esti- systems. For promoting the usage of such a solution as a good alter-
mation, they used the RAPL interface reachable on this server. Com- native of physical power meters, the proposed solution has to be freely
pared to RAPL, they manage to have an average estimation error of available for the community.
7.5%, and a worst case estimation error of 9.4%. Nevertheless, as de- Synthesis and Contribution In this article, we clearly differ from the
monstrated in Colmant et al. (2015), these estimation errors can be state-of-the-art by providing an open source, modular, and completely
exceeded in scenarios where only single cores of a CPU are monitored. configurable implementation of a power modeling toolkit: POWERAPI. For
the purpose of this article, we base our analysis on standard benchmarks
2.2. Software-defined power meters (e.g., PARSEC, NAS NPB) to provide an open testbed for building and comparing
CPU power models. We believe that our implementation can be easily
Software-defined power meters are customizable and adaptable extended to other domain or synthetic workloads (Polfliet et al., 2011)
solutions that can deliver raw power consumptions or power estimation depending on requirements (cf. Section 6.1). To the best of our knowledge,
POWERAPI is the first solution that embeds a configurable and extensible
at various frequencies and granularity, depending on the requirements.
Power estimation of running processes is not a trivial task and must learning approach to identify the power model of a CPU, independently of
tackle several challenges. Several solutions are already proposed by the the features it exhibits. Once learned, the power models are then used by
POWERAPI to accurately produce process-level power estimation in produc-
state-of-the-art:
pTop (Do et al., 2009) is a process-level power profiling tool for tion at high frequencies. Unlike the existing approaches already published
Linux or Windows platforms. pTop uses a daemon in background for in the literature, we rather propose an interchangeable and configurable
continuously profiling statistics of running processes. pTop keeps traces approach that is i) architecture-agnostic and ii) processor-aware.
about component states and stores temporarily the amount of energy
consumed over each time interval. This tool displays, similarly to the 3. POWERAPI, a middleware toolkit to learn power models
top output, the total amount of energy consumed—i.e., in Joules—per
running process. The authors propose static built-in energy models for We propose and design POWERAPI, an open-source toolkit7 for as-
CPU, disks, and network components. An API is also provided to get sembling software-defined power meters upon need. POWERAPI is built on
energy information of a given application per selected component. top of Scala and the actor programming model using the Akka library.
PowerScope (Flinn and Satyanarayanan, 1999) is a tool capable of Akka is an open-source toolkit for building scalable and distributed
tracking the energy usage per application for later analysis and opti- applications on the JVM, which pushes forward the actor programming
mizations. Moreover, it maps the energy consumption to procedures model as the best programming model for concurrency. The software
within applications for better understanding the energy distribution. components of POWERAPI are implemented as actors, which can process
The tool uses 2 computers for offline analysis: one for sampling system millions of messages per second (Nordwall, 2012), a key property for
activities (Profiling computer) and another for collecting power mea- supporting real-time power estimation. POWERAPI is therefore fully
surements from an external digital multimeter (Data Collection com- asynchronous and scales on several dimensions—i.e., the number of
puter). Once the profiling phase completed, the Profiling computer is input sources, the requested monitoring frequencies, and the number of
then used to compute all energy profiles for later use. monitoring targets.
PowerTOP (Intel, 2015b) is a Linux tool for finding energy-con- More specifically, the POWERAPI toolkit identifies 5 types of actor
suming software on multiple component sources (e.g., CPU, GPU, USB components:
devices, screen). Several modes are available, such as the calibrate Clock actors are the entry point of our architecture and allow us to
mode to test different brightness levels as well as USB devices, or the meet throughput requirements by emitting ticks at given frequencies for
interactive mode for enabling different energy saving mechanisms not waking up the other components.
enabled by default. This software-defined power meter can only report Monitor actors handle the power monitoring request for one or
on power estimation while running on battery within an Intel laptop, or several processes. They react to the messages published by a clock
only usage statistics otherwise. actor, configured to emit tick messages at a given frequency. The
SPAN (Wang et al., 2011) is designed for providing real-time power monitor is also responsible for aggregating the power estimation by
phases information of running applications. It also offers external API applying a function (e.g., SUM, MEAN, MAX) defined for the monitoring
calls to manually allow developers to synchronize the source-code ap- when needed.
plications with power dissipation. The authors first design micro Sensor actors connect the software-defined power meters to the
benchmarks for sampling the only HWPC used—i.e., IPC—and they underlying system in order to collect raw measurements of system ac-
gather the HWPC data and raw power measurements for computing the tivity. Raw measurements can be coarse-grained power consumption
parameters of their hand-crafted power models. They can next use reported by third-party power meters and embedded probes (e.g.,
SPAN for real-time power monitoring and offline source-code analysis. RAPL), or fine-grained CPU activity statistics as delivered by the pro-
Different limitations can be extracted from these solutions and the cess file system (ProcFS). Sensors are triggered according to the
6 7
https://perf.wiki.kernel.org. Freely available from: http://powerapi.org.
385
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
requested monitoring frequency and forward raw measurements to the HWPC events provided by the CPU strongly vary according to the pro-
appropriate formula actor. cessor type.
Formula actors use the raw measurements received from the sensor More specifically, a CPU can expose several performance monitoring
to compute a power estimation. A formula implements a specific power units (PMUs) depending on its architecture and model. For example, 2
model (Kansal et al., 2010; Versick et al., 2013) to convert raw mea- PMUs are detected on an Intel Xeon W3520: nehalem and nehalem
surements into power estimation. The granularity of the power esti- uncore, each providing 2 types of HWPCs that cover either fixed or generic
mation reported by the formula (machine, core, process) depends on HWPC events. A fixed HWPC event can only be used for one predefined
the granularity of the measurements forwarded by the sensors. event, usually cycles, bus cycles, or instructions retired, while a generic
Reporter actors finally give the power estimation computed by the HWPC can monitor any event. If there are more events monitored than
aggregating function to a Display object. The Display object is re- available counters for a PMU, the kernel applies multiplexing to alter
sponsible to convert the raw power estimation and the related in- the frequency and to provide a fair access to each HWPC event. When
formations (e.g., the timestamp, the monitoring id or the devices) into a multiplexing is triggered, the events cannot be monitored accurately
suitable format. The delivered report is then exposed, for instance via a anymore and estimation are returned instead.
web interface, via a virtual file system (e.g., based on FUSE),8 or can be As shown in Table 2, the number of generic counters supported and
uploaded into a database (e.g., InfluxDB).9 the number of available events varies considerably across architectures
Sensor and Formula actors are tightly coupled and we group them as and even among models of the same manufacturer.
a PowerModule that links the input metrics and the power model. The approach we propose consists of two phases: an offline learning
The overall architecture of POWERAPI is depicted in Fig. 1. Several phase and an online exploitation phase, as depicted in Fig. 2. The
PowerModule components can be assembled together for grouping learning phase analyses the power consumption and triggered HWPC
power estimation from multiple sources. One can see that POWERAPI is events of the target CPU, in order to identify the key HWPC events that
fully modular, can be used to assemble power meters upon needs and to impact the power consumption. These key events are combined in a
fulfill all monitoring requirements. We can also note that POWERAPI is a power model—i.e., a formula—which is then used during the online
non-invasive solution and does not require costly investments or spe- exploitation phase to deliver real-time power estimation by feeding a
cific kernel updates. software-defined power meter, built with POWERAPI, that embeds the in-
Three instances of software-defined power meters are also depicted ferred CPU power model (cf. Fig. 1b).
in the figure. On the left side (Fig. 1a), one can find an instance of
POWERAPI especially configured to learn the CPU power models. This
instance is composed of two PowerModule components, one for re- 4.1. Learning phase
trieving raw accurate CPU metrics via libpfm, and another, for re-
trieving the power measurements from a physical power meter. The During the learning phase, our goal is to automatically classify the
data are then forwarded to several files to be later processed by our HWPC events in order to identify those that are best characterizing the
learning approach explained below. The resulting power model is then CPU activity and are tightly correlated with its power consumption. In
written inside a configuration file that can be used later by a new in- the following paragraphs, we describe each of the steps illustrated in
stance of POWERAPI to estimate the power consumption in production. In Fig. 2.
the middle (Fig. 1b), another instance of POWERAPI is configured to use Input workload injection. To explore the diversity of activities of a
the aforementioned power model for producing fine-grained power CPU, we consider a set of representative benchmark applications cov-
estimation. And finally, on the right side (Fig. 1c), an instance of POW- ering the features provided by a CPU. In particular, to promote the
ERAPI is configured to retrieve the power measurements from a bluetooth
reproducibility of our results, we favor freely available and widely used
power meter, PowerSpy, in real-time. benchmark suites, such as PARSEC (Bienia and Li, 2009). We publish
In this article, we use POWERAPI i) to learn the power models pre- several of these open source benchmarks, as well as POWERAPI, as Docker
sented in Section 4, ii) to validate our learning approach and compare it containers that can be easily deployed from DockerHub to host ma-
against the state-of-the-art in Section 5, and iii) to build several soft- chines in order to quickly learn their power model using our toolkit.10
ware-defined power meters in Section 6. However, this choice does not prevent from including additional
(micro-)benchmark suites or any other sample workloads to study their
impact on the inferred power models. During this step, POWERAPI takes
4. Learning CPU power models care of launching the selected workloads several times—3 times by
default—and in isolation for reducing the potential noise that can be
While state-of-the-art CPU power models demonstrate that experienced during the learning phase and thus improve the reliability
achieving accurate power estimation is possible (cf. Table 1), most of of the monitoring phase.
the past contributions are barely generalizable to processors or appli- Acquisition of raw HWPC counters Technically, the CPU cannot
cations that were not part of the original study. Therefore, instead of monitor hundreds of HWPC events simultaneously (Intel, 2015a), without
trying to design yet another power model, we rather propose an ap- triggering an event multiplexing mechanism, which degrades the ac-
proach capable of learning the specifics of a processor and building the curacy of the collected metrics. Thus, we have to split the list of
fitting CPU power model. Hence, our solution intends to cover evolving available events into subsets to avoid multiplexing causing inaccuracies
architectures and aims at delivering a reusable solution that will be able in our process. Based on the information gathered from
to deal with current and future generations of CPU architectures. Table 2—numbers are automatically extracted from the architec-
As reported by Kansal et al. (2010), the CPU load does not accu- ture—we compute the number of events that can be monitored in
rately reflect the diversity of the CPU activities. In particular, to faith- parallel. For a given CPU, the number of workload executions w to be
fully capture the power model of a CPU, the types of tasks that are considered is therefore defined as:
executed by the CPU have to be clearly identified. We therefore decide
to base our power models on hardware performance counters (HWPCs) to ⎡ Ep ⎤
collect raw, yet accurate, metrics reflecting the types of operations that w=⎢ ∑ Cp ⎥
× W ×i
are truly executed by the CPU. However, the number and the nature of ⎢
⎢ p ∈ PMUs ⎥
⎥ (1)
8 10
https://github.com/libfuse/libfuse. Open source benchmark containers: https://github.com/Spirals-Team/
9
https://www.influxdata.com/time-series-platform/influxdb. benchmark-containers.
386
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
Fig. 1. Overview of the POWERAPI’s architecture with 3 instances of software-defined power meter built with it. The leftmost instance (Fig. 1a) shows the component
assembly used in Section 4 to learn CPU power models, while the middle and the rightmost instances (Fig. 1b and c resp.) represent the assemblies made in Section 5
to compare real-time power estimation with measurements.
where E is the set of events made available by the processor for a given Power model inference by regression We finally apply a regres-
PMU, C is the set of generic counters available for a PMU, W is the set of sion analysis to derive the CPU power model from the previously se-
input workloads, and i is the number of sampling iterations to execute. lected HWPC events. In particular, we consider that the power model of a
Combining HWPC events and sample applications may quickly lead to host machine, P(host), is defined as:
the comparison of thousands of candidate metrics. Hence, a filtering
P (host ) = Pidle (host ) + Pactive (host )
step is introduced to guarantee an acceptable duration for the learning
c c
phase. Our approach therefore proposes an automated way to focus on = ∑c ∈ C Pidle (host ) + ∑p ∈ P (host ) ∑c ∈ C Pactive (p) (3)
the most relevant events. In the first step, each workload is only exe-
cuted for 30 s (configurable) while collecting values from HWPC events where Pidle(host) refers to the idle consumption of the host machi-
and from a power meter. We then filter out the most relevant HWPC ne—i.e., its power consumption when the node does not host any ac-
events by applying the Pearson correlation coefficient. We compute the tivity—and Pactive(host) refers to the dynamic consumption due to the
Pearson correlation coefficient re, p for each workload between the n activity of each hosted process p. Both of these idle and dynamic con-
values reported by each monitored HWPC event e and the collected power sumptions encompass all the hardware components c supported by the
consumption p: host machine C (CPU, motherboard, fans, memory, disk, etc.). As part of
this article, we consider both CPU- and memory-intensive workloads,
n
∑i = 1 (ei − e ) (pi − p ) thus modeling the power consumption of these components (CPU,
re, p = n
∑i = 1 (ei − e )2
n
∑i = 1 (pi − p )2 memory) and their side-effects on related components (motherboard,
(2)
fans, etc.).
Selection of relevant HWPC events As next step, we eliminate the While the state-of-the-art has investigated several regression tech-
HWPC events that have a median correlation coefficient (r͠ ) below a niques, from linear (Dolz et al., 2015; Li and John, 2003), to polynomial
configurable threshold. To provide a default configuration value, we ones (Arjona Aroca et al., 2014; Colmant et al., 2015), this article re-
consider Cohen’s classification of coefficients that states that 0.1 to 0.3 ports on the adoption of the robust ridge regression (Lim et al., 2010;
can be interpreted as a weak correlation, 0.3 to 0.5 as a moderate Rousseeuw and Leroy, 1987), which belongs to the family of multi-
correlation and greater than 0.5 as a strong correlation (Benesty et al., variate linear regressions. This technique has been chosen to eliminate
2009; Cohen, 1988). Therefore, we consider that any coefficient below outliers and to limit the effect of collinearity between variables—i.e.,
0.5 clearly indicates a lack of correlation between the considered event HWPC events in our case. Unlike least squares estimates for regression
(e) and power consumption (p). During this step, we can quickly filter models, the robust regression better tolerates non-normal measurement
out hundreds of uncorrelated—and therefore irrelevant—events, re- errors, which may happen when combining power measurements and
sulting for instance in 253 left out of 514 events on an Intel Xeon HWPC events.
W3520. The reduced set of HWPC events is then used to relaunch all the The computation of the multiple linear regression should balance
workloads, but this time with the default runtime of the selected the gain in terms of estimation error with the cost of including an ad-
benchmarks. By increasing this correlation filter threshold, the user can ditional event into the CPU power model. To design the CPU power
reduce the number of events to be considered in the next step and speed model as accurately as possible, we consider a training subset Rn of n
up the regression phase, at the risk of considering a very limited benchmarks (∀n < |W|, Rn⊆W)—composed from those exhibiting the
number of events, depending on CPU architectures. lowest median Pearson coefficients—as input for our regression pro-
At the end of the full execution, we rank the remaining HWPC events cess. From Rn, we compute a CPU power model for each combination of
for all the workloads based on their newly calculated median correla- HWPC events, by taking into account the limited number of events that
tion with the power consumption, as depicted in Fig. 3. This figure il- can be monitored concurrently. For each training set Rn, and from all
lustrates the distribution of Pearson coefficients for the 30 best events, the computed power models, we only keep the one with the smallest
which varies for each of the workloads (W) taken from the PARSEC regression error. Finally, we compare the CPU power model obtained
benchmark suite on the Intel Xeon W3520 processor. One can clearly for each Rn and we pick the one that minimizes the absolute error be-
distinguish the benchmarks that stimulate all selected HWPC events (e.g., tween the regression and the remaining benchmarks, which have not
x264, vips) from the ones whose power consumptions match only been included in the training set (En = W ∖Rn ) .
some specific events (e.g., freqmine, fluidanimate). As an illustration, Fig. 4 depicts the distribution of the average error
387
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
Table 2
1% off.6% run.
3 W 70 W max.
the numbers of generic counters and available events.
< 7% of total
Manuf. CPU PMU #Gen. #Ev.
Error(s)
energy
< 5%
2.5%
4.7%
7.5%
3W
Intel Xeon nhm 4 338
5%
4%
5%
W3520 nhm_unc 8 176
i3 2120 snb 4 336
Espresso
IOzone
Multiple linear by component
Polynomial, multiple linear
Multiple linear
Multiple linear
Multiple linear
Multiple linear
Multiple linear
Support vector
Regression(s)
ARMAX
Linear
Linear
Linear
7components 91 preselected
HW sensors HWPCs
non-halted cycles
larger error than one that uses a lower number of events. As an ex-
ample, on the Intel Xeon W3520, the CPU power model composed of 2
3 to 5 HWPCs
HW sensors
HW sensors
Feature(s)
events taken from the PMU nhm (see Table 2) emerges from this ana-
15 HWPCs
5 HWPCs
IPC
1.60 W.
All the above steps allow POWERAPI to compute a CPU power model
Core 2 Duo & Xeon, Itanium 2, Turion
Sandy Bridge
Core 2 Duo
Pentium 4
Pentium 4
The resulting CPU power models are then used as input parameters
Summary of existing CPU power models.
this article, we use a libpfm11 sensor actor on the host nodes to collect
Bertran et al. (Bertran et al., 2010)
Bircher et al. (Bircher et al., 2005)
the HWPC events. libpfm can list and access available HWPC counters on
Yang et al. (Yang et al., 2014)
Dolz et al. (Dolz et al., 2015)
11
http://perfmon2.sourceforge.net.
388
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
Fig. 3. Pearson coefficients of the Top-30 correlated events for the PARSEC benchmarks on the Intel Xeon W3520.
a 12
Empirically found. http://www.alciom.com/en/products/powerspy2-en-gb-2.html.
389
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
To assess the effectiveness of the robust ridge regression, we inspect the System configuration
eigenvalues of corresponding correlation matrix. Very low values (close This server is configured to run a Linux Ubuntu 14.04 (kernel 3.13)
to zero, 10−3 ) in the resulting matrix denote a collinearity between with the vanilla configuration.
variables. The selected events have eigenvalues of 1.5 and 0.5, con- CPU power model definition
firming the non-collinearity of the HWPC events included in this CPU The resulting CPU power model computed by POWERAPI with a
power model. training subset, R3, is composed of 2 HWPC events from PMU snb
CPU power model accuracy (e1 = idq: empty 15, e2 = uops_dispatched: stall_cycles 16) and 1 HWPC
Our approach automatically isolates the idle consumption of the event from PMU snb_unc_cbo0 (e3 = unc_clockticks17):
node whose relationship to TDP is defined in Rivoire et al. (2007) and
Pidle = 30 W
Noureddine et al. (2014) as P ≃ Pidle + 0.7 × TDP . Regarding the dy-
CPU 1.12·e1 4.55·e2 6.89·e3
namic power consumption of the host machine, Fig. 5 reports an Pactive = + +
108 109 1010
average relative error of 1.35% (1.60 W) for the subset of benchmarks
that were not included in the sampling set (bodytrack, swaptions, vips, CPU power model accuracy
x264). In particular, this figure reports on the comparison between the Although it uses a similar configuration and the same subset of the 4
power estimation produced with POWERAPI and a PowerSPY Bluetooth benchmarks described in Section 5.1, the resulting CPU power model
power meter while running the PARSEC benchmark suite. The software- strongly differs from the power model computed for the Xeon. Yet,
defined power meter built with POWERAPI has been configured with a Fig. 7 reports a relative error of 1.57% (0.71 W), on average, which
frequency of 1 Hz on an Intel Xeon W3520. confirms the accuracy of our CPU power models for Intel architectures,
To confirm the accuracy of the generated CPU power model, we beyond the Intel-specific models previously published (Colmant et al.,
propose to compare these results with state-of-the-art power models 2015).
that we can reproduce. In particular, we selected 2 CPU power Beyond Intel architectures, POWERAPI can also address alternative
13 15
l1i:reads counts all instruction fetches, including uncacheable fetches that idq:empty counts cycles the Instruction Decode Queue (IDQ) is empty.
16
bypass the L1I. uops_dispatched:stall_cycles counts number of cycles no micro-operation
14
lsd:inactive counts all cycles when the loop stream detector emits no uops (μops) were dispatched to be executed on this thread.
17
to the RAT for scheduling. unc_clockticks is not officially documented.
390
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
Fig. 6. Comparison of the errors exhibited from the power models learned with the approaches presented in Noureddine et al. (2014) (Jalen) and (Colmant et al.,
2015) (BitWatts), and the one presented in this article (PowerAPI), for the Xeon W3520 processor Pidle = 92 W .
Fig. 7. Relative error distribution of the PARSEC benchmarks on the i3 pro- Fig. 8. Relative error distribution of the PARSEC benchmarks on the Opteron
cessor (Pidle = 30 W ). processor (Pidle = 390 W ).
CPU processors provided by AMD and ARM. While POWERAPI can accurate power estimations for power-con-
suming servers, it can also be used to estimate the power consumption
of energy efficient systems running atop of ARM processors.
5.3. AMD Opteron 8354
5.4. ARM Cortex A15
System configuration
This server is configured to run a Linux Ubuntu 14.04 (kernel 3.13)
System configuration
with the vanilla configuration.
Our last configuration is a Jetson Tegra K121 with Linux Ubuntu
CPU power model definition
14.04 (kernel 3.10). The processor has 4 plus 1 cores on its chip, de-
The resulting CPU power model computed by POWERAPI from the
signed and optimized by NVIDIA. The 4 cores have a standard behavior,
training subset, R3, is composed of 3 HWPC events from the PMU
while the additional core is designed to be energy efficient. These cores
fam10h_barcelona (e1 = probe: upstream_non_isoc_writes 18, e2 =
are exclusive—i.e., we cannot use both configurations together. By
instruction_cache_fetches 19,
default, the 4 cores are enabled, and only a manual action can put the
e3 = retired_mmx_and_fp_instructions: all20):
processor in low power mode. We first use this default behavior for the
Pidle = 390 W purpose of validation.
CPU 2.63·e1 8.20·e2 + 3.16·e3 CPU power model definition
Pactive = − +
10 4 109 The resulting CPU power model computed by POWERAPI from R4 is
composed of 3 HWPC events from the PMU arm_ac15 (e1 = cpu_cycles 22,
CPU power model accuracy e2 = inst_spec_exec_integer_inst 23, e3 = bus_cycles 24):
Fig. 8 reports a relative error of 0.20% (0.81 W), on average. While
Pidle = 3.5 W
most of the works in the state-of-the-art focus on Intel architectures (cf.
CPU 1.18·e1 1.26·e2 1.84·e3
Table 1), the accuracy of the CPU power model we generate for the Pactive = + +
AMD Opteron configuration assesses our capability to cover alternative 109 1010 1011
CPU architectures. Given the particularly high value of the idle con- CPU power model accuracy
sumption (390 W) imposed by the infrastructure setup (cluster con- Fig. 9 reports a relative error of 2.70% (0.17 W), on average. This
figuration), this relative error raises up to 8.45% (0.77 W) if we com- error rate has to be balanced with the low idle consumption of this CPU,
pute this relative error by only considering the dynamic power compared to previous configurations. Nonetheless, this CPU power
consumption—i.e., by subtracting Pidle from raw measurements.
21
https://developer.nvidia.com/jetson-tk1.
18 22
probe:upstream_non_isoc_writes is not officially documented. cpu_cycles is not officially documented.
19 23
instruction_cache_fetches is not officially documented. inst_spec_exec_integer_inst counts the integer data processing instructions
20
retired_mmx_and_fp_instructions:all counts different classes of Floating speculatively executed.
24
Point (FP) instructions. bus_cycles is not officially documented.
391
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
Can we build CPU power models that better fit specific domains of ap-
plications?
In Section 5, we identified applications from the PARSEC bench-
mark suite as representative workloads for characterizing the power
Fig. 9. Relative error distribution of the PARSEC benchmarks on the A15 consumption of our processors. In particular, we focused on delivering
processor (Pidle = 3.5 W ). CPU power models that can estimate the power consumptions of a wide
diversity of applications. However, if one knows beforehand that a
specific type of workload will run on a node, we can also use our ap-
model demonstrates that POWERAPI can also generate accurate CPU
proach to derive domain-specific CPU power models. As an example,
power models for embedded systems, thus going beyond standard
we use a set of benchmarks from the well-known NAS parallel bench-
server settings.
mark (NPB) suite (Bailey et al., 1991) on the ARM Cortex A15, and
derive a new power model specifically for this set of applications using
5.5. Synthesis our approach described in Section 4.1. NPB was designed to take ad-
vantage of highly parallel supercomputers and thus the implemented
From all the above experiments and observations, we can assess that benchmarks represent CPU-intense workloads.
the generated CPU power models perform well on a variety of re- The resulting CPU power model computed by POWERAPI with the
presentative architectures, including Intel, ARM, and AMD. Our solu- lowest average error is composed of 3 HWPC events from the PMU
tion does not rely on a specific processor extension (e.g., RAPL) and can arm_ac15: (e1 = bus_read_access25, e2 = cpu_cycles 26, e3 =
use specific workloads during the learning phase to build domain-spe- bus_access 27):
cific CPU power models. On average, our solution exhibits a relative
error of 1.5% (0.8 W), not only clearly outperforming the state-of-the- Pidle = 3.5 W
art, but also offering a reproducible approach for comparing CPU power CPU −1.72·e1 1.52·e2 − 5.08·e3
Pactive = +
models. 108 109
The closest method to ours, described in Economou et al. (2006),
In Fig. 10, we depict the results and compare them with the original
builds a CPU power model for the whole testbed system and use it then
model derived in Section 5. We can see that a domain-specific model
inside their software solution, Mantis, for power estimation. Their
can improve the original one (PARSEC model) with an average relative
power model is composed of 4 metrics, the CPU utilization, the memory
error of 4% (corresponding to 0.41 W).
access count, hard disk, and network I/O rates, and exhibits an error
In comparison, the PARSEC power model has an average relative
range between 0% and 15%. However, the key limitations of their so-
error of 20% (2.34 W), which shows the benefits of building domain-
lution are that i) they use a predefined set of metrics, which can clearly
specific CPU power models. We are thus able to derive accurate CPU
differ between architectures, ii) they use a heavy subsystem of power
power models with our approach despite the diversity of benchmarks.
planes to get the power consumption of components for their offline
To the best of our knowledge, our solution is the first to be open-source,
power modeling, and, iii) their solution produces only power estimation
configurable, and directly usable to build CPU power models.
for components, not at process level.
Feng et al. (2005) and Ge et al. (2010) describe a solution for
computing the power profiles of each component of a subsystem. A 6.2. Real-time power monitoring
component power profile corresponds to its power footprint over a
given period of time. Moreover, their solution allows them to get ad- Can we use the derived CPU power models to estimate the power con-
ditional insights about the software power consumption as they propose sumption of any workload in real-time?
the same kind of profile at the code level. However, this solution tends To further evaluate the applicability of POWERAPI in a real-world and
to be very intrusive by connecting to the hardware pins to collect the multi-threaded environment, we run the SPECjbb 2013
power consumption of each component, and they do not propose a benchmark (SPEC, 2013). This benchmark implements a supermarket
proper way to estimate the power consumption of these components in company that handles distributed warehouses, online purchases, as well
production. as high level management operations (data mining). The benchmark is
We strongly believe that our approach is well suited to explore the implemented in Java and consists of controller components for mana-
space of HWPC events made available by the CPU and for profiling with ging the application and backends that perform the actual work. A run
accuracy the power consumption drawn by the CPU. takes approximately 45 minutes; it has varying CPU utilization levels
and requires at least 2 GB memory per backend to finish properly.
We use the Intel i3 2120 processor for this experiment with the CPU
6. Building software-defined power meters
power model introduced in Section 5.2. Thanks to the capabilities of
hardware performance counters, the power models which are generated
The learning and model generation approach we introduced in this
at the host level can be directly used at the application—or pro-
paper are combined to build accurate CPU power models. All the de-
cess—level. Fig. 11 illustrates the per-process power consumption,
scribed CPU power models are built to represent the overall power
consumption of a node. They can also be used to produce accurate
process-level power estimations when needed, thanks to the different 25
bus_read_access is semantically meaningful.
modes exposed by the hardware counters (cf. Sections 6.2 and6.3). 26
cpu_cycles is semantically meaningful.
27
From our CPU power models, we can easily extract the idle power bus_access is semantically meaningful.
392
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
Fig. 10. Absolute error distribution of the NPB benchmarks on the A15 pro- Fig. 12. Process-level power estimation delivered by POWERAPI in real-time
cessor by using the PARSEC and NPB power models (Pidle = 3.5 W ). (4 Hz) (Pidle = 92 W ).
393
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
Pidle = 92 W
CPU 2.02·e1 7.76·e2 + 4.43·e3 + 2.70·e4
Pactive = +
108 109
The third model covers the performance optimizations provided by
CentOS (C.perf). We use the tuned-adm tool for improving perfor-
mance in specific use cases and for interacting with the power saving
mechanisms. This command comes with different tuning server profiles
depending on the use of the underlying system and hardware. We use
the latency-performance profile, which allows the operating system to
lower the latency of the system and thus to increase performance of a
virtual guest; it is thus well suited to reduce swapping when CentOS is
installed in a VM.
The CPU power model computed for CentOS with the latency-per-
formance profile is composed of 4 HWPC events from the PMU nhm
Fig. 13. Energy consumption of the host by using the 4-plus-1 power profiles on (e1 = l1d_prefetch : triggers35, e2 = uops_decoded_dec036, e3 =
the A15 processor and cg.B (Pidle = 3.5 W ). fp_comp_ops_exe: sse_fp_scalar 37, e4 = l1i: reads 38):
Pidle = 125 W
(2.7 KJ < 4 KJ). The 1-core profile exhibits a low power consumption, CPU 8.86·e1 7.93·e2 + 6.33·e3 + 5.38·e4
but is penalized by the idle power accumulating over time. Pactive = +
108 109
In comparison to existing solutions, these power profiles were de-
rived automatically without a deep expertise of the underlying archi- One can observe that the idle consumption (Pidle) of this power
tectures and they can be used directly in our middleware toolkit, model is higher than the one isolated in the previous power model. This
POWERAPI. Our solution is then able to detect which mode is enabled and
results directly from the choice of selecting the latency-performance
to adapt the CPU power model accordingly at runtime for estimating profile to keep all the cores active. The operating system implements
the power consumption of software assets. Our approach therefore this policy by disabling the C-states CPU feature, which impacts the idle
captures all the features enabled on the processor to build adjusted CPU consumption of the host. While the operating system cannot benefit
power models. from C-states to optimize the energy consumption, the induced over-
head is therefore associated to the idle consumption of the host as it
cannot be fairly distributed among processes. The average power con-
6.5. System impact on CPU power models sumption values reported by the CPU power models are depicted in
Fig. 14.
Does the CPU power model depend on the underlying operating system or In particular, we compare the duration and the power consumption
performance profiles? of each profile while changing the utilization ratio of a core and in-
We already showed that the CPU optimizations can have a non- creasing the number of allocated cores. With default settings, the choice
negligible impact on the power consumption. Additionally, several between Ubuntu and CentOS does not impact the power consumption
hardware optimizations are controlled by the operating system, such as and none of them pulls out of the game in terms of execution duration.
frequency scaling or hyper-threading. However, more interesting reports are delivered when the latency-
We now compare the power consumption of 2 popular open source performance profile is enabled. Indeed, when one process stresses a full
operating systems: Ubuntu and CentOS. Ubuntu is known as user- core, the power difference between the default settings and this profile
friendly, with a huge community of users and the philosophy of sup- can be greater than 20 W. This difference is due to the idle power that
porting a wide variety of systems, from servers to mobile devices. represents a non-negligible part of the power drawn by this profile.
CentOS is derived from the open-source version of Red Hat Enterprise Actually, the latency-performance profile turns all cores of the pro-
Linux (RHEL) and hence targets productivity systems that require a cessor in the C0 state, which means that the cores are always turned on
stable and dependable OS. Some very useful tools are available on this for minimizing the latency to wake up.
system for optimizing the hardware and software. Moreover, one can see that the activation of the performance profile
A version of Ubuntu 14.04 with a Linux kernel 3.13 and a version of does not decrease the execution duration of the benchmark. Hence, we
CentOS 7 with a Linux kernel 3.10 were installed on the Intel Xeon can clearly target Ubuntu or CentOS with default settings to get the best
W3520. We use the benchmark bt from the NPB suite. compromise between performance and power consumption.
For our experiment, we compare 3 CPU power models. One was Based on the above experiment, we can reach the following con-
already presented in Section 5.1 and we use the default settings without clusions:
specific behavior (U.def).
The second model represents the default CPU settings of CentOS • the operating system is not the major cause of power variations (by
(C.def). The CPU power model is composed of 4 HWPC events from the comparing U.def and C.def bars),
PMU nhm (e1 = uops_retired: active_cycles31, e2 = uops_issued : any 32, • the power impact of disabling C-states is non-negligeable (by com-
e3 = ssex_uops_retired : scalar_single 33, paring C.def and C.perf bars),
e4 = uops_retired: retire_slots 34): • reducing the computation time does not necessarily improve the
power consumption (by comparing the evolution of lines and bars
for 75% and 100% CPU load on 1 core).
31
uops_retired:active_cycles counts the number of cycles with μops retired.
32
uops_issued:any counts the number of cycles when μops are issued by the 35
l1d_prefetch:triggers counts number of prefetch requests triggered by the
Register Alias Table (RAT) to Reservation Station (RS). Finite State Machine (FSM) and pushed into the prefetch FIFO.
33
ssex_uops_retired:scalar_single counts Streaming SIMD Extensions (SSE) 36
uops_decoded_dec0 counts μops decoded by decoder 0.
scalar single-precision FP μops retired. 37
fp_comp_ops_exe:sse_fp_scalar counts number of SSE FP scalar μops exe-
34
uops_retired:retire_slots counts number of retirement slots used when μops cuted.
38
are retired. l1i:reads counts all instruction fetches.
394
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
395
M. Colmant et al. The Journal of Systems & Software 144 (2018) 382–396
Dolz, M.F., Kunkel, J., Chasapis, K., Catalán, S., 2015. An analytical methodology to Rasmussen, M., 2015. Sched: energy cost model for energy-aware scheduling.
derive power models based on hardware and software metrics. Comput. Sci. Res. Dev. Rivoire, S., Ranganathan, P., Kozyrakis, C., 2008. A comparison of high-level full-system
Economou, D., Rivoire, S., Kozyrakis, C., 2006. Full-system power analysis and modeling power models. Proceedings of the Conference on Power Aware Computing and
for server environments. In Workshop on Modeling Benchmarking and Simulation. Systems.
Fan, X., Weber, W.-D., Barroso, L.A., 2007. Power provisioning for a warehouse-sized Rivoire, S., Shah, M.A., Ranganathan, P., Kozyrakis, C., 2007. JouleSort: a balanced en-
computer. Proceedings of the 34th Annual International Symposium on Computer ergy-efficiency benchmark. Proceedings of the ACM SIGMOD international con-
Architecture. ference on Management of data.
Feng, X., Ge, R., Cameron, K., 2005. Power and energy profiling of scientific applications Rousseeuw, P.J., Leroy, A.M., 1987. Robust Regression and Outlier Detection.
on distributed systems. Proceedings of the 19th IEEE International Parallel and Saborido, R., Arnaoudova, V., Beltrame, G., Khomh, F., Antoniol, G., 2015. On the impact
Distributed Processing Symposium. of sampling frequency on software energy measurements. PeerJ 3, e1219. PrePrints
Flinn, J., Satyanarayanan, M., 1999. PowerScope: a tool for profiling the energy usage of Schulte, E., Dorn, J., Harding, S., Forrest, S., Weimer, W., 2014. Post-compiler software
mobile applications. Proceedings of the Second IEEE Workshop on Mobile Computer optimization for reducing energy. Proceedings of the 19th International Conference
Systems and Applications. on Architectural Support for Programming Languages and Operating Systems.
Ge, R., Feng, X., Song, S., Chang, H.-C., Li, D., Cameron, K., 2010. Powerpack: energy SPEC, 2013. SPECjbb2013 Design Document.
profiling and analysis of high-Performance systems and applications. IEEE Trans. Sterling, C., 2013. Energy Consumption tool in Visual Studio 2013.
Parallel Distrib. Syst. Tang, G., Jiang, W., Xu, Z., Liu, F., Wu, K., 2015. Zero-Cost, fine-grained power mon-
Havet, A., Schiavoni, V., Felber, P., Colmant, M., Rouvoy, R., Fetzer, C., 2017. GenPack: a itoring of datacenters using non-intrusive power disaggregation. Proceedings of the
generational scheduler for cloud data centers. Proceedings of the 5th IEEE 16th Annual Middleware Conference.
International Conference on Cloud Engineering (IC2E). Vapnik, V., Golowich, S.E., Smola, A., 1997. Support vector method for function ap-
Heath, T., Diniz, B., Carrera, E.V., Meira Jr., W., Bianchini, R., 2005. Energy conservation proximation, regression estimation, and signal processing. Advances in Neural
in heterogeneous server clusters. Proceedings of the Tenth ACM SIGPLAN Symposium Information Processing Systems.
on Principles and Practice of Parallel Programming. Versick, D., Wassmann, I., Tavangarian, D., 2013. Power consumption estimation of CPU
Intel, 2015. Intel 64 and IA-32 Architectures Software Developer’s Manual. and peripheral components in virtual machines. SIGAPP Appl. Comput. Rev.
Intel, 2015. PowerTOP. URL https://01.org/powertop. Wang, S., Chen, H., Shi, W., 2011. SPAN: A software power analyzer for multicore
Isci, C., Martonosi, M., 2003. Runtime power monitoring in high-end processors: meth- computer systems. Sustainable Comput.
odology and empirical data. Proceedings of the 36th Annual IEEE/ACM International Yang, H., Zhao, Q., Luan, Z., Qian, D., 2014. Imeter: an integrated VM power model based
Symposium on Microarchitecture. on performance profiling. Fut. Gener. Comput. Syst.
Jia, W., Shaw, K., Martonosi, M., 2012. Stargazer: automated regression-based GPU de- Yoon, C., Kim, D., Jung, W., Kang, C., Cha, H., 2012. Appscope: application energy me-
sign space exploration. Performance Analysis of Systems and Software. tering framework for android smartphone using kernel activity monitoring. USENIX
Kansal, A., Zhao, F., 2008. Fine-grained energy profiling for power-aware application Annual Technical Conference. 12. pp. 1–14.
design. SIGMETRICS Perform. Eval. Rev. Zamani, R., Afsahi, A., 2012. A study of hardware performance monitoring counter se-
Kansal, A., Zhao, F., Liu, J., Kothari, N., Bhattacharya, A.A., 2010. Virtual machine power lection in power modeling of computing systems. Proceedings of the 2012
metering and provisioning. Proceedings of the 1st ACM Symposium on Cloud International Green Computing Conference.
Computing. Zhai, Y., Zhang, X., Eranian, S., Tang, L., Mars, J., 2014. HaPPy: hyperthread-aware
Krioukov, A., Mohan, P., Alspaugh, S., Keys, L., Culler, D., Katz, R.H., 2010. NapSAC: power profiling dynamically. Proceedings of the USENIX Annual Technical
design and Implementation of a Power-proportional Web Cluster. Proceedings of the Conference.
First ACM SIGCOMM Workshop on Green Networking. Zhang, L., Tiwana, B., Qian, Z., Wang, Z., Dick, R.P., Mao, Z.M., Yang, L., 2010. Accurate
Kurpicz, M., Sobe, A., Felber, P., 2014. Using power measurements as a basis for workload online power estimation and automatic battery behavior based power model gen-
placement in heterogeneous multi-cloud environments. Proceedings of the 2nd eration for smartphones. Proceedings of the Eighth IEEE/ACM/IFIP International
International Workshop on CrossCloud Systems. Conference on Hardware/Software Codesign and System Synthesis. ACM, New York,
Landin, P.J., 1966. The next 700 programming languages. Commun. ACM. NY, USA, pp. 105–114. https://doi.org/10.1145/1878961.1878982.
Li, T., John, L.K., 2003. Run-time modeling and estimation of operating system power
consumption. SIGMETRICS Perform. Eval. Rev. Maxime Colmant holds a PhD in Computer Science from the University of Lille and
Lim, M.Y., Porterfield, A., Fowler, R., 2010. SoftPower: fine-grain power estimations focuses on the design and implementation of sustainable software systems. As part of his
using performance counters. Proceedings of the 19th ACM International Symposium PhD thesis, he designed new methods and tools to measure the power consumption of
on High Performance Distributed Computing. server-side infrastructures.
McCullough, J.C., Agarwal, Y., Chandrashekar, J., Kuppuswamy, S., Snoeren, A.C., Gupta,
R.K., 2011. Evaluating the effectiveness of model-based power characterization.
Romain Rouvoy is a Professor of Computer Science at the University of Lille, a member
Proceedings of the USENIX Annual Technical Conference.
Meisner, D., Sadler, C.M., Barroso, L.A., Weber, W.-D., Wenisch, T.F., 2011. Power of the Inria Spirals project-team and a Junior Fellow of Institut Universitaire de France.
management of online data-intensive services. SIGARCH Comput. Archit. News. His research expertise spans the fields of distributed systems and software engineering.
Moghaddami Khalilzad, N., Lelli, J., Lipari, G., Nolte, T., 2013. Towards energy-aware
multiprocessor hierarchical scheduling of real-time systems. 19th IEEE International Mascha Kurpicz holds a PhD in Computer Science from the University of Neuchtel and
Conference on Embedded and Real-Time Computing Systems and Applications. focuses on the design and implementation of sustainable software systems. As part of her
Nordwall, P., 2012. 50 million messages per second - on a single machine, http:// PhD thesis, he was working in the field of energy-efficient cloud computing.
letitcrash.com/post/20397701710/50-millionmessages-per-second-on-a-single.
Noureddine, A., Rouvoy, R., Seinturier, L., 2014. Unit testing of energy consumption of Anita Sobe holds a PhD in Computer Science from Alpen-Adria Universitt Klagenfurt. She
software libraries. Proceedings of the 29th Annual ACM Symposium on Applied worked as post-doctoral research fellow at the University of Neuchtel, Switzerland con-
Computing. tributing to the ParaDIME project targeting software-hardware solutions for energy-effi-
Noureddine, A., Rouvoy, R., Seinturier, L., 2015. Monitoring energy hotspots in software - cient data centers.
Energy profiling of software code. Autom. Softw. Eng.
Orgerie, A.-C., Dias de Assuncão, M., Lefèvre, L., 2014. A survey on techniques for im-
Pascal Felber is a Professor of Computer Science and the University of Neuchtel. His
proving the energy efficiency of large-scale distributed systems. ACM Comput. Surv.
research expertise focuses is on dependable, concurrent, and distributed computing.
Polfliet, S., Ryckbosch, F., Eeckhout, L., 2011. Automated full-system power character-
ization. Micro. IEEE.
Prekas, G., Primorac, M., Belay, A., Kozyrakis, C., Bugnion, E., 2015. Energy pro- Lionel Seinturier is a Professor of Computer Science at the University of Lille, a member
portionality and workload consolidation for latency-critical applications. Proceedings of the Inria Spirals project-team. His research expertise spans the fields of distributed
of the Sixth ACM Symposium on Cloud Computing. systems and software engineering.
396