Sie sind auf Seite 1von 12

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO.

1, MARCH 2013 23

Synchronous-Logic and Asynchronous-Logic 8051


Microcontroller Cores for Realizing the Internet of
Things: A Comparative Study on Dynamic Voltage
Scaling and Variation Effects
Kok-Leong Chang, Member, IEEE, Joseph S. Chang, Bah-Hwee Gwee, Senior Member, IEEE, and
Kwen-Siong Chong, Member, IEEE

Abstract—Microcontrollers play a vital role in embodying tion. This paradigm is known as the internet of things (IoT).
intelligence into battery-powered everyday objects to realize the The realization of IoT requires everyday objects to be able to
internet of things (IoT). The desirable attributes of such a micro- harvest/store/use energy. In general, objects are either ac-pow-
controller and the like include high energy and area efficiency,
and robust error-free operation under dynamic voltage scaling ered (power consumption not critical), rechargeable (runs for a
(DVS), workload, process, voltage, and temperature (PVT) vari- few days with one recharge), or battery-powered (runs for a few
ation effects. In this work, a synchronous-logic and a years with standard batteries, or indefinitely on harvested en-
quasi-delay-insensitive asynchronous-logic 8051 micro- ergy). AC-powered or rechargeable objects are currently well-
controller core are designed and fabricated for full-range DVS connected to the internet via various well-established wired/
from nominal to deep sub-threshold. The performance of the
and are largely comparable at nominal conditions wireless protocols. However, these objects, in comparison, play
and the entire DVS range, but differs when PVT and workload are only a very small part in the IoT. A large number of everyday
varied. At nominal , both the microcontroller cores feature battery-powered objects are energy-constrained and thus chal-
comparable energy and speed, with the electromagnetic inter- lenging for efficient internet connectivity. Nevertheless, these
ference of the lower and the area larger objects are usually low-cost, purpose-specific, and highly suit-
than the . When DVS is applied, both the microcontroller
cores feature comparable energy and speed; the requires able for crowd sourcing. Connecting these objects to the internet
simultaneous adjustment of clock frequency with . At wide not only completes the IoT framework but also serves as an en-
PVT variations, up to delay margins are required for the abler for applications in remote health-care, remote monitoring,
, whereas the operates at actual speed. When the smart transport, and logistics.
workload of both microcontrollers is varied, the features Embodying internet/networking capabilities on everyday
lower energy dissipation per workload due to the exploitation of
its asynchronous-logic protocols. For IoT applications that incur battery-powered objects challenges circuit and system de-
wide PVT and workload variations, is more suitable due to signers in all aspects of software and hardware microcontroller
its self-timed nature, whereas when PVT and workload variations design [1]. Designing microcontrollers for use in battery-pow-
are less severe, is more suitable due to a smaller IC area. ered objects is challenging primarily due to unpredictable
Index Terms—Asynchronous logic, dynamic voltage scaling, mi- quantity and quality of the energy source. The quantity of
crocontrollers, ubiquitous sensors. energy is unpredictable due to the energy-constrained operation
modality of such objects, e.g., reliability/capacity of standard
batteries, and sparse availability of solar, temperature, piezo-
I. INTRODUCTION
electric, radio-frequency energy, etc. The quality of energy is

I N THE future, everyday objects will ubiquitously embody also unpredictable due to the highly variable nature of such
digital sensing, communication, and processing capabilities sources of energy. Hence, the desirable attributes of such a
and are capable of collecting, processing, and relaying informa- microcontroller and the like include high-speed, high-energy
efficiency, small integrated circuit (IC) area, and features
error-free robust operation at wide operation space (dynamic
Manuscript received July 31, 2012; revised October 29, 2012; accepted Jan-
uary 01, 2013. Date of publication February 26, 2013; date of current version voltage scaling (DVS) and workload) and wide variation space
March 07, 2013. This paper was recommended by Guest Editor A.-Y. Wu. [process, voltage, and temperature (PVT)].
K.-L. Chang is with Institute of Materials Research and Engineering At the system-level, there is a choice between two general
(IMRE), A*STAR, Synthesis and Integration Group, 117602, Singapore
(e-mail: changkl@imre.a-star.edu.sg). digital-logic design philosophies—the prevalent conventional
J. S. Chang is with Nanyang Technological University, School of Electrical (clocked) synchronous-logic (sync) approach [2], the (clock-
and Electronic Engineering, Division of Circuits and Systems, 639798, Singa- less) asynchronous-logic (async) approach [3], and the hybrid
pore (e-mail: ejschang@ntu.edu.sg).
B.-H. Gwee is with Nanyang Technological University, School of Electrical globally async locally sync (GALS) approach [4]. In sync sys-
and Electronic Engineering, 639798 Singapore (e-mail: ebhgwee@ntu.edu.sg). tems, the operation between circuit modules are synchronized
K.-S. Chong is with Temasek Laboratories, Nanyang Technological Uni- according to a global timing reference (the global clock or sub-
versity, Integrated Systems Research Laboratory, 639798, Singapore (e-mail:
kschong@ntu.edu.sg). multiple clocks thereof) distributed via a clock infrastructure.
Digital Object Identifier 10.1109/JETCAS.2013.2243031 The clock frequency is ascertained such that the delay of every

2156-3357/$31.00 © 2013 IEEE


24 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013

operation is within the period of said clock, and the critical path static timing analysis (SSTA) can be employed to significantly
or slowest module typically defines the maximum clock fre- reduce delay margins of sync circuits. Nevertheless, when the
quency. For sake of error-free robust operation, the maximum entire operation and variation space is specified, SSTA and the
clock frequency is limited by the worst-case condition in the conventional approaches are comparable.
entire specified operation and variation space. It is well estab- In the case of QDI async systems, DVS simply involves
lished that in order to optimize energy and speed, the clock fre- varying and the circuits therein innately operate at their
quency should be adjusted accordingly to match the delay of maximum (actual) speed. Put simply, the QDI async system
the critical path [5] as the operating condition changes within innately accommodates the change in the operation space and
the specified operation and variation space, particularly at nom- variation space to achieve error-free operation.
inal and best-case conditions thereof. The async approach, on From a system’s viewpoint, it is essential to identify a system
the other hand, employs distributed local timing references in- (conventional sync or QDI async) for a specific operation
stead of a global timing reference. Async systems, particularly and variation space. In [17], a comparison between a sync
quasi-delay-insensitive (QDI) async [6], are self-timed where and async 8051 microcontroller is reported. However, the
local async circuit modules independently operate at their max- microcontrollers are fabricated on different dies (embodying
imum (actual) speed [7] in the specified operation and variation different memories) and are compared at the nominal supply
space, whereas its sync counterpart requires clock frequency ad- voltage with no variations considered. In [18], a synchronous
justments to achieve the same. microcontroller is desynchronized to an asynchronous equiva-
In many applications, there are instances of the need for lent, however the microcontrollers are similarly fabricated on
both high and modest computation speeds. An established a different die, and comparisons are done at nominal supply
and well-adopted method to increase the energy efficiency voltage with no variations considered. To the best of our knowl-
(hence reduce power dissipation) of these applications is DVS edge, there has not hitherto been a direct comparison between
where the supply voltage , is at nominal (high) voltage the conventional sync system and the QDI async system to
for the former instances and reduced (near-threshold and delineate and compare their attributes at wide operation and
sub-threshold voltages) for the latter other instances, thereby variation space, and of particular interest, where the range of
facilitating a trade-off between energy and speed: high energy, the DVS is full-range—from nominal voltage to sub-threshold
high speed nominal voltage low energy, medium speed voltage. Put differently, at this juncture, despite the maturity
near threshold voltage highest energy efficiency, low of the conventional sync and QDI async design philosophies
speed (sub-threshold voltage); other methods include clock and reported designs, the question of which design philosophy
gating, power gating, etc. One possible application is the en- is advantageous for full DVS remains somewhat contentious;
ergy-critical hearing aid [8], [9] where there are instances of noting that the availability of EDA tools, test and verification
high computation (e.g., noise reduction [10] and microphone methodologies and general acceptance are also important
directivity) and low computation (quiet conditions [11]). considerations in this question.
For sync systems, this method is appropriately known as dy- In this paper, we attempt to compare the speed, energy and
namic voltage and frequency scaling (DVFS) [12], [13] where IC area at wide operation and variation space between a sync
DVFS simultaneously scales the clock frequency (speed) with and an async system—a sync 8051 and an async 8051
respect to (the clock frequency is reduced for lower . To enable the comparison to be as equitable as pos-
) to accommodate the ensuing increased delay of the sible, they are both fabricated on the same die using 130 nm
critical path(s). Correlating the clock frequency (delays) with CMOS, and are based on the same standard library cells and
typically involves firmware programming by means of shared standard library memories; clock-gating is applied on
post-silicon timing characterization at every possible in the where possible. Both the and are de-
DVFS [14]. Further, delay margins are added to the ascertained signed for low to mid speed (50 MHz), and are realized with
clock frequency at every possible to accommodate the static-logic gates for robustness for full-range DVS; the reasons
worst-case condition in the specified PVT variation space. for adopting static-logic (vis-à-vis dynamic-logic) is delineated
Delay margins incur both energy and speed overheads, partic- in Section II. There is no deliberate attempt to optimize var-
ularly at the nominal and best-case conditions in the specified ious cell designs by means of custom designs. On the basis of
PVT variation space. It is generally accepted that because simulations and measurements on prototype ICs, the and
variations (due to PVT) increase as reduces (especially are compared at wide operation and variation spaces.
at sub-threshold voltages vis-à-vis at nominal voltage) and as The energy and speed of the and are largely com-
the minimum feature size of the fabrication technology scales parable at nominal conditions and the entire DVS range, but dif-
down [15], delay margins would need to be even more con- fers when PVT and workload are varied. At nominal , both
servative. For example, at sub-threshold voltages for 130-nm the microcontroller cores feature comparable energy and speed,
complementary metal–oxide–semiconductor (CMOS), a delay with the electromagnetic interference of the
margin of [14] is required to accommodate variations lower and the area larger than the . When DVS is
due solely to (process in PVT variations); a larger delay applied, both the microcontroller cores feature comparable en-
margin is not excessive in view of the other variations (voltage ergy and speed; the requires simultaneous adjustment
and temperature in PVT variations) [1], [16]. It is worth of clock frequency with . At wide PVT variations, up to
noting that when the specified operation and variation space delay margins are required for the , whereas the
is (limited operation and variation space), statistical operates at its actual speed. When the workload of both
CHANG et al.: SYNCHRONOUS-LOGIC AND ASYNCHRONOUS-LOGIC 8051 MICROCONTROLLER CORES FOR REALIZING THE INTERNET OF THINGS 25

Fig. 1. Simulated (a) energy and (b) speed of the static-logic async 8051 ALU and the dynamic-logic async 8051 ALUs implemented using , 0.2,
and 0.5, normalized with respect to the static-logic async 8051 ALU @500 mV. KCR is the ratio of the strength of the keeper and the strength of the critical path.

microcontrollers is varied, the features lower energy dis- and nominal voltage regions. Fig. 1(a) and (b), respectively,
sipation per workload due to the exploitation of its async-logic depicts the energy dissipation and speed of four
protocols. async 8051 arithmetic logic units (ALUs) at the typical process
This paper is organized in the following fashion. A succinct corner and : an async 8051 ALU implemented using
review of async approaches is presented in Section II. The ar- static-logic gates, and three async 8051 ALUs implemented
chitecture of the and the proposed are presented using dynamic-logic gates with the ratio of the strength of the
in Section III. Section IV presents the simulation and measure- keeper and the strength of the critical path ,
ment results on prototype ICs and benchmarking. Conclusions 0.2, and 0.5 (i.e., the larger the ratio, the larger the keeper);
are drawn in Section V. for ease of comparison, the results are normalized to the
static-logic async 8051 ALU @500 mV. In the near-threshold
II. REVIEW OF ASYNCHRONOUS-LOGIC APPROACHES voltage region , the dynamic-logic async 8051
In general, there are four async design philosophies ALU with is faster, but dissipates
[19]–[23]: the matched-delay approach (also known as bun- higher energy than the static-logic async 8051 ALU. At this
dled-data, assumes correlation between circuit delays and KCR, the minimum of the dynamic-logic async 8051
independently generated delays), the speed-independent ap- ALU is limited, . This is because at lower
proach (assumes arbitrary delays for circuits and all wire forks voltages, the reduces significantly, and KCR needs
are isochronic [24]), the QDI approach (assumes arbitrary to be increased for keepers to function robustly. Specifically,
delays for circuits and some wire forks are isochronic), and for the dynamic-logic async 8051 ALU to operate in the
the delay-insensitive approach (assumes arbitrary delays for sub-threshold voltage region [e.g., at the minimum energy
circuits and wires). Isochronic wire forks are wire fanouts point, (see Section IV-B)], it is necessary that
which has matching signal transitions at all ends of the fanout. . At , the dynamic-logic async 8051 ALU
The QDI async approach offers the best compromise between with is slower and dissipates higher
circuit complexity and robustness in view of the state-of-the-art energy than the static-logic async 8051 ALU. Further, when the
fabrication processes and DVS operation [20]. dynamic-logic async 8051 ALU operates at the near-threshold
The QDI async approach is generally based on one of two region , it dissipates higher energy than
broad logic implementations: dynamic-logic or static-logic. the static-logic async 8051 ALU. For completeness, the dy-
In dynamic-logic circuits, keepers are usually employed to namic-logic async 8051 ALU with is operable
maintain voltages in wires that are occasionally tri-stated. On down to 100 mV, at significant energy and speed overheads.
the other hand, in static-logic circuits, voltages in wires are In short, dynamic-logic sized at above-threshold is faster, but
always maintained by either inputs or combinational feedbacks. dissipate higher energy than static-logic and not robust at
Keepers that are appropriately sized (for speed and energy) sub-threshold voltages. Dynamic-logic sized for robustness at
to operate at nominal voltage may fail to operate at lower sub-threshold voltages is slower and dissipates higher energy
voltages. Conversely, keepers that are appropriately sized to than static-logic for the entire DVS range. The static-logic
operate at sub-threshold voltages suffer from significant speed approach, on the other hand, is ratioless and hence appropriate
and energy penalties when operating in the near-threshold for full range DVS [25].
26 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013

Fig. 2. Simulated (a) logic-low noise margin (NML) and (b) logic-high noise margin (NMH) of static-logic gates with fan-in of 1–4 for to 500 mV.
NML and NMH are normalized to and expressed as a percentage thereof.

To further qualify the robustness of static-logic for full-range controller) and the GP Port Control (general-purpose port
DVS, consider the noise margin of standard sized static-logic controller). Prog Port Control allows the initialization of the
gates with various fan-in; dynamic-logic circuits are not con- on-chip ROM by means of an off-chip mnemonic programmer
sidered here due to their lower robustness at low voltages. Nev- (through the program I/O) during power-up. GP Port Control
ertheless, for completeness, the noise margin of dynamic-logic serves as an interface controller between the and
circuits is reported in [25]. The static-logic gates consist of NOT (I/O of and , respectively), and the general-pur-
fan in , NOR , and NAND pose I/O [ , which comprises of , ,
gates. Fig. 2(a) and (b), respectively, depict the , and (where is 0–3)].
logic-low (NML) and logic-high (NMH) noise margins [26] Table I tabulates the operation modes of the ,
at the typical process corner and 25 , normalized to and their shared memories. The master reset (RSTN),
(and expressed as a percentage thereof). It can be seen, as ex- program enable (PROGN), microcontroller core selector
pected, that higher fan-in gates have lower noise margins, and ( ), and external interrupt (INTN) inputs apply
at very low voltages, the noise margins are unacceptable—neg- to both and . The active low input PROGN
ative values. For a minimum NML and NMH of 10% (de- disables both the and , and allows the ROM
picted in Fig. 2(a) and (b) with a horizontal dashed line) [27], to be programmed via the Prog Port Control block by means
employing static-logic gates with features a min- of the program I/O. The active high (active low) input
imum operating of 180 mV [operable at the minimum en- activates the . The active low
ergy point, (see Section IV-B)]. Further, to operate at input INTN triggers the interrupt system of the and
lower voltages with the same minimum NML and NMH, gates .
of lower fan-in should be used, e.g., employing static-logic gates
with features a minimum operating of 115 mV. A. Synchronous Microcontroller Core—S8051
The design of sync microcontrollers, including the ,
is mature and extensively reported in literature [28]. The de-
III. ARCHITECTURE OF THE S8051 AND A8051 sign of herein is based on the technology-independent
This section first succinctly reviews the general architecture Synopsys microcontroller core macro cell and synthesized for
of the and of the proposed , and serves as pre- low-mid speed (50 MHz). The is based on standard li-
amble to the detailed design of the latter. brary cells and clock-gating is employed where necessary. The
Fig. 3 depicts the architecture of , the proposed , standard library cells are based on static-logic gates; this ap-
and the shared embedded 1 kB ROM (read-only memory for proach is adopted together with the limit on the fan-in for each
program), 128 B RAM (random-access memory for data), gate to [25]. The reason for employing static-logic and lim-
and 1 kB XRAM (external random-access memory for data). iting fan-in has been delineated in Section II earlier. The micro-
The and also share three groups of inputs and controller core features four clocks per instruction cycle and 1–5
outputs: the control I/O (main control signals), program I/O instruction cycles per instruction (i.e., 4–20 clocks per instruc-
(programmable and debug signals), and general-purpose I/O tion). This synthesized design, being a practical design, em-
(general-purpose input/output signals). They also share two bodies 30% delay margins to accommodate modest PVT varia-
controller blocks: the Prog Port Control (programmable port tions at nominal voltage operation.
CHANG et al.: SYNCHRONOUS-LOGIC AND ASYNCHRONOUS-LOGIC 8051 MICROCONTROLLER CORES FOR REALIZING THE INTERNET OF THINGS 27

Fig. 3. Block diagram of the QDI async 8051 microcontroller , the sync 8051 microcontroller , and the shared blocks.

B. Asynchronous Microcontroller Core—A8051 Fig. 4 depicts the block diagram of the which is in part
specified using Balsa [30], and in part [flow controller (FCont)
This section first presents the general architecture of the block and memory controller (MemCont) block] handcrafted;
, and thereafter, the general-purpose I/O that Balsa is an async behavioral synthesis EDA tool based on the
exploit the modality of the local handshake protocols therein. syntax-directed translation approach. The syntax-directed trans-
The is implemented using null convention logic lation approach translates every software code into an equiv-
(NCL) gates [29] based on the same standard library cells as alent circuit. In the case of Balsa, the circuits are based on
the , and their shared (with the ) ROM, RAM, and standard library cells, and as delineated earlier, the generated
XRAM are implemented using standard library memories; as designs may be inefficient [31], [32]. The FCont block syn-
delineated earlier, standard library cells are used to make the chronizes the sync control I/O with the via the ARSTN
comparison with the as equitable as possible. Although (async reset) and AINTN (async interrupt) signals, while Mem-
using standard library cells will potentially be disadvantageous Cont block synchronizes the sync embedded memories with the
for because standard library cells are highly optimized via the AROM (async ROM), ARAM (async RAM), and
for sync. Nevertheless, this work attempts to make an equitable AXRAM (async XRAM) signals.
sync and async comparison by means of standard library cells Both the and are designed as two-stage
to demonstrate the viability of both approaches, especially pipeline systems. The instruction fetch (IF) block forms the
when operation and variation spaces are considered. first pipeline stage and manages the fetching and grouping
28 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013

TABLE I
OPERATION MODES OF THE , AND THEIR SHARED BLOCKS

Fig. 4. Block diagram of the async 8051 microcontroller core, .

Fig. 5. (a) Block diagram of a sensor application with an 8051 microcontroller,


and the associated, (b) C program, and (c) 8051 program for the 8051 micro-
of instructions, including the handling of exceptions (e.g., controller after compilation.
initialization, interrupts, and branching). This stage includes IF,
FCont, instruction pointer (IP), instruction pointer arithmetic
unit (IPAU), and MemCont blocks. The decode and execute for the general-purpose I/O ports to stall the microcon-
(D&X) block forms the second pipeline stage and manages the troller under modest computation requirements so as to operate
fetching of operands, execution of operation, and writing back at 100% workload.
the results. This stage includes the D&X, register file (ReF), Fig. 5(a) depicts the block diagram of an acoustic sensor
ALU, and MemCont blocks. application embodying an 8051 microcontroller. The acoustic
The general architecture of the described until now is sensor consists of a microphone block, analog-to-digital con-
comparable to the , with the exception of the general-pur- vertor (ADC) block, detector block, and an 8051 block (micro-
pose I/O block that is designed slightly differently to exploit the controller). The 8051 block waits for the assertion of the Start
potential advantages of the async protocol. The general-purpose signal, and then samples ADCOut from the ADC block and ini-
I/O block will now be delineated in turn. tiates the digital signal processing algorithm DoDSP, refer to
Workload is defined as the minimum required computation Fig. 5(b). The Complete signal asserts at the end of one cycle
speed (MIPS at a given that satisfies a given computa- and the cycle repeats for the next Start signal. Fig. 5(b) and (c),
tion task) divided by the maximum speed at the given . respectively, depicts the C program and corresponding complied
It is well established that speed reduces as reduces, and 8051 program for the microcontroller. The conditional while
thus for a given computation speed requirement, the workload loop in the C program translates to a conditional branch instruc-
increases as decreases. Generally, for efficient DVS op- tion that polls for the assertion of the Start signal. In situations
eration, operating at 100% workload features minimum when the microcontroller executes every loop faster than the
(and energy) for a given computation speed requirement. As assertion of the Start signal, the speed of the microcontroller is
is reduced, energy and speed reduces until the minimum higher than the required speed (workload is ), and thus
energy point, where reducing further reduces speed but in- operating the microcontroller at a lower is advantageous.
creases energy due to leakage. Intuitively, DVS should not op- Nevertheless, as the frequency of the Start signal reduces fur-
erate below the minimum energy point. However, in situations ther, approaches the minimum energy point and energy
when the required computation speed is modest, and when the dissipation increases thereafter. Further (dynamic) energy re-
speed at the minimum energy point is still higher (i.e., workload duction can be achieved at the minimum energy point by stalling
at the minimum energy point is ), applying DVS alone the microcontroller—the IF block stalls the fetching of the con-
is insufficient. Consider now the exploitation of async protocols ditional branch instruction until the Start signal asserts.
CHANG et al.: SYNCHRONOUS-LOGIC AND ASYNCHRONOUS-LOGIC 8051 MICROCONTROLLER CORES FOR REALIZING THE INTERNET OF THINGS 29

Fig. 6. General-purpose I/O configuration for (a) , (b) active mode, and (c) passive mode. The timing diagram of (d) , (e)
active mode, and (f) passive mode. The state transition diagram for the general-purpose I/O for (g) , (h) active mode, and (i) passive
mode.

The general-purpose I/O block configurations for the


and the are, respectively, depicted in Fig. 6(a)–(c).
Fig. 6(a) depicts the general-purpose I/O block config-
uration where the Start and ADCOut signals are connected to
, and the Complete signal is connected to . A sim-
ilar (to the ) configuration of the general-purpose
I/O block is depicted in Fig. 6(b) where the Start and ADCOut
signals are also connected to , and the Complete signal
is connected to . In order to leave the initiation of data
transfer on to the microcontroller, the and
handshake signals are connected together. This configuration
for the general-purpose I/O block is termed as the active
mode. A different configuration of the general-purpose Fig. 7. Chip microphotograph of the , , and their shared memory
I/O block is depicted in Fig. 6(c) where the ADCOut signal blocks.
is connected to , and the Complete signal is connected
to . In order to leave the initiation of data transfer on
to the Start signal, the and Start signals are joined sipation indicates energy is continuously expended. Fig. 6(f) de-
by means of a Muller C-element with the output connected to picts the main signals of the acoustic sensor using (gen-
the handshake signal. This configuration for the eral-purpose I/O block in the passive mode), and Fig. 6(i) de-
general-purpose I/O block is termed as the passive mode. picts its state transition diagram. It can be observed that the
Fig. 6(d) and (e), respectively, depict the main signals of (general-purpose I/O block in the active mode) stalls
the acoustic sensor using and (general-purpose whenever Start is not asserted after the de-assertion of the Com-
I/O block in the active mode), and Fig. 6(g) and (h), respec- plete signal. Expectedly, the is in the idling state and the
tively, depict their state transition diagrams. It can be observed associated energy dissipation indicates energy is only dissipated
from their state transition diagrams that the and between the assertion of the Start and de-assertion of the Com-
(general-purpose I/O block in the active mode) perform itera- plete signals. The Start signal stalls the IF block by stalling the
tive conditional branching to poll the assertion of Start after the handshake sequence: , which in turn stalls
de-assertion of the Complete signal. Expectedly, both microcon- the IF block via GP Port Control block ReF block D&X
trollers are always in the active state and associated energy dis- block (see Figs. 3 and 4). Similar approaches have been reported
30 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013

TABLE II
MEASURED EPI AND MIPS OF THE AND THE PROPOSED AT NOMINAL OPERATING CONDITION

Fig. 8. Measured power spectrum from 0 Hz to 1 GHz of (a) and (b) .

for the exploitation of async protocols for stalling operations in cells); this area overhead can be mitigated (to [6]) with
a microprocessor [33], [34]. Nevertheless, this work extends the custom cells.
energy efficiency of an async microcontroller beyond the min-
imum energy point in DVS. A. Performance at Nominal Operating Condition
Six benchmark programs are used to evaluate the EPI and
MIPS of the and : arithmetic, logical, data
IV. BENCHMARKING THE S8051 AND A8051 CORES
transfer, Boolean variable, branching, and Dhrystone v2.1.
In this section, we will first compare the and The first five benchmark programs evaluate the performance
at the nominal operating condition ( , , of one particular instruction type, whereas the last evaluates
full workload and no PVT variations). The figures-of-merit the average performance. The average measurement results
(FOM) for nominal operating condition include energy per in- on 40 prototype ICs are tabulated in Table II and for ease of
struction (EPI) and millions of instructions per second (MIPS). comparison, the results are normalized to the and shown
EPI delineates the energy dissipated for executing an instruc- in parentheses. Based on Dhrystone v2.1, the performance of
tion, and MIPS delineates the speed of the instruction execution the and are comparable—the MIPS is
rate. Second, we will compare the EPI and MIPS of the 10% higher and EPI is 10% lower than the .
and at wide operation space and variation space Fig. 8(a) and (b), respectively, depict the power spectrum
(PVT). Third, we compare the energy per computation (EPC) (0 Hz–1 GHz) of the and . The 50-MHz clock
of the and at wide operation space (workload). of the causes peaks at its harmonic frequencies, and the
EPC is employed to delineate energy dissipation at different highest peak is at 400 MHz. In comparison and not unexpect-
workloads because it takes into consideration energy expended edly, the features a more evenly distributed power spec-
due to execution of redundant instructions, if any. All com- trum and the highest peak at 330 MHz is lower
parisons in the first and second parts are based on comparable than the . A low and evenly distributed power spectrum is
and , where the general-purpose I/O often desirable in many applications, including sensitive ubiq-
operates in the active mode. The comparison in the third part uitous RF transceivers.
includes the with the general-purpose I/O operating in
the passive mode to demonstrate the potential advantages of B. Performance at Wide Operation Space and
the async protocol. Variation Space (PVT)
The , and their shared memory blocks are real- Consider now the performance at wide operation space where
ized using 130-nm CMOS and the chip microphotograph of one for and is varied from deep sub-threshold
of the 40 prototypes is shown in Fig. 7. Collectively, they oc- (at 100 mV) to nominal (at 1.2 V), at full workload and
cupy 4.12 mm , with the mm occupying ; for shared blocks (e.g., memories), is main-
the area of the mm , and this is largely due to tained at nominal and for all blocks, no PVT variations are
the dual-rail-encoded QDI async (and based on standard library assumed. Fig. 9(a) and (b), respectively, depict the measured
CHANG et al.: SYNCHRONOUS-LOGIC AND ASYNCHRONOUS-LOGIC 8051 MICROCONTROLLER CORES FOR REALIZING THE INTERNET OF THINGS 31

Fig. 9. Measured (a) EPI and (b) MIPS performance when changes from 100 mV to 1.2 V.

Fig. 10. Simulated (a) EPI and (b) MIPS performance when , normalized to the A8051 at .

EPI and MIPS of the and . From the Consider now the condition where (min-
DVFS and DVS lines, their EPI and MIPS are compa- imum energy point), and full workload, and at wide
rable. However, note that the EPI for is only comparable variation (PVT) space: , and
to when it is operating at minimum at all clock to ; for shared blocks (e.g., memories),
frequencies. The is operable from 100 mV to 1.2 V is maintained at nominal and no PVT variations are assumed.
(full-range DVS) while the is operable in the same varia- The variation spaces are introduced by means of simulations
tion space with clock frequency reduced to 57 kHz to operate at because it is challenging to introduce precise variations into the
the minimum energy point , and to 3 kHz to operate prototypes. The is clocked at 57 kHz for comparable EPI
at (full-range DVFS); is operable to . Fig. 10(a) and (b), respectively, depict the simulated
down to due to the 30% delay margins already added EPI and MIPS of the and when ;
for practicality during synthesis. For completeness, the EPI and for ease of comparison, the results are normalized to the
MIPS trajectories for 57 kHz and 3 kHz operation for at . The innately accommodates the
are plotted in Fig. 9(a) and (b), respectively. Interestingly, both variation space, whereas requires a delay margin to
the and that are designed for minimum of achieve the same. The delay margin translates to
180 mV (see Section II) are operable at 100 mV, nonetheless, lower speed, and higher energy dissipation compared
exhaustive benchmarks are required for reliable operation. to the when conditions are nominal . For
32 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013

Fig. 11. Simulated (a) EPI and (b) MIPS performance when , normalized to the A8051 at .

Fig. 12. Simulated (a) EPI and (b) MIPS performance when to , normalized to the A8051 at .

completeness, the EPI and MIPS for the are de- the same. The delay margin translates to lower
picted. At this clock frequency, the fails to operate below speed, and higher energy dissipation when conditions are
. nominal . For completeness, the
Fig. 11(a) and (b), respectively, depict the simulated EPI and is operable down to .
MIPS of the and when ; for Overall, the requires a delay margin to op-
ease of comparison, the results are normalized to the at erate at the minimum energy point (250 mV) and over the com-
. As before, the innately accommodates the bined variation space of , and
variation space, whereas requires a delay to . The delay margin accounts for worst-case
margin to achieve the same. The delay margin trans- conditions in the operation and variation (PVT) space,
lates to lower speed, and higher energy dissipa- but not unexpectedly, results in worst-case speeds and incurs ex-
tion when conditions are nominal . For com- cess energy due to leakage currents at nominal conditions.
pleteness, the EPI and MIPS for the are de-
picted. At this clock frequency, the is operable down to C. Performance at Wide Operation Space (Workload)
. Consider now the performance at wide operation space where
Fig. 12(a) and (b), respectively, depict the simulated EPI and the required computation speed is varied from 10% to 100%, at
MIPS of the and for to ; for and ; no PVT variations are assumed.
ease of comparison, the results are normalized to the at Fig. 13 depicts the measured EPC of the and .
. The innately accommodates the variation The required computation speed is normalized to the compu-
space, whereas requires a delay margin to achieve tation speed of at 100% workload and expressed as a
CHANG et al.: SYNCHRONOUS-LOGIC AND ASYNCHRONOUS-LOGIC 8051 MICROCONTROLLER CORES FOR REALIZING THE INTERNET OF THINGS 33

[2] W. Keister, A. E. Ritchie, and S. H. Washburn, The Design of Switching


Circuits. New York: Van Nostrand, 1951.
[3] C. J. Myers, Asynchronous Circuit Design. New York: Wiley, 2001.
[4] K.-S. Chong, K.-L. Chang, B.-H. Gwee, and J. S. Chang, “Syn-
chronous-Logic and globally-asynchronous-locally-synchronous
(GALS) acoustic digital signal processors,” IEEE J. Solid-State
Circuits, vol. 47, no. 3, pp. 769–780, Mar. 2012.
[5] Y. Ikenaga, M. Nomura, Y. Nakazawa, and Y. Hagihara, “A circuit for
determining the optimal supply voltage to minimize energy consump-
tion in LSI circuit operations,” IEEE J. Solid-State Circuits, vol. 43,
no. 4, pp. 911–918, Apr. 2008.
[6] A. J. Martin and M. Nystrom, “Asynchronous techniques for
system-on-Chip design,” Proc. IEEE, vol. 94, no. 6, pp. 1089–1120,
Jun. 2006.
[7] J. Wuu, D. Weiss, C. Morganti, and M. Dreesen, “The asynchronous 24
MB on-chip level-3 cache for a dual-core itanium-family processor,” in
Proc. IEEE Int. Solid-State Circuits Conf., 2005, vol. 1, pp. 488–612.
[8] J. S. Chang, M.-T. Tan, Z. Cheng, and Y.-C. Tong, “Analysis and de-
sign of power efficient class D amplifier output stages,” IEEE Trans.
Circuits Syst. I, Fundam. Theory Appl., vol. 47, no. 6, pp. 897–902,
Jun. 2000.
[9] J. S. Chang and Y. C. Tong, “A micropower-compatible time-multi-
Fig. 13. Measured EPC of the and the . The required computa- plexed SC speech spectrum analyzer design,” IEEE J. Solid-State Cir-
tion speed is normalized to the computation speed of S8051 at 100% workload cuits, vol. 28, no. 1, pp. 40–48, Jan. 1993.
and expressed as a percentage thereof. [10] B. L. Sim, Y. C. Tong, J. S. Chang, and C. T. Tan, “A parametric for-
mulation of the generalized spectral subtraction method,” IEEE Trans.
Speech Audio Process., vol. 6, no. 4, pp. 328–337, Jul. 1998.
[11] R. Sarpeshkar, C. Salthouse, J.-J. Sit, M. W. Baker, S. M. Zhak, T. K. T.
percentage thereof. It is observed that as the required compu- Lu, L. Turicchia, and S. Balster, “An ultra-low-power programmable
tation speed reduces, the EPC of both the and analog bionic ear processor,” IEEE Trans. Biomed. Eng., vol. 52, no.
4, pp. 711–727, Apr. 2005.
(general-purpose I/O in the active mode) increases and this is [12] A. H. Farrahi, C. Chen, A. Srivastava, G. Tellez, and M. Sarrafzadeh,
largely due to the energy expended in the iterative execution of “Activity-driven clock design,” IEEE Trans. Computer-Aided Design
Integr. Circuits Syst., vol. 20, no. 6, pp. 705–714, Jun. 2001.
conditional branch instructions (dynamic) and leakage (static). [13] D. Ma and R. Bondade, “Enabling power-efficient DVFS operations
On the other hand, as the required computation speed reduces, on silicon,” IEEE Circuits Syst. Mag., vol. 10, no. 1, pp. 14–30, Mar.
the EPC of the with the general-purpose I/O in the pas- 2010.
[14] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T.
sive mode increases only due to leakage. As delineated earlier, Mudge, “Near-Threshold computing: Reclaiming Moore’s law through
this is because the with the general-purpose I/O in the energy efficient integrated circuits,” Proc. IEEE, vol. 98, no. 2, pp.
253–266, Feb. 2010.
passive mode stalls the IF block when necessary, thereby main- [15] International roadmap for semiconductors 2009 [Online]. Available:
taining 100% workload, and reducing dynamic energy dissipa- http://www.itrs.net
tion as the required computation speed reduces. Consequently, [16] C. I. Joon, P. S. Phill, and K. Roy, “Exploring asynchronous design
techniques for process-tolerant and energy-efficient subthreshold op-
at 10% required computation speed, the EPC of with its eration,” IEEE J. Solid-State Circuits, vol. 45, no. 2, pp. 401–410, Feb.
general-purpose I/O in the passive mode is lower than 2010.
the . [17] H. van Gageldonk, K. van Berkel, A. Peeters, D. Baumann, D. Gloor,
and G. Stegmann, “An asynchronous low-power 80C51 microcon-
troller,” in Proc. Int. Symp. Adv. Res. Asynch. Circuits Syst., 1998, pp.
V. CONCLUSION 96–107.
[18] J. Cortadella, A. Kondratyev, L. Lavagno, and C. P. Sotiriou, “Desyn-
We have compared two 8051 microcontroller cores for IoT chronization: Synthesis of asynchronous circuits from synchronous
applications. One based on the conventional sync specifications,” IEEE Trans. Computer-Aided Design Integrated
and the other based on QDI async . For sake of eq- Circuits Syst., vol. 25, no. 10, pp. 1904–1921, Oct. 2006.
[19] D. E. Muller and W. S. Bartky, “A theory of asynchronous circuits,” in
uitable comparison, they both embody standard library cells, Proc. Int. Symp. Theory Switch., 1959, pp. 204–243.
standard library memories and fabricated on the same die. [20] A. J. Martin, S. M. Burns, T. K. Lee, D. Borkovic, and P. J.
For IoT applications that are less prone to PVT and workload Hazewindus, “The design of an asynchronous microprocessor,” in
Caltech Conf. Adv. Res. VLSI, 1989, pp. 351–373.
variations, the energy and speed of the and are [21] K.-S. Chong, B.-H. Gwee, and J. S. Chang, “Energy-Efficient syn-
largely comparable at nominal conditions and the entire DVS chronous-logic and asynchronous-logic FFT/IFFT processors,” IEEE
(or DVFS) range, rendering the less attractive due to J. Solid-State Circuits, vol. 42, no. 9, pp. 2034–2045, Sep. 2007.
[22] K.-L. Chang and B.-H. Gwee, “A low-energy low-voltage asyn-
larger IC area. For IoT applications that are more prone to chronous 8051 microcontroller core,” in Proc. IEEE Int. Symp.
PVT and workload variations, the requires , Circuits Syst., 2006, pp. 3181–3184.
and delay margins for process , voltage [23] H. van Gageldonk, K. van Berkel, A. Peeters, D. Baumann, D. Gloor,
and G. Stegmann, “An asynchronous low-power 80C51 microcon-
and temperature ( to ) varia- troller,” in Proc. Int. Symp. Adv. Res. Asynch. Circuits Syst., 1998, pp.
tions, thereby making is more advantageous despite its 96–107.
larger IC area. [24] S. Keller, M. Katelman, and A. J. Martin, “A necessary and sufficient
timing assumption for speed-independent circuits,” in Proc. IEEE
Symp. Asynch. Circuits Syst., 2009, pp. 65–76.
REFERENCES [25] A. Wang, B. H. Calhoun, and A. P. Chandrakasan, Sub-threshold De-
[1] R. D. Jorgenson, L. Sorensen, D. Leet, M. S. Hagedorn, D. R. Lamb, sign for Ultra Low-Power Systems. New York: Springer, 2006.
T. H. Friddell, and W. P. Snapp, “Ultralow-Power operation in sub- [26] J. Lohstroh, E. Seevinck, and J. de Groot, “Worst-case static noise
threshold regimes applying clockless logic,” Proc. IEEE, vol. 98, no. margin criteria for logic circuits and their mathematical equivalence,”
2, pp. 299–314, Feb. 2010. IEEE J. Solid-State Circuits, vol. 18, no. 6, pp. 803–807, Dec. 1983.
34 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013

[27] M. Alioto, “Understanding DC behavior of subthreshold CMOS logic conferences, including the IEEE-National Institutes of Health (NIH) Life
through closed-form analysis,” IEEE Trans. Circuits Syst. I, Reg. Pa- Sciences Systems and Applications Workshop, the IEEE-NIH CAS Medical
pers, vol. 57, no. 7, pp. 1597–1607, Jul. 2010. and Environmental Workshop, and the International Symposium on Integrated
[28] I. S. MacKenzie and R. C.-W. Phan, The 8051 Microcon- Circuits and Systems. He has also been awarded numerous academic, defense
troller. Upper Saddle River, NJ: Pearson Prentice Hall, 2007. and industrial grants, exceeding $11M, including from Defense Advanced Re-
[29] K. M. Fant and S. A. Brandt, “NULL convention logic: A complete and search Projects Agency (USA), E.U. grants, from multinational corporations,
consistent logic for asynchronous digital circuit synthesis,” in Proc. Int. etc.
Conf. Appl. Specific Syst., Archit. Processors, 1996, pp. 261–273.
[30] A. Bardsley, “Implementing Balsa handshake circuits,” Ph.D. disserta-
tion, Dept. Comput. Sci., Univ. Manchester, Manchester, U.K., 2000.
[31] C.-F. Law, B.-H. Gwee, and J. S. Chang, “Asynchronous control Bah-Hwee Gwee (S’93–M’97–SM’03) received
network optimization using fast minimum-cycle-time analysis,” IEEE the B.Eng. degree in electrical and electronic en-
Trans. Computer-Aided Design Integr. Circuits Syst., vol. 27, no. 6, gineering from University of Aberdeen, Aberdeen,
pp. 985–998, Jun. 2008. U.K., in 1990, and the M.Eng. and Ph.D. degrees
[32] S. F. Nielsen, J. Sparso, and J. Madsen, “Behavioral synthesis of asyn- from Nanyang Technological University, Singapore,
chronous circuits using syntax directed translation as backend,” IEEE in 1992 and 1998 respectively.
Trans. Very Large Scale (VLSI) Syst., vol. 17, no. 2, pp. 248–261, Feb. He was an Assistant Professor in School of
2009. Electrical and Electronic Engineering, Nanyang
[33] C. I. Kelly, V. Ekanayake, and R. Manohar, “SNAP: A Sensor-network Technological University (NTU) from 1999 to 2005
asynchronous processor,” in Proc. Int. Symp. Asynch. Circuits Syst., and has been an Associate Professor since 2005.
2003, pp. 24–33. He has been holding the appointment of Assistant
[34] V. Ekanayake, I. C. Kelly, and R. Manohar, “An ultra low-power pro- Chair (Students) of School of Electrical and Electronic Engineering since
cessor for sensor networks,” in Int. Conf. Archit. Support Program. 2010. He was the Principal Investigator (PI) of a number of research projects
Languages Operat. Syst., 2004, pp. 27–36. including the ASEAN-European Union University Network Programme,
Ministry of Education Tier-1 and Tier-2, Defence Science and Technology
Agency and Temasek Laboratories projects. He was also the co-PIs of DARPA
Kok-Leong Chang (S’07–M’11) received the (USA), NTU-Panasonic, NTU-Lingköping research projects. His total research
B.Eng. (first-class honors) and Ph.D. degrees in grant is amounting to more than U.S. $5m. He has filed and granted several
electrical and electronic engineering from Nanyang USA and Singapore patents in circuit design. His research interests include
Technological University, Singapore, in 2004 and sub-threshold, dynamic voltage scaling asynchronous circuit, GALS NoC and
2011, respectively. Class-D amplifier designs. He has been an Associate Editor for Journal of
He is a Scientist with the Institute of Materials Circuits, Systems and Signal Processing (2007–2012).
Research and Engineering, A*STAR, Singapore. His Dr. Gwee was the Chairman of IEEE Singapore Circuits and Systems Chapter
research interests include robust asynchronous-logic in 2005, 2006 and 2013. He has been the members of IEEE CAS Society
circuits, microprocessor architectures, and printed DSP, VLSI and Bio-CAS Technical Committees since 2004. He has served in
electronics. the Organizing Committees for IEEE BioCAS-2004, IEEE APCCAS-2006,
Technical Program Chair for ISIC-2007, co-Chair for ISIC-2011 and served
in the steering committee for IEEE APCCAS 2006–2008. He has been an
Associate Editor for IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II:
Joseph S. Chang received the B.Eng. in electrical EXPRESS BRIEF 2010–2011, and an Associate Editor for IEEE TRANSACTIONS
and computer engineering from Monash University, ON CIRCUITS AND SYSTEMS—PART I: REGULAR PAPERS since 2012. He was
Melbourne, Australia, and the Ph.D. degree from the an IEEE Distinguished Lecturer for CAS Society in 2009/2010.
Department of Otolaryngology, University of Mel-
bourne, Melbourne, Australia.
He is currently with Nanyang Technological
University (NTU), Singapore, where he was previ- Kwen-Siong Chong (S’03–M’09) received the
ously the Associate Dean of Research and Graduate B.Eng., M.Phil., and Ph.D. degrees in electrical and
Studies at the College of Engineering. He is also an electronic engineering from Nanyang Technological
Adjunct at Texas A&M University, College Station, University, Singapore, in 2001, 2002, and 2007,
TX, USA. He is a multi-disciplinary engineer and respectively.
his research interests encompass emerging technologies and traditional crcuits He is presently a Senior Research Scientist with
and system-related fields, including printed electronics, microfluidics, life Temasek Laboratories, Nanyang Technological Uni-
sciences, audiology, psychophysics, acoustics, and biomedical and electronic versity, Singapore. He was a visiting researcher in
devices. He publishes prolifically and has been awarded 10 patents with several Nara Institute of Science and Technology, Japan, in
pending. He has founded two startups in the field of electroacoustics, and has 2010, and in the University of Michigan, USA, in
designed numerous related products, adopted for industry and commercially. 2012. He was the co-principal investigator/collabo-
Dr. Chang served as Editor of the Open Column, IEEE CIRCUITS AND rator of the Defense Advanced Research Projects Agency (USA) and Ministry
SYSTEMS MAGAZINE, Associate Editor of the IEEE TRANSACTIONS ON of Education Tier-2 (Singapore) research projects. His research interests include
CIRCUITS AND SYSTEMS—PART I and the IEEE TRANSACTIONS ON CIRCUITS asynchronous VLSI designs, low-voltage low-power VLSI circuits, audio signal
AND SYSTEMS—PART II, Guest Editor for the PROCEEDINGS OF THE IEEE, processing, and soft-error tolerant designs.
Guest Editor of the CIRCUITS AND SYSTEMS MAGAZINE (life sciences special Dr. Chong was the Secretary of IEEE Circuits and Systems (CAS) Society,
issue), and Chair of the Life Sciences Systems and Applications Technical Singapore Chapter, in 2011 and 2012. He has been the member of IEEE CAS
Committee of the IEEE CAS Society. He has chaired several international Society VLSI Technical Committee since 2009.

Das könnte Ihnen auch gefallen