Beruflich Dokumente
Kultur Dokumente
Series Editors
Nikil D. Dutt, Department of Computer Science, Zot Code 3435,
Donald Bren, School of Information and Computer Sciences, University of
California, Irvine, CA 92697-3435, USA
Peter Marwedel, TU Dortmund, Informatik 12, Otto-Hahn-Str. 16, 44227,
Dortmund, Germany
Grant Martin, Tensilica Inc., 3255-6 Scott Blvd., Santa Clara, CA 95054, USA
Integrated Optical
Interconnect Architectures
for Embedded Systems
Editors
Ian OConnor Gabriela Nicolescu
Ecole Centrale de Lyon Lyon Institute of Dpt. Gnie Informatique & Gnie Logiciel
Nanotechnology Ecole Polytechnique de Montreal
Ecully, France Montreal, QC, Canada
v
vi Preface
In the first part of this book, we examine both ends of the optical interconnect
domain, as a convergence of application needs (performance metrics of communi-
cation infrastructure in systems on chip) and enabling technology (building blocks
of silicon photonics).
In Chap. 1, the system on chip concept is introduced, with a particular focus on
SoC communication systems and the main features and limitations of various types
of on-chip interconnect. The author examines both performance and physical inte-
gration issues and stresses that on-chip interconnect, rather than logic gates, is the
bottleneck to system performance. Much research and industry effort are focused
today on vertical solutions, at packaging level (system in package or SiP) or at inte-
gration level (3D integrated circuits or 3DICs)these approaches can indeed for a
time relax the SoC interconnect bottleneck and allow the implementation of com-
plex, heterogeneous, and high-performance systems. However, the author concludes
with the observation that increasing complexity and requirements in terms of com-
putation capability of new generation systems will reach the limit of electrical inter-
connect quite soon, driving the need for novel solutions and different approaches for
reliable and effective on-chip and die-to-die communication. Optical interconnect is
one such potential solution.
In Chap. 2, the authors give a review of silicon photonics technology and focus
on explaining the main principles and orders of magnitude of the various compo-
nents that are required for on-chip optical interconnects and in particular for WDM
(wavelength division multiplexing) links. Achieving true CMOS compatibility at
material and process level is a driving factor for silicon photonics, and the high
refractive index contrast also makes it possible to scale down photonic building
blocks to a footprint compatible with thousands of components on a single chip.
The authors highlight the fast pace of progress of this technology and their convic-
tion that on-chip optical links will become a reality before 2020, while also singling
out the two most significant issues that still need to be solved when using silicon
photonics for on-chip links: which approach for the light source and how to handle
thermal issues (both to lower thermal dissipation and to minimize sensitivity to
temperature variation).
The second part of this book looks at various proposals for communication topolo-
gies based on silicon photonics, in particular for MPSoCs at a scale of tens to hun-
dreds of cores. Indeed, there have been several proposals in recent years for optical
interconnect networks attempting to provide improved performance and energy
efficiency compared to electrical networks. Three chapters review some of these
Preface vii
topologies and make further novel proposals, while introducing critical concepts to
this domain such as multilevel design and analysis and system integration and
interfacing.
Chapter 3 introduces this part of the book with a further review of basic nano-
photonic devices as integrated with a standard CMOS process. The authors then
propose a structured approach to clearly analyze previous proposals at relevant
abstraction levels (here considered to be architectural, microarchitectural, and phys-
ical) and use this approach to identify opportunities for new designs and make the
link between application requirements and technology constraints. The design pro-
cess is illustrated in an on-chip tile-to-tile network, processor-to-DRAM network,
and DRAM memory channel, and the authors conclude with a discussion of lessons
learned throughout such a design process with a set of guidelines for designers.
Chapter 4 proposes a fat tree-based optical NoC (FONoC) at several levels of
detail including the topology, floorplan, and protocols. Central to the proposal is a
low-power and low-cost optical turnaround router (OTAR) with an associated rout-
ing algorithm. In contrast to some other optical NoCs, FONoC does not require a
separate electronic NoC for network control, since it carries both payload data and
network control data on the same optical network. The authors describe the proto-
cols, which are designed to minimize network control data and related power con-
sumption. The overall power consumption and performance (delay and throughput)
is evaluated by an analytical model and compared to a matched electronic 64-node
NoC in 45 nm CMOS under different offered loads and packet sizes.
In Chap. 5, the authors describe an optical ring bus (ORB)-based hybrid opto-
electric on-chip communication architecture. This topology uses an optical ring
waveguide to replace global pipelined electrical interconnects while maintaining
the interface with typical bus protocol standards such as AMBA AXI3. The pro-
posed ORB architecture supports serialization of uplinks/downlinks to optimize
communication power dissipation and is shown to reduce transfer latency power
consumption compared to a pipelined, electrical, bus-based communication archi-
tecture at the 22 nm CMOS technology node.
Using this as a platform, the authors develop a novel 4-layered hardware stack
architecture consisting of the physical layer, the physical-adapter layer, the data link
layer, and the network layer, allowing the modular design of each building block
and boosting the interoperability and design reuse. Crucial to proving the industrial
viability of the approach, the authors have made significant effort to model and
integrate the proposed protocol stack within an industrial simulation environment
(ST OCCS GenKit) using an industrial standard (VSTNoC) protocol. Similarly as
in Chap. 3, the authors use this approach to introduce the micro-architecture of a
new electrical distributed router as a wrapper for the ONoC and evaluate the perfor-
mance of the layered architecture both at the system level (for network latency and
throughput) and at the physical (optical) level. Experimental results prove the scal-
ability of the network and demonstrate that it is able to deliver a comparable band-
width or even better (in large network sizes).
In Chap. 7, the authors exploit application characteristics by examining the
behavior of on-chip network traffic to understand how its locality in space and time
can be advantageously exploited by slowly reconfiguring networks, such as a
reconfigurable photonic NOC. The authors provide implementation details and a
performance and power characterization in which the topology is adapted automati-
cally (at the microsecond scale) to the evolving traffic situation by use of silicon
microrings.
Finally, while the previous chapter focused on exploiting application character-
istics, Chap. 8 explores new physical integration strategies by coupling the optical
interconnect concept to the emerging paradigm of 3DICs. The authors investigate
design trade-offs for a 3D MPSoC using a specific optical interconnect layer and
highlight current and short-term design trends. A system-level design space explo-
ration flow is also proposed, taking routing capabilities of optical interconnect into
account. The resulting application-to-architecture mappings demonstrate the
benefits of the 3D MPSoC architectures and the efficiency of the system-level
exploration flow.
We would like to take this opportunity to thank all the contributors of this book
for having undertaken the writing of each chapter and for their patience during the
review process. We also wish to extend our appreciation to the team at Springer for
their editorial guidance as well as of course for giving us the opportunity to compile
this book together.
ix
x Contents
xi
xii Contributors
Alberto Scandurra
Abstract Systems on chip (SoCs) are complex systems containing billions of transis-
tors integrated in a unique silicon chip, implementing highly complex functionalities by
means of a variety of modules communicating with the system memories and/or between
them through a proper communication system. Integration density is now so high that
many issues arise when a SoC has to be implemented, and the electrical limits of inter-
connect wires are a limiting factor for performance. The main SoC building-block to be
affected by these problems is the on-chip communication system (or on-chip intercon-
nect), whose task is to ensure effective and reliable communication between all the
functional blocks of the SoC. A novel methodology aiming at solving the problems
mentioned above consists of splitting a complex system over more dice, exploiting the
so-called system in package (SiP) approach and opening the way to dedicated high-
performance communication layers such as optical interconnect. This chapter deals with
the SoC technology, describes current solutions for on-chip interconnect, illustrates the
issues faced during the SoC design and integration phases and introduces the SiP con-
cept and its benefits.
Outlook
Systems on chip (SoCs) are complex systems containing billions of transistors inte-
grated in a single silicon-chip, implementing highly complex functionalities by means
of a variety of modules communicating with the system memories and/or between
A. Scandurra (*)
OCCS Group, STMicroelectronics,
Stradale Primosole 50, 95121, Catania, Italy
e-mail: Alberto.scandurra@st.com
The system on chip (SoC) is now the essential solution for delivering competitive
and cost-efficient performance in todays challenging electronics market. Consumers
using PCs, PDAs, cell-phones, games, toys and many other products demand more
features, instant communications and massive data storage in ever smaller and more
affordable products. The unstoppable drive in silicon fabrication has delivered tech-
nology to meet this demandchips with hundreds of millions of gates using 130 nm
processes are no more than the size of a thumbnail. These SoCs present one of the
biggest challenges that engineers have ever faced; how to manage and integrate
enormously complex designs that combine the richest imaginable mix of micropro-
cessors, memories, buses, architectures, communication standards, protocol proces-
sors, interfaces and other intellectual property components where system level
considerations of synchronization, testability, conformance and verification are cru-
cial. Integrated circuit (IC) design has become a multi-million-gate challenge for
which the demands on design teams are ever greater.
The techniques used in designing multi-million-gate SoCs employ the worlds
most advanced electronic design automation (EDA), with a level of sophistication
that requires highly trained and experienced engineers. Key issues to be managed in
the design process include achieving timing closure that accounts for wire delays in
the metal interconnects inside the chip, and designs for tests so that the chips can be
manufactured economically. Early prediction of the right architecture, design-flow
and best use of EDA solutions is required to achieve first silicon success and neces-
sarily decrease the time-to-market from years to months.
1 Interconnect Issues in High-Performance Computing Architectures 5
Peripherals
Slow memories
controllers
Peripherals are slow memories such as I2C and Smartcard, used where no high
performance is required, and operating at around 50/100 MHz.
Normally the CPUs run at the highest speed and the memory system represents
the SoC bottleneck in terms of performance.
Hence within a single chip, different circuit islands run at different frequen-
cies; this approach is called GALS (globally asynchronous locally synchronous)
and is widely used today. The different clock frequencies required to operate the
various subsystems are generated by the clock generator (clockgen), while the sub-
systems are linked together by the on-chip interconnect, such as the STBus/STNoC
[1] in the case of STMicroelectronics products. Typically the on-chip interconnect
optimizes the CPU path, i.e. the interconnect structure normally operates at the
same frequency as the CPU. Since the other subsystems often operate at a different
frequency, dedicated frequency converters have to be placed between the intercon-
nect and the other subsystems to enable inter-block communication.
As already shown in Fig. 1.1, a SoC can be seen as a number of intellectual proper-
ties (IPs) properly connected by an on-chip communication architecture (OCCA),
an infrastructure that interconnects the various IPs and provides the communication
mechanisms necessary for distributed computation over a set of heterogeneous pro-
cessing modules. The throughput and latency of the communication infrastructure,
and also the relevant power consumption, often limit the overall SoC performance.
Until now the prominent type of OCCA has been the on-chip bus, such as the
STBus from STMicroelectronics, the AMBA bus from ARM [2], CoreConnect
from IBM [3], which represent the traditional shared-communication medium. This
type of OCCA, while not at all scalable, has been able to fulfill SoC requirements
because the performance bottleneck has always been the memory system. However,
with the growing requirements of more modern SoCs and CMOS technology scal-
ing, the performance bottleneck is moving from memories to interconnect, as
detailed in Sect. 4.
In order to overcome this limit, a new generation architecture, called network on
chip (NoC), has been deeply studied and proposed; it is an attempt to translate the
networking and parallel computing domain experience into the SoC world, relying on
a packet-switched micro-network backbone based on a well-defined protocol stack.
Innovative NoC architectures include STNoC from STMicroelectronics [4], thereal
from Philips Research Lab [5], and Xpipe from University of Bologna [6].
On-Chip Bus
initiators (PEs able to generate traffic), and a set of physical channels through which
the traffic flows are routed from initiators to targets (PEs able to receive and process
traffic) and vice versa.
The peculiarities of a bus, which are also the main drawbacks, are:
Limited available bandwidth, given by the product of the bus size (width) by the
bus operating frequency. To achieve a higher available bandwidth implies either
widening the bus size, thereby amplifying physical issues such as wire conges-
tion, or increasing the operating frequency, leading to increased power consump-
tion, and which is moreover limited by physical issues such as capacitive load
and capacitive coupling.
Lack of bandwidth scalability, since connecting more IPs to the bus implies
dividing the total available bandwidth among all the IPs, thereby allocating a
lower bandwidth to each of them.
Limited system scalability, since connecting more IPs to the bus results in an
increase of the capacitive load, which leads to a drop in operating frequency.
Limited quality of service, since there is no possibility to process different classes
of traffic (such as low latency CPUs, high bandwidth video/audio processors,
DMAs) in a different way.
High occupation area, due to the large number of wires required to transport all
the protocol information, i.e. data and control signals (STBus interfaces for
example are characterized by hundreds of wires).
High power consumption, which is determined by the switching activity and
potentially affects all the wires of the bus.
Network on Chip
The new requirements of modern applications impose the need for new solutions to
overcome the previously mentioned drawbacks of on-chip buses, both for the clas-
sic shared-bus (such as AMBA AHB) and the more advanced communication sys-
tems supporting crossbar structures (such as the STBus). In conjunction with the
most recent technology features, a novel on-chip communication architecture, called
network on chip (NoC), has been proposed.
It is important to highlight that the NoC concept is not merely an adaptation to
the SoC context of parallel computing or wide area network domains; many issues
are in fact still open in this new field, and the highly complex design space requires
detailed exploration. The key open points are, for instance, the choice of the net-
work topology, the message format, the end-to-end services, the routing strategies,
the flow control and the queuing management. Moreover, the type of quality of
service (QoS) to be provided is another open issue, as is the most suitable software
view to allow the applications to exploit NoC infrastructure peculiarities.
From lessons learned by the telecommunications community, the global on-chip
communication model is decomposed into layers similar to the ISOOSI reference
model (see Fig. 1.2). The protocol stack enables different services and allows QoS,
8 A. Scandurra
Application
Presentation
Session
Transport
Network
Network
Data link
Physical
Topology
A first parameter for the topology is its scalability; a topology is said to be scalable
if it is possible to create larger networks of any size, by simply adding new nodes.
Two different approaches can be followed for the specification of the topology of a
NoC: topology-dependent and topology-independent. The former approach specifies
the network architecture and its building blocks assuming a well defined topology.
The latter aims at providing flexibility to the SoC architect in choosing the topology
for the interconnect, depending on the application. This means that it is possible to
build any kind of topology by plugging together the NoC building-blocks in the
proper way. While this second approach is more versatile because of the higher
configurability allowed, it also has the following drawbacks:
A very wide design and verification space, which would require significant effort
to ensure a high quality product to the NoC user.
Exposure of the complexity of the network layer design (including issues such as
deadlock) to the SoC architect, thus requiring novel specific competencies and a
high effort in defining an effective (in terms of performance) and deadlock-free
architecture.
A need for high parametric building blocks, with few cost optimization possibilities.
Moreover, a NoC built on top of a specific topology still needs a high degree of
flexibility (routing, flow control, queues, QoS) in order to properly configure the
interconnect to support different application requirements.
Routing Algorithms
Routing algorithms are responsible for the selection of a path from a source node to
a destination node in a particular topology of a network. A good routing algorithm
balances the load across the various network channels even in the presence of non-
uniform and heavy traffic patterns. A well designed routing algorithm also keeps
path lengths as short as possible, thus reducing the overall latency of a message.
Another important aspect of a routing algorithm is its ability to operate in the
presence of faults in the network. If a particular algorithm is hardwired into the rout-
ers and a link or node fails, the entire network fails. However, if the algorithm can
be reprogrammed or adapted to bypass the failure, the system can continue to oper-
ate with only a slight loss in performance.
Routing algorithms are classified depending on how they select between the possible
paths from a source node to a destination node. Three main categories are specified:
1 Interconnect Issues in High-Performance Computing Architectures 11
Deterministic, where the same path is always chosen between a source and a
destination node, even if multiple paths exist.
Oblivious, where the path is chosen without taking into account the present state
of the network; oblivious routing algorithms include deterministic routing algo-
rithms as a subset.
Adaptive, where the current state of the network is used to select the path.
Deadlock
between incoming requests and outgoing responses: the node does not perform as a
sink for incoming packets, due to the finite size of the buffers and the dependencies
between requests and responses.
In shared memory architectures, complex cache-coherent protocols could lead to
a deeper level of dependencies. The effect of these protocol dependencies can be
eliminated by using disjoint networks to handle requests and replies. The following
two approaches are possible:
Two physical networks, i.e., separated physical data buses for requests and
responses.
Two virtual networks, i.e., separated virtual channels for requests and responses.
Quality of Service
The set of services requested by the IPs connected to the network (called network
clients) and the mechanisms used to provide these services are commonly referred
to as QoS.
Generally, it is useful to classify the traffic across the network into a number of
classes, in order to efficiently allocate network resources to packets. Different
classes of packets usually have different requirements in terms of importance, toler-
ance to latency, bandwidth and packet loss.
Two main traffic categories are specified:
Guaranteed service
Best effort
Traffic classes belonging to the former category are guaranteed a certain level of
performance as long as the injected traffic respect a well defined set of constraints.
Traffic classes belonging to the latter category do not get any strong guarantee from
the network; instead, it will simply make its best effort to deliver the packets to their
destinations. Best effort packets may then have arbitrary delay, or even be
dropped.
The key quality of service concern in implementing best effort services is provid-
ing fairness among all the best effort flows. Two alternative solutions exist in terms
of fairness:
Latency-based fairness, aiming at providing equal delays to flows competing for
the same resource.
Throughput-based fairness, aiming at providing equal bandwidth to flows com-
peting for the same resource.
While latency-based fairness can be achieved implementing a fair arbitration scheme
[such as round-robin or least recently used (LRU)], throughput-based fairness can be
achieved in hardware by separating each flow requesting a resource into a separate
queue, and then serving the queues in round-robin fashion. The implementation of such
a separation can be expensive; in fact while physical channels (links) do not have to be
1 Interconnect Issues in High-Performance Computing Architectures 13
replicated because of their dynamic allocation, virtual channels and buffers, requiring
FIFOs, have to be replicated for each different class of traffic. So it is very important to
choose the proper number of classes needing true isolation, keeping in mind that in
many situations it may be possible to combine classes without a significant degradation
of quality of service but gaining a reduction in hardware complexity.
Error Recovery
A high performance, reliable and energy efficient NoC architecture requires a good
utilization of error-avoidance and error-tolerance techniques, at most levels of its
layered organization. Using modern technologies to implement the present day sys-
tems (in order to improve performance and reduce power consumption), means
adopting lower levels of power supply voltage, leading to lower margins of noise
immunity for the signals transmitted over the communication network of the sys-
tem. This leads to a noisy interconnect, which behaves as an unreliable transport
medium, and introduces errors in the transmitted signals. So the communication
process needs to be fault-tolerant to ensure correct information transfer. This can be
achieved through the use of channel coding. Such schemes introduce a controlled
amount of redundancy in the transmitted data, increasing its noise immunity.
Linear block codes are commonly used for channel encoding. Using an (n, k)
linear block code, a data block of k bits length is mapped onto an n bit code word,
which is transmitted over the channel. The receiver examines the received signal
and declares an error if it is not a valid code word.
Once an error has been detected, it can be handled in one of two different ways:
Forward error correction (FEC), where the properties of the code are used to cor-
rect the error.
Retransmission, also called automatic repeat request (ARQ), where the receiver
asks the sender to retransmit the code word affected by the error.
FEC schemes require a more complex decoder, while ARQ schemes require the
existence of a reverse channel from the receiver to the transmitter, in order to ask for
the retransmission.
In decananometric CMOS technologies, DSM effects are significant and the physi-
cal design of a SoC is increasingly faced with two types of issue:
Performance issues, related mainly to the bandwidth requirements of the different
IPs, that in order to be fulfilled, would require SoCs to run at very high speeds.
Integration issues, related to the difficulties encountered mainly during the place-
ment of the hard macros and the standard cells, and during the routing of clock
nets and communication system wires.
14 A. Scandurra
Performance Issues
Integration Issues
Figure 1.5 is an illustration of the physical issues; it shows the floorplan of an exam-
ple CMOS chip for a consumer application.
1 Interconnect Issues in High-Performance Computing Architectures 15
In this figure the rectangles represent the various IPs of the chip (both initiators
and targets); the space available for the communication system is the very irregular
shape between all the different IPs. In such an area the Network Interfaces, repre-
senting the access points of the IPs to the on-chip network, the nodes, responsible
for arbitration and propagation of information, and all the physical channels con-
necting the different NoC building-blocks have to be placed. Because of the shape,
which is quite irregular and with thin regions, and the area size, it is evident that the
placement of the interconnect standard cells can be difficult, and the routing of the
wires that can be also very long will likely suffer congestion.
enhance the total system performance. Global interconnects have the largest
pitch and a delay typically longer than one or two clock cycles.
Intermediate interconnect, having dimensions that are between those of local and
global interconnects.
A key difference between local and global interconnect is that the length of the
former scales with the technology node, while for the latter the length is approxi-
mately constant.
From a functional point of view, the two main important and performance-
demanding applications of interconnects in SoC are signaling (i.e. the communica-
tion of different logic units) and clock distribution. In this context they can be
classified as:
Point-to-point links, used for critical data-intensive links, such as CPU-memory
buses in processor architectures.
1 Interconnect Issues in High-Performance Computing Architectures 17
The continuous evolution and scaling down of CMOS technologies has been the
basis of most of todays information technologies. It has allowed the improvement
of the performance of electronic circuits, increasing their yield and lowering the
cost per function on chip. Through this, the processing and storage of information
(in particular digitally encoded information) has become a cheap commodity.
Computing powers not imaginable only a few years ago have been brought to the
desktops of every researcher and every engineer. Electronic ICs and their ever
increasing degree of integration have been at the core of our current knowledge-
based society and they have formed the basis of a large part of the growth of
efficiency and competitiveness of large as well as small industries.
Continuing this evolution will however require a major effort. A further scaling
down of feature sizes in microelectronic circuits will be necessary. To reach this
goal, major challenges have to be overcome, and one of these is the interconnect
bottleneck.
The rate of inter-chip communication is now the limiting factor in high perfor-
mance systems. The function of an interconnect or wiring system is to distribute
clock and other signals to and among the various circuits/systems on a chip. The
fundamental development requirement for interconnect is to meet the high-speed
transmission needs of chips despite further scaling of feature sizes. This scaling
down however, has been shown to increase the signal runtime delays in the global
18 A. Scandurra
interconnect layers severely. Indeed, while the reduction in transistor gate lengths
increases the circuit speed, the signal delay time for global wires continues to
increase with technology scaling, primarily due to the increasing resistance of the
wires and their increasing lengths. Current trends to decrease the runtime delays,
the power consumption and the crosstalk, focus on lowering the RC-product of the
wires, by using metals with lower resistivity (e.g. Copper instead of Aluminum) and
by the use of insulators with lower dielectric constant. Examples of the latter include
nanoporous SiOC-like or organic (SilK type) materials, which have dielectric con-
stants below 2.0 or air gap approaches, which reach values close to 1.81.7.
Integration of these materials results in an increased complexity however, and they
have inherent mechanical weaknesses. Moreover, introducing ultra low dielectric
constant materials finds its fundamental physical limit when one considers that the
film permittivity cannot be less than 1 (that of a vacuum).
Therefore, several researchers have come to the conclusion that the global inter-
connect performance needed for future generations of ICs cannot be achieved even
with the most optimistic values of metal resistivity and dielectric constants.
Evolutionary solutions will not suffice to meet the performance roadmap and there-
fore radical new approaches are needed.
Several such possibilities are now envisaged, the most prominent of which are
the use of RF or microwave interconnects, optical interconnects, 3D interconnects
and cooled conductors. The ITRS roadmap suggests that research and evaluation is
greatly needed for all these solutions for the next few years. Subsequently, a narrow-
ing down of remaining solutions and start of an actual development effort is
expected.
As has already been stated, the main limitations due to metallic interconnects are
the crosstalk between lines and the noise on transmitted signals, the delay, the con-
nection capability and the power consumption (due to repeaters). As a result, the
Semiconductor Research Corporation has cited interconnect design and planning as
a primary research thrust.
An ideal interconnect should be able to transmit any signal with no delay, no degra-
dation (either inherent or induced by external causes), over any distance without
consuming any power, requiring zero physical footprint and without disturbing the
surrounding environment.
According to this, a number of metrics have been defined in order to characterize
the performance and the quality of real interconnects, such as:
Propagation delay
Bandwidth density
Power-delay product
Bit error rate
1 Interconnect Issues in High-Performance Computing Architectures 19
45
90 nm
65 nm
45 nm
Delay per unit length (ps/mm)
40
32 nm
22 nm
35
30
25
20
1 2 3 4 5 6 7
Normalized interconnect width
Propagation Delay
The propagation delay is the time required by a signal to cross a wire. Pure intercon-
nect delay depends on the link length and the speed of propagation of the wavefront
(time of flight). Electrical regeneration introduces additional delay through buffers
and transistor switching times. Additionally, delay can be induced by crosstalk.
It can be reduced by increasing the interconnect width at the expense of a smaller
bandwidth density.
Technology scaling has insignificant effect on the delay of an interconnect with
an optimal number of repeaters. The minimum achievable interconnect delay
remains effectively fixed at approximately 20 ps/mm when technology scales from
90 to 22 nm, as shown in Fig. 1.7.
Bandwidth Density
Power-Delay Product
realistic cases, power will also be required in emitter and receiver circuitry, and in
regeneration circuits.
A distinction can also be made between static and dynamic power consumption
by introducing a factor a representing the switching activity of the interconnect link
(0 < a < 1).
The power-delay product (PDP) is routinely used in the technology design pro-
cess to evaluate circuit performance.
The bit error rate (BER) may be defined as the rate of error occurrences and is the
main criterion in evaluating the performance of digital transmission systems.
For an on-chip communication system a BER of 1015 is acceptable; electrical
interconnects typically achieve BER figures better than 1045. That is why the BER
is not commonly considered in integrated circuit design circles. However, future
operation frequencies are likely to change this, since the combination of necessarily
faster rise and fall times, lower supply voltages and higher crosstalk increases the
probability of wrongly interpreting the signal that was sent.
Errors come from signal degradation. Real signals are characterized by their
actual frequency content and by their voltage or current value limits. The frequency
content will define the necessary channel bandwidth, according to Shannon
Hartleys theorem. Analogue signals are highly sensitive to degradation and the
preferred mode of signal transmission over interconnect is digital.
Signal degradation can be classed as time-based, inherent and externally
induced:
Time-based: non-zero rise-time, overshoot, undershoot, and ringing time-based
degradation can be incorporated into the delay term for digital signals. While the
whole of these degradations can be assimilated into a quasi-deterministic behav-
ior that does not exceed the noise margins of the digital circuits, a transformation
in temporal space is possible (to contribute to the regeneration delay term). This
assumption is however destined to disappear with nanometric technologies,
because of a more probabilistic behavior and especially of weaker noise
margins.
Inherent: attenuation (dB/cm), skin effect, and reflections (dB).
Externally induced: crosstalk (dB/cm) and sensitivity to ambient noise.
The allowable tolerance on signal degradation and delay for a given bandwidth
and power budget forces a limit to the transmission distance. The maximum inter-
connect segment length can in fact be calculated, a segment being defined as a por-
tion of interconnect not requiring regeneration at a receiver point spatially distant
from its emission point.
Signal regeneration in turn leads to a further problem, i.e., the energy used to
propagate the signal in the transmission medium can escape into the surrounding
environment and perturb the operation of elements close to the transmission path.
1 Interconnect Issues in High-Performance Computing Architectures 21
3D Interconnect
The typical electronics product/system of the near future is expected to include all
the following types of building-blocks:
Several studies and technology roadmaps have highlighted that these electronics
products of the future will be characterized by a high level of heterogeneity, in terms
of the following mix:
Technology: digital, analog, RF, optoelectronic, MEMS, embedded passives.
Frequency: from hundreds of MHz in digital components domain till hundreds of
GHz in RF, microwave and optical domains.
Signal: digital circuits coexisting with ultra low-noise amplifier RF circuits.
Architecture: heterogeneous architectures, i.e. event driven, data driven and time
driven models of computation, regular versus irregular structures, tradeoffs
required between function, form and fit over multiple domains of computational
elements and multiple hierarchies of design abstraction.
Design: electrical design to be unified with physical and thermal design across
multiple levels of design abstraction.
In order to simplify the design and manufacturing of such complex and hetero-
geneous systems, relying on different technologies, an adequate approach would be
to split them over a number of independent dice. Some, or even many, of the dice
will need to be in communication with each other. This approach is known as sys-
tem in package (SiP) [9], however many terms are in use that are almost synony-
mous: high density packaging (HDP), multi chip module (MCM), multi chip
package (MCP), few chip package (FCP) [10]. In general the term SiP is used when
a whole system, rather than a part, is placed into a single MCM.
The SiP paradigm moves packaging design to the early phases of system design
including chip/package functionality partitioning and integration, which is a para-
digm shift from the conventional design approach. Packaging has always played an
important role in electronic products manufacturing; however in the early days its
role was primarily structural in nature, while today and tomorrow it is playing
22 A. Scandurra
Footprintmore functionality fits into a small space. This extends Moores Law
and enables a new generation of tiny but powerful devices.
Heterogeneous integrationcircuit layers can be built with different processes,
or even on different types of wafers. This means that components can be opti-
mized to a much greater degree than if they were built together on a single wafer.
Even more interesting, components with completely incompatible manufactur-
ing could be combined in a single device (see Fig. 1.8). It is worth considering
that non-digital functions (memory, analog) are best built in non-digital pro-
cesses, that can be integrated in a low-noise and low-cost process by integrating
them in a package, rather than in a chip with additional process steps.
Speedthe average wire length becomes much shorter. Because propagation
delay is proportional to the square of the wire length, overall performance
increases.
Powerkeeping a signal on-chip reduces its power consumption by 10 to a 100
times. Shorter wires also reduce power consumption by producing less parasitic
capacitance. Reducing the power budget leads to less heat generation, extended
battery life, and lower cost of operation.
1 Interconnect Issues in High-Performance Computing Architectures 23
limits of electrical I/O channels between dice. Using other interconnect technolo-
gies (as previously mentioned) within single chips or even a dedicated interconnect
layer in a chip stack may alleviate these issues.
Conclusion
In this chapter the system on chip concept is introduced, and current SoC commu-
nication systems are described. The main features, as well as the limitations, of the
various types of on-chip interconnect are illustrated. Some details are given about
both performance issues and physical integration issues, highlighting why today
interconnect, rather than logic gates, is seen as the system bottleneck.
The system in package approach is then introduced, seen as a possibility to relax
the issues affecting SoC technology and allow the implementation of complex, het-
erogeneous and high performance systems.
However the increasing complexity and requirements in terms of computation
capability of new generation systems will reach the limit of electrical interconnect
quite soon, requesting novel solutions and different approaches for reliable and
effective on-chip and die-to-die communication.
References
1. STMicroelectronics. UM0484 User manual: STBus communication system concepts and defi-
nitions. http://www.st.com/internet/com/TECHNICAL_RESOURCES/TECHNICAL_
LITERATURE/USER_MANUAL/CD00176920.pdf. Last accessed on October 8, 2012
1 Interconnect Issues in High-Performance Computing Architectures 25
Introduction
In this chapter we will discuss the most common technology aspects to implement an
optical interconnect system, and more specifically an on-chip optical interconnect sys-
tem. From an application point of view, optical interconnects should seamlessly replace
the function of electrical interconnects. This means that an optical interconnect, or an
interconnect fabric, should always have an electrical interface. With this in mind, an
optical interconnect should be a self-contained system, including the electro-optical
and opto-electrical conversions, as well as the control and switching electronics.
We will dissect optical interconnects into their constituent building blocks, and
will explore the different options for their technological implementation. We will
then go into more details on the most promising technology for on-chip intercon-
nects: silicon photonics.
The most simple optical interconnect is a point-to-point optical link connecting two
electrical systems. Such a link typically consists of a unit converting an electrical
signal into an optical signal, a medium to carry the optical signal and a unit to con-
vert it back into an electrical signal. In an on-chip link, the medium is typically an
optical waveguide, confining light along an optical transmission line. Such wave-
guides are discussed in detail in section Waveguide Circuits. Converting the opti-
cal signal into an electrical one is done through a photodetector, typically combined
with a trans-impedance amplifier to convert the photocurrent into a voltage.
Photodetectors are discussed in section Photodetectors. For the conversion of the
electrical signal into an optical one, there are basically two main approaches, based
on the choice of the light source. This is shown in Fig. 2.1.
The most straightforward way of converting an electrical signal into light is by
directly modulating a light source. In the case of a high-speed optical interconnect,
this would be a laser. In case of many links on a chip, this would require a dedicated
laser per link. As will be discussed in section Light Sources, integrating many
small laser sources on a chip is certainly technologically feasible. However, these
sources can also generate a significant amount of heat.
An alternative is to use a continuous wave (CW) light source, and subsequently
modulate a signal onto it. This approach has the advantage that only a single common
Fig. 2.1 Optical link implementation using (a) an internal directly modulated light source, and
(b) a CW external light source with a signal modulator
2 Technologies and Building Blocks for On-Chip Optical Interconnects 29
light source is required, and this source can even be placed off-chip and fed through
an optical waveguide or fiber. As will be shown in section Modulation, Switching,
Tuning, the actual signal modulators can be implemented in more simple techno-
logy than the lasers. They should also reduce the on-chip heat generation. Another
advantage of modulating a CW source, compared to a directly modulated laser, is
the possibility to use advanced phase modulation formats, effectively coding more
bits in the same bandwidth. This is very difficult to achieve using a directly modu-
lated source, where typically intensity modulation is used.
However, an external source could pose additional topological constraints, as it
requires feed-in lines for all the modulators. This could be alleviated by integrating
an on-chip CW light source per link, accompanied by a signal modulator, but this
would again carry a penalty in power consumption and chip area.
Fig. 2.2 Optical networks on a chip. (a) Circuit switched, (b) wavelength switched, (c) WDM bus
High-Contrast Photonics
Now we have an idea of which building blocks are required for on-chip optical
interconnects, we should look for the best suited technologies and materials to
implement them. Unlike integrated electronics, photonic integrated circuits come in
a large variety of materials: glasses, semiconductors (silicon, germanium and IIIV
compounds), lithium niobate, polymers, etc. and each of these has its strong and
weak points. But when we look towards technologies for optical interconnects, we
can already impose some boundary conditions. The foremost constraint is one of
density: in compliance with Moores law, electronics are steadily shrinking, and if
2 Technologies and Building Blocks for On-Chip Optical Interconnects 31
Silicon Photonics
Silicon is the most prominent semiconductor for electronics. But in recent years it
has shown to be a promising material for integrated photonics as well [10, 36]. It has
a high refractive index contrast with its own native oxide, and is transparent at the
commonly used communications wavelengths around 1,550 and 1,310 nm. But the
main attraction for silicon as a material for photonic integration is that it can be
processed with the same tools and similar chemistry as now used for the fabrication
of electronic circuitry [10] and even monolithically with CMOS on the same sub-
strate [7, 36, 85]. This not only leverages the huge investments in wafer-scale pro-
cessing and patterning technologies, but also facilitates the direct integration of
silicon photonics with electronics.
However, silicon may seem like a good material for waveguides, but it is notori-
ously bad for active photonic functions, especially the emission of light. So to
implement a full optical link there will always be a requirement to integrate other
materials for sources and detectors. As will be discussed in section Photodetectors,
detectors can be implemented in germanium, a material that can be deposited or
epitaxially grown on silicon. However, efficient light sources may need the inclu-
sion of efficient light emitters, and IIIV semiconductors are currently considered
to be the best option.
IIIV materials, either based on gallium arsenide (GaAs) or indium phosphide (InP)
are commonly used for efficient light sources and photodetectors. They can also be
used for photonic integrated circuits, and can provide a similar index contrast with
glass as silicon. Also, different integration schemes to integrate active and passive
functions on the same IIIV chips have been demonstrated, and some are commer-
cially available today. However, the wafer-scale fabrication technologies for IIIV
32 W. Bogaerts et al.
semiconductor are somewhat lagging those for silicon, missing the drive of the
electronics industry, and typical IIIV semiconductors are not available in large-
size wafers (200 or 300 mm).
Therefore, a attractive route is to combine active functions in IIIV semiconduc-
tors with silicon photonics. This can be done by integrating ready-made IIIV com-
ponents onto a silicon photonics chip. This is definitely possible using flip-chip based
technologies, but it is a relatively cumbersome process that limits the number of
components that can be integrated simultaneously. Also, alignment tolerances can be
quite tough to meet, which translates in a significant higher integration cost.
The alternative is to integrate unprocessed IIIV material onto the silicon in the
form of a thin (local) film, and subsequently use wafer scale processing techno-
logies to pattern the IIIV devices. The obvious technique to integrate the IIIV
material would seem to be direct epitaxy, but the crystal lattice mismatch of IIIV
materials with silicon is typically too large to effectively do this, and while there are
some demonstrations of IIIV growth on silicon (directly or through a germanium
interface layer), the large number of dislocations generated degrade the optical
quality of the IIIV material.
The alternative to direct epitaxy is the use of bonding. Small IIIV dies are
locally bonded to a silicon wafer, which can already be patterned with photonic
circuitry. After bonding, the IIIV material can be thinned down to a thin film. The
actual bonding can be done in different ways, either directly making use of molecu-
lar forces or through the use of an intermediate adhesive or metal layer. The merits
of the different technologies are discussed in section Light Sources.
After the integration of the IIIV material on silicon, the actual devices can be
further processed on wafer scale, and patterned using the same lithographic tech-
niques used for silicon processing. However, when this processing is done in silicon
fabs that also process electronics, care should be taken not to contaminate tools with
IIIV compounds. Also, the integration of IIIV material into a fully functional
photonic/electronic chip, including the electrical contacting, is not straightforward.
An optical interconnect only makes sense when it is tightly integrated with the elec-
tronic systems it needs to interconnect. While the optical interconnect is primarily
devised to support the electronics, the interconnect subsystems also require dedi-
cated electronics for driving and control. The actual integration strategy, i.e. how to
combine the optical interconnect layer with the electronics, can have a strong impact
on the performance, the floor space and ultimately the cost of the full component.
However, the main essential point from an integration point of view, and that holds
for all the technologies discussed throughout this chapter, is that everything should
ultimately be compatible with wafer-scale processing.
In section Integration in an Electronics Interconnect we discuss a number of
integration options for silicon photonics interconnect layers in a traditional
2 Technologies and Building Blocks for On-Chip Optical Interconnects 33
electronics chip. One of the main criteria is the position of the photonics fabrica-
tion in the entire electronics fabrication flow. Here we can discern between front-
end-of-line processes (the photonics sitting at the same level as the transistors),
back-end-of-line processing (the photonics is positioned between or directly on
top of the metal interconnect layers) or 3-D integrated (the photonics is processed
separately and integrated as a complete stack on the electronics). These options
are also illustrated in Fig. 2.21.
Waveguide Circuits
Photonic integrated circuits can combine many function on a single chip. Key to
this, and especially in the context of interconnects, is to transport light efficiently
between functional elements of the chip. The most straightforward way for this is
through optical waveguides which confine light to propagate along a line-shaped
path. As we will see further, these waveguides can also be used as functional ele-
ments themselves, especially by manipulating multiple delays to obtain interfer-
ence, which in turn can be used to construct filters for particular wavelengths.
Optical Waveguides
Optical waveguides need to confine light along a path on chip, so it can be used to
transport a signal between two points. The most straightforward way to construct a
waveguide is to use a core with a high refractive index surrounded by a lower refrac-
tive index. A well-known example of such a waveguide is an optical fiber, consist-
ing of two types of glass with a slight difference in refractive index. Most optical
waveguides have an invariant cross section along the propagation direction. The
propagation inside the waveguide can then be described in terms of eigenmodes: a
field distribution in and around the core that propagates as a single entity at a fixed
velocity. Such an eigenmode is characterized by a propagation vector b or an effec-
tive refractive index neff. The propagation speed of the mode in the waveguide is
given by c neff, with c the speed of light in vacuum. Depending on their dimensions,
waveguides can support multiple eigenmodes, that propagate independently with
their own neff.
On a chip, there are much more different ways to construct a high index wave-
guide core: glasses, polymers and different types of semiconductor are the most
straightforward. Especially the last category is relevant: as already explained, optical
waveguides can be made more compact when there is a high index contrast between
core and cladding. This makes semiconductors extremely attractive, and silicon in
particular, because of its compatibility with CMOS fabrication processes.
To construct a submicrometer waveguide in silicon, a cladding material with a
low refractive index is needed. Silica (SiO2) is perfectly suited for the purpose,
34 W. Bogaerts et al.
Fig. 2.3 High-contrast silicon waveguide geometries. (a) Photonic wire strip waveguide, (b) rib
waveguide
resulting in an index contrast of 3. 451. 45. The cladding material should surround
the entire waveguide core, however, and this requires a layer stack of silicon and
silica. Such silicon-on-insulator (SOI) wafers are already used for the fabrication of
electronics, and can be commercially purchased from specialized manufacturers,
such as SOITEC. The high-quality substrates are typically fabricated through wafer
bonding, where partially oxidized wafers are fused together by molecular bonding.
By carefully implanting one of the wafers with hydrogen prior to bonding, a defect
layer can be formed at a precise depth, and the substrate of the wafer can be removed,
leaving a thin layer of silicon on top of a buried oxide (BOx). Such an SOI stack for
nanophotonic waveguides has a typical silicon thickness of 200400 nm, and a bur-
ied oxide of at least 1 mm, preferably 2 mm thick, to avoid leakage of light into the sili-
con substrate. This gives a high refractive index contrast in the vertical direction.
To create an in-plane index contrast, the SOI layer is patterned, typically using
a combination of lithography and plasma etching [10, 15, 81]. Depending on the
etch depth, different waveguide geometries can be obtained. The most common are
illustrated in Fig. 2.3. A strip waveguide, often called a photonic wire, has a fully
etched-through cladding and offers the highest possible contrast in all directions.
Alternatively, a rib waveguide has a partially etched cladding and has a weaker
lateral contrast.
The lateral contrast has a direct impact on the confinement. The larger the lateral
confinement, the smaller the mode size can be, the closer the waveguides can be
spaced without inducing crosstalk, and the tighter the bend radius can be.
Photonics wires typically consist of a silicon core of 300500 nm width and 200
400 nm height. Several groups have standardized on 220 nm thick silicon, as sub-
strates with this thickness can be purchased off the shelf. The core dimensions are
dictated by several factors. First, it is of best interest to confine the light as tightly as
possible. This does not mean that the core can be shrunk indefinitely. At certain
dimensions, the size of the optical mode will be minimal, and for smaller cores the
2 Technologies and Building Blocks for On-Chip Optical Interconnects 35
mode will expand again. For a 220 nm thick silicon core, the optical mode is smallest
for a width around 450 nm at wavelengths around 1,550 nm. For this configuration,
not all the light is confined to the silicon, but a significant fraction of the light (about
25%) is in the cladding. With such waveguides, it is possible to make bends with
23m bend radius with no significant losses.
A second thing to consider is the single-mode behavior of the waveguide.
As optical waveguides get larger (for the same index contrast), they can support
more eigenmodes. While these propagate independently they can couple when there
is a change in cross section (e.g. a bend, a crossing, a splitter). This can give rise to
unwanted effects such as multi-mode interference, losses and crosstalk. Therefore,
it is best to have a waveguide which only supports a single guided mode. This can
be done by keeping the cross section sufficiently small. In the same SOI layer stack
of 220 nm thickness, all higher-order modes are suppressed for widths below 480 nm,
again for wavelengths around 1,550 nm.
Finally, there is the issue of polarization: modes in optical waveguides can be
classified according to their polarization: the orientation of the electric field compo-
nents. On a optical chip, this classification is typically done with respect to the plane
of the chip. We find quasi-TE (TransverseElectric field) modes with the E-field
(almost) in the plane of the chip, and quasi-TM (TransverseMagnetic field) modes
which have their E-field in the (mostly) vertical direction. In the case of a vertically
symmetric waveguide cross section (e.g. a rectangular silicon wire completely sur-
rounded by oxide), the waveguide will always support both a TE and a TM mode (so
the waveguide is never truly single-mode), but the TE and TM modes are fully decou-
pled: as long as the vertical symmetry is maintained, there will be no mode mixing
between the TE and the TM ground mode, not even in bends or splitters. Whether the
TE or the TM mode is the actual ground mode of the waveguide depends on the cross
section: the mode with the E-field along the largest core dimension will have the high-
est effective index. In the case of a waveguide cross section which is wider than its
height, the TE is mode the ground mode. For a perfectly square waveguide cross sec-
tion, the TE and TM modes are degenerate. Typically, photonic wires have a larger
width than height, because this is easier on fabrication (printing wider lines, and etch-
ing less deep). They are therefore most commonly used in the TE-polarization.
The essential figure of merit for photonic wires is their propagation loss: the
lower the loss, the longer an optical link can be for a given power budget. Photonic
wires fabricated with high-resolution e-beam lithography have been demonstrated
with losses as low as 1 dB/cm [48], meaning they still retain 80% of the optical
power after 1 cm propagation. For waveguides defined with optical lithography,
such as used for the fabrication of electronics, the propagation losses are slightly
higher, of the order of 1.4 dB/cm [13]. These losses are mainly attributed to scatter-
ing at roughness induced by the fabrication process, and absorption at surface states.
Making waveguides wider reduces the modal overlap with the sidewall, which
reduces the waveguide loss, even down to 0.3 dB/cm [99], but at the cost of a taper-
ing section and loss of single-mode behaviour.
Because of their small feature size, the properties of photonic wires are fairly
wavelength dependent: the effective index as well as the exact mode profile changes
36 W. Bogaerts et al.
Coupling Structures
Fig. 2.4 Coupling structures for optical chips. (a) Spot-size converter for edge coupling, (b) vertical
grating coupler
optical interconnects, this is not a main concern, as light does not have to leave the
chip. However, there are two easily identified exceptions: in the case where an
external light source is used, this light has to be coupled to the chip. As second
aspect is the extension of on-chip interconnects to multi-chip modules. For the sake
of completeness, we will briefly discuss the two main options for coupling light to
a chip: edge coupling in the plane of the chip, and vertical coupling. As the on-chip
waveguides typically have a different cross-section than the off-chip mode, a spot-
size converter will be necessary. The most relevant coupling structures are illus-
trated in Fig. 2.4.
As the on-chip waveguides are oriented in the plane of the chip, it is relatively
easy to transport the light to the edge of the chip. At the edge, the small wire mode
should be converted to a fiber-matched mode. The traditional approach to this is
including an adiabatic taper, consisting of a gradually narrowing silicon waveguide:
for very small widths, the light is no longer confined and the mode expands. This
larger mode is then captured by a larger waveguide (in oxide, oxynitride or poly-
mers) which can couple directly to a fiber at the polished facet of the chip. This
tapering approach has two advantages: it is a fairly simple and tolerant concept to
manufacture once you have a patterning technology capable of sub-100 nm features,
and it works over a broad wavelength range. Coupling efficiencies of 90% have
been demonstrated [96]. However, the edge-coupling approach has significant draw-
backs as well. The number of ports that can be accommodated at the edge is limited,
and the path of the optical waveguide to the edge should not be crossed by any
obstacle, such as metal interconnects. The taper structures are also quite large,
requiring lengths of several hundred nanometer. Finally, the optical ports are only
38 W. Bogaerts et al.
accessible after dicing the wafer and polishing the facets: this makes wafer-scale
testing and selecting known-good-dies for further processing difficult.
The alternative is vertical coupling: using a diffraction grating, light can be cou-
pled from an on-chip waveguide to a fiber positioned above the chip. The grating
can be implemented as etched grooves [10, 102], metal lines [103], or subwave-
length structures [74]. Such structures attain coupling efficiencies of over 30%. By
engineering the grating layer structure, higher coupling efficiencies of 70% have
been demonstrated [112, 114]. The gratings can be made quite compact by design-
ing them such that the fiber-size spot is focused directly into the core of a photonic
wire waveguide [111].
However, because the grating is a diffractive structure, its behavior is wavelength
dependent. The typical operational bandwidth (at 50% or 3 dB) is quite large:
6080 nm. This is possible because of the very high refractive index contrast of the
silicon waveguides.
There is also the matter of fabrication: the best devices require more complex
fabrication techniques, and deviations in the fabrication will quickly lead to a shift
in wavelength or a drop in efficiency. The vertical couplers do have significant oper-
ational advantages: they are more tolerant to fiber alignment errors, can be used
directly on the wafer for testing and die selection, and can be positioned anywhere
on the chip, giving more flexibility for packaging or testing.
The vertical should be treated with a question mark, though. When designing
the diffraction grating for true vertical coupling, one introduces several (unwanted)
parasitic effects. For one, the grating will become strongly reflective: it will also act
as a Bragg reflector, reflecting light from the waveguide back into the waveguide.
This can be partially reduced by engineering the grating [89]. Also, a vertical grat-
ing is symmetric, and symmetry-breaking schemes should be implemented to avoid
the grating coupling to both directions of the waveguide. For fiber coupling, the
solution is to use fibers polished at an angle. But in situations when vertical cou-
pling is a necessity (e.g. integration of a vertical light source) additional measures,
such as a lens or refracting wedge are required [92].
Both the multiplexing and the routing require wavelength selective elements.
On a chip, these are best implemented by interference of two or more waves with a
wavelength-dependent path length difference. This can be self-interference in a
resonator, two-path interference in a (cascaded) MachZehnder interferometer, or
multipath interference. In all cases, the physical length of the delay line scales
inversely with the group index, so photonic wires are well placed to implement
these wavelength selective functions. On the other hand, the free spectral range
(FSR) of a filter is the wavelength spacing between two adjacent filter peaks, and it
should have a sufficiently large FSR to cover a broad band of signal wavelengths in
WDM. For this, the delay length should be sufficiently short, and here the photonic
wires sharp bend radius and tight spacing allows FSRs which are difficult to con-
struct with other waveguide technologies.
Fig. 2.5 Ring-resonator filters. (a) All-pass filter consisting of a single ring on a bus waveguide.
(b) Add-drop filter, which drops wavelength channels from the bus waveguide to the drop port
(free spectral range, or FSR) inversely proportional to the arms length and the group
index of the waveguide (Fig. 2.6).
For splitting and combining the light, one can make use of directional couplers
or multi-mode interferometers (MMI). In the former, light can couple between two
adjacent waveguides, and the coupling strength can be controlled by the length of
the coupler or the width of the gap. However, in a photonic wire geometry this gap
is difficult to control accurately. MMIs use broad waveguide area which supports
multiple modes to distribute the light to two or more output waveguides. They have
been proven to be more tolerant than directional couplers for 50% coupling ratios,
but arbitrary ratios are more difficult to design accurately.
While single MZIs have a sinusoidal wavelength response, they can be cascaded
to obtain more complex filter behavior. This can be done through a cascade where
the MZIs are stacked in series, or directly using common splitter and combiner sec-
tions [12, 35, 105, 123].
As MZI-based filters are nonresonant, they dont suffer from nonlinear effects,
but they typically require a much larger footprint than a ring-resonator-based filter
for a similar filter response.
The principle of an MZI can be extended to multiple delay lines: input light can be
split up between an array of waveguides with a wavelength-dependent phase delay.
When the outputs of these delay lines are arranged in a grating configuration, the
distributed light will be refocused in a different location depending on the phase
2 Technologies and Building Blocks for On-Chip Optical Interconnects 41
delay (and thus, on the wavelength). this way, one component can (de)multiplex a
multitude of wavelength channels simultaneously [32]. Again, silicon photonic
wires can make the delay lines of the arrayed waveguide grating (AWG) shorter and
arrange them in a more compact way that other waveguide technologies (Fig. 2.7).
An 1 N AWG with one input waveguide can serve as a multiplexer. However, if
designed properly, an n N AWG can be used to route light from any input to any
output, based on the choice of wavelength at the input. This is done by carefully
matching the FSR of the AWG to N times the wavelength channel spacing [34].
Because of the high index contrast, silicon AWGs typically perform worse than
glass-based components (but with a much smaller footprint), with crosstalk levels
around 2025 dB [14].
AWGs typically have a Gaussian-shaped transmission band. However, by engi-
neering the geometry of the access waveguides, a more uniform pass band can be
obtained, typically with a penalty in insertion loss of approximately 3 dB. More
elaborate synchronized schemes can reduce this loss by cascading an additional
interferometer to the AWG [33, 118]. Such techniques have been demonstrated for
mature silica waveguide technology.
Fig. 2.7 Arrayed waveguide grating in silicon on insulator. (a) Operating principle. (b) Eight-
channel AWG in silicon, presented in [14]. (c) Plot of the transmission of the eight output channels
based on the date of the device from [14]
the different output waveguides [19, 20, 20]. Here, too, it is possible to configure
the input and output waveguides in a router configuration. Performance of PCGs
is similar to that of silicon AWGs, with crosstalk levels around 2030 dB [19].
The choice whether to use an AWG or PCG is very much dependent on the chan-
nel spacing and number of channels [14] (Fig. 2.8).
All filters discussed here rely on a wavelength-dependent phase delay. This implies
a good control of the dispersion (wavelength dependence of the neff and ng of the
waveguide or slab area. In silicon photonics, the dispersion is very dependent on the
actual fabricated geometry. For wire-based filters, nanometer-variations in line
width can result in wavelength shifts in the order of nanometers. Therefore, accurate
control of the fabrication process is extremely important, and on top of that, active
tuning or trimming of the delay lines is often necessary.
Fabrication Accuracy
Fig. 2.8 Planar curved grating (or Echelle grating) in silicon on insulator. (a) Operation principle. (b)
Example of a four-channel PCG from [20]. (c) The transmission plotted based on the data from [20]
fabrication environments. It has been shown that it is indeed possible to control the
average delay line width (and therefore the peak wavelength) of a ring resonator or
an MZI to within a nanometer between two devices on the same chip, and within
23 nm for devices on the same wafer or even between wafers [94].
Even with this process control, it is not possible to manufacture wavelength
filters with subnanometer accuracy while still maintaining practical process toler-
ances. In a typical CMOS fabrication process, tolerances of 510% are used.
While photonic wire waveguides are much larger that todays state-of-the-art
transistor features, the tolerances are stricter, and thus well below 1% of the criti-
cal dimensions.
Temperature Control
In addition to fabrication control, the WDM filters should also be tolerant to different
operational conditions, most notably a broad temperature range. Temperatures can
vary wildly within an electronics chip, with hot-spots popping up irregularly. Silicon
photonic wires are very susceptible to temperature variations, and in filters this results
44 W. Bogaerts et al.
in a peak wavelength shift of the order of 50100 pm/ C. This effect can be reduced
in various ways. One can design the waveguide to have no thermal dependence, by
incorporating materials which have the opposite thermal behavior as silicon: some
polymers claddings have been demonstrated to work well for this purpose [104], but
these then introduce many questions on fabrication and reliability.
Alternatively, active thermal compensation could be used by including heaters,
or even coolers, in or near the waveguides. Such thermal tuning can compensate the
remaining process variations, but it introduces additional power consumption for
the heaters, as well as the necessity for control and monitoring circuitry.
The heaters themselves can be incorporated as metallic resistors [29, 44], using
silicides or doped silicon [108, 117] or even use the silicon of the waveguide itself
as a heater element [50].
Light Sources
The light source problem is probably the technologically most controversial tech-
nological challenge in silicon photonic optical interconnects. It is well know that
crystalline silicon cannot emit light due to its indirect bandgap. This makes mono-
lithically integrated lasers very difficult, and opens the door to a large number of
light source alternatives.
In the specific case of on-chip optical interconnect, an number of requirements
are imposed on candidate light sources. First of all, they have to be electrically
pumped and work under continuous wave or be directly modulated, depending on
the interconnect scheme from Fig. 2.1. Also, they should be efficient and have a low
threshold current, in order to reduce the energy per bit in the link.
The most straightforward solution is to use a commercially available InP based
laser diode and integrate it onto the SOI circuits. The laser power will then be dis-
tributed over the whole chip and shared as a common optical power supply by all
the links, as shown in Fig. 2.1b. A more challenging scheme is to implement an
individual on-chip laser for each link, which can then either be used in CW or be
directly modulated (cf. Fig. 2.1a). In the following part of the section, we will dis-
cuss in detail the implementation and challenges of the two schemes.
Using an off-chip laser decouples the light source problem from the silicon photon-
ics, obviating the need to build a light source in/on silicon. Also, the laser diode can
be tested and selected prior to assembly. The challenging issue here is the optical
coupling interface between a laser diode and an SOI waveguide.
In its simplest form, the laser diode is just fiber-pigtailed and connected to the sili-
con chip using the fiber couplers discussed in section Waveguide Circuits. But the
2 Technologies and Building Blocks for On-Chip Optical Interconnects 45
Fig. 2.9 Off-chip laser sources. (a) Fiber pigtailed source, (b) laser subassembly mounted on a
non-vertical grating coupler [36]. (c) VCSEL mounted on a vertical grating coupler, (d) VCSEL
mounted on a non-vertical grating coupler with a refracting wedge [92]
laser diode can also be mounted on the chip itself, either as a bare chip or a subas-
sembly. In that case, the coupling scheme should be adapted to the particular laser
diode. An example is the laser package developed by Luxtera, which couples the
horizontal laser light into a vertical grating coupler by means of a reflecting mirror
and a ball lens integrated in a micropackage on top of a silicon photonics chip [36].
When using vertical coupling, vertical-cavity surface emitting lasers (VCSEL)
are very attractive: such devices can be flip-chipped directly on top of a grating
coupler. However, this means the grating coupler should work for the vertical
direction. As discussed in section Waveguide Circuits it is not straightforward
to implement truly vertical grating couplers. One problem is that due to symmetry
the grating coupler will diffract the source light into the waveguide on both sides
of the grating. A solution suggested by Schrauwen et al is to use a refractive
angled interface to deflect the perfectly vertical light to a non-vertical grating
coupler (Fig. 2.9).
The disadvantage of using a off-chip device is the strict optical alignment needed for
integration, and such a process has to be repeated sequentially if multiple lasers are
46 W. Bogaerts et al.
going to be used. This is especially true in a WDM environment, where a light source
for each wavelength channel is needed. This can be accomplished by a multi-wavelength
laser or comb-laser, or by connecting an individual laser for each channel.
On-Chip Lasers
When bringing the lasers to the chip, a number of new possibilities arise. When the
lasers are integrated with the photonic circuitry, a much higher density can be
achieved. Lasers can be integrated close to the modulators, and much denser link
networks can be built. Also, the lasers can be batch processed.
In order to achieve optical gain on SOI, one needs to integrate new materials with
optical gain or modify silicon itself at the early stage of the chip fabrication. Various
methods have been proposed. One of the most successful so far is the heterogenous
integration of IIIV materials on SOI based on bonding technology, which will be
discussed in the following part. Some advanced on-chip lasers based on other
approaches will also be reviewed.
up side down on top of an SOI wafer. The SOI wafer can be either patterned or
unpatterned. If necessary, multiple IIIV dies with different epi-structures can be
bonded on the same SOI wafer for realizing different functionalities. The IIIV dies
are still unpatterned at this stage. Thus, only a coarse alignment to the underlying
SOI structures is necessary. Then, the InP substrate is removed by mechanical grind-
ing and chemical etching. To isolate the etching solution from the device layers, an
etch stop layer (usually InGaAs/InP), which will be removed subsequently, is
embedded between these layers and the substrate. The devices in the IIIV layers
are then lithographically aligned and fabricated with standard wafer-scale process-
ing. As compared to the approach of using off-chip lasers mentioned in section
Off-Chip Lasers and Interfaces to the SOI Circuits, the tight alignment tolerance
during the bonding process is much relaxed here.
To realize bonding between the IIIV dies and the SOI wafer, there are two
common techniques: direct (molecular) bonding and adhesive bonding. In the first
approach, a thin layer of SiO2 is first deposited on top of the IIIV dies and the SOI
wafer. For a patterned SOI, the wafer should be planarized and polished through a
chemical mechanical polishing (CMP) process [53, 110]. The initial bonding of the
IIIV dies and SOI wafer is achieved through the van der Waals force. Such an
attraction force is only noticeable when the two surfaces are brought close together
within a few atomic layers. Thus, in order to make the van der Waals attraction take
place in a large portion of the bonded interfaces, the surfaces of the IIIV dies and
SOI wafer must be particle-free, curvature-free, and ultra-smooth. The bonded
stack is subsequently annealed, usually at a relatively low temperature (up to
300 C), in order to avoid cracks induced by the thermal expansion coefficient mis-
match between the IIIV and silicon. A stronger covalent bonding will then hap-
pen if the two bonded surfaces are chemical-activated before contacting. Without
the aid of SiO2, direct bonding of IIIV material and silicon is also possible through
O2 plasma activation of both surfaces and incorporation of vertical outgassing
channels on SOI [66].
Alternatively, in the adhesive bonding approach, a bonding agent, usually a poly-
mer film, will be applied in between the two bonded surfaces. Due to the liquid form
of the polymer before curing, the topography of surfaces can thus be planarized, and
some particles, at least with diameters smaller than the polymer layer thickness, are
acceptable. The whole stack will also undergo a curing step at an appropriate tem-
perature depending on the chosen polymer. The most successful implementation of
this technology on the related devices mentioned in this book is done through DVS-
BCB polymer, due to its good planarization properties, low curing temperature
(250 C), and resistance to common acids and bases [88].
The optical coupling from the bonded active laser cavities to the passive SOI wave-
guides is one of the most challenging issues in designing a micro-laser based on the
IIIV/SOI heterogeneous integration technology. In order to accommodate the
48 W. Bogaerts et al.
Fig. 2.11 IIIV bonded stripe laser geometries. (a) FabryPerot laser with integrated polymer
mode converter between IIIV and silicon waveguides [90], (b) fabricated laser, (c) bonded IIIV
laser with thick gain section and inverted-taper mode conversion to the silicon waveguide [63], (d)
IIIV/SOI hybrid waveguide structure with evanescent gain section [40]
pin junction and facilitate an efficient current injection, a thick IIIV epi-layer
structure is necessary. This will normal result in a low index contrast in the vertical
direction for the IIIV waveguide. However, an single mode SOI wire waveguide
has a high-index contrast in all directions. The mismatch in the mode profiles and
the effective mode indices makes the out-coupling of the laser light difficult. A solution
is to use a mode converter, made of an SOI inverse taper and a polymer waveguide,
for interfacing an single mode SOI waveguide to an IIIV FabryPerot (FP) laser
cavity, as shown in Fig. 2.11a,b [90]. The structure is designed and fabricated in a
self-aligned manner. Despite the fact that this mode converter has a large footprint,
efficient light output with power up to 1 mW in the SOI waveguide was obtained,
and subsequent optimizations of such mode converters (2.11c) have demonstrated
power up to 3 mW on both ends of the laser cavity [63]. The advantage of this
approach is that in the bonded region most of the light is in the IIIV material and
will experience strong gain, while the laser mirrors can be implemented in the
silicon.
An alternative approach, proposed by Fang and coworkers, is to use an ultra-thin
bonding layer [40]. As shown in Fig. 2.11d, the IIIV layers and silicon in this case
can be considered together as one hybrid waveguide. Here, a large portion of the
2 Technologies and Building Blocks for On-Chip Optical Interconnects 49
guided power is still located in silicon, and the overlap with the active IIIV materials
is smaller. This implies that the gain per unit of length of such a structure will also be
smaller. Still, with proper design a sufficient overlap with the gain medium can be
achieved. Based on such a waveguide structure, stand-alone FP lasers were introduced
initially, and integrated distributed feedback (DFB) lasers, distributed Bragg reflector
(DFB) lasers, and ring lasers were also demonstrated subsequently [3840].
Partly because of the limited gain caused by the small modal overlap, the laser
devices mentioned above still have a relatively large footprint (100 mm to 1 mm).
They can give a lasing power several mW with performances similar to that of an
off-chip laser. This kind of device is also ideal for the implementation shown in
Fig. 2.1b, where a CW laser is used as an optical power supply. However, because
they are still quite large, such lasers cannot be directly modulated directly at speeds
required for optical links, as in Fig. 2.1a.
For this, a true micro-laser with a dimension of several microns is the logical
candidate. Such small lasers can be implemented at any position where an electro-
optical interface is needed. The best example of such microlasers are based on
microdisks, coupled to a single mode silicon wire waveguide [110]. This is shown
in Fig. 2.12a, b. Different from the approaches mentioned above, the out-coupling
here is based on the evanescent coupling from the cavity resonant mode to the
guided mode in the silicon waveguide. Due to the mode index mismatch, a very high
coupling is still difficult to achieve, but the coupling should not be too high anyway
as not to destroy the cavity resonance. Single mode output power over 100 mW
under continuous wave with a microdisk cavity of 7.5 mm diameter and a threshold
current of 0.38 mA was obtained, as shown in Fig. 2.12c, d [100]. These lasers are
quite small, and they can be directly modulated. Direct current modulation up to
4 Gb/s was achieved [75]. Also, as the lasers are evanescently coupled to the bus
waveguides, several microdisks can be cascaded on one silicon waveguide: for
instance, a 4-channel multiwavelength laser source for WDM applications has been
demonstrated, as shown in Fig. 2.12e, f [109].
A different form of such a micro-laser uses a micro-ring cavity which is laterally
coupled to a silicon waveguide. Continuous wave lasing was achieved with rings of
diameters as small as 50 mm [67].
Optical gain in silicon can be achieved through various optical nonlinear effects [16,
43, 91], which led to the first realization of a silicon laser [16, 17]. However, such a
device based on a pure optical effects cannot possible be pumped electrically, which
makes such a laser unsuitable for on-chip optical interconnect. Gain through carrier
population inversion, which can be electrically pumped, is quasi-impossible in sili-
con, since it is an indirect bandgap material with very inefficient radiative recombi-
nation of carriers. Still, locally confining the carrier in, e.g., silicon nanocrystals,
provides an approach to increase the radiative recombination probability, and net
optical gain was demonstrated [84]. However, for silicon nanocrystals the gain
50 W. Bogaerts et al.
Fig. 2.12 (a) Schematic structure, (b) lightcurrentvoltage curve, and (c) spectrum of a IIIV
microdisk laser on an SOI waveguide [100, 113]. The light power was measured in the access fiber,
which is about one third of that in the SOI waveguide. (d) Spectrum and fabricated structure (inset)
of a multiwavelength laser [109]
wavelength is within the visible band, which is not suitable for integration with sili-
con waveguide. A nanocrystal-based gain material for longer wavelengths would
require IVVI semiconductors, with bulk bandgaps beyond 2 mm.
Erbium doping, which is widely used in fiber amplifiers, provides another route
to implement gain in silicon. Net material gain was achieved in the 1.55 mm wave-
length band, but no laser action was reported so far [51, 77].
Finally, an approach which has drawn a considerable amount of interest and some
recent promising results is the epitaxial growth of germanium on silicon for mono-
lithic lasers. Although Ge is also an indirect bandgap material, the offset between the
direct and the indirect bandgap is sufficiently small that bandgap engineering can be
done to stimulate radiative recombination from the direct bandgap valley. By using a
combination of strain and heavy n-type doping the germanium can be turned into a
direct-bandgap material [73]. Based on this approach, a FP laser working under
pulsed operation has been demonstrated through optical pumping [72]. With electri-
cal pumping, such a laser could provide an ideal light source for on-chip optical
interconnects, as germanium is already present in many CMOS fabs.
2 Technologies and Building Blocks for On-Chip Optical Interconnects 51
For many practical purposes it is essential than the function of an optical chip can
be electrically controlled. This is especially true in interconnects, where an electri-
cal signal should be imprinted on an optical carrier, transported through an optical
link or network, and then converted back to an electrical signal. This requires sev-
eral functions where electrical actuation of optical components are required:
Signal modulation. The electrical signal should be imposed on an optical carrier,
which requires a very fast mechanism to change the optical properties of a wave-
guide circuit. Modulation speeds are required from 1 GHz over 10 GHz, 40 GHz and
even beyond 100 GHz,
Switching. The optical signal should be routed through the network. In the case of
a switched network topology, the switch should be sufficiently fast to rapidly estab-
lish and reroute connections, but it should consume as little power as possible to
maintain its state once the switching operation is performed. Depending on the
configuration, switching speeds can be ms to ns.
Tuning. As discussed in the section on passive waveguides, the fabrication technol-
ogy is far from perfect, and especially in WDM configuration the operation condi-
tions often require active tuning to keep the WDM filters spectrally aligned. Tuning
is typically a rather slow process (ms to ms) but should require low power.
To modulate an optical carrier on a carrier wavelength one can either modulate the
amplitude or the phase. When propagating through a (waveguide) medium this
involves a modulation of the absorption or the refractive index, respectively. The
most simple form of modulation is direct amplitude modulation, of onoff-keying
(OOK). This scheme is exceptionally easy to decode at the receiver side, as it only
requires a photodetector. Electrical amplitude modulators can be based on elec-
troabsorption effects, i.e. band-edge shifts driven by an external electric field, but
this only works at a given wavelength. Alternatively, phase modulation encodes the
signal in the phase of the light, and this is generally a broadband effect. This makes
a more efficient use of the spectrum, and an electrical modulator now requires only
a change in refractive index, not absorption: this is much easier to achieve. At the
receiver side things become more convoluted though, requiring multiple detectors
or interferometric structures. More advanced modulation schemes involve multiple
amplitude or phase levels, making much more efficient use of the spectrum, but
requiring much more complex detection schemes at the receiving end. The modu-
lation format (phase or absorption) can be decoupled from the actual physical
modulation effect (absorption or index change). This is shown in Fig. 2.13. OOK
can be achieved using direct absorption modulation, but also using phase modula-
tors in conjunction with an interferometer or resonator. A phase modulator can also
52 W. Bogaerts et al.
Fig. 2.13 Electro-optic amplitude and phase modulation. (a) An electrical signal drives an elec-
tro-absorber, where the absorption edge shifts as a function of the electric field. (b) An electro-
optic phase shifter changes the optical length of the cavity, and thus shifts the absorption wavelength.
(c) A phase shifter in the MZI changes the phase difference in the two arms from constructive to
destructive interference. (d) Two amplitude modulators in a MZI will act as a phase modulator
Thermal Tuning
Mechanical Tuning
Fig. 2.15 Integrated heating mechanisms: (a) metal top heater. (b) Silicide (or highly-doped) side
heater. (c) Top metal heater with insulation trenches. (d) Heater inside the waveguide core
Fig. 2.16 Mechanical waveguide actuation: (a) apply strain by actuating from the substrate. (b)
Electrostatically moving waveguide butt coupling. (c) Actuating the spacing in a directional cou-
pler. (d) Actuating the slot width of a slot waveguide
Carrier Manipulation
As discussed, the fastest and most efficient phase modulators are based on direct electro-
optic effects. However, because it has a centro-symmetric lattice, silicon does not have
the required second-order (Pockels) effect. While is possible to induce this effect using
strain [57] to break the lattice symmetry, this requires substantial substrate engineering.
Therefore, the most common solution today for all-silicon modulators is to use
the carrier dispersion effect [86]: the refractive index (both the real and imaginary
part) of silicon depends on the concentration of electrons and holes in the material
[97]. Injection into or extraction of carriers out of a waveguide core will change
its effective index, and therefore its optical length. This results in a phase modula-
tion at the output. To manipulate the carrier density, one can use injection, deple-
tion or accumulation mechanisms, as shown in Fig. 2.17. The strongest effect is
carrier injection into the intrinsic region of a pin diode, located in the center of
the waveguide core to maximize the overlap with the optical mode. Applying a
forward biasing on the diode forces majority carriers from the p and n region into
2 Technologies and Building Blocks for On-Chip Optical Interconnects 57
Fig. 2.17 Silicon modulator geometries. (a) Forward-biased pin diode, (b) reverse-biased pn
diode, (c) vertical pn diode and (d) vertical siliconoxidesilicon capacitor
the core. [49, 122]. As this involves a lot of carriers, the effect is quite strong.
However, it is limited in speed by the recombination time of the carriers in the
core. To obtain modulation speeds well in excess of 1 Gbps, special driving
schemes using pre-emphasis are required.
A faster alternative is based on the depletion of a pn diode in the core. Reverse-
biasing such a diode will expand or shrink the depletion region in the junction.
Because the number of carriers involved is much smaller than with the injection
scheme, the effect is much weaker. However, it is not limited by the carrier recom-
bination time, only by the mobility and the capacitance formed by the depletion
region [45, 69, 71]. The effect can be enhanced by using complex junction geome-
tries, or multiple junctions inside the waveguide core by creating a larger overlap
with the optical mode [78, 79]. However, as the modulation efficiency is directly
linked to the amount of carriers that are moved around, a high modulation efficiency
is typically combined with rather high absorption losses. Reverse-biased pn diode
configurations have been demonstrated with Vp Lp of about 1 V cm.
58 W. Bogaerts et al.
The main effect of the carrier manipulation is a change in refractive index, even
though also a change in absorption is induced. To make an amplitude modulator
out of the resulting phase modulator, the junction or capacitor must be incorpo-
rated in an interferometer or (ring) resonator. In a MachZehnder interferometer,
one can put a modulator in both arms and operate the device in pushpull: this
essentially halves the device length or operating voltage. Injection modulators with
very high modulation efficiency have been demonstrated with a length of only
150 mm [49], small enough to be driven as a lumped electrical load. Carrier deple-
tion modulators on the other side, require lengths of millimeters to get a decent
modulation depth at CMOS operating voltage. As the effects could support modu-
lation of 40 GHz or beyond, special care has to be taken with the electrical drivers
to avoid unwanted RF effects over the length of the modulator. The simplest
approach is to drive the diode from a coplanar microwave waveguide which runs
parallel to the optical waveguide: the electrical wave will copropagate with the
optical mode, and with careful design the propagation velocities can be matched.
The drawback of this approach is that the microwave waveguide needs to be termi-
nated, which dissipates a lot of power.
The alternative is to use a resonator-based structure in combination with a phase
modulator. The most common and practical resonator geometry for this purpose is
a ring resonator [11]: the modulator diode is curved into a compact ring [122, 124]
or disk [119], and on resonance light will circle in this ring for thousands of times.
The rings can be as small as 10 mm in diameter, which means that it can be electri-
cally actuated as a lumped element. This obviates the need for coplanar electrodes
and significantly reduces the power dissipation. Making use of a resonator intro-
duces some drawbacks: the main one is that the modulator resonance should be
spectrally aligned with the operating wavelength. This imposes stringent fabrication
requirements and the requirement of some tuning mechanism to compensate for
operating conditions. The modulator could be tuned by applying a bias to the modu-
lation voltage, but as the modulation effects are typically quite small, in most cases
the tuning range will be too small. So an additional tuning mechanism, such as a
heater, is required.
2 Technologies and Building Blocks for On-Chip Optical Interconnects 59
Carrier-Based Switches
Instead of using heaters or carriers for modulation, they can also be used for switching.
While the mechanism is the same, the operational requirements for switching are
different. Response times are of the order of ms or ms, and power efficiency is impor-
tant, as switches operate as passive devices, and like with WDM components, all
dissipation will add to the link power budget. In this respect, thermal switches seem
the simplest solution, and thermal switches have been demonstrated [27, 47, 117].
Alternatively, one can use carrier injection. This effect is still quite strong, and one of
the main drawbacks can now be turned into an advantage: as a modulator, the carrier
injection device is limited by the carrier lifetime in the intrinsic region of the junc-
tion. However, if the structure can be engineered to increase this lifetime, a switch
can maintain its state for longer without additional power consumption [108]. This is
an effect which also applies to charge accumulation devices, where the switching
action is controlled by charging or draining a capacitor.
As already mentioned, silicon is not necessarily the best material for electro-optic
modulation, given its lack of intrinsic first-order electro-optic effects. Therefore, an
efficient way could be to integrate the silicon with other optical materials or struc-
tures which allow efficient modulation. One possibility is the integration of IIIV
semiconductors, in a similar way as the light sources. Alternatively, electro-optic
materials can be directly integrated with the silicon.
Silicon/IIIV Modulators
IIIV semiconductors are well known for their good electro-optic properties, making
them an interesting candidate to realize high performance modulators on a silicon
photonic platform. Similar to silicon, typically carrier depletion type modulators [25]
and stark effect electro-absorption modulators [62] are used. Also IIIV microdisk
modulator structures relying on the change in Q-factor by bleaching of the quantum
well absorption through current injection were demonstrated [76]. The first two
approaches were implemented in a hybrid waveguide approach, in which the optical
mode is partially confined to the silicon and partially to the IIIV waveguide, similar
to the hybrid IIIV/silicon laser platform. Realized electro-absorption modulators
show 5 dB extinction ratio at 10 Gbit/s with a sub-volt drive and 30 nm optical band-
width. MachZehnder type modulators, based on a carrier-depletion approach, show
a modulation efficiency of 1.5 V mm and over 100 nm optical bandwidth. In these
60 W. Bogaerts et al.
cases the modulation bandwidth was RC limited. By applying proper traveling wave
electrode designs and terminations much high speeds can be envisioned. In the
microdisk approach, evanescent coupling between a silicon waveguide layer and a
IIIV microdisk mode is used. The microdisk supports several resonances, the
Q-factor of which, and hence the transmission characteristic of the disk, can be
altered by current injection in the quantum well active region, which bleaches the
absorption.
Photodetectors
Introduction
At the end of the optical link the optical signals need to be converted to the electrical
domain again. This has to be done at high speed and with as little as possible signal
degradation due to noise (high sensitivity). Integrated photodetectors, which convert
the incident optical power into a photocurrent, connected to an integrated transim-
pedance amplifier, enable the optical-electrical conversion on the photonic intercon-
nection layer. For an intra-chip optical interconnect application, the photodetector
2 Technologies and Building Blocks for On-Chip Optical Interconnects 61
Fig. 2.18 Hybrid silicon modulator concepts. (a) Vertical sandwich waveguide, (b) slot-based sili-
con hybrid modulator
should satisfy several requirements. The speed of the detector, its responsivity and
its dark current are important performance metrics. However, also the device foot-
print and the available thermal budget for the incorporation of the photodetectors in
the electronic/photonic integrated circuit are important. Several material systems
can be considered to realize the photodetectors. While crystalline silicon is transpar-
ent for near infrared wavelengths ( > 1.1 mm), silicon photodetectors can still be used
in an on-chip interconnect context, for example for optical clock distribution through
free space using 850 nm wavelengths or by inducing defects in the silicon (through
ion implantation) which renders the material absorbing at near infrared communica-
tion wavelengths. The use of Silicon-Germanium or IIIV semiconductors are how-
ever a more straightforward route to realize high performance integrated
photodetectors on a silicon waveguide circuit. In the following subsections we will
give a brief overview of the state-of-the-art in the integration of photodetectors on a
silicon waveguide platform.
Photodetector Geometry
Basically, two types of photodetector structures are used, differing in the way an elec-
trical field is applied in the absorbing region. In one approach, a reverse biased pin
structure is used to extract the generated electronhole pairs from the absorbing
62 W. Bogaerts et al.
Fig. 2.20 Integrated photodetector geometries: coupling from optical waveguide to integrated
photodetector
Silicon Photodetectors
IIIV Photodetectors
High frequency, > 1 GHz, optical infrared detectors, sensitive between 1,100 and
1,700 nm, are usually fabricated in semiconducting InGaAs. Although this semicon-
ductor has the advantage of bandgap tailoring, it requires additional technology to
integrate these IIIV semiconductor materials on the electronic/photonic integrated
circuit. A straight forward approach (and the most rugged approach) is to use flip-
chip integration. This approach however limits the density of integration. With dis-
crete devices, receiver sensitivity is limited by the capacitance of the bulky detector.
Thanks to the much smaller capacitance of waveguide detectors, the receiver elec-
tronics can be redesigned with much higher performance, implying also a perfor-
mance improvement when detectors can be integrated.
In order to achieve the integration of IIIV semiconductors on the silicon
waveguide platform, a die-to-wafer bonding procedure can be used, as discussed
in section Light Sources to transfer the IIIV epitaxial layer stack onto the sili-
con waveguide circuit. This approach has the advantage that a dense integration
of photodetectors can be achieved and that all alignment is done by means of
lithographic techniques. An alternative approach would be to hetero-epitaxially
grow IIIV compounds on the silicon waveguide circuit. The large mismatch in
lattice constant between silicon and InP-based semiconductors makes it difficult
however to form high quality IIIV semiconductor layers on silicon, although a
2 Technologies and Building Blocks for On-Chip Optical Interconnects 65
lot of progress is being made in this field in recent years. Layer quality has a
direct influence on photodetector dark current, responsivity and maximum oper-
ation speed.
Most of the research so far has been geared towards the demonstration of high
performance IIIV semiconductor photodetectors on silicon, however without
addressing issues such as compatibility of the metallization with CMOS integration.
For example, typically Au-based electrodes are used for these devices. Surface illu-
mination on a silicon waveguide platform can be accomplished by using a diffrac-
tion grating to deflect the light from the silicon waveguide to the IIIV layer stack.
This approach has the advantage that the photodector doesnt have to be closely
integrated with the silicon waveguide layer. It can easily be placed a few micrometer
away from the silicon waveguide layer. Proof-of-principle devices based on this
concept were realized in [87] on a 10 10 mm footprint, however showing a limited
responsivity due to the sub-optimal epitaxial layer structure that was used.
Both pin type [9] and MSM [21] type waveguide photodetectors were realized
on a silicon-on-insulator waveguide platform. Responsivities in the range of
0.51 A/W were realized this way, in a device of about 50 mm2 in size. In the case of
the metalsemiconductormetal photodetector, the device speed is determined by
the spacing between the electrodes and the applied bias. Using a conservative 1 mm
spacing between the electrodes and applying 5 V reverse bias, simulations predict a
bandwidth over 35 GHz. In the pin structure, a bandwidth of 33 GHz was experi-
mentally obtained.
Germanium Photodetectors
When all building blocks of a photonic interconnect link or network are there, they
need to be integrated with electronics. With electronics we should distinguish
between the actual functional blocks that need to be interconnected (e.g. processor
cores or large blocks of memory) and the additional electronics that supports the
actual optical link (laser drivers, tuning current sources, monitor readouts,
amplifiers, ). In essence, the latter is an essential part of the photonic circuitry,
rather than the electronics themselves. Given the modulation speeds of the optical
links, it is essential that the driver electronics are as close as possible to the actual
photonic components.
In the integration of photonics and electronics there are many trade-offs that
need to be considered. Operation speed is definitely one of them, but also power
consumption, heat dissipation strategies, chip real estate, yield, and finally bring-
ing the electronics and photonics together. The most commonly considered inte-
gration scenarios are illustrated in Fig. 2.21: Integration of the photonics layer
directly with the transistors, integrating the photonics in or on top of the metal
interconnect layers, or fabricating the photonics layer separately and using a 3-D
integration strategy to bring both layers together. In this section we will compare
the merits of those options.
When integrating photonics and electronics, one will always be faced with simi-
lar questions: what will be the impact of one technology on the other, and the result-
ing compromises. And of course there is the problem of compound yield: the overall
yield of the integrated photonicelectronic circuit is the product of the electronics
yield, the photonics yield, and the yield of the integration process. If one of the steps
has a low yield, the compound yield might make the approach unviable, unless one
can incorporate a selection step with intermediate testing.
Front-End-of-Line
Fig. 2.21 Integration strategies for photonics and electronics. A photonic circuit can be integrated
in the front-end-of-line (FEOL), back-end-of-line (BEOL) and using 3-D stacking
circuitry as well as elementary logic. Even at a 130 nm node, the electronics on the
chip consumes much less real-estate than the photonics. Given the fact that transis-
tors and photonics compete for the same real-estate, this approach only makes
sense in situations where the primary function of the chip is photonic, and not
where photonics supports the electronics. The current products of Luxtera are
therefore focused towards active optical cables, and not on-chip interconnects [36].
Given the fact that chip real-estate is extremely precious, front-end photonic/
electronic integration for on-chip interconnects does only make sense if the photon-
ics does not encroach on the transistor space of the logic it is serving. Still, the argu-
ments of keeping the driver electronics close to the photonics holds. The solution is
to use an SOI process for cointegrating the driver electronics and the photonics, and
then use a 3-D integration technology to connect this photonic/electronic link layer
to the actual logic. Such techniques are discussed further.
While front-end integration does not seem to make sense for on-chip optical
interconnects, for longer interconnects it can still be the most attractive proposition.
However, only a few electronics manufacturers run their transistors in an SOI pro-
cess; the majority makes CMOS on bulk silicon. Because a CMOS manufacturer is
very unlikely to modify their processes to such an extent as to accommodate SOI
processes (which would need redevelopment or at least recalibrating their entire
front-end process), several groups in the world have explored the possibilities of
building a photonic substrate in a bulk CMOS process.
The main obstacle for making a waveguide in bulk silicon is the lack of a buffer
layer which optically insulates the waveguide from the high-index substrate. One
solution is to build a hybrid substrate with local SOI regions where the photonic
waveguides will be. This can be done in two ways: starting from a bulk Si wafer, or
starting from an SOI wafer. When using an SOI wafer, one can etch away the top
silicon and the buried oxide in the regions which will accommodate electronics.
Subsequently, a selective silicon epitaxial regrowth can be done to create a bulk sili-
con substrate at the same level as the waveguide layer. To finish, a chemical/mechan-
ical polishing (CMP) step is required. Alternatively, one could start with a bulk
silicon substrate, where a deep trench is etched in the waveguide regions. Using an
oxide deposition and CMP, a planar substrate with local areas of buried oxide is
created. Subsequently, the core layer of silicon can be deposited. This can be amor-
phous silicon, which can be recrystallized using solid-phase epitaxy, seeding off the
bulk silicon substrate [95]. Finally, it is even possible to create a local waveguide
layer by undercutting the bulk silicon substrate [83]. This results in waveguides
formed in the polysilicon gate layer which do have a higher propagation loss than
high-quality single-crystal waveguides.
Both approaches allow the integration of SOI waveguides in a bulk CMOS pro-
cess. However, this does not solve all issues. CMOS processing typically relies on a
very uniform distribution of features to achieve reliable processing over the chip
and the wafer (especially for dry etching and CMP). The same also holds for photo-
nics, and the density and length-scale of the features do not necessarily match.
Therefore, careful consideration is needed when combining both types of features
onto the same substrate in the same process layer, respecting the proper spacing and
inclusion of dummy structures to guarantee the correct densities.
2 Technologies and Building Blocks for On-Chip Optical Interconnects 69
Also, while one could devise a process flow where many steps are shared between
the transistors and the photonics, there will be a need for additional steps, which
could have an impact on chip/wafer yield. The compound yield of the process flow
can drop dramatically with the number of steps.
Electronic chips already have multiple metal interconnect layers. An optical inter-
connect layer embedded in or deposited on top of these metal layers would definitely
make sense from a separation of concerns point-of-view. However, this conflict
somewhat with the technologies required for silicon photonics: especially the high-
quality single-crystal silicon layer needed for waveguides and modulators is impos-
sible to incorporate monolithically: there is no epitaxial substrate in the
back-end-of-line interconnect layers, and the temperature budget does not allow
silicon epitaxy: in BEOL processes, the process temperature is limited to ca. 450 C.
As we have discussed, amorphous silicon is a possibility, but with the penalty of
higher optical losses and the difficulty of making good junctions for carrier-based
modulators. Other optical materials will not allow the same integration density as
silicon photonics and might only be suitable for real global interconnects.
3D Integration
To overcome this problem, the photonics layer can also be integrated on top of the
electronics using 3-D integration techniques. this would allow both layers to be
fabricated separately (in their own optimized process flow, or even different fabs).
This means fewer compromises are needed in both the electronics and photonics,
and there is no real competition for real-estate. It is now also possible to make the
photonics layer in one technology, and still remain compatible with various CMOS
technology nodes: the photonics should not scale down as aggressively as the
advanced CMOS.
3-D integration can be accomplished in different ways, depending on the appli-
cation. The photonics can be stacked on the electronics or the other way around,
depending on the die size. On-chip interconnects will likely require a similar die
size for photonics and electronics, but for applications in sensing or spectroscopy,
or even off-chip datacomm, the photonics die could be larger than the electronics
die. In general, the smaller die will be stacked on the larger die.
3-D integration technology in general relies on through-silicon-vias (TSV) to
connect the metal layers of both chips. Here we can distinguish between processes
where this TSV is processed before the stacking or after the stacking. In Via-first
processes, TSV could be fabricated in the photonics wafer, which then requires no
modifications to the electronics process. The photonics die would then be on the top
of the electronics die, and face upwards (needed for applications where access to the
70 W. Bogaerts et al.
Fig. 2.22 3-D integration approaches. (a) Photonics face-down. The photonics wafer is bonded
upside-down on the electronics wafer and metal TSV connections are processed after bonding and
substrate removal. (b) Photonics face-up. TSVs are processed in the photonics wafer and stick out
after substrate thinning. No wafer-scale processing is needed after stacking
waveguides is essential). Such TSVs are typically large, with the diameter and pitch
proportional to the thickness of the substrate. Large TSVs will then introduce para-
sitic resistance and capacitance [8, 54, 60], which can be a dominant speed bump for
high-speed interconnect.
When the vias are processed post-bonding, this problem can be largely over-
come. Silicon photonics uses an SOI wafer, so the buried oxide can be used as a very
selective stopping layer for substrate removal: an SOI photonics waver could be
bonded upside-down on a CMOS wafer and the entire silicon substrate, and even the
buried oxide, could be removed [41, 61]. Afterwards, deep-etched vias connect to
the underlying CMOS metallization layers and even process additional metal inter-
connects on top. the layers here can be so thin that the parasitics of the large TSVs
can be avoided (Fig. 2.22).
Backside Integration
To decouple the photonics and the electronics process but still process everything on a
single high-quality substrate, one can make use the back side of the wafer as well as the
front side: e.g. The silicon photonics layer could be processed on the front side of an
SOI wafer, and bulk electronics on the back side. The high-temperature steps for both
sides could be executed first after which the metal interconnects and the TSVs are
defined. Different TSV technologies could require wafer thinning and bonding to a
handling wafer. Using an unthinned wafer requires relatively large TSVs [41].
2 Technologies and Building Blocks for On-Chip Optical Interconnects 71
Competition for chip area is less than with FEOL integration, but as with some
3D integration approaches the TSVs need to pass through the transistor layer [41,
61]. Even though the use of high-quality substrates is optimized, two-sided process-
ing introduces problems of wafer handling, packaging and testing.
Flip-Chip Integration
Summary
In this chapter we took a closer look at the different components that are required
for on-chip optical interconnects, and more particular WDM links. The technology
we focused on was that of silicon photonics, as it is the most obvious candidate to
realize on-chip optical links: the materials and processes are the closest to true
CMOS compatibility, and the high refractive index contrast makes it possible to
scale down the photonic building blocks to a footprint which allows thousands of
components on a single chip.
While most of the technology is already there, there are many issues that still
need to be solved when using silicon photonics for on-chip links: the big question
is the one of the light source. While we discussed the various options, there is as
yet no clear-cut winner, and all options have their advantages and disadvantages.
The second main challenge with silicon photonic links is the thermal manage-
ment. Especially in a WDM setting, where spectral filters are required, silicon
photonics is extremely temperature-sensitive, and it is not inconceivable that a
significant portion of the power budget of the optical link will be needed for ther-
mal feedback and control.
The world of silicon photonics is moving extremely rapidly, and new break-
through developments are being reported every year. This chapter is therefore
only intended as a snapshot view, and for that we focused mostly on explaining all
the principles, rather than give a complete report of the latest and greatest results.
Given this fast technological progress, and the strong need for higher bandwidth,
we are convinced that on-chip optical links will become a reality later in this
decade.
72 W. Bogaerts et al.
References
1. Agarwal AM, Liao L, Foresi JS, Black MR, Duan X, Kimerling LC (1996) Low-loss poly-
crystalline silicon waveguides for silicon photonics. J Appl Phys 80(11):61206123
2. Alloatti L, Korn D, Hillerkuss D, Vallaitis T, Li J, Bonk R, Palmer R, Schellinger T, Koos C,
Freude W, Leuthold J, Fournier M, Fedeli J, Barklund A, Dinu R, Wieland J, Bogaerts W,
Dumon P, Baets R (2010) Silicon high-speed electro-optic modulator. In: 2010 7th IEEE
international conference on group IV photonics (GFP), pp 195197 Beijing, China
3. Almeida VR, Panepucci RR (2007) Noems devices based on slot-waveguides. In: Conference
on lasers and electro-optics/quantum electronics and laser science conference and photonic
applications systems technologies, p JThD104 Washington DC, USA
4. Anderson PA, Schmidt BS, Lipson M (2006) High confinement in silicon slot waveguides
with sharp bends. Opt Express 14(20):91979202
5. Assefa S, Xia F, Vlasov YA (2010) Reinventing germanium avalanche photodetector for
nanophotonic on-chip optical interconnects. Nature 464(7285):U80U91
6. Baehr-Jones T, Hochberg M, Wang GX, Lawson R, Liao Y, Sullivan PA, Dalton L, Jen AKY,
Scherer A (2005) Optical modulation and detection in slotted silicon waveguides. Opt Express
13(14):52165226
7. Barwicz T, Watts MR, Popovic MA, Rakich PT, Socci L, Kartner FX, Ippen EP, Smith HI
(2007) Polarization-transparent microphotonic devices in the strong confinement limit. Nat
Photon 1:5760
8. Bermond C, Cadix L, Farcy A, Lacrevaz T, Leduc P, Flechet B (2009) High frequency char-
acterization and modeling of high density TSV in 3d integrated circuits. In: 2009 SPI 09
IEEE workshop on signal propagation on interconnects, pp 14 Strasbourgh, France
9. Binetti PRA, Leijtens XJM, de Vries T, Oei YS, Di Cioccio L, Fedeli J-M, Lagahe C, Van
Campenhout J, Van Thourhout D, van Veldhoven PJ, Notzel R, Smit MK (2009) Inp/InGaAs
photodetector on SOI circuitry. In: Group IV photonics, pp 214216 2009 6th IEEE interna-
tional conference on group IV photonics (GFP), San Francisco, USA
10. Bogaerts W, Baets R, Dumon P, Wiaux V, Beckx S, Taillaert D, Luyssaert B, Van Campenhout
J, Bienstman P, Van Thourhout D (2005) Nanophotonic waveguides in silicon-on-insulator
fabricated with CMOS technology. J Lightwave Technol 23(1):401412
11. Bogaerts W, De Heyn P, Van Vaerenbergh T, De Vos K, Kumar Selvaraja S, Claes T, Dumon
P, Bienstman P, Van Thourhout D, Baets R. (2012), Silicon microring resonators. Laser &
Photon. Rev., 6: 4773. doi: 10.1002/lpor.201100017
12. Bogaerts W, Dumon P, Van Thourhout D, Taillaert D, Jaenen P, Wouters, J, Beckx S, Wiaux
V, Baets R (2006) Compact wavelength-selective functions in silicon-on-insulator photonic
wires. J Sel Top Quantum Electron 12(6):13941401
13. Bogaerts W, Selvaraja SK (2011) Compact single-mode silicon hybrid rib/strip waveguide
with adiabatic bends. IEEE Photon J 3(3):422432
14. Bogaerts W, Selvaraja SK, Dumon P, Brouckaert J, De Vos K, Van Thourhout D, Baets R
(2010) Silicon-on-insulator spectral filters fabricated with CMOS technology. J Sel Top
Quantum Electron 16(1):3344
15. Bogaerts W, Wiaux V, Taillaert D, Beckx S, Luyssaert B, Bienstman, P, Baets R (2002)
Fabrication of photonic crystals in silicon-on-insulator using 248-nm deep UV lithography.
IEEE J Sel Top Quantum Electron 8(4):928934
16. Boyraz O, Jalali B (2004) Demonstration of a silicon Raman laser. Opt Express 12(21):
52695273
17. Boyraz O, Jalali B (2005) Demonstration of directly modulated silicon Raman laser. Opt
Express 13(3):796800
18. Bravo-Abad J, Ippen EP, Soljacic M (2009) Ultrafast photodetection in an all-silicon chip
enabled by two-photon absorption. Appl Phys Lett 94:241103
19. Brouckaert J, Bogaerts W, Dumon P, Van Thourhout D, Baets R (2007) Planar concave grat-
ing demultiplexer fabricated on a nanophotonic silicon-on-insulator platform. J Lightwave
Technol 25(5):12691275
2 Technologies and Building Blocks for On-Chip Optical Interconnects 73
20. Brouckaert J, Bogaerts W, Selvaraja S, Dumon P, Baets R, Van Thourhout D (2008) Planar
concave grating demultiplexer with high reflective Bragg reflector facets. IEEE Photon
Technol Lett 20(4):309311
21. Brouckaert J, Roelkens G, Van Thourhout D, Baets R (2007) Compact InAlAs/InGaAs
metalsemiconductormetal photodetectors integrated on silicon-on-insulator waveguides.
IEEE Photon Techol Lett 19(19):14841486
22. Bulgan E, Kanamori Y, Hane K (2008) Submicron silicon waveguide optical switch driven by
microelectromechanical actuator. Appl Phys Lett 92(10):101110
23. Casalino M, Coppola G, Iodice M, Rendina I, Sirleto L (2010) Near-infrared sub-bandgap
all-silicon photodetectors: state of the art and perspectives. Sensors 10:1057110600
24. ChaiChuay C, Yupapin PP, Saeung P (2009) The serially coupled multiple ring resonator
filters and Vernier effect. Opt Appl XXXIX(1):175194
25. Chen H-W, Kuo Y-H, Bowers JE (2008) High speed hybrid silicon evanescent mach-zehnder
modulator and switch. Opt Express 16:2057120576
26. Chen L, Lipson M (2009) Ultra-low capacitance and high speed germanium photodetectors
on silicon. Opt Express 17(10):79017906
27. Chu T, Yamada H, Ishida S, Arakawa Y (2005) Compact 1 x n thermo-optic switches based
on silicon photonic wire waveguides. Opt Express 13(25):1010910114
28. Cunningham JE, Shubin I, Zheng X, Pinguet T, Mekis A, Luo Y, Thacker H, Li G, Yao J, Raj
K, Krishnamoorthy AV (2010) Highly-efficient thermally-tuned resonant optical filters. Opt
Express 18(18):1905519063
29. Dai D, Yang L, He S (2008) Ultrasmall thermally tunable microring resonator with a submi-
crometer heater on Si nanowires. J Lightwave Technol 26(58):704709
30. C. Debaes, D. Agarwal, A. Bhatnagar, H. Thienpont, and D. A. B. Miller, High-Impedance
High-Frequency Silicon Detector Response for Precise Receiverless Optical Clock Injection, in
SPIE Photonics West 2002 Meeting, San Jose, California, Proc. SPIE Vol. 4654, 7888 (2002)
31. Ding R, Baehr-Jones T, Liu Y, Bojko R, Witzens J, Huang S, Luo J, Benight S, Sullivan P,
Fedeli J-M, Fournier M, Dalton L, Jen A, Hochberg M (2010) Demonstration of a low V pi
L modulator with GHz bandwidth based on electro-optic polymer-clad silicon slot wave-
guides. Opt Express 18(15):1561815623
32. Dragone C (1991) An NxN optical multiplexer using a planar arrangement of two star cou-
plers. IEEE Photon Technol Lett 3(9):812814
33. Dragone C (1998) Efficient techniques for widening the passband of a wavelength router.
J Lightwave Technol 16(10):18951906
34. Dumon P, Bogaerts W, Van Thourhout D, Taillaert D, Baets R, Wouters, J, Beckx S, Jaenen
P (2006) Compact wavelength router based on a silicon-on-insulator arrayed waveguide grat-
ing pigtailed to a fiber array. Opt Express 14(2):664669
35. Dumon P, Bogaerts W, Wiaux V, Wouters J, Beckx S, Van Campenhout J, Taillaert D, Luyssaert
B, Bienstman P, Van Thourhout D, Baets R (2004) Low-loss SOI photonic wires and ring reso-
nators fabricated with deep UV lithography. IEEE Photon Technol Lett 16(5): 13281330
36. P. Duran, Blazar 40 Gbps Optical Active Cable, Luxteras white paper from: www.luxtera.
com, 2008.
37. Espinola RL, Tsai M-C, Yardley JT, Osgood RM Jr (2003) Fast and low-power thermooptic
switch on thin silicon-on-insulator. IEEE Photon Technol Lett 15(10):13661368
38. Fang AW, Koch BR, Gan K-G, Park H, Jones R, Cohen O, Paniccia MJ, Blumenthal DJ, Bowers
JE (2008) A racetrack mode-locked silicon evanescent laser. Opt Express 16(2): 13931398
39. Fang AW, Koch BR, Jones R, Lively E, Liang D, Kuo Y-H, Bowers JE (2008) A distributed
Bragg reflector silicon evanescent laser. IEEE Photon Technol Lett IEEE 20(20): 16671669
40. Fang AW, Park H, Cohen O, Jones R, Paniccia MJ, Bowers JE (2006) Electrically pumped
hybrid algainas-silicon evanescent laser. Opt Express 14(20):92039210
41. Fedeli JM, Augendre E, Hartmann JM, Vivien L, Grosse P, Mazzocchi, V, Bogaerts W, Van
Thourhout D, Schrank F (2010) Photonics and electronics integration in the Helios project.
In: 2010 7th IEEE international conference on group IV photonics (GFP), pp 356358
Beijing, China
74 W. Bogaerts et al.
42. Foresi JS, Black MR, Agarwal AM, Kimerling LC (1996) Losses in polycrystalline silicon
waveguides. Appl Phys Lett 68(15):20522054
43. Foster MA, Turner AC, Sharping JE, Schmidt BS, Lipson M, Gaeta AL (2006) Broad-band
optical parametric gain on a silicon photonic chip. Nature 441(7096):960963
44. Gan F, Barwicz T, Popovic MA, Dahlem MS, Holzwarth CW, Rakich PT, Smith HI, Ippen
EP, Kartner FX (2007) Maximizing the thermo-optic tuning range of silicon photonic struc-
tures. In: 2007 photonics in switching, pp 6768 San Francisco, USA
45. Gardes F, Reed G, Emerson N, Png C (2005) A sub-micron depletion-type photonic modula-
tor in silicon on insulator. Opt Express 13(22):88458854
46. Geis MW, Spector SJ, Grein ME, Yoon JU, Lennon DM, Lyszczarz TM (2009) Silicon wave-
guide infrared photodiodes with over 35 ghz bandwidth and phototransistors with 50 a/w
response. Opt Express 17(7):51935204
47. Geis MW, Spector SJ, Williamson RC, Lyszczarz TM (2004) Submicrosecond submilliwatt
silicon-on-insulator thermooptic switch. IEEE Photon Technol Lett 16(11):25142516
48. Gnan M, Thoms S, Macintyre DS, De La Rue RM, Sorel M (2008) Fabrication of low-loss
photonic wires in silicon-on-insulator using hydrogen silsesquioxane electron-beam resist.
Electron Lett 44(2):115116
49. Green WMJ, Rooks MJ, Sekaric L, Vlasov YuA (2007) Ultra-compact, low RF power, 10
Gb/s silicon MachZehnder modulator. Opt Express 15(25):1710617113
50. Gunn C (2006) Cmos photonics for high-speed interconnects. IEEE Micro 26(2):5866
51. Han H-S, Seo S-Y, Shin JH, Park N (2002) Coefficient determination related to optical gain in
erbium-doped silicon-rich silicon oxide waveguide amplifier. Appl Phys Lett 81(20): 37203722
52. Harke A, Krause M, Mueller J (2005) Low-loss singlemode amorphous silicon waveguides.
Electron Lett 41(25):13771379
53. Hattori HT, Seassal C, Touraille E, Rojo-Romeo P, Letartre X, Hollinger G, Viktorovitch P,
Di Cioccio L, Zussy M, Melhaoui LE, Fedeli JM (2006) Heterogeneous integration of micro-
disk lasers on silicon strip waveguides for optical interconnects. Photon Technol Lett IEEE
18(1):223225
54. Healy MB, Lim SK (2009) A study of stacking limit and scaling in 3d ICS: an interconnect
perspective. In: 2009 ECTC 2009 59th Electronic components and technology conference,
pp 12131220 San Diego, USA
55. Heebner J, Grover R, Ibrahim T (2008) Optical microresonators: theory, fabrication and
applications. In: Springer series in optical sciences, 1st edn. Springer, Berlin
56. Ikeda T, Takahashi K, Kanamori Y, Hane K (2010) Phase-shifter using submicron silicon wave-
guide couplers with ultra-small electro-mechanical actuator. Opt Express 18(7):70317037
57. Jacobsen RS, Andersen KN, Borel PI, Fage-Pedersen J, Frandsen LH, Hansen O, Kristensen
M, Lavrinenko AV, Moulin G, Ou H, Peucheret, C, Zsigri B, Bjarklev A (2006) Strained sili-
con as a new electro-optic material. Nature 441(7090):199202
58. Kang Y, Liu H-D, Morse M, Paniccia MJ, Zadka M, Litski S, Sarid G, Pauchard A, Kuo Y-H,
Chen H-W, Zaoui WS, Bowers JE, Beling A, McIntosh DC, Zheng X, Campbell JC (2009)
Monolithic germanium/silicon avalanche photodiodes with 340 GHz gain-bandwidth
product. Nat Photon 3(1):5963
59. Kazmierczak A, Bogaerts W, Drouard E, Dortu F, Rojo-Romeo P, Gaffiot F, Van Thourhout
D, Giannone D (2009) Highly integrated optical 4 x 4 crossbar in silicon-on-insulator tech-
nology. J Lightwave Technol 27(16):33173323
60. Kim DH, Mukhopadhyay S, Lim SK (2009) Tsv-aware interconnect length and power pre-
diction for 3d stacked ICS. In: 2009 IITC 2009 IEEE international interconnect technology
conference, pp 2628 Sapporo, Japan
61. Koester SJ, Young AM, Yu RR, Purushothaman S, Chen K-N, La Tulipe DC, Rana N, Shi L,
Wordeman MR, Sprogis EJ (2008) Wafer-level 3d integration technology. IBM J Res Dev
52(6):583597
62. Kuo Y-H, Chen Y-H, Bowers J E (2008) High speed hybrid silicon evanescent electroabsorp-
tion modulator. Opt Express 16:99369941
2 Technologies and Building Blocks for On-Chip Optical Interconnects 75
81. McNab SJ, Moll N, Vlasov YA (2003) Ultra-low loss photonic integrated circuit with mem-
brane-type photonic crystal waveguides. Opt Express 11(22):29272939
82. Michel J, Liu J, Kimerling LC (2010) High-performance Ge-on-Si photodetectors. Nat
Photon 4(8):527534
83. Orcutt JS, Khilo A, Holzwarth CW, Popovi MA, Li H, Sun J, Bonifield T, Hollingsworth R,
Krtner FX, Smith HI, Stojanovi V, Ram RJ (2011) Nanophotonic integration in state-of-
the-art CMOS foundries. Opt Express 19(3):23352346
84. Pavesi L, Dal Negro L, Mazzoleni C, Franzo G, Priolo F (2000) Optical gain in silicon nano-
crystals. Nature 408(6811):440444
85. Pinguet T, Analui B, Balmater E, Guckenberger D, Harrison M, Koumans R, Kucharski D,
Liang Y, Masini G, Mekis A, Mirsaidi S, Narasimha A, Peterson M, Rines D, Sadagopan V,
Sahni S, Sleboda TJ, Song, D, Wang Y, Welch B, Witzens J, Yao J, Abdalla S, Gloeckner S, De
Dobbelaere P (2008) Monolithically integrated high-speed CMOS photonic transceivers. In:
2008 5th IEEE international conference on group IV photonics, pp 362364 Sorrento, Italy
86. Reed GT, Mashanovich G, Gardes FY, Thomson DJ (2010) Silicon optical modulators. Nat
Photon 4(8):518526
87. Roelkens G, Brouckaert J, Taillaert D, Dumon P, Bogaerts W, Van Thourhout D, Baets R
(2005) Integration of InP/InGaAsP photodetectors onto silicon-on-insulator waveguide cir-
cuits. Opt Express 13(25):1010210108
88. Roelkens G, Brouckaert J, Van Thourhout D, Baets R, Notzel R, Smit M (2006) Adhesive
bonding of InP/InGaAsP dies to processed silicon-on-insulator wafers using DVS-bis-
benzocyclobutene. J Electrochem Soc 153(12):G1015G1019
89. Roelkens G, Van Thourhout D, Baets R (2007) High efficiency grating couplers between sili-
con-on-insulator waveguides and perfectly vertical optical fibers. Opt Lett 32(11): 14951497
90. Roelkens G, Van Thourhout D, Baets R, Ntzel R, Smit M (2006) Laser emission and photo-
detection in an InP/InGaAsP layer integrated on and coupled to a silicon-on-insulator wave-
guide circuit. Opt Express 14(18):81548159
91. Rong HS, Liu AS, Jones R, Cohen O, Hak D, Nicolaescu R, Fang A, Paniccia M (2005) An
all-silicon Raman laser. Nature 433(7023):292294
92. Schrauwen J, Scheerlinck S, Van Thourhout D, Baets R (2009) Polymer wedge for perfectly
vertical light coupling to silicon. In: Broquin J-M, Greiner CM (eds) Integrated optics: devices,
materials, and technologies, vol XIII. Proceedings of SPIE, vol 7218, SPIE, p 72180B
93. Selvaraja S, Sleeckx E, Schaekers M, Bogaerts W, Van Thourhout D, Dumon P, Baets R
(2009) Low-loss amorphous silicon-on-insulator technology for photonic integrated circuitry.
Opt Commun 282(9):17671770
94. Selvaraja SK, Bogaerts W, Dumon P, Van Thourhout D, Baets R (2010) Subnanometer lin-
ewidth uniformity in silicon nanophotonic waveguide devices using CMOS fabrication tech-
nology. J Sel Top Quantum Electron 16(1):316324
95. Shin DJ, Lee KH, Ji H-C, Na KW, Kim SG, Bok JK, You YS, Kim SS, Joe IS, Suh SD, Pyo
J, Shin YH, Ha KH, Park YD, Chung CH (2010) MachZehnder silicon modulator on bulk
silicon substrate; toward dram optical interface. In: 2010 7th IEEE international conference
on group IV photonics (GFP), pp 210212 Beijing, China
96. Shoji T, Tsuchizawa T, Watanabe T, Yamada K, Morita H (2002) Low loss mode size converter
from 03mm square Si waveguides to singlemode fibres. Electron Lett 38(25):16691700
97. Soref R, Bennett B (1987) Electrooptical effects in silicon. J Quantum Electron 23(1):
123129
98. Sparacin DK, Sun R, Agarwal AM, Beals MA, Michel J, Kimerling LC, Conway TJ,
Pomerene AT, Carothers DN, Grove MJ, Gill DM, Rasras MS, Patel SS, White AE (2006)
Low-loss amorphous silicon channel waveguides for integrated photonics. In: 2006 3rd IEEE
international conference on group IV photonics, pp 255257 Ottawa, Canada
99. Spector S, Geis MW, Lennon D, Williamson RC, Lyszczarz TM (2004) Hybrid multi-mode/
single-mode waveguides for low loss. In: Optical amplifiers and their applications/integrated
photonics research. Optical Society of America, p IThE5 San Francisco
2 Technologies and Building Blocks for On-Chip Optical Interconnects 77
100. Spuesens T, Liu L, De Vries T, Rojo-Romeo P, Regreny P, Van Thourhout D (2009) Improved
design of an InP-based microdisk laser heterogeneously integrated with SOI. In: 6th IEEE
international conference on group IV photonics, p FA3 Sorrento, Italy
101. Sun P, Reano RM (2010) Submilliwatt thermo-optic switches using free-standing silicon-on-
insulator strip waveguides. Opt Express 18(8):84068411
102. Taillaert D, Bogaerts W, Bienstman P, Krauss TF, Van Daele P, Moerman I, Verstuyft S, De
Mesel K, Baets R (2002) An out-of-plane grating coupler for efficient butt-coupling between
compact planar waveguides and single-mode fibers. J Quantum Electron 38(7):949955
103. Taillaert D, Van Laere F, Ayre M, Bogaerts W, Van Thourhout D, Bienstman P, Baets R
(2006) Grating couplers for coupling between optical fibers and nanophotonic waveguides.
Jpn J Appl Phys 45(8A):60716077
104. Teng J, Dumon P, Bogaerts W, Zhang H, Jian X, Han X, Zhao M, Morthier G, Baets R (2009)
Athermal silicon-on-insulator ring resonators by overlaying a polymer cladding on narrowed
waveguides. Opt Express 17(17):1462714633
105. Tsuchizawa T, Yamada K, Fukuda H, Watanabe T, Takahashi J, Takahashi, M, Shoji T,
Tamechika E, Itabashi S, Morita H (2005) Microphotonics devices based on silicon microfab-
rication technology. IEEE J Sel Top Quantum Electron 11(1):232240
106. Van Acoleyen K, Roels J, Claes T, Van Thourhout D, Baets RG (2011) Nems-based optical
phase modulator fabricated on silicon-on-insulator. In: 2011 8th IEEE international confer-
ence on group IV photonics, p FC6 London, UK
107. Van Campenhout J, Green WMJ, Assefa S, Vlasov YA (2010) Integrated nisi waveguide heat-
ers for CMOS-compatible silicon thermooptic devices. Opt Lett 35(7):10131015
108. Van Campenhout J, Green WM, Assefa S, Vlasov YuA (2009) Low-power, 2x2 silicon elec-
tro-optic switch with 110-nm bandwidth for broadband reconfigurable optical networks. Opt
Express 17(26):2402024029
109. Van Campenhout J, Liu L, Romeo PR, Van Thourhout D, Seassal C, Regreny P, Di Cioccio
L, Fedeli J-M, Baets R (2008) A compact SOI-integrated multiwavelength laser source based
on cascaded InP microdisks. IEEE Photon Technol Lett 20(16):13451347
110. Van Campenhout J, Rojo RP, Regreny P, Seassal C, Van Thourhout D, Verstruyft S, Di Ciocco
L, Fedeli J-M, Lagahe C, Baets R (2007) Electrically pumped InP-based microdisk lasers
integrated with a nanophotonic silicon-on-insulator waveguide circuit. Opt Express 15(11):
67446749
111. Van Laere F, Claes T, Schrauwen J, Scheerlinck S, Bogaerts W, Taillaert D, OFaolain L, Van
Thourhout D, Baets R (2007) Compact focusing grating couplers for silicon-on-insulator
integrated circuits. Photon Technol Lett 19(23):19191921
112. Van Laere F, Roelkens G, Ayre M, Schrauwen J, Taillaert D, Van Thourhout D, Krauss TF,
Baets R (2007) Compact and highly efficient grating couplers between optical fiber and nano-
photonic waveguides. J Lightwave Technol 25(1):151156
113. Van Thourhout D, Spuesens T, Selvaraja SK, Liu L, Roelkens G, Kumar R, Morthier G, Rojo-
Romeo P, Mandorlo F, Regreny P, Raz O, Kopp C, Grenouillet L (2010) Nanophotonic
devices for optical interconnect. J Sel Top Quantum Electron 16(5):13631375
114. Vermeulen D, Selvaraja S, Verheyen P, Lepage G, Bogaerts W, Absil P, Van Thourhout D,
Roelkens G (2010) High-efficiency fiber-to-chip grating couplers realized using an advanced
CMOS-compatible silicon-on-insulator platform. Opt Express 18(17):1827818283
115. Vivien L, Osmond J, Fdli J-M, Marris-Morini D, Crozat P, Damlencourt J-F, Cassan E,
Lecunff Y, Laval S (2009) 42 ghz pin germanium photodetector integrated in a silicon-on-
insulator waveguide. Opt Express 17(8):62526257
116. Vivien L, Rouvire M, Fdli J-M, Marris-Morini D, Damlencourt J-F, Mangeney J, Crozat
P, El Melhaoui L, Cassan E, Le Roux X, Pascal D, Laval S (2007) High speed and high
responsivity germanium photodetector integrated in a silicon-on-insulator microwaveguide.
Opt Express 15(15):98439848
117. Vlasov Y, Green WMJ, Xia F (2008) High-throughput silicon nanophotonic wavelength-
insensitive switch for on-chip optical networks. Nat Photon 2(4):242246
78 W. Bogaerts et al.
118. Wang Z, Chen Y-Z, Doerr CR (2009) Analysis of a synchronized flattop AWG using low
coherence interferometric method. IEEE Photon Technol Lett 21(8):498500
119. Watts MR, Trotter DC, Young RW, Lentine AL (2008) Ultralow power silicon microdisk modu-
lators and switches. In: 2008 5th IEEE international conference on group IV photonics, pp 46
Sorrento, Italy
120. Webster MA, Pafchek RM, Sukumaran G, Koch TL (2005) Low-loss quasi-planar ridge
waveguides formed on thin silicon-on-insulator. Appl Phys Lett 87(23), p.231108
121. Xia F, Rooks M, Sekaric L, Vlasov Yu (2007) Ultra-compact high order ring resonator filters
using submicron silicon photonic wires for on-chip optical interconnects. Opt Express
15(19):1193411941
122. Xu Q, Manipatruni S, Schmidt B, Shakya J, Lipson M (2007) 125 gbit/s carrier-injection-
based silicon micro-ring silicon modulators. Opt Express 15(2):430436
123. Yamada K, Shoji T, Tsuchizawa T, Watanabe T, Takahashi J, Itabashi S (2005) Silicon-wire-
based ultrasmall lattice filters with wide free spectral ranges. J Sel Topics Quantum Electron
11:232240
124. Ye T, Cai X (2010) On power consumption of silicon-microring-based optical modulators.
J Lightwave Technol 28(11):16151623
125. Zhang L, Yue Y, Xiao-Li Y, Wang J, Beausoleil RG, Willner AE (2010) Flat and low disper-
sion in highly nonlinear slot waveguides. Opt Express 18(12):1318713193
126. Zheng X, Patil D, Lexau J, Liu F, Li G, Thacker H, Luo Y, Shubin I, Li J, Yao J, Dong P, Feng
D, Asghari M, Pinguet T, Mekis A, Amberg P, Dayringer M, Gainsley J, Moghadam H F,
Alon E, Raj K, Ho R, Cunningham J, Krishnamoorthy A (2011) Ultra-efficient 10gb/s hybrid
integrated silicon photonic transmitter and receiver. Opt Express 19(6):51725186
127. Zhu S, Fang Q, Yu MB, Lo GQ, Kwong DL (2009) Propagation losses in undoped and
n-doped polycrystalline silicon wire waveguides. Opt Express 17(23):2089120899
Part II
On-Chip Optical Communication
Topologies
Chapter 3
Designing Chip-Level Nanophotonic
Interconnection Networks
C. Batten ()
School of Electrical and Computer Engineering,
College of Engineering, Cornell University, 323 Rhodes Hall, Ithaca, NY 14853, USA
e-mail: cbatten@cornell.edu
A. Joshi
Department of Electrical and Computer Engineering, Boston University, Boston, MA 02215, USA
V. Stojanov
Department of Electrical Engineering and Computer Science, Massachusetts Institute of
Technology, Cambridge, MA 02139, USA
K. Asanovi
Department of Electrical Engineering and Computer Science, University of California at
Berkeley, Berkeley, CA 94720, USA
Introduction
Todays graphics, network, embedded, and server processors already contain many
cores on one chip, and this number will continue to increase over the next decade.
Intra-chip and inter-chip communication networks are becoming critical compo-
nents in such systems, affecting not only performance and power consumption, but
also programmer productivity. Any future interconnect technology used to address
these challenges must be judged on three primary metrics: bandwidth density,
energy efficiency, and latency. Enhancements of current electrical technology
might enable improvements in two metrics while sacrificing a third. Nanophotonics
is a promising disruptive technology that can potentially achieve simultaneous
improvements in all three metrics, and could therefore radically transform chip-
level interconnection networks. Of course, there are many practical challenges
involved in using any emerging technology including economic feasibility, effec-
tive system design, manufacturing issues, reliability concerns, and mitigating vari-
ous overheads.
There has recently been a diverse array of proposals for network architectures
that use nanophotonic devices to potentially improve performance and energy
efficiency. These proposals explore different single-stage topologies from buses [9,
14, 29, 53, 74, 76] to crossbars [39, 44, 64, 65, 76] and different multistage topolo-
gies from quasi-butterflies [6, 7, 26, 32, 34, 41, 56, 63] to tori [18, 48, 69]. Note that
we specifically focus on chip-level networks as opposed to cluster-level optical net-
works used in high-performance computing and data-centers. Most proposals use
different routing algorithms, flow control mechanisms, optical wavelength organi-
zations, and physical layouts. While this diversity makes for an exciting new
research field, it also makes it difficult to see relationships between different propos-
als and to identify promising directions for future network design.
In previous work, we briefly described our approach for designing nanophoto-
nic interconnection networks, which is based on thinking of the design at three
levels: wthe architectural level, the microarchitectural level, and the physical
level [8]. In this chapter, we expand on this earlier description, provide greater
detail on design trade-offs at each level, and categorize previous proposals in the
literature. Architectural-level design focuses on choosing the best logical network
topology and routing algorithm. This early phase of design should also include a
detailed design of an electrical baseline network to motivate the use of nanophoto-
nic devices. Microarchitectural-level design considers which buses, channels, and
routers should be implemented with electrical versus nanophotonic technology.
This level of design also explores how to best implement optical switching, tech-
niques for wavelength arbitration, and effective flow control. Physical-level design
determines where to locate transmitters and receivers, how to map wavelengths to
3 Designing Chip-Level Nanophotonic Interconnection Networks 83
Nanophotonic Technology
This section briefly reviews the basic devices used to implement nanophotonic
interconnection networks, before discussing the opportunities and challenges
involved with this emerging technology. See [10, 68] for a more detailed review of
recent work on nanophotonic devices. This section also describes in more detail the
specific nanophotonic technology that we assume for the case studies presented
later in this chapter.
Fig. 3.1 Nanophotonic devices. Four point-to-point nanophotonic channels implemented with
wavelength-division multiplexing. Such channels can be used for purely intra-chip communication
or seamless intra-chip/inter-chip communication. Number inside ring indicates resonant wave-
length; each input (I1I4) is passively connected to the output with the corresponding subscript
(O1O4); link corresponding to I2 O2 on wavelength l2 is highlighted (from [8], courtesy of
IEEE)
data to output O2, and so on. For higher bandwidth channels we can either increase
the modulation rate of each wavelength, or we can use multiple wavelengths to
implement a single logical channel. The same devices can be used for a purely intra-
chip interconnect by simply integrating transmitters and receivers on the same
chip.
As shown in Fig. 3.1, the silicon ring resonator is used in transmitters, passive
filters, and receivers. Although other photonic structures (e.g., MachZehnder inter-
ferometers) are possible, ring modulators are extremely compact (310 mm radius)
resulting in reduced area and power consumption. Although not shown in Fig. 3.1,
many nanophotonic interconnection networks also use active filtering to implement
optical switching. For example, we might include multiple receivers with active
filters for wavelength l1 on chip B. Each receivers ring filter would be detuned by
default, and we can then actively tune a single receivers ring filter into resonance
using charge injection. This actively steers the light to one of many possible outputs.
Some networks use active ring filters in the middle of the network itself. For exam-
ple, we might replace the passive ring filters on chip A in Fig. 3.1 with active ring
filters to create an optical switch. When detuned, inputs I1, I2, I3, and I4 are con-
nected to outputs O1, O4, O3, and O2, respectively. When the ring filters are actively
tuned into resonance, then the inputs are connected to the outputs with the corre-
sponding subscripts. Of course, one of the challenges with these actively switched
filters is in designing the appropriate electrical circuitry for routing and flow control
that determines when to tune or detune each filter.
Most recent nanophotonic interconnection designs use the devices shown in
Fig. 3.1, but some proposals also use alternative devices such as vertical cavity surface
emitting lasers combined with free-space optical channels [1, 78] or planar wave-
guides [48]. This chapter focuses on the design of networks with the more common
ring-based devices and linear waveguides, and we leave a more thorough treatment of
interconnect network design using alternative devices for future work.
3 Designing Chip-Level Nanophotonic Interconnection Networks 85
Opto-Electrical Integration
Tightly integrating optical and electrical devices is critical for achieving the potential
bandwidth density and energy efficiency advantages of nanophotonic devices. There
are three primary approaches for opto-electrical integration in intra-chip and inter-
chip interconnection networks: hybrid integration, monolithic back-end-of-line
(BEOL) integration, and monolithic front-end-of-line (FEOL) integration.
Hybrid Integration. The highest-performing optical devices are fabricated
through dedicated processes customized for building such devices. These optical
chips can then be attached to a micro-electronic chip fabricated with a standard
electrical CMOS process through package-level integration [2], flip-chip bonding
the two wafers/chips face-to-face [73, 81], or 3D integration with through-silicon
vias [35]. Although this approach is feasible using integration technologies avail-
able currently or in the near future, it requires inter-die electrical interconnect (e.g.,
micro-bumps or through-silicon vias) to communicate between the micro-electronic
and active optical devices. It can be challenging to engineer this inter-die intercon-
nect to avoid mitigating the energy efficiency and bandwidth density advantages of
chip-level nanophotonics.
Monolithic BEOL Integration. Nanophotonic devices can be deposited on top of
the metal interconnect stack using amorphous silicon [38], poly-silicon [67], silicon
nitride [5], germanium [52], and polymers [15, 33]. Ultimately, a combination of these
materials can be used to create a complete nanophotonic link [79]. Compared to
86 C. Batten et al.
hybrid integration, BEOL integration brings the optical devices closer to the micro-
electronics which can improve energy efficiency and bandwidth density. BEOL inte-
gration does not require changes to the front end, does not consume active area, and
can provide multiple layers of optical devices (e.g., multi-layer waveguides). Although
some specialized materials can be used in BEOL integration, the nanophotonic devices
must be deposited within a strict thermal processing envelope and of course require
modifications to the final layers of the metal interconnect stack. This means that
BEOL devices often must trade-off bandwidth density for energy efficiency (e.g.,
electro-optic modulator devices [79] operate at relatively high drive voltages to achieve
the desired bandwidth and silicon-nitride waveguides have large bending losses limit-
ing the density of photonic devices). BEOL integration is suitable for use with both
SOI and bulk CMOS processes, and can potentially also be used in other applications
such as for depositing optics on DRAM or FLASH chips.
Monolithic FEOL Integration. Photonic devices without integrated electrical cir-
cuitry have been implemented in monocrystalline silicon-on-insulator (SOI) dies
with a thick layer of buried oxide (BOX) [23, 49], and true monolithic FEOL integra-
tion of electrical and photonic devices have also been realized [25, 28]. Thin-BOX
SOI is possible with localized substrate removal under the optical devices [31]. On
the one hand, FEOL integration can support high-temperature process modifications
and enables the tightest possible coupling to the electrical circuits, but also consumes
valuable active area and requires modifications to the sensitive front-end processing.
These modifications can include incorporating pure germanium or high-percentage
silicon-germanium on the active layer, additional processing steps to reduce wave-
guide sidewall roughness, and improving optical cladding with either a custom thick
buried-oxide or a post-processed air gap under optical devices. In addition, FEOL
integration usually requires an SOI CMOS process, since the silicon waveguides are
implemented in the same silicon film used for the SOI transistors. There has, how-
ever, been work on implementing FEOL polysilicon nanophotonic devices with
localized substrate removal in a bulk process [58, 61].
Ring-resonator devices have extremely high Q-factors, which enhance the electro-optical
properties of modulators and active filters and enables dense wavelength division mul-
tiplexing. Unfortunately, this also means small unwanted changes in the resonance can
quickly shift a device out of the required frequency operating range. Common sources
of variation include process variation that can result in unwanted ring geometry varia-
tion within the same die, and thermal variation that can result in spatial and temporal
variation in the refractive index of silicon-photonic devices. Several simulation-based
and experimental studies have reported that a 1 nm variation in the ring width can shift
a rings resonance by approximately 0.5 nm [47, 70], and a single degree change in
temperature can shift a rings resonance by approximately 0.1 nm [22, 47, 51]. Many
nanophotonic network proposals assume tens of wavelengths per waveguide [6, 32, 63,
74, 76], which results in a channel spacing of less than 1 nm (100 GHz). This means
3 Designing Chip-Level Nanophotonic Interconnection Networks 87
shown in Fig. 3.1, and eventually to the photodetector. In addition to the photonic
device losses, there is also a limit to the total amount of optical power that can be
transmitted through a waveguide without large non-linear losses. High optical losses
per wavelength necessitate distributing those wavelengths across many waveguides
(increasing the overall area) to stay within this non-linearity limit. Minimizing opti-
cal loss is a key device design objectives, and meaningful system-level design must
take into account the total optical power overhead.
In the case studies presented later in this chapter, we will be assuming a monolithic
FEOL integration strategy. Our approach differs from other integration strategies,
since we attempt to integrate nanophotonics into state-of-the-art bulk-CMOS micro-
electronic chips with no changes to the standard CMOS fabrication process. In this
section, we provide a brief overview of the specific technology we are developing
with our colleagues at the Massachusetts Institute of Technology. We use our experi-
ences with a 65 nm test chip [60], our feasibility studies for a prototype 32 nm pro-
cess, predictive electrical device models [80], and interconnect projections [36] to
estimate both electrical and photonic device parameters for a target 22 nm technology
node. Device-level details about the MIT nanophotonic technology assumed in the
rest of this chapter can be found in [30, 5861], although the technology is rapidly
evolving such that more recent device-level work uses more advanced device and
circuit techniques [24, 25, 46]. Details about the specific technology assumptions for
each case study can be found in our previous system-level publications [6, 7, 9, 32].
Waveguide. To avoid process changes, we design our waveguides in the polysili-
con layer on top of the shallow-trench isolation in a standard bulk CMOS process
(see Fig. 3.2a). Unfortunately, the shallow-trench oxide is too thin to form an effec-
tive cladding and shield the core from optical-mode leakage into the silicon sub-
strate. We have developed a novel self-aligned post-processing procedure to etch
away the silicon substrate underneath the waveguide forming an air gap. A reason-
ably deep air gap provides a very effective optical cladding. For our case studies, we
assume eight-waveguide bundles can use the same air gap with a 4-mm waveguide
pitch and an extra 5-mm of spacing on either side of the bundle. We estimate a time-
of-flight latency of approximately 10.5 ps/mm which enables raw interconnect
latencies for crossing a 400-mm2 chip to be on the order of one to three cycles for a
5-GHz core clock frequency.
Transmitter. Our transmitter design is similar to past approaches that use minor-
ity charge-injection to change the resonant frequency of ring modulators [50]. Our
racetrack modulator design is implemented by doping the edges of a polysilicon
modulator structure creating a lateral PiN diode with undoped polysilicon as the
intrinsic region (see Fig. 3.2b). Our device simulations indicate that with polysilicon
carrier lifetimes of 0.11 ns it is possible to achieve sub-100 fJ per bit time (fJ/bt)
modulator driver energy for random data at up to 10 Gb/s with advanced digital
3 Designing Chip-Level Nanophotonic Interconnection Networks 89
Fig. 3.2 MIT monolithic FEOL nanophotonic devices. (a) Polysilicon waveguide over SiO2 film
with an air gap etched into the silicon substrate to provide optical cladding; (b) polysilicon ring
modulator that uses charge injection to modulate a single wavelength: without charge injection the
resonant wavelength is filtered to the drop port while all other wavelengths continue to the
through port; with charge injection, the resonant frequency changes such that no wavelengths are
filtered to the drop port; (c) cascaded polysilicon rings that passively filter the resonant wave-
length to the drop port while all other wavelengths continue to the through port (adapted
from [7], courtesy of IEEE)
equalization circuits. To avoid robustness and power issues from distributing a mul-
tiple-GHz clock to hundreds of transmitters, we propose implementing an optical
clock delivery scheme using a simple single-diode receiver with duty-cycle correc-
tion. We estimate the serialization and driver circuitry will consume less than sin-
gle-cycle at a 5-GHz core clock frequency.
Passive Filter. We use polysilicon passive filters with two cascaded rings for
increased frequency roll-off (see Fig. 3.2c). As mentioned earlier in this section, the
rings resonance is sensitive to temperature and requires active thermal tuning.
Fortunately, the etched air gap under the ring provides isolation from the thermally
conductive substrate, and we add in-plane polysilicon heaters inside most rings to
improve heating efficiency. Thermal simulations suggest that we will require
40100 mW of static power for each double-ring filter assuming a temperature range
of 20 K. These ring filters can also be designed to behave as active filters by using
charge injection as in our transmitters, except at lower data rates.
Receiver. The lack of pure Ge presents a challenge for mainstream bulk CMOS
processes. We use the embedded SiGe (2030 % Ge) in the p-MOSFET transistor
source/drain regions to create a photodetector operating at around 1200 nm. Simulation
results show good capacitance (less than 1 fF/mm) and dark current (less than 10 fA/mm)
at near-zero bias conditions, but the sensitivity of the structure needs to be improved
to meet our system specifications. In advanced process nodes, the responsivity and
speed should improve through better coupling between the waveguide and the photo-
detector in scaled device dimensions, and an increased percentage of Ge for device
strain. Our photonic receiver circuits would use the same optical clocking scheme as
our transmitters, and we estimate that the entire receiver will consume less than 50 fJ/
bt for random data. We estimate the deserialization and driver circuitry will consume
less than single-cycle at a 5-GHz core clock frequency.
90 C. Batten et al.
Based on our device simulations and experiments we project that it may be pos-
sible to multiplex 64 wavelengths per waveguide at a 60-GHz spacing, and that by
interleaving wavelengths traveling in opposite directions (which helps mitigate
interference) we can possibly have up to 128 wavelengths per waveguide. With a
4-mm waveguide pitch and 64128 wavelengths per waveguide, we can achieve a
bandwidth density of 160320 Gb/s/mm for intra-chip nanophotonic interconnect.
With a 50-mm fiber coupler pitch, we can achieve a bandwidth density of
1225 Gb/s/mm for inter-chip nanophotonic interconnect. Total link latencies includ-
ing serialization, modulation, time-of-flight, receiving, and deserialization could
range from three to eight cycles depending on the link length. We also project that
the total electrical and thermal on-chip energy for a complete 10 Gb/s nanophotonic
intra-chip or inter-chip link (including a racetrack modulator and a double-ring filter
at the receiver) can be as low as 100250 fJ/bt for random data. These projections
suggest that optical communication should support significantly higher bandwidth
densities, improved energy efficiency, and competitive latency compared to both
optimally repeated global intra-chip electrical interconnect (e.g., [36]) and projected
inter-chip electrical interconnect.
Architectural-Level Design
a b c d
Fig. 3.3 Logical topologies for various 64 terminal networks. (a) 64-writer/64-reader single global
bus; (b) 6464 global non-blocking crossbar; (c) 8-ary 2-stage butterfly; (d) 8-ary 2-dimensional
torus. Squares: input and/or output terminals; dots: routers; in (c) inter-dot lines: uni-directional
channels; in (d) inter-dot lines: two channels in opposite directions (from [8], courtesy of IEEE)
a more extensive review of logical network topologies, and see [4] for a study
specifically focused on intra-chip networks). At this preliminary phase of design,
we can begin to determine the bus and channel bandwidths that will be required to
meet application requirements assuming ideal routing and flow-control algorithms.
Usually this analysis is in terms of theoretical upper-bounds on the networks band-
width and latency, but we can also begin to explore how more realistic routing
algorithms might impact the networks performance. When designing nanophotonic
interconnection networks, it is also useful to begin by characterizing state-of-the-art
electrical networks. Developing realistic electrical baseline architectures early in
the design process can help motivate the best opportunities for leveraging nanopho-
tonic devices. This subsection discusses a range of topologies used in nanophotonic
interconnection networks.
A global bus is perhaps the simplest of logical topologies, and involves N input
terminals arbitrating for a single shared medium so that they can communicate with
one of N 1 output terminals (see Fig. 3.3a). Buses can make good use of scarce
wiring resources, serialize messages which can be useful for some higher-level pro-
tocols, and enable one input terminal to easily broadcast a message to all output
terminals. Unfortunately, using a single shared medium often limits the performance
of buses due to practical constraints on bus bandwidth and arbitration latency as the
number of network terminals increases. There have been several nanophotonic bus
designs that explore these trade-offs, mostly in the context of implementing efficient
DRAM memory channels [9, 29, 53, 74, 76] (discussed further in case study #3),
although there have also been proposals for specialized nanophotonic broadcast
buses to improve the performance of application barriers [14] and cache-coherence
protocols [76]. Multiple global buses can be used to improve system throughput,
and such topologies have also been designed using nanophotonic devices [62].
A global crossbar topology is made up of N buses with each bus dedicated to a
single terminal (see Fig. 3.3b). Such topologies present a simple performance model
92 C. Batten et al.
terminals across a single router port at the edge of the network, while external con-
centration integrates multiple terminals into a unified higher-radix router. There has
been some work investigating how to best use nanophotonics in both two-dimen-
sional torus [69] and mesh [18, 48] topologies.
While many nanophotonic interconnection networks can be loosely categorized
as belonging to one of the four categories shown in Fig. 3.3, there are also more radi-
cal alternatives. For example, Koohi et al. propose a hierarchical topology for an
on-chip nanophotonic network where a set of global rings connect clusters each
with their own local ring [42].
Table 3.1 is an example of the first-order analysis that can be performed at the
architectural level of design. In this example, we compare six logical topologies for
a 64-terminal on-chip symmetric network. For the first-order latency metrics we
assume a 22-nm technology, 5-GHz clock frequency, and a 400-mm2 chip. The bus
and channel bandwidths are sized so that each terminal can sustain 128 b/cycle
under uniform random traffic assuming ideal routing and flow control. Even from
this first-order analysis we can start to see that some topologies (e.g., crossbar,
butterfly, and Clos) require fewer channels but they are often long, while other
topologies (e.g., torus and mesh) require more channels but they are often short.
We can also see which topologies (e.g., crossbar and Clos) require more global
bisection wiring resources, and which topologies require higher-radix routers (e.g.,
crossbar, butterfly, Clos, and cmesh). First-order zero-load latency calculations can
help illustrate trade-offs between hop count, router complexity, and serialization
latency. Ultimately, this kind of rough analysis for both electrical and nanophoto-
nic networks helps motivate the microarchitectural-level design discussed in the
next section.
Microarchitectural-Level Design
a b c d e f g
Fig. 3.4 Symbols used in nanophotonic schematics and layouts, For all ring-based devices, the
number next to the ring indicates the resonant wavelength, and a range of numbers next to the ring
indicates that the symbol actually represents multiple devices each tuned to a distinct wavelength
in that range. The symbols shown include: (a) coupler for attaching a fiber to an on-chip wave-
guide; (b) transmitter including driver and ring modulator for l1; (c) multiple transmitters includ-
ing drivers and ring modulators for each of l1l4; (d) receiver including passive ring filter for l1
and photodetector; (e) receiver including active ring filter for l1 and photodetector; (f) passive ring
filter for l1; (g) active ring filter for l1 (from [8], courtesy of IEEE)
a b c d
Fig. 3.5 Microarchitectural schematics for nanophotonic four terminal buses. The buses connect one
or more input terminals (I1I4) to one or more output terminals (O1O4) via a single shared wavelength:
(a) single-writer broadcast-reader bus; (b) single-writer multiple-reader bus; (c) multiple-writer sin-
gle-reader bus; (d) multiple-writer multiple-reader bus (adapted from [8], courtesy of IEEE)
medium (see Fig. 3.5). Assuming a fixed modulation rate per wavelength, we can
increase the bus bandwidth by using using multiple parallel wavelengths. In the sin-
gle-writer broadcast-reader (SWBR) bus shown in Fig. 3.5a, a single input terminal
modulates the bus wavelength that is then broadcast to all four output terminals. This
form of broadcast bus does not need any arbitration because there is only one input
terminal. The primary disadvantage of a SWBR bus is simply the large amount of
optical power require to broadcast packets to all output terminals. If we wish to send
a packet to only one of many outputs, then we can significantly reduce the optical
power by using active filters in each receiver. Figure 3.5b shows a single-writer mul-
tiple-reader (SWMR) bus where by default the ring filters in each receiver are
detuned such that none of them drop the bus wavelength. When the input terminal
sends a packet to an output terminal, it first ensures that the ring filter at the destina-
tion receiver is actively tuned into the bus wavelength. The control logic for this
active tuning usually requires additional optical or electrical communication from
the input terminal to the output terminals. Figure 3.5c illustrates a different bus net-
work called a multiple-writer single-reader (MWSR) bus where four input terminals
arbitrate to modulate the bus wavelength that is then dropped at a single output ter-
minal. MWSR buses require global arbitration, which can be implemented either
electrically or optically. The most general bus network enables multiple input termi-
nals to arbitrate for the shared bus and also allows a packet to be sent to one or more
output terminals. Figure 3.5d illustrates a multiple-writer multiple-reader (MWMR)
bus with four input terminals and four output terminals, but multiple-writer broad-
cast-reader (MWBR) buses are also possible. Here arbitration will be required at
both the transmitter side and the receiver side. MWBR/MWMR buses will require
O(Nbl) transceivers where N is the number of terminals and bl is the number of
shared wavelengths used to implement the bus.
There are several examples of nanophotonic buses in the literature. Several
researchers have described similar techniques for using a combination of nanopho-
tonic SWBR and MWSR buses to implement the command, write-data, and read-
data buses in a DRAM memory channel [29, 53, 74, 76]. In this context the
arbitration for the MWSR read-data bus is greatly simplified since the memory
controller acts a master and the DRAM banks act as slaves. We investigate various
ways of implementing such nanophotonic DRAM memory channels as part of the
96 C. Batten et al.
a b c
Fig. 3.6 Microarchitectural schematics for nanophotonic 44 crossbars. The crossbars connect all
inputs (I1I4) to all outputs (O1O4) and are implemented with either: (a) four single-writer multi-
ple-reader (SWMR) buses; (b) four SWMR buses with additional output buffering; or (c) four
multiple-writer single-reader (MWSR) buses (adapted from [8], courtesy of IEEE)
case study in section Case Study #3: DRAM Memory Channel. Binkert et al. dis-
cuss both single-wavelength SWBR and SWMR bus designs for use in implement-
ing efficient on-chip barrier networks, and the results suggest that a SWMR bus
can significantly reduce the required optical laser power as compared to a SWBR
bus [14]. Vantrease et al. also describe a nanophotonic MWBR bus used to broad-
cast invalidate messages as part of the cache-coherence protocol [76]. Arbitration
for this bus is performed optically with tokens that are transferred between input
terminals using a specialized arbitration network with a simple ring topology. Pan
et al. proposed several techniques to help address scaling nanophotonic MWMR
buses to larger numbers of terminals: multiple independent MWMR buses improve
the total network bisection bandwidth while still enabling high utilization of all
buses, a more optimized optical token scheme improves arbitration throughput,
and concentrated bus ports shared by multiple terminals reduce the total number of
transceivers [62].
Global crossbars have several attractive properties including high throughput and
a short fixed latency. Nanophotonic crossbars use a dedicated nanophotonic bus per
input or output terminal to enable every input terminal to send a packet to a different
output terminal at the same time. Implementing such crossbars with nanophotonics
have many of the same advantages and challenges as nanophotonic buses except at
a larger scale. Figure 3.6 illustrates three types of nanophotonic crossbars. In the
SWMR crossbar shown in Fig. 3.6a, there is one bus per input and every output can
3 Designing Chip-Level Nanophotonic Interconnection Networks 97
read from any of these buses. As an example, if I2 wants to send a packet to O3 it first
arbitrates for access to the output terminal, then (assuming it wins arbitration) the
receiver for wavelength l2 at O3 is actively tuned, and finally the transmitter at I2
modulates wavelength l2 to send the packet. SWBR crossbars are also possible
where the packet is broadcast to all output terminals, and each output terminal is
responsible for converting the packet into the electrical domain and determining if
the packet is actually destined for that terminal. Although SWBR crossbars enable
broadcast communication they use significantly more optical power than a SWMR
crossbar for unicast communication. Note that even SWMR crossbars usually
include a low-bandwidth SWBR crossbar to implement distributed redundant arbi-
tration at the output terminals and/or to determine which receivers at the destination
should be actively tuned. A SWMR crossbar needs one transmitter per input, but
requires O(N2bl) receivers. Figure 3.6b illustrates an alternative called a buffered
SWMR crossbar that avoids the need for any global or distributed arbitration. Every
input terminal can send a packet to any output terminal at any time assuming it has
space in the corresponding queue at the output. Each output locally arbitrates among
these queues to determine which packet can access the output terminal. Buffered
SWBR/SWMR crossbars simplify global arbitration at the expense of an additional
O(N2) buffering. Buffered SWMR crossbars can still include a low-bandwidth
SWBR crossbar to determine which receivers at the destination should be actively
tuned. The MWSR crossbar shown in Fig. 3.6c is an alternative microarchitecture
that uses one bus per output and allows every input to write any of these buses. As
an example, if I2 wants to send a packet to O3 it first arbitrates, and then (assuming
it wins arbitration) it modulates wavelength l3. A MWSR crossbar needs one
receiver per output, but requires O(N2bl) transmitters. For larger networks with
wider channel bitwidths, the quadratic number of transmitters or receivers required
to implement nanophotonic crossbars can significantly impact optical power, ther-
mal tuning power, and area.
There have been several diverse proposals for implementing global crossbars
with nanophotonics. Many of these proposals use global on-chip crossbars to imple-
ment L2-to-L2 cache-coherence protocols for single-socket manycore processors.
Almost all of these proposals include some amount of concentration, so that a small
number of terminals locally arbitrate for access to a shared crossbar port. This con-
centration helps leverage electrical interconnect to reduce the radix of the global
crossbar, and can also enable purely electrical communication when sending a
packet to a physically close output terminal. Krman et al. describe three on-chip
SWBR nanophotonic crossbars for addresses, snoop responses, and data for imple-
menting a snoopy-based cache-coherence protocol [39]. The proposed design uses
distributed redundant arbitration to determine which input port can write to which
output port. A similar design was proposed by Pasrich et al. within the context of a
multiprocessor system-on-chip [64]. Krman et al. have recently described a more
sophisticated SWMR microarchitecture with connection-based arbitration that is
tightly coupled to the underlying physical layout [40]. Miller et al. describe a buff-
ered SWBR nanophotonic crossbar for implementing a directory-based cache-
coherence protocol, and the broadcast capabilities of the SWBR crossbar are used
98 C. Batten et al.
for invalidation messages [44]. The proposed design requires several hundred thou-
sand receivers for a 6464 crossbar with each shared bus using 64 wavelengths
modulated at 10 Gb/s. Vantrease et al. describe a MWSR nanophotonic crossbar for
implementing a directory-based cache-coherence protocol, and a separate MWBR
nanophotonic bus for invalidation messages [76]. The proposed design requires
about a million transmitters for a 6464 crossbar with each shared bus using 256
wavelengths modulated at 10 Gb/s. Arbitration in the MWSR nanophotonic cross-
bar is done with a specialized optical token scheme, where tokens circle around a
ring topology. Although this scheme does enable round-robin fairness, later work
by Vantrease et al. investigated techniques to improve the arbitration throughput for
these token-based schemes under low utilization [75]. Petracca et al. proposed a
completely different microarchitecture for a nanophotonic crossbar that uses optical
switching inside the network and only O(Nbl) transmitters and completely passive
receivers [65]. The proposed design requires a thousand optical switches for a 6464
crossbar with each shared bus using 96 wavelengths modulated at 10 Gb/s. Each
switch requires around O(8bl) actively tuned filters. The precise number of active
filters depends on the exact switch microarchitecture and whether single-wavelength
or multiple-wavelength active filters are used. Although such a microarchitecture
has many fewer transmitters and receivers than the designs shown in Fig. 3.6, a
separate multi-stage electrical network is required for arbitration and to setup the
optical switches.
There are additional design decisions when implementing a multi-stage topol-
ogy, since each network component can use either electrical or nanophotonic
devices. Figure 3.7 illustrates various microarchitectural designs for a 2-ary
2-stage butterfly topology. In Fig. 3.7a, the routers are all implemented electrically
and the channels connecting the first and second stage of routers are implemented
with point-to-point nanophotonic channels. This is a natural approach, since we
can potentially leverage the advantages of nanophotonics for implementing long
global channels and use electrical technology for buffering, arbitration, and
switching. Note that even though these are point-to-point channels, we can still
draw the corresponding nanophotonic implementations of these channels as being
wavelength-division multiplexed in a microarchitectural schematic. Since a sche-
matic is simply meant to capture the high-level interaction between electrical and
nanophotonic devices, designers should simply use the simplest representation at
this stage of the design. Similarly, the input and output terminals may be co-
located in the physical design, but again the schematic is free to use a more abstract
representation. In Fig. 3.7b, just the second stage of routers are implemented with
nanophotonic devices and the channels are still implemented electrically. Since
nanophotonic buffers are currently not feasible in intra-chip and inter-chip net-
works, the buffering is done electrically and the routers 22 crossbar is imple-
mented with a nanophotonic SWMR microarchitecture. As with any nanophotonic
crossbar, additional logic is required to manage arbitration for output ports. Such
a microarchitecture seems less practical since the router crossbars are localized,
and it will be difficult to outweigh the opto-electrical conversion overhead when
working with short buses. In Fig. 3.7c, both the channels and the second stage of
3 Designing Chip-Level Nanophotonic Interconnection Networks 99
a b
Fig. 3.7 Microarchitectural schematics for nanophotonic 2-ary 2-stage butterflies. Networks con-
nect all inputs (I1I4) to all outputs (O1O4) with each network component implemented with either
electrical or nanophotonic technology: (a) electrical routers and nanophotonic channels; (b) elec-
trical first-stage routers, electrical channels, and nanophotonic second-stage routers; (c) electrical
first-stage routers, nanophotonic channels, and nanophotonic second-stage routers; (d) similar to
previous subfigure except that the channels and intra-router crossbars are unified into a single stage
of nanophotonic interconnect (adapted from [8], courtesy of IEEE)
Fig. 3.8 Microarchitectural schematics for nanophotonic 4-ary 1-dim torus. Networks connect all
inputs (I1I4) to all outputs (O1O4) with each network component implemented with either electri-
cal or nanophotonic technology: (a) electrical routers and nanophotonic channels or (b) nanopho-
tonic routers and channels. Note that this topology uses a single unidirectional channel to connect
each of the routers (from [8], courtesy of IEEE)
uses packet-based flow control while the nanophotonic data network uses circuit-
switched flow control. The radix-4 blocking routers require special consideration by
the routing algorithm, but later work by Sherwood-Droz et al. fabricated alternative
non-blocking optical router microarchitectures that can be used in this nanophoto-
nic torus network [71]. Poon et al. survey a variety of designs for optical routers that
can be used in on-chip multi-stage nanophotonic networks [66]. Li et al. propose a
two-dimensional circuit-switched mesh topology with a second broadcast nanopho-
tonic network based on planar waveguides for the control network [48]. Cianchetti
et al. proposed a fully optical two-dimensional mesh topology with packet-based
flow control [18]. This proposal sends control bits on dedicated wavelengths ahead
of the packet payload. These control bits undergo an opto-electrical conversion at
each router hop in order to quickly conduct electrical arbitration and flow control. If
the packet wins arbitration, then the router control logic sets the active ring filters
such that the packet payload proceeds through the router optically. If the packet
loses arbitration, then the router control logic sets the active ring filters to direct the
packet to local receivers so that it can be converted into the electrical domain and
buffered. If the packet loses arbitration and no local buffering is available then the
packet is dropped, and a nack is sent back to the source using dedicated optical
channels. Later work by the same authors explored optimizing the optical router
microarchitecture, arbitration, and flow control [17]. To realize significant advan-
tages over electrical networks, fully optical low-dimensional torus networks need to
carefully consider waveguide crossings, drop losses at each optical router, the total
tuning cost for active ring filters in all routers, and the control network overhead.
102 C. Batten et al.
Physical-Level Design
The final phase of design is at the physical level and involves mapping wavelengths
to waveguides, waveguide layout, and placing nanophotonic devices along each
waveguide. We often use abstract layout diagrams that are similar to microarchitec-
tural schematics but include additional details to illustrate the physical design.
Ultimately, we must develop a detailed layout diagram that specifies the exact place-
ment of each device, and this layout is then used to calculate the area consumed by
nanophotonic devices and the total optical power required for all wavelengths. This
subsection discusses a range of physical design issues that arise when implementing
the nanophotonic microarchitectures described in the previous section.
Figure 3.9 illustrates general approaches for the physical design of nanophoto-
nic buses. These examples implement a four-wavelength SWMR bus, and they
differ in how the wavelengths are mapped to each waveguide. Figure 3.9a illus-
trates the most basic approach where all four wavelengths are multiplexed onto the
same waveguide. Although this produces the most compact layout, it also requires
all nanophotonic devices to operate on the same waveguide which can increase the
total optical loss per wavelength. In this example, each wavelength would experi-
ence one modulator insertion loss, O(Nbl) through losses in the worst case, and a
drop loss at the desired output terminal. As the number of wavelengths for this bus
increases, we will need to consider techniques for distributing those wavelengths
across multiple waveguides both to stay within the waveguides total bandwidth
capacity and within the waveguides total optical power limit. Figure 3.9b illus-
trates wavelength slicing, where a subset of the bus wavelengths are mapped to
distinct waveguides. In addition to reducing the number of wavelengths per wave-
guide, wavelength slicing can potentially reduce the number of through losses and
thus the total optical power. Figure 3.9ce illustrate reader slicing, where a subset
of the bus readers are mapped to distinct waveguides. The example shown in
Fig. 3.9c doubles the number of transmitters, but the input terminal only needs to
drive transmitters on the waveguide associated with the desired output terminal.
Reader slicing does not reduce the number of wavelengths per waveguide, but it
does reduce the number of through losses. Figure 3.9d illustrates a variation of
reader slicing that uses optical power splitting. This split nanophotonic bus requires
a single set of transmitters, but requires more optical power since this power must
be split between the multiple bus branches. Figure 3.9e illustrates another variation
of reader slicing that uses optical power guiding. This guided nanophotonic bus
also only requires a single set of transmitters, but it uses active ring filters to guide
the optical power down the desired bus branch. Guided buses require more control
overhead but can significantly reduce the total optical power when the optical loss
per branch is large. Reader slicing can be particularly effective in SWBR buses,
since it can reduce the number of drop losses per wavelength. It is possible to
implement MWSR buses using a similar technique called writer slicing, which can
help reduce the number of modulator insertion losses per wavelength. More com-
plicated physical design (e.g., redundant transmitters and optical power guiding)
3 Designing Chip-Level Nanophotonic Interconnection Networks 103
a b c
Fig. 3.9 Physical design of nanophotonic buses. The four wavelengths for an example four-output
SWMR bus are mapped to waveguides in various ways: (a) all wavelengths mapped to one wave-
guide; (b) wavelength slicing with two wavelengths mapped to one waveguide; (c) reader slicing
with two readers mapped to one waveguide and two redundant sets of transmitters; (d) reader slic-
ing with a single transmitter and optical power passively split between two branches; (e) reader
slicing with a single transmitter and optical power actively guided down one branch (adapted
from [8], courtesy of IEEE)
may have some implications on the electrical control logic and thus the networks
microarchitecture, but it is important to note that these techniques are solely
focused on mitigating physical design issues and do not fundamentally change the
logical network topology. Most nanophotonic buses in the literature use wave-
length slicing [29, 74, 76] and there has been some exploration of the impact of
using a split nanophotonic bus [14, 74]. We investigate the impact of using a guided
nanophotonic bus in the context of a DRAM memory channel as part of the case
study in section Case Study #3: DRAM Memory Channel.
Most nanophotonic crossbars use a set of shared buses, and thus wavelength slic-
ing, reader slicing, and writer slicing are all applicable to the physical design of
these crossbars. Figure 3.10a illustrates another technique called bus slicing, where
a subset of the crossbar buses are mapped to each waveguide. In this example, a 44
SWMR crossbar with two wavelengths per bus is sliced such that two buses are
mapped to each of the two waveguides. Bus-sliced MWSR crossbars are also pos-
sible. Bus slicing reduces the number of wavelengths per waveguide and the number
of through losses in both SWMR and MWSR crossbars. In addition to illustrating
how wavelengths are mapped to waveguides, Fig. 3.10a also illustrates a serpentine
layout. Such layouts minimize waveguide crossings by snaking all waveguides
104 C. Batten et al.
Fig. 3.10 Physical design of nanophotonic crossbars. In addition to the same techniques used with
nanophotonic buses, crossbar designs can also use bus slicing: (a) illustrates a 44 SWMR cross-
bar with two wavelengths per bus and two buses per waveguide. Colocating input and output ter-
minals can impact the physical layout. For example, a 44 SWMR crossbar with one wavelength
per bus and a single waveguide can be implemented with either: (b) a double-serpentine layout
where the light travels in one direction or (c) a single-serpentine layout where the light travels in
two directions (from [8], courtesy of IEEE)
around the chip, and they result in looped, U-shaped, and S-shaped waveguides. The
example in Fig. 3.10a assumes that the input and output terminals are located on
opposite sides of the crossbar, but it is also common to have pairs of input and out-
put terminals co-located. Figure 3.10b illustrates a double-serpentine layout for a
44 SWMR crossbar with one wavelength per bus and a single waveguide. In this
layout, waveguides are snaked by each terminal twice with light traveling in one
direction. Transmitters are on the first loop, and receivers are on the second loop.
Figure 3.10c illustrates an alternative single-serpentine layout where waveguides
are snaked by each terminal once, and light travels in both directions. A single-
serpentine layout can reduce waveguide length but requires additional transmitters
to send the light for a single bus in both directions. For example, input I2 uses l2 to
3 Designing Chip-Level Nanophotonic Interconnection Networks 105
a b c
d e
Fig. 3.11 Physical design of nanophotonic point-to-point channels. An example with four point-
to-point channels each with four wavelengths can be implemented with either: (a) all wavelengths
mapped to one waveguide; (b) wavelength slicing with two wavelengths from each channel mapped
to one waveguide; (c) partial channel slicing with all wavelengths from two channels mapped to
one waveguide and a serpentine layout; (d) partial channel slicing with a ring-filter matrix layout
to passively shuffle wavelengths between waveguides; (e) full channel slicing with each channel
mapped to its own waveguide and a point-to-point layout (adapted from [8], courtesy of IEEE)
Fig. 3.7a. Figure 3.11a illustrates the most basic design where all sixteen wave-
lengths are mapped to a single waveguide with a serpentine layout. As with nano-
photonic buses, wavelength slicing reduces the number of wavelengths per
waveguide and total through losses by mapping a subset of each channels wave-
lengths to different waveguides. In the example shown in Fig. 3.11b, two wave-
lengths from each channel are mapped to a single waveguide resulting in eight total
wavelengths per waveguide. Figure 3.11ce illustrate channel slicing where all
wavelengths from a subset of the channels are mapped to a single waveguide.
Channel slicing reduces the number of wavelengths per waveguide, the through
losses, and can potentially enable shorter waveguides. The example shown in
Fig. 3.11c, maps two channels to each waveguide but still uses a serpentine layout.
The example in Fig. 3.11d has the same organization on the transmitter side, but
uses a passive ring filter matrix layout to shuffle wavelengths between waveguides.
These passive ring filter matrices can be useful when a set of channels is mapped to
one waveguide, but the physical layout requires a subset of those channels to also be
passively mapped to a second waveguide elsewhere in the system. Ring filter matri-
ces can shorten waveguides at the cost of increased waveguide crossings and one or
more additional drop losses. Figure 3.11e illustrates a fully channel-sliced design
with one channel per waveguide. This enables a point-to-point layout with wave-
guides directly connecting input and output terminals. Although point-to-point lay-
outs enable the shortest waveguide lengths they usually also lead to the greatest
number of waveguide crossings and layout complexity. One of the challenges with
ring-filter matrix and point-to-point layouts is efficiently distributing the unmodu-
lated laser light to all of the transmitters while minimizing the number of laser
couplers and optical power waveguide complexity. Optimally allocating channels to
waveguides can be difficult, so researchers have investigated using machine learn-
ing [39] or an iterative algorithm [11] for specific topologies. There has been some
exploratory work on a fully channel-sliced physical design with a point-to-point
layout for implementing a quasi-butterfly topology [41], and some experimental
work on passive ring filter network components similar in spirit to the ring-filter
matrix [82]. Point-to-point channels are an integral part of the case studies in sec-
tions Case Study #1: On-Chip Tile-to-Tile Network and Case Study #2: Manycore
Processor-to-DRAM Network.
Much of the above discussion about physical-level design is applicable to micro-
architectures that implement multiple-stages of nanophotonic buses, channels, and
routers. However, the physical layout in these designs is often driven more by the
logical topology, leading to inherently channel-sliced designs with point-to-point
layouts. For example, nanophotonic torus and mesh topologies are often imple-
mented with regular grid-like layouts. It is certainly possible to map such topologies
onto serpentine layouts or to use a ring filter matrix to pack multiple logical chan-
nels onto the same waveguide, but such designs would probably be expensive in
terms of area and optical power. Wavelength slicing is often used to increase the
bandwidth per channel. The examples in the literature for fully optical fat-tree net-
works [26], torus networks [69], and mesh networks [18, 48] all use channel slicing
and regular layouts that match the logical topology. Since unmodulated light will
3 Designing Chip-Level Nanophotonic Interconnection Networks 107
need to be distributed across the chip to each injection port, these examples will
most likely require more complicated optical power distribution, laser couplers
located across the chip, or some form of hybrid laser integration.
Figures 3.12 and 3.13 illustrate several abstract layout diagrams for an on-chip
nanophotonic 6464 global crossbar network and an 8-ary 2-stage butterfly net-
work. These layouts assume a 22-nm technology, 5-GHz clock frequency, and 400-
mm2 chip with 64 tiles. Each tile is approximately 2.52.5 mm and includes a
co-located network input and output terminal. The network bus and channel band-
widths are sized according to Table 3.1. The 6464 crossbar topology in Fig. 3.12
uses a SWMR microarchitecture with bus slicing and a single-serpentine layout.
Both layouts map a single bus to each waveguide with half the wavelengths directed
from left to right and the other half directed from right to left. Both layouts are able
to co-locate the laser couplers in two locations along one edge of the chip to sim-
plify packaging. Figure 3.12a uses a longer serpentine layout, while Fig. 3.12b uses
a shorter serpentine layout which reduces waveguide lengths at the cost of increased
electrical energy to communicate between the more distant tiles and the nanophoto-
nic devices. The 8-ary 2-stage butterfly topology in Fig. 3.13 is implemented with
16 electrical routers (eight per stage) and 64 point-to-point nanophotonic channels
connecting every router in the first stage to every router in the second stage.
Figure 3.13a uses channel slicing with no wavelength slicing and a point-to-point
layout to minimize waveguide length. Note that although two channels are mapped
to the same waveguide, those two channels connect routers in the same physical
locations meaning that there is no need for any form of ring-filter matrix. Clever
waveguide layout results in 16 waveguide crossings located in the middle of the
chip. If we were to reduce the wavelengths per channel but maintain the total wave-
lengths per waveguide, then a ring-filter matrix might be necessary to shuffle chan-
nels between waveguides. Figure 3.13b uses a single-serpentine layout. The
serpentine layout increases waveguide lengths but eliminates waveguide crossings
in the middle of the chip. Notice that the serpentine layout requires co-located laser
couplers in two locations along one edge of the chip, while the point-to-point layout
requires laser couplers on both sides of the chip. The point-to-point layout could
position all laser couplers together, but this would increase the length of the optical
power distribution waveguides. Note that in all four layouts eight waveguides share
the same post-processing air gap, and that some waveguide crossings may be neces-
sary at the receivers to avoid positioning electrical circuitry over the air gap.
Figure 3.14 illustrates the kind of quantitative analysis that can be performed at the
physical level of design. Detailed layouts corresponding to the abstract layouts in
Figs. 3.12b and 3.13b are used to calculate the total optical power and area overhead
as a function of optical device quality and the technology assumptions in the earlier
section on nanophotonic technology. Higher optical losses increase the power per
waveguide which eventually necessitates distributing fewer wavelengths over more
waveguides to stay within the waveguides total optical power limit. Thus higher opti-
cal losses can increase both the optical power and the area overhead. It is clear that for
these layouts, the crossbar network requires more optical power and area for the same
quality of devices compared to the butterfly network. This is simply a result of the
108 C. Batten et al.
Fig. 3.12 Abstract physical layouts for 6464 SWMR crossbar. In a SWMR crossbar each tile
modulates a set wavelengths which then must reach every other tile. Two waveguide layouts are
shown: (a) uses a long single-serpentine layout where all waveguides pass directly next to each
tile; (b) uses a shorter single-serpentine layout to reduce waveguide loss at the cost of greater
electrical energy for more distant tiles to reach their respective nanophotonic transmitter and
receiver block. The nanophotonic transmitter and receiver block shown in (c) illustrates how bus
slicing is used to map wavelengths to waveguides. One logical channel (128 b/cycle or 64 l per
channel) is mapped to each waveguide, but as required by a single-serpentine layout, the channel
is split into 64 l directed left to right and 64 l directed right to left. Each ring actually represents
64 rings each tuned to a different wavelength; a = l1l64; b = l64l128; couplers indicate where laser
light enters chip (from [8], courtesy of IEEE)
cost of providing O(N2bl) receivers in the SWMR crossbar network versus the sim-
pler point-to-point nanophotonic channels used in the butterfly network. We can also
perform rough terminal tuning estimates based on the total number of rings in each
layout. Given the technology assumptions in the earlier section on nanophotonic
3 Designing Chip-Level Nanophotonic Interconnection Networks 109
Fig. 3.13 Abstract physical layouts for 8-ary 2-stage butterfly with nanophotonic channels. In a
butterfly with nanophotonic channels each logical channel is implemented with a set of wave-
lengths that interconnect two stages of electrical routers. Two waveguide layouts are shown:
(a) uses a point-to-point layout; (b) uses a serpentine layout that results in longer waveguides but
avoids waveguide crossings. The nanophotonic transmitter and receiver block shown in (c) illus-
trates how channel slicing is used to map wavelengths to waveguides. Two logical channels (128 b/
cycle or 64 l per channel) are mapped to each waveguide, and by mapping channels connecting the
same routers but in opposite directions we avoid the need for a ring-filter matrix. Each ring actually
represents 64 rings each tuned to a different wavelength; a = l1l64; b = l64l128; k is seven for
point-to-point layout and 21 for serpentine layout; couplers indicate where laser light enters chip
(from [8], courtesy of IEEE)
technology the crossbar network requires 500,000 rings and a fixed thermal tuning
power of over 10 W. The butterfly network requires only 14,000 rings and a fixed
thermal tuning power of 0.28 W. Although the crossbar is more expensive to implement,
110 C. Batten et al.
Network Design
Table 3.1 shows configurations for various topologies that meet the MTBw target.
Nanophotonic implementations of the 6464 crossbar and 8-ary 2-stage butterfly
networks were discussed in section Designing Nanophotonic Interconnection
Networks. Our preliminary analysis suggested that the crossbar network could
achieve good performance but with significant optical power and area overhead,
while the butterfly network could achieve lower optical power and area overhead
but might perform poorly on adversarial traffic patterns. This analysis motivated our
interest in high-radix, low-diameter Clos networks. A classic three-stage (m,n,r)
Clos topology is characterized by the number of routers in the middle stage (m), the
radix of the routers in the first and last stages (n), and the number of input and output
switches (r). For this case study we explore a (8,8,8) Clos topology which is similar
to the 8-ary 2-stage butterfly topology shown in Fig. 3.3c except with three stages of
routers. The associated configuration for the MTBw target is shown in Table 3.1.
This topology is non-blocking which can enable significantly higher performance
than a blocking butterfly, but the Clos topology also requires twice as many bisec-
tion channels which requires careful design at the microarchitectural and physical
level. We use an oblivious non-deterministic routing algorithm that efficiently bal-
ances load by always randomly picking a middle-stage router.
3 Designing Chip-Level Nanophotonic Interconnection Networks 111
a 1 b 1
10 10
Through Loss (dB/ring)
4 4
10 10
0 1 2 3 4 5 0 1 2 3 4 5
Waveguide Loss (dB/cm) Waveguide Loss (dB/cm)
Crossbar Optical Power (W) Crossbar Area Overhead (%)
c 10
1 d 10
1
15
10
5
Through Loss (dB/ring)
2 2
5
10 10
15
1 2
5
3 3
10 10
4 4
10 10
0 1 2 3 4 5 0 1 2 3 4 5
Waveguide Loss (dB/cm) Waveguide Loss (dB/cm)
Butterfly Optical Power (W) Butterfly Area Overhead (%)
Fig. 3.14 Comparison of 6464 crossbar and 8-ary 3-stage butterfly networks. Contour plots
show optical laser power in Watts and area overhead as a percentage of the total chip area for the
layouts in Figs. 3.12b and 3.13b. These metrics are plotted as a function of optical device quality
(i.e., ring through loss and waveguide loss) (from [8], courtesy of IEEE)
The 8-ary 2-stage butterfly in Fig. 3.13b has low optical power and area overhead
due to its use of nanophotonics solely for point-to-point channels and not for optical
switching. For the Clos network we considered the two microarchitectures illus-
trated in Fig. 3.15. For simplicity, these microarchitectural schematics are for a
smaller (2,2,2) Clos topology. The microarchitecture in Fig. 3.15a uses two sets of
nanophotonic point-to-point channels to connect three stages of electrical routers.
All buffering, arbitration, and flow-control is done electrically. As an example, if
input I2 wants to communicate with output O3 then it can use either middle router. If
the routing algorithm chooses R2, 2, then the network will use wavelength l2 on the
first waveguide to send the message to R2, 1 and wavelength l4 on the second wave-
guide to send the message to O4. The microarchitecture in Fig. 3.15b implements
both the point-to-point channels and the middle stage of routers with nanophoton-
ics. We chose to purse the first microarchitecture, since preliminary analysis sug-
gested that the energy advantage of using nanophotonic middle-stage routers was
outweighed by the increased optical laser power. We will revisit this assumption
later in this case study. Note how the topology choice impacted our microarchitec-
tural-level design; if we had chosen to explore a low-radix, high-diameter Clos
112 C. Batten et al.
a b
Fig. 3.15 Microarchitectural schematic for nanophotonic (2,2,2) Clos. Both networks have four
inputs (I14), four outputs (O14), and six 22 routers (R1-3;1-2) with each network component imple-
mented with either electrical or nanophotonic technology: (a) electrical routers with four nano-
photonic point-to-point channels; (b) electrical first- and third-stage routers with a unified stage
of nanophotonic point-to-point channels and middle-stage routers (from [32], courtesy of
IEEE)
topology then optical switching would probably be required to avoid many opto-
electrical conversions. Here we opt for a high-radix, low-diameter topology to mini-
mize the complexity of the nanophotonic network.
We use a physical layout similar to that shown for the 8-ary 2-stage butterfly in
Fig. 3.13b except that we require twice as many point-to-point channels and thus
twice as many waveguides. For the Clos network, each of the eight groups of routers
includes three instead of two radix-8 routers. The Clos network will have twice the
optical power and area overhead as shown for the butterfly in Fig. 3.14c and 3.14d.
Note that even with twice the number of bisection channels, the Clos network still
uses less than 10 % of the chip area for a wide range of optical device parameters.
This is due to the impressive bandwidth density provided by nanophotonic techno-
logy. The Clos network requires an order of magnitude fewer rings than the crossbar
network resulting in a significant reduction in optical power and area overhead.
Evaluation
Design Themes
This case study illustrates several important design themes. First, it can be challenging
to show a compelling advantage for purely on-chip nanophotonic interconnection
114 C. Batten et al.
a b
100 100
Avg Latency (cycles)
60 60
40 40
20 20
0 0
0 2 4 6 8 0 6 12 18 24
Offered BW (kb/cycle) Offered BW (kb/cycle)
ecmeshx2 in LTBw Configuration ecmeshx2 in HTBw Configuration
c d
100 100
Avg Latency (cycles)
20 20
0 0
0 2 4 6 8 0 6 12 18 24
Offered BW (kb/cycle) Offered BW (kb/cycle)
pclos in LTBw Configuration pclos in HTBw Configuration
Fig. 3.16 Latency versus offered bandwidth for on-chip tile-to-tile networks. LTBw systems have
a theoretical throughput of 64 b/cycle per tile, while HTBw systems have a theoretical throughput
of 256 b/cycle both for the uniform random traffic pattern (adapted from [32], courtesy of IEEE)
networks if we include fixed power overheads, use a more aggressive electrical base-
line, and consider local as well as global traffic patterns. Second, point-to-point nano-
photonic channels (or at least a limited amount of optical switching) seems to be a
more practical approach compared to global nanophotonic crossbars. This is espe-
cially true when we are considering networks that might be feasible in the near future.
Third, it is important to use an iterative design process that considers all levels of the
design. For example, Fig. 3.17 shows that the router power begins to consume a
significant portion of the total power at higher bandwidths in the nanophotonic Clos
network, and in fact, follow up work by Kao et al. began exploring the possibility of using
both nanophotonic channels and one stage of low-radix nanophotonic routers [34].
a 20 b 60
Laser = 33 W
Electrical Power (W)
10 30
5 15
0 0
X X
lo a
lo a
ec mes h/p d
ec esh hx2 8d
ec mes h/p d
ec esh hx2 8d
emesh /ur
emesh /ur
es 2/ d
es 2/ d
c
ec es x2/ /ur
ec es x2/ /ur
ec me h/ 8c
ec me h/ 8c
2/ c
2/ c
lo s
lo s
pc s
s
pc s
s
hx p8
hx p8
pc eclo
pc eclo
e es p2
e es p2
m hx p2
p8
m hx p2
p8
m s p
m s p
h
h
/
/
emmes
emmes
e
e
m
m
Dynamic Power at 2 kb/cycle DynamicPowerat8kb/cycle
Fig. 3.17 Dynamic power breakdown for on-chip tile-to-tile networks. Power of eclos and pclos
did not vary significantly across traffic patterns. (a) LTBw systems at 2 kb/cycle offered bandwidth
(except for emesh/p2d and ecmeshx2/p2d which saturated before 2 kb/cycle, HTBw system shown
instead); (b) HTBw systems at 8 kb/cycle offered bandwidth (except for emesh/p2d and ecmeshx2/
p2d which are not able to achieve 8 kb/cycle). pclos-c (pclos-a) corresponds to conservative
(aggressive) nanophotonic technology projections (from [32], courtesy of IEEE)
shared cache, and each DRAM module includes multiple memory controllers and
DRAM chips to provide large bandwidth with high capacity. We assume that the
address space is interleaved across DRAM modules at a fine granularity to maxi-
mize performance, and any structure in the address stream from a single core is
effectively lost when we consider hundreds of tiles arbitrating for tens of DRAM
modules. This case study assumes a 22-nm technology, 2.5-GHz clock frequency,
512-bit packets for transferring cache lines, and 400-mm2 chip. We also assume that
the total power of the processor chip is one of the key design constraints limiting
achievable performance. More details on this case study can be found in [6, 7].
Network Design
Fig. 3.18 Logical topology for processor-to-DRAM network. Two (3,9,2,2) LMGS networks are
shown: one for the memory request network and one for the memory response network. Each LMGS
network includes three groups of nine tiles arranged in small 3-ary 2-dimensional mesh cluster and
two global 32 routers that interconnect the clusters and DRAM memory controllers (MC). Lines in
cluster mesh network represent two unidirectional channels in opposite directions; other lines repre-
sent one unidirectional channel heading from left to right (from [8], courtesy of IEEE)
since it decouples the number of tiles, clusters, and memory controllers. In this case
study, we explore LMGS topologies supporting 256 tiles and 16 DRAM modules
with one, four, and 16 clusters. Since the DRAM memory controller design is not
the focus of this case study, we ensure that the memory controller bandwidth is not
a bottleneck by providing four electrical DRAM memory controllers per DRAM
module. Note that high-bandwidth nanophotonic DRAM described as part of the
case study in section Case Study #3: DRAM Memory Channel could potentially
provide an equivalent amount of memory bandwidth with fewer memory controllers
and lower power consumption.
As mentioned above, our design uses a hybrid opto-electrical microarchitecture
that targets the advantages of each medium: nanophotonic interconnect for energy-
efficient global communication, and electrical interconnect for fast switching,
efficient buffering, and local communication. We use first-order analysis to size the
nanophotonic point-to-point channels such that the memory system power con-
sumption on uniform random traffic is less than a 20 W power constraint. Initially,
we balance the bisection bandwidth of the cluster mesh networks and the global
channel bandwidth, but we also consider overprovisioning the channel bandwidths
in the cluster mesh networks to compensate for intra-mesh contention. Configurations
with more clusters will require more nanophotonic channels, and thus each channel
will have lower bandwidth to still remain within this power constraint.
Figure 3.19 shows the abstract layout for our target system with 16 clusters. Since
each cluster requires one dedicated global channel to each DRAM module, there are a
total of 256 cluster-to-memory channels with one nanophotonic access point per chan-
nel. Our first-order analysis determined that 16 l (160 Gb/s) per channel should enable
the configuration to still meet the 20 W power constraint. A ring-filter matrix layout is
used to passively shuffle the 16-l channels on different horizontal waveguides destined
for the same DRAM module onto the same set of four vertical waveguides. We assume
that each DRAM module includes a custom switch chip containing the global router for
both the request and response networks. The switch chip on the memory side arbitrates
between the multiple requests coming in from the different clusters on the processor
chip. This reduces the power density of the processor chip and could enable multi-
socket configurations to easily share the same DRAM modules. A key feature of this
layout is that the nanophotonic devices are not only used for inter-chip communication,
but can also provide cross-chip transport to off-load intra-chip global electrical wiring.
Figure 3.20 shows the laser power as a function of optical device quality for two differ-
ent power constraints and thus two different channel bandwidths. Systems with greater
aggregate bandwidth have quadratically more waveguide crossings, making them more
sensitive to crossing losses. Additionally, certain combinations of waveguide and cross-
ing losses result in large cumulative losses and require multiple waveguides to stay with
in the waveguide power limit. These additional waveguides further increase the total
number of crossings, which in turn continues to increase the power per wavelength,
meaning that for some device parameters it is infeasible to achieve a desired aggregate
bandwidth with a ring-filter matrix layout.
118 C. Batten et al.
Fig. 3.19 Abstract physical layout for nanophotonic processor-to-dram network. Target
(16,16,16,4) LMGS network with 256 tiles, 16 DRAM modules, and 16 clusters each with a 4-ary
2-dimensional electrical mesh. Each tile is labeled with a hexadecimal number indicating its clus-
ter. For simplicity the electrical mesh channels are only shown in the inset, the switch chip includes
a single memory controller, each ring in the main figure actually represents 16 rings modulating or
filtering 16 different wavelengths, and each optical power waveguide actually represents 16 wave-
guides (one per horizontal waveguide). NAP = nanophotonic access point; nanophotonic request
channel from group 3 to DRAM module 0 is highlighted (adapted from [7], courtesy of IEEE)
3 Designing Chip-Level Nanophotonic Interconnection Networks 119
a 0.1 b 0.1
1 Infeasible Infeasible
Crossing Loss (dB/xing)
Fig. 3.20 Optical power for nanophotonic processor-to-DRAM networks. Results are for a
(16,16,16,4) LMGS topology with a ring-filter matrix layout and two different power constraints:
(a) low power constraint and thus low aggregate bandwidth and (b) high power constraint and thus
high aggregate bandwidth (from [7], courtesy of IEEE)
Evaluation
a b c
350
Avg Latency (cycles)
300
250
200
150
0 0.4 0.8 1.2 0 2 4 6 0 2 4 6 8 10
Offered BW (kb/cycle) Offered BW (kb/cycle) Offered BW (kb/cycle)
Fig. 3.21 Latency versus offered bandwidth for processor-to-DRAM networks. E electrical,
P nanophotonics, 1/4/16 number of clusters, x1/x2/x4 over-provisioning factor (adapted from [7],
courtesy of IEEE)
Table 3.2 shows the power breakdown for the E4x2 and P16x1 configurations
near saturation. As expected, the majority of the power in the electrical configuration
is spent on the global channels that connect the access points to the DRAM mod-
ules. By implementing these channels with energy-efficient photonic links we
have a larger portion of our energy budget for higher-bandwidth on-chip mesh
networks even after including the overhead for thermal tuning. Note that the laser
power is not included here as it is highly dependent on the physical layout and
photonic device design as shown in Fig. 3.20. The photonic configurations con-
sume close to 15 W leaving 5 W for on-chip optical power dissipation as heat.
Ultimately, photonics enables an 810 improvement in throughput at similar
power consumption.
3 Designing Chip-Level Nanophotonic Interconnection Networks 121
Design Themes
This case study suggests it is much easier to show a compelling advantage for
implementing a inter-chip network with nanophotonic devices, as compared to a
purely intra-chip nanophotonic network. Additionally, our results show that once
we have made the decision to use nanophotonics for chip-to-chip communication, it
makes sense to push nanophotonics as deep into each chip as possible (e.g., by using
more clusters). This approach for using seamless intra-chip/inter-chip nanophotonic
links is a general design theme that can help direct future directions for nanophoto-
nic network research. Also notice that our nanophotonic LMGS network was able
to achieve an order-of-magnitude improvement in throughput at a similar power
constraint without resorting to more sophisticated nanophotonic devices, such as
active optical switching. Again, we believe that using point-to-point nanophotonic
channels offers the most promising approach for short term adoption of this technol-
ogy. The choice of the ring-filter matrix layout was motivated by its regularity, short
waveguides, and the need to aggregate all of the nanophotonic couplers in one place
for simplified packaging. However, as shown in Fig. 3.20, this layout puts significant
constraints on the maximum tolerable losses in waveguides and crossings. We are
currently considering alternate serpentine layouts that can reduce the losses in cross-
ings and waveguides. However, the serpentine layout needs couplers at multiple
locations on the chip, which could increase packaging costs. An alternative would
be to leverage the multiple nanophotonic devices layers available in monolithic
BEOL integration approach. Work by Biberman et al. has shown how multilayer
deposited devices can significantly impact the feasibility of various network archi-
tectures [13], and this illustrates the need for a design process that iterates across the
architecture, microarchitecture, and physical design levels.
b
a
c d
Fig. 3.22 PIDRAM designs. Subfigures illustrate a single DRAM memory channel (MC) with
four DRAM banks (B) at various levels of design: (a) logical topology for DRAM memory chan-
nel; (b) shared nanophotonic buses where optical power is broadcast to all banks along a shared
physical medium; (c) split nanophotonic buses where optical power is split between multiple direct
connections to each bank; (d) guided nanophotonic buses where optical power is actively guided
to a single bank. For clarity, command bus is not shown in (c) and (d), but it can be implemented
in a similar fashion as the corresponding write-data bus or as a SWBR bus (adapted from [9],
courtesy of IEEE)
Network Design
Figure 3.22a illustrates the logical topology for a DRAM memory channel. A mem-
ory controller is used to manage a set of DRAM banks that are distributed across
one or more DRAM chips. The memory system includes three logical buses: a com-
mand bus, a write-data bus, and a read-data bus. Figure 3.22b illustrates a straight-
forward nanophotonic microarchitecture for a DRAM memory channel with a
combination of SWBR, SWMR, and MWSR buses.
The microarchitecture in Fig. 3.22b can also map to a similar layout that we call
a shared nanophotonic bus. In this layout, the memory controller first broadcasts a
command to all of the banks and each bank determines if it is the target bank for the
command. For a PIDRAM write command, just the target bank will then tune-in its
3 Designing Chip-Level Nanophotonic Interconnection Networks 123
nanophotonic receiver on the write-data bus. The memory controller places the write
data on this bus; the target bank will receive the data and then perform the corre-
sponding write operation. For a PIDRAM read command, just the target bank will
perform the read operation and then use its modulator on the read-data bus to send
the data back to the memory controller. Unfortunately, the losses multiply together
in this layout making the optical laser power an exponential function of the number
of banks. If all of the banks are on the same PIDRAM chip, then the losses can be
manageable. However, to scale to larger capacities, we will need to daisy-chain
the shared nanophotonic bus through multiple PIDRAM chips. Large coupler losses
and the exponential scaling of laser power combine to make the shared nanophoto-
nic bus feasible only for connecting banks within a PIDRAM chip as opposed to
connecting banks across PIDRAM chips.
Figure 3.22c shows the alternative reader-/writer-sliced split nanophotonic bus
layout, which divides the long shared bus into multiple branches. In the command
and write-data bus, modulated laser power is still sent to all receivers, and in the
read-data bus, laser power is still sent to all modulators. The split nature of the bus,
however, means that the total laser power is roughly a linear function of the number
of banks. If each bank was on its own PIDRAM chip, then we would use a couple
of fibers per chip (one for modulated data and one for laser power) to connect the
memory controller to each of the PIDRAM chips. Each optical path in the write-
data bus would only traverse one optical coupler to leave the processor chip and one
optical coupler to enter the PIDRAM chip regardless of the total number of banks.
This implementation reduces the extra optical laser power as compared to a shared
nanophotonic bus at the cost of additional splitter and combiner losses in the mem-
ory controller. It also reduces the effective bandwidth density of the nanophotonic
bus, by increasing the number of fibers for the same effective bandwidth.
To further reduce the required optical power, we can use a reader-/writer-sliced
guided nanophotonic bus layout, shown in Fig. 3.22d. Each nanophotonic demulti-
plexer uses an array of either active ring or comb filters. For the command and
write-data bus, the nanophotonic demultiplexer is placed after the modulator to
direct the modulated light to the target bank. For the read-data bus, the nanophoto-
nic demultiplexer is placed before the modulators to allow the memory controller to
manage when to guide the light to the target bank for modulation. Since the optical
power is always guided down a single branch, the total laser power is roughly con-
stant and independent of the number of banks. The optical loss overhead due to the
nanophotonic demultiplexers and the reduced bandwidth density due to the branch-
ing make a guided nanophotonic bus most attractive when working with relatively
large per-bank optical losses.
Figure 3.23 illustrates in more detail our proposed PIDRAM memory system.
The figure shows a processor chip with multiple independent PIDRAM memory
channels; each memory channel includes a memory controller and a PIDRAM
DIMM, which in turn includes a set of PIDRAM chips. Each PIDRAM chip con-
tains a set of banks, and each bank is completely contained within a single PIDRAM
chip. We use a hybrid approach to implement each of the three logical buses. The
memory scheduler within the memory controller orchestrates access to each bus to
124 C. Batten et al.
Fig. 3.23 PIDRAM memory system organization. Each PIDRAM memory channel connects to a
PIDRAM DIMM via a fiber ribbon. The memory controller manages the command bus (CB),
write-data bus (WDB), and read-data bus (RDB), which are wavelength division multiplexed onto
the same fiber. Nanophotonic demuxes guide power to only the active PIDRAM chip. B=PIDRAM
B=PIDRAM bank; each ring represents multiple rings for multi-wavelength buses (from [9], cour-
tesy of IEEE)
a b
P1 Layout P2 Layout
Fig. 3.24 Abstract physical layout for PIDRAM chip. Two layouts are shown for an example
PIDRAM chip with eight banks and eight array blocks per bank. For both layouts, the nanophoto-
nic command bus ends at the command access point (CAP), and an electrical H-tree implementa-
tion efficiently broadcasts control bits from the command access point to all array blocks. For
clarity, the on-chip electrical command bus is not shown. The difference between the two layouts
is how far nanophotonics is extended into the PIDRAM chip: (a) P1 uses nanophotonic chip I/O
for the data buses but fully electrical on-chip data bus implementations, and (b) P2 uses seamless
on-chip/off-chip nanophotonics to distribute the data bus to a group of four banks. CAP = com-
mand access point; DAP = data access point (adapted from [9], courtesy of IEEE)
and the optically passive ring filter banks at the bottom and top of the waterfall
ensure that each of these vertical waveguides only contains a subset of the channels
wavelengths. Each of these vertical waveguides is analogous to the electrical vertical
buses in P1, so a bank can still be striped across the chip horizontally to allow easy
access to the on-chip nanophotonic interconnect. Various layouts are possible that
correspond to more or less nanophotonic access points. For a Pn layout, n indicates
the number of partitions along each vertical electrical data bus. All of the nanopho-
tonic circuits have to be replicated at each data access point for each bus partition.
This increases the fixed link power due to link transceiver circuits and ring heaters.
It can also potentially lead to higher optical losses, due to the increased number of
rings on the optical path. Our nanophotonic layouts all use the same on-chip com-
mand bus implementation as traditional electrical DRAM: a command access point
is positioned in the middle of the chip and an electrical H-tree command bus broad-
casts the control and address information to all array blocks.
Evaluation
To evaluate the energy efficiency and area trade-offs of the proposed DRAM chan-
nels, we use a heavily modified version of the CACTI-D DRAM modeling tool.
Since nanophotonics is an emerging technology, we explore the space of possible
126 C. Batten et al.
results with both aggressive and conservative projections for nanophotonic devices.
To quantify the performance of each DRAM design, we use a detailed cycle-level
microarchitectural simulator. We use synthetic traffic patterns to issue loads and
stores at a rate capped by the number of in-flight messages. We simulate a range of
different designs with each configuration name indicating the layout (Pn), the num-
ber of banks (b8/b64), and the number of I/Os per array core (io4/io32). We use the
events and statistics from the simulator to animate our DRAM and nanophotonic
device models to compute the energy per bit.
Figure 3.25 shows the energy-efficiency breakdown for various layouts imple-
menting three representative PIDRAM configurations. Each design is subjected to a
random traffic pattern at peak utilization and the results are shown for the aggressive
and conservative photonic technology projections. Across all designs it is clear that
replacing the off-chip links with photonics is advantageous, as E1 towers above the
rest of the designs. How far photonics is taken on chip, however, is a much richer
design space. To achieve the optimal energy efficiency requires balancing both the
data-dependent and data-independent components of the overall energy. The
data-independent energy includes: electrical laser power for the write bus, electrical
laser power for the read bus, fixed circuit energy including clock and leakage, and
thermal tuning energy. As shown in Fig. 3.25a, P1 spends the majority of the energy
on intra-chip communication (write and read energy) because the data must traverse
long global wires to get to each bank. Taking photonics all the way to each array
block with P64 minimizes the cross-chip energy, but results in a large number of
photonic access points (since the photonic access points in P1 are replicated 64
times in the case of P64), contributing to the large data-independent component of
the total energy. This is due to the fixed energy cost of photonic transceiver circuits
and the energy spent on ring thermal tuning. By sharing the photonic access points
across eight banks, the optimal design is P8. This design balances the data-dependent
savings of using intra-chip photonics with the data-independent overheads due to
electrical laser power, fixed circuit power, and thermal tuning power.
Once the off-chip and cross-chip energies have been reduced (as in the P8 layout
for the b64-io4 configuration), the activation energy becomes dominant. Figure 3.25b
shows the results for the b64-io32 configuration which increases the number of bits
we read or write from each array core to 32. This further reduces the activate energy
cost, and overall this optimized design is 10 more energy efficient than the base-
line electrical design. Figure 3.25c shows similar trade-offs for the low-bandwidth
b8-io32 configuration.
In addition to these results, we also examined the energy as a function of utiliza-
tion and the area overhead. Figure 3.26 illustrates this trade-off for configurations
with 64 banks and four I/Os per array core. As expected, the energy per bit increases
as utilization goes down due to the data-independent power components. The large
fixed power in electrical DRAM interfaces helps mitigate the fixed power overhead
in a nanophotonic DRAM interface at low utilization; these results suggest the
potential for PIDRAM to be an energy efficient alternative regardless of utilization.
Although not shown, the area overhead for a PIDRAM chip is actually quite mini-
mal since any extra active area for the nanophotonic devices is compensated for the
more area-efficient, higher-bandwidth array blocks.
3 Designing Chip-Level Nanophotonic Interconnection Networks 127
a b c
10
Energy (pJ/bt)
8
6
4
2
0
E1
E1
E1
P1
P2
P4
P8
P1
P2
P4
P8
P1
P2
P4
P8
6
2
4
6
2
4
P1
P3
P6
P1
P3
P6
b64-io4w/ b64-io32w/ b8-io32w/
ConservativeProj ConservativeProj ConservativeProj
d e f
10
Energy (pJ/bt)
8
6
4
2
0
E1
E1
E1
P1
P2
P4
P8
P1
P2
P4
P8
P1
P2
P4
P8
6
2
4
6
2
4
P1
P3
P6
P1
P3
P6
Fig. 3.25 Energy breakdown for DRAM memory channels. Energy results are for uniform ran-
dom traffic with enough in-flight requests to saturate the DRAM memory channel. (ac) Assume
conservative nanophotonic device projections, while (df) assume more aggressive nanophotonic
projections. Results for (a), (b), (d), and (e) are at a peak bandwidth of 500 Gb/s and (c) and (f)
are at a peak bandwidth of 60 Gb/s with random traffic. Fixed circuits energy includes clock and
leakage. Read energy includes chip I/O read, cross-chip read, and bank read energy. Write energy
includes chip I/O write, cross-chip write, and bank write energy. Activate energy includes chip I/O
command, cross-chip row address energy, and bank activate energy (from [9], courtesy of IEEE)
Design Themes
Point-to-point nanophotonic channels were a general theme in the first two case
studies, but in this case study point-to-point channels were less applicable. DRAM
memory channels usually use bus-based topologies to decouple bandwidth from
capacity, so we use a limited form of active optical switching in reader-sliced
SWMR and MWSR nanophotonic buses to reduce the required optical power. We
see this is a gradual approach to nanophotonic network complexity: a designer can
start with point-to-point nanophotonic channels, move to reader-sliced buses if there
is a need to scale terminals but not the network bandwidth, and finally move to fully
128 C. Batten et al.
Fig. 3.26 Energy versus utilization. Energy results are for uniform random traffic with varying
numbers of in-flight messages. To reduce clutter, we only plot the three most energy efficient
waterfall floorplans (P4, P8, P16) (adapted from [9], courtesy of IEEE)
Conclusions
space, and since this is still an emerging technology there is less intuition on which
parts of the design space are the most promising. Only exploring a single topology,
microarchitecture, or layout ignores some of the trade-offs involved in alternative
approaches. For example, restricting a design to only use optical switching eliminates
some high-radix topologies. These high-radix topologies can, however, be imple-
mented with electrical routers and point-to-point nanophotonic channels. As another
example, only considering wavelength slicing or only considering bus/channel slicing
artificially constrains bus and channel bandwidths as opposed to using a combination
of wavelength and bus/channel slicing. Iterating through the three levels of design can
enable a much richer exploration of the design space. For example, as discussed in
section Case Study #2: Manycore Processor-to-DRAM Network, an honest evalua-
tion of our final results suggest that it may be necessary to revisit some of our earlier
design decisions about the importance of waveguide crossings.
Use an Aggressive Electrical Baseline. There are many techniques to improve
the performance and energy-efficiency of electrical chip-level networks, and most
of these techniques are far more practical than adopting an emerging technology.
Designers should assume fairly aggressive electrical projections in order to make a
compelling case for chip-level nanophotonic interconnection networks. For exam-
ple, with an aggressive electrical baseline technology in section Case Study #1:
On-Chip Tile-to-Tile Network, it becomes more difficult to make a strong case for
purely on-chip nanophotonic networks. However, even with aggressive electrical
assumptions it was still possible to show significant potential in using seamless
intra-chip/inter-chip nanophotonic links in sections Case Study #2: Manycore
Processor-to-DRAM Network and Case Study #3: DRAM Memory Channel.
Assume a Broad Range of Nanophotonic Device Parameters. Nanophotonics is
an emerging technology, and any specific instance of device parameters are cur-
rently meaningless for realistic network design. This is especially true when param-
eters are mixed from different device references that assume drastically different
fabrication technologies (e.g., hybrid integration versus monolithic integration). It
is far more useful for network designers to evaluate a specific proposal over a range
of device parameters. In fact, one of the primary goals of nanophotonic interconnec-
tion network research should be to provide feedback to device experts on the most
important directions for improvement. In other words, are there certain device
parameter ranges that are critical for achieving significant system-level benefits?
For example, the optical power contours in section Case Study #2: Manycore
Processor-to-DRAM Network helped not only motivate alternative layouts but also
an interest in very low-loss waveguide crossings.
Carefully Consider Nanophotonic Fixed-Power Overheads. One of the primary
disadvantages of nanophotonic devices are the many forms of fixed power including
fixed transceiver circuit power, static thermal tuning power, and optical laser power.
These overheads can impact the energy efficiency, on-chip power density, and sys-
tem-level power. Generating a specific amount of optical laser power can require
significant off-chip electrical power, and this optical laser power ultimately ends up
as heat dissipation in various nanophotonic devices. Ignoring these overheads or
only evaluating designs at high utilization rates can lead to overly optimistic results.
130 C. Batten et al.
For example, section Case Study #1: On-Chip Tile-to-Tile Network suggested
that static power overhead could completely mitigate any advantage for purely on-
chip nanophotonic networks, unless we assume relatively aggressive nanophotonic
devices. This is in contrast to the study in section Case Study #3: DRAM Memory
Channel, which suggests that even at low utilization, PIDRAM can achieve similar
performance at lower power compared to projected electrical DRAM interfaces.
Motivate Nanophotonic Network Complexity. There will be significant practical
risk in adopting nanophotonic technology. Our goal as designers should be to
achieve the highest benefit with the absolute lowest amount of risk. Complex nano-
photonic interconnection networks can require many types of devices and many
instances of each type. These complicated designs significantly increase risk in
terms of reliability, fabrication cost, and packing issues. If we can achieve the same
benefits with a much simpler network design, then ultimately this increases the
potential for realistic adoption of this emerging technology. Two of our case studies
make use of just nanophotonic point-to-point channels, and our hope is that this
simplicity can reduce risk. Once we decide to use nanophotonic point-to-point
channels, then high-radix, low-diameter topologies seem like a promising direction
for future research.
We would like to thank our co-authors on the various publications that served as the basis for the
three case studies, including Y.-J. Kwon, S. Beamer, I. Shamim, and C. Sun. We would like to
acknowledge the MIT nanophotonic device and circuits team, including J. S. Orcutt, A. Khilo,
M. A. Popovi, C. W. Holzwarth, B. Moss, H. Li, M. Georgas, J. Leu, J. Sun, C. Sorace,
F. X. Krtner, J. L. Hoyt, R. J. Ram, and H. I. Smith.
References
1. Abousamra A, Melhem R, Jones A (2011) Two-hop free-space based optical interconnects for
chip multiprocessors. In: International symposium on networks-on-chip (NOCS), May 2011.
http://dx.doi.org/10.1145/1999946.1999961 Pittsburgh, PA
2. Alduino A, Liao L, Jones R, Morse M, Kim B, Lo W, Basak J, Koch B, Liu H, Rong H, Sysak
M, Krause C, Saba R, Lazar D, Horwitz L, Bar R, Litski S, Liu A, Sullivan K, Dosunmu O, Na
N, Yin T, Haubensack F, Hsieh I, Heck J, Beatty R, Park H, Bovington J, Lee S, Nguyen H, Au
H, Nguyen K, Merani P, Hakami M, Paniccia MJ (2010) Demonstration of a high-speed
4-channel integrated silicon photonics WDM link with silicon lasers. In: Integrated photonics
research, silicon, and nanophotonics (IPRSN), July 2010. http://www.opticsinfobase.org/
abstract.cfm?URI=iprsn-2010-pdiwi5 Monterey, CA
3. Amatya R, Holzwarth CW, Popovi MA, Gan F, Smith HI, Krtner F, Ram RJ (2007) Low-
power thermal tuning of second-order microring resonators. In: Conference on lasers and
electro-Optics (CLEO), May 2007. http://www.opticsinfobase.org/abstract.cfm?URI=CLEO-
2007-CFQ5 Baltimore, MA
3 Designing Chip-Level Nanophotonic Interconnection Networks 131
4. Balfour J, Dally W (2006) Design tradeoffs for tiled CMP on-chip networks. In: International
symposium on supercomputing (ICS), June 2006. http://dx.doi.org/10.1145/1183401.1183430
Queensland, Australia
5. Barwicz T, Byun H, Gan F, Holzwarth CW, Popovi MA, Rakich PT, Watts MR, Ippen EP,
Krtner F, Smith HI, Orcutt JS, Ram RJ, Stojanovic V, Olubuyide OO, Hoyt JL, Spector S,
Geis M, Grein M, Lyszcarz T, Yoon JU (2007) Silicon photonics for compact, energy-efficient
interconnects. J Opt Networks 6(1):6373
6. Batten C, Joshi A, Orcutt JS, Khilo A, Moss B, Holzwarth CW, Popovi MA, Li H, Smith HI, Hoyt
JL, Krtner FX, Ram RJ, Stojanovi V, Asanovi K (2008) Building manycore processor-to-
DRAM networks with monolithic silicon photonics. In: Symposium on high-performance inter-
connects (hot interconnects), August 2008 http://dx.doi.org/10.1109/HOTI.2008.11 Stanford, CA
7. Batten C, Joshi A, Orcutt JS, Khilo A, Moss B, Holzwarth CW, Popovi MA, Li H, Smith HI,
Hoyt JL, Krtner FX, Ram RJ, Stojanovi V, Asanovi K (2009) Building manycore proces-
sor-to-DRAM networks with monolithic CMOS silicon photonics. IEEE Micro 29(4):8-21
8. Batten C, Joshi A, Stojanovi V, Asnaovi K (2012) Designing chip-level nanophotonic inter-
connection networks. IEEE J Emerg Sel Top Circuits Syst. http://dx.doi.org/10.1109/
JETCAS.2012.2193932
9. Beamer S, Sun C, Kwon Y-J, Joshi A, Batten C, Stojanovi V, Asanovi K (2010) Rearchitecting
DRAM memory systems with monolithically integrated silicon photonics. In: International
symposium on computer architecture (ISCA), June 2010. http://dx.doi.
org/10.1145/1815961.1815978 Saint-Malo, France
10. Beausoleil RG (2011) Large-scale integrated photonics for high-performance interconnects.
ACM J Emerg Technol Comput Syst 7(2):6
11. Beux SL, Trajkovic J, OConnor I, Nicolescu G, Bois G, Paulin P (2011) Optical ring network-
on-chip (ORNoC): architecture and design methodology. In: Design, automation, and test in
Europe (DATE), March 2011. http://ieeexplore.ieee.org/xpl/articleDetails.
jsp?arnumber=5763134 \bibitem{beux-photo-ornoc-date2011} Grenoble, France
12. Bhuyan LN, Agrawal DP (2007) Generalized hypercube and hyperbus structures for a com-
puter network. IEEE Trans Comput 33(4):323333
13. Biberman A, Preston K, Hendry G, Sherwood-Droz N, Chan J, Levy JS, Lipson M, Bergman
K (2011) Photonic network-on-chip architectures using multilayer deposited silicon materials
for high-performance chip multiprocessors. ACM J Emerg Technol Comput Syst 7(2):7
14. Binkert N, Davis A, Lipasti M, Schreiber R, Vantrease D (2009) Nanophotonic barriers. In:
Workshop on photonic interconnects and computer architecture, December 2009. Atlanta, GA
15. Block BA, Younkin TR, Davids PS, Reshotko MR, Chang BMPP, Huang S, Luo J, Jen AKY
(2008) Electro-optic polymer cladding ring resonator modulators. Opt Express 16(22):
1832618333
16. Christiaens I, Thourhout DV, Baets R (2004) Low-power thermo-optic tuning of vertically
coupled microring resonators. Electron Lett 40(9):560561
17. Cianchetti MJ, Albonesi DH (2011) A low-latency, high-throughput on-chip optical router
architecture for future chip multiprocessors. ACM J Emerg Technol Comput Syst 7(2):9
18. Cianchetti MJ, Kerekes JC, Albonesi DH (2009) Phastlane: a rapid transit optical routing net-
work. In: International symposium on computer architecture (ISCA), June 2009. http://dx.doi.
org/10.1145/1555754.1555809 Austin, TX
19. Clos C (1953) A study of non-blocking switching networks. Bell Syst Techn J 32:406424
20. Dally WJ, Towles B (2004) Principles and practices of interconnection networks. Morgan
Kaufmann. http://www.amazon.com/dp/0122007514
21. DeRose CT, Watts MR, Trotter DC, Luck DL, Nielson GN, Young RW (2010) Silicon micror-
ing modulator with integrated heater and temperature sensor for thermal control. In: Conference
on lasers and electro-optics (CLEO), May 2010. http://www.opticsinfobase.org/abstract.
cfm?URI=CLEO-2010-CThJ3 San Jose, CA
22. Dokania RK, Apsel A (2009) Analysis of challenges for on-chip optical interconnects. In:
Great Lakes symposium on VLSI, May 2009. http://dx.doi.org/10.1145/1531542.1531607
Paris, France
132 C. Batten et al.
23. Dumon P, Bogaerts W, Baets R, Fedeli J-M, Fulbert L (2009) Towards foundry approach for
silicon photonics: silicon photonics platform ePIXfab. Electron Lett 45(12):581582
24. Georgas M, Leu JC, Moss B, Sun C, Stojanovi V (2011) Addressing link-level design
tradeoffs for integrated photonic interconnects. In: Custom integrated circuits conference
(CICC), September 2011 http://dx.doi.org/10.1109/CICC.2011.6055363 San Jose, CA
25. Georgas M, Orcutt J, Ram RJ, Stojanovi V (2011) A monolithically-integrated optical receiver
in standard 45 nm SOI. In: European solid-state circuits conference (ESSCC), September 2011.
http://dx.doi.org/10.1109/ESSCIRC.2011.6044993 Helsinki, Finland
26. Gu H, Xu J, Zhang W (2009) A low-power fat-tree-based optical network-on-chip for multi-
processor system-on-chip. In: Design, automation, and test in Europe (DATE), May 2009.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5090624 Nice, France
27. Guha B, Kyotoku BBC, Lipson M (2010) CMOS-compatible athermal silicon microring reso-
nators. Opt Express 18(4):34873493
28. Gunn C (2006) CMOS photonics for high-speed interconnects. IEEE Micro 26(2):5866
29. Hadke A, Benavides T, Yoo SJB, Amirtharajah R, Akella V (2008) OCDIMM: scaling the
DRAM memory wall using WDM-based optical interconnects. In: Symposium on high-
performance interconnects (hot interconnects), August 2008. http://dx.doi.org/10.1109/
HOTI.2008.25 Stanford, CA
30. Holzwarth CW, Orcutt JS, Li H, Popovi MA, Stojanovi V, Hoyt JL, Ram RJ, Smith HI
(2008) Localized substrate removal technique enabling strong-confinement microphotonics in
bulk-Si-CMOS processes. In: Conference on lasers and electro-optics (CLEO), May 2008.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4571716 San Jose, CA
31. Hwang E, Bhave SA (2010) Nanophotonic devices on thin buriod oxide silicon-on insulator
substrates. Opt Express 18(4):38503857
32. Joshi A, Batten C, Kwon Y-J, Beamer S, Shamim I, Asanovi K, Stojanovi V (2009) Silicon-
photonic Clos networks for global on-chip communication. In: International symposium on
networks-on-chip (NOCS), May 2009 http://dx.doi.org/10.1109/NOCS.2009.5071460 San
Diego, CA
33. Kalluri S, Ziari M, Chen A, Chuyanov V, Steier WH, Chen D, Jalali B, Fetterman H, Dalton
LR (1996) Monolithic integration of waveguide polymer electro-optic modulators on VLSI
circuitry. Photon Technol Lett 8(5):644646
34. Kao Y-H, Chao JJ (2011) BLOCON: a bufferless photonic Clos network-on-chip architecture.
In: International symposium on networks-on-chip (NOCS), May 2011. http://dx.doi.
org/10.1145/1999946.1999960 Pittsburgh, PA
35. Kash JA (2008) Leveraging optical interconnects in future supercomputers and servers. In:
Symposium on high-performance interconnects (hot interconnects), August 2008. http://dx.
doi.org/10.1109/HOTI.2008.29 Stanford, CA
36. Kim B, Stojanovi V (2008) Characterization of equalized and repeated interconnects for NoC
applications. IEEE Design Test Comput 25(5):430439
37. Kim J, Balfour J, Dally WJ (2007) Flattened butterfly topology for on-chip networks. In:
International symposium on microarchitecture (MICRO), December 2007 http://dx.doi.
org/10.1109/MICRO.2007.15 Chicago, IL
38. Kimerling LC, Ahn D, Apsel AB, Beals M, Carothers D, Chen Y-K, Conway T, Gill DM,
Grove M, Hong C-Y, Lipson M, Liu J, Michel J, Pan D, Patel SS, Pomerene AT, Rasras M,
Sparacin DK, Tu K-Y, White AE, Wong CW (2006) Electronic-photonic integrated circuits on
the CMOS platform. In: Silicon Photonics, March 2006. http://dx.doi.org/10.1117/12.654455
San Jose, CA
39. Krman N, Krman M, Dokania RK, Martnez JF, Apsel AB, Watkins MA, Albonesi DH
(2006) Leveraging optical technology in future bus-based chip multiprocessors. In: International
symposium on microarchitecture (MICRO), December 2006 http://dx.doi.org/10.1109/
MICRO.2006.28 Orlando, FL
40. Krman N, Martnez JF (2010) A power-efficient all-optical on-chip interconnect using
wavelength-based oblivious routing. In: International conference on architectural support for
3 Designing Chip-Level Nanophotonic Interconnection Networks 133
59. Orcutt JS, Khilo A, Popovic MA, Holzwarth CW, Li H, Sun J, Moss B, Dahlem MS, Ippen EP,
Hoyt JL, Stojanovi V, Kartner FX, Smith HI, Ram RJ (2009) Photonic integration in a com-
mercial scaled bulk-CMOS process. In: International conference on photonics in switching,
September 2009. http://dx.doi.org/10.1109/PS.2009.5307769 Pisa, Italy
60. Orcutt JS, Khilo A, Popovic MA, Holzwarth CW, Moss B, Li H, Dahlem MS, Bonifield TD,
Kartner FX, Ippen EP, Hoyt JL, Ram RJ, Stojanovi V (2008) Demonstration of an electronic
photonic integrated circuit in a commercial scaled bulk-CMOS process. In: Conference on
lasers and electro-optics (CLEO), May 2008. http://ieeexplore.ieee.org/xpl/articleDetails.
jsp?arnumber=4571838 San Jose, CA
61. Orcutt JS, Tang SD, Kramer S, Li H, Stojanovi V, Ram RJ (2011) Low-loss polysilicon wave-
guides suitable for integration within a high-volume polysilicon process. In: Conference on
lasers and electro-optics (CLEO), May 2011. http://ieeexplore.ieee.org/xpl/articleDetails.
jsp?arnumber=5950452 Baltimore, MD
62. Pan Y, Kim J, Memik G (2010) FlexiShare: energy-efficient nanophotonic crossbar architec-
ture through channel sharing. In: International symposium on high-performance computer
architecture (HPCA), January 2010. http://dx.doi.org/10.1109/HPCA.2010.5416626
Bangalore, India
63. Pan Y, Kumar P, Kim J, Memik G, Zhang Y, Choudhary A (2009) Firefly: illuminating on-chip
networks with nanophotonics. In: International symposium on computer architecture (ISCA),
June 2009. http://dx.doi.org/10.1145/1555754.1555808 Austin, TX
64. Pasricha S, Dutt N (2008) ORB: an on-chip optical ring bus communication architecture for
multi-processor systems-on-chip. In: Asia and South Pacific design automation conference
(ASP-DAC), January 2008 http://dx.doi.org/10.1109/ASPDAC.2008.4484059 Seoul, Korea
65. Petracca M, Lee BG, Bergman K, Carloni LP (2009) Photonic NoCs: system-level design
exploration. IEEE Micro 29(4):7477
66. Poon AW, Luo X, Xu F, Chen H (2009) Cascaded microresonator-based matrix switch for sili-
con on-chip optical interconnection. Proc IEEE 97(7):12161238
67. Preston K, Manipatruni S, Gondarenko A, Poitras CB, Lipson M (2009) Deposited silicon
high-speed integrated electro-optic modulator. Opt Express 17(7):51185124
68. Reed GT (2008) Silicon photonics: the state of the art. Wiley-Interscience. http://www.ama-
zon.com/dp/0470025794
69. Shacham A, Bergman K, Carloni LP (2008) Photonic networks-on-chip for future generations
of chip multiprocessors. IEEE Trans Comput 57(9):12461260
70. Sherwood-Droz N, Preston K, Levy JS, Lipson M (2010) Device guidelines for WDM inter-
connects using silicon microring resonators. In: Workshop on the interaction between nano-
photonic devices and systems (WINDS), December 2010. Atlanta, GA
71. Sherwood-Droz N, Wang H, Chen L, Lee BG, Biberman A, Bergman K, Lipson M (2008) Optical
4x4 hitless silicon router for optical networks-on-chip. Opt Express 16(20):1591515922
72. Skandron K, Stan MR, Huang W, Velusamy S, Sankarananarayanan K, Tarjan D (2003)
Temperature-aware microarchitecture. In: International symposium on computer architecture
(ISCA), June 2003. http://dx.doi.org/10.1145/871656.859620 San Diego, CA
73. Thourhout DV, Campenhout JV, Rojo-Romeo P, Regreny P, Seassal C, Binetti P, Leijtens
XJM, Notzel R, Smit MK, Cioccio LD, Lagahe C, Fedeli J-M, Baets R (2007) A photonic
interconnect layer on CMOS. In: European conference on optical communication (ECOC),
September 2007. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5758445 Berlin,
Germany
74. Udipi AN, Muralimanohar N, Balasubramonian R, Davis A, Jouppi N (2011) Combining
memory and a controller with photonics through 3D-stacking to enable scalable and energy-
efficient systems. In: International symposium on computer architecture (ISCA), June 2011.
http://dx.doi.org/10.1145/2000064.2000115 San Jose, CA
75. Vantrease D, Binkert N, Schreiber R, Lipasti MH (2009) Light speed arbitration and flow
control for nanophotonic interconnects. In: International symposium on microarchitecture
(MICRO), December 2009. http://dx.doi.org/10.1145/1669112.1669152 New York, NY
3 Designing Chip-Level Nanophotonic Interconnection Networks 135
J. Xu () H. Gu W. Liu
Mobile Computing System Laboratory, Department of Electronic
and Computer Engineering, The Hong Kong University of Science and Technology,
Clear Water Bay Kowloon, Hong Kong
e-mail: jiang.xu@ust.hk
W. Zhang
Nanyang Technological University, Singapore, Singapore
comparing with the electronic NoC on a 64-core MPSoC. We simulate the FONoC
for the 64-core MPSoC and show the end-to-end delay and network throughput
under different offered loads and packet sizes.
Introduction
optical router, which significantly reduces the cost and optical power loss of 2D
mesh/torus optical NoCs [17]. Previous optical NoC and router studies concentrate
2D topologies, such as mesh and torus.
In this work, we propose a new optical NoC, FONoC (fat tree-based optical NoC),
including its topology, protocols, as well as a low-power and low-cost optical router,
OTAR (optical turnaround router). In contrast to previous optical NoCs, FONoC
does not require the building of a separate electronic NoC. It transmits both payload
data and network control data over the same optical network. FONoC is based on a
fat tree topology, which is a hierarchical multistage network, and has been used by
multi-computer systems [18]. It also attracts the attentions of electronic NoC studies
[1921]. While electronic fat tree-based NoCs use packet switching for both payload
data and network control data, FONoC uses circuit switching for payload data and
packet switching for network control data. The protocols of FONoC minimize the
network control data and the related power consumed by optical-electronic conver-
sions. An optimized turnaround routing algorithm is designed to utilize the mini-
mized network control data and a low-power feature of OTAR, which can passively
route packets without powering on any microresonator in 40% of cases. An analyti-
cal model is developed to assess the power consumption of FONoC. Based on the
analytical model and SPICE simulations, we compare FONoC with a matched elec-
tronic NoC in 45 nm. We simulate the FONoC for the 64-core MPSoC and show its
performance under various offered loads and packet sizes.
The rest of the chapter is organized as follows. Section Optical Turnaround
Router for FONoC describes the optical router proposed for FONoC. Section Fat
Tree-Based Optical NoC details FONoC, including the topology, floorplan, and
protocols. Section Comparison and Analysis evaluates and analyzes the power
consumption, optical power loss, and network performance of FONoC. Conclusions
are drawn in section Conclusions.
The two switching elements used by OTAR are crossing and parallel elements,
which implement the basic 1 2 switching function (Fig. 4.1). Both switching
140 J. Xu et al.
The switching fabric of an optical router can be implemented by the traditional fully-
connected crossbar. An n n optical router requires an n n crossbar, which is com-
posed of n2 microresonators and 2n crossing waveguides. Figure 4.2a shows a 4 4
fully-connected crossbar, which has four input ports and four output ports. The fully-
connected crossbar can be optimized based on the routing algorithm used by an opti-
cal router. The turnaround routing algorithm (also called the least common ancestor
routing algorithm) has been favored by many fat tree-based networks [22, 23]. In this
algorithm, a packet is first routed upstream until it reaches the common ancestor node
of the source and destination of the packet; then, the packet is routed downstream to
4 FONoC: A Fat Tree Based Optical Network-on-Chip 141
the destination. Turnaround routing is a minimal path routing algorithm and is free of
deadlock and livelock. In addition, it is a low-complexity adaptive algorithm which
does not use any global information. These features make the turnaround routing
algorithm particularly suitable for optical NoCs, which require both low latency and
low cost at the same time. Some microresonators can be removed from the fully-
connected crossbar based on the turnaround routing algorithm (Fig. 4.2b). Compared
with the fully-connected crossbar, the optimized crossbar saves six microresonators,
but still has the same number of waveguide crossings, and hence does not improve the
optical power loss or, by extension, the power consumption.
We propose a new router, OTAR, for FONoC (Fig. 4.3). OTAR is a 4 4 optical
router using the turnaround routing algorithm. It consists of an optical switching
fabric, a control unit, and four control interfaces (CI). The switching fabric uses
only six microresonators and four waveguides. The control unit uses electrical sig-
nals to configure the switching fabric according to the routing requirement of each
packet. The control interfaces inject and eject control packets to and from optical
waveguides.
The OTAR router has four bidirectional ports, called UP right, UP left, DOWN
right, and DOWN left. OTAR has a low-power feature, and can passively route
packets which travel on the same side. Packets, travelling between UP left and
DOWN left as well as between UP right and DOWN right, do not require any
microresonator to be powered on. There are a total of ten possible inputoutput
ports combinations. The passive cases account for four out of the ten possible com-
binations, so that if traffic arrives at each port with equal probability, 40% traffic
will be routed passively without activating any microresonator. The four ports are
aligned to their intended directions, and the input and output of each port is also
properly aligned. The microresonators in OTAR are identical, and have the same
142 J. Xu et al.
on-state and off-state resonant wavelengths, lon and loff. OTAR uses the wavelength
lon to transmit the payload packets which carry payload data, and loff to transmit
control packets which carry network control data.
The switching fabric implements a 4 4 switching function for the four bidirec-
tional ports. It is designed to minimize waveguide crossings. The U-turn function is
not implemented because the routing algorithm does not use it. Two unnecessary
turns are also eliminated since payload packets will not make turns when they flow
down the fat tree in turnaround routing. The OTAR router is strictly non-blocking
while using the turnaround routing algorithm. This can be proved by exhaustively
examining all the possible cases. The non-blocking property can help to increase the
network throughput.
The control unit processes the control packets and configures the optical switch-
ing fabric. Control packets are used to setup and maintain optical paths for payload
packets, and are processed in the electronic domain. The control unit is built from
CMOS transistors and uses electrical signals to power each microresonator on and
off according to the routing requirement of each packet. It uses an optimized routing
algorithm, which we will describe in the next section. Each port of the OTAR has a
control interface. The control interface includes two parallel switching elements (to
minimize the optical loss), an optical-electronic (OE) converter to convert optical
control packets into electronic signals, and an electronic-optical (EO) converter to
carry out the reverse conversion. The microresonators in the control interface are
always in the off-state and identical to those in the optical switching fabric. Their
off-state resonant wavelength loff is used to transmit control packets.
4 FONoC: A Fat Tree Based Optical Network-on-Chip 143
We propose a new optical NoC, FONoC (fat tree-based optical NoC), for MPSoCs
including its topology, floorplan, and protocols. In contrast to other optical NoCs,
FONoC transmits both payload packets and control packets over the same optical
network. This obviates the need for building a separate electronic NoC for control
packets. The hierarchical network topology of FONoC makes it possible to connect
the FONoCs of multiple MPSoCs and other chips, such as off-chip memories, into an
inter-chip optical network and thus form a more powerful multiprocessor system.
FONoC is based on a fat tree topology to connect OTARs and processor cores
(Fig. 4.4). It is a non-blocking network, and provides path diversity to improve per-
formance. Processors are connected to OTARs by optical-electronic and electronic-
optical interfaces (OEEO), which convert signals between optical and electronic
domains. The notation FONoC(m,k) describes a FONoC connecting k processors
using an m-level fat tree. There are k processors at level 0 and k/2 OTARs at other
levels. Based on the fat tree topology, to connect k processors, the number of net-
work levels required is m = log 2 k + 1 , and all the processors are in the first network
level. While connecting with other MPSoCs and off-chip memories, OTARs at the
topmost level route the packets from FONoC to an inter-chip optical network. In
this case, the number of OTARs required is k log k . If an inter-chip optical net-
2
2
work is not used, OTARs at the topmost level can be omitted. In this case, only
144 J. Xu et al.
k
(log 2 k - 1) OTARs are required. In Fig. 4.4, each optical interconnect is bidirec-
2
tional, and includes two optical waveguides.
The corresponding floorplan of FONoC for a 64-core MPSoC is shown in
Fig. 4.5. Starting from level 2, multiple OTARs are grouped into a router cluster for
floorplanning purposes. The router clusters are connected by optical interconnects.
FONoC can be built on the same device layer as the processors, but to reduce chip
area, 3D chip technology can also fabricate FONoC on a separate device layer and
stack it above a device layer for processor cores [24].
FONoC Protocols
1- p
ELSE pout = pup in
RETURN output port pout
146 J. Xu et al.
We analyze the power consumption, optical power loss, and network performance
for FONoC. The power consumption of FONoC is compared to a matched elec-
tronic NoC. The optical power loss of OTAR is compared to three other optical
routers under different conditions. We simulate and compare the network perfor-
mance of the FONoC for the 64-core MPSoC under various offered loads and
packet sizes.
Power Consumption
o o
EPK = E payload + Ectrl (4.1)
o
E payload can be calculated by Eq. (4.2), where m is the number of microresonators
in the on-state while transferring the payload packet, Pmro is the average power con-
sumed by a microresonator when it is in the on-state, Lopayload is the payload packet
size, R is the data rate of EOOE interfaces, d is the distance traveled by the payload
packet, c is the speed of light in a vacuum, n is the reflection index of silicon optical
o
waveguide, Eoeeo is the energy consumed for 1-bit OE and EO conversions.
o
Lopayload d n
E payload = mPmro ( + o
) + Eoeeo Lopayload
R c (4.2)
Ectrl can be calculated by Eq. (4.3). Additional variables are defined as follows:
Loctrl is the total size of the control packets used, h is the number of hops to transfer
the payload packet, Ecue is the average energy required by the control unit to make
decisions for the payload packet.
o
Ectrl = Eoeeo Loctrl h + Ecue (h + 1) (4.3)
4 FONoC: A Fat Tree Based Optical Network-on-Chip 147
which represents an 87% power saving. The results show that the power saving
could increase to 93% while using 128-byte packets in a 1024-core MPSoC.
We analyze and compare the optical power loss of OTAR with three other optical
routers including the fully-connected crossbar, the optimized crossbar, and the 4 4
optical router proposed in, which is referred to as COR for clarity. In our compari-
son, we considered two major sources of optical power losses, the waveguide cross-
ing insertion loss and microresonator insertion loss. The waveguide crossing
insertion loss is 0.12 dB per crossing [16], and the microresonator insertion loss is
0.5 dB [27]. In an optical router, packets transferring between different input and
output ports may encounter different losses. We analyze the maximum loss, mini-
mum loss, and average loss of all possible cases (Fig. 4.7). The results show that
OTAR is the best in all comparisons. OTAR has 4% less minimum loss, 23% less
average loss, and 19% less maximum loss than the optimized crossbar. COR has the
same maximum loss as OTAR, but has higher average and minimum losses.
The number of microresonators used by an optical router is an indicator for the
area cost. While the optimized crossbar uses fewer microresonators than the fully-
connected crossbar, they have the same losses. OTAR uses 6 microresonators; the
fully-connected crossbar uses 16; the optimized crossbar uses 10; and COR uses 8.
OTAR uses the lowest number of microresonators, at 40% fewer than the optimized
crossbar.
Network Performance
We simulate the FONoC for the 64-core MPSoC and study the network perfor-
mance in terms of end-to-end (ETE) delay and network throughput. The ETE delay
4 FONoC: A Fat Tree Based Optical Network-on-Chip 149
is the average time between the generation of a packet and its arrival at its destination.
It is the sum of the connection-oriented path-setup time and the time used to trans-
mit optical packets. We simulated a range of packet sizes used by typical MPSoC
applications. We assumed a moderate bandwidth of 12.5 Gbps for each intercon-
nect. In the simulations, we assume that processors generate packets independently
and the packet generation time intervals follow a negative exponential distribution.
We used the uniform traffic pattern, i.e. each processor sends packets to all other
processors with the same probability. FONoC is simulated in a network simulator,
OPNET (www.opnet.com).
The ETE delay under different offered loads and packet sizes is shown in Fig. 4.8.
It shows that FONoC saturates at different loads with different packet sizes. The
ETE delay is very low before the saturation load, and increases dramatically after it.
For 32-byte packets, ETE delay is 0.06 ms before the saturation load 0.2, and goes
up to 110 ms after it. Packets larger than 32-byte have higher saturation load. This is
due to the lower number of control packets when using larger packets under the
same offered load. In addition, larger packets also have longer transmission times
and cause longer inter-packet arrival gaps compared with smaller packets under the
same offered load. These both help to reduce network contention during path setup,
and lead to higher saturation loads. Figure 4.8 also shows the network throughput
under various offered load and packet sizes. Ideally, throughput should increase
with the offered load. However, when the network becomes saturated, it will not be
able to accept a higher offered load beyond its capacity. The results show that the
throughput remains at a certain level after the saturation point.
Conclusions
This work proposes FONoC including its protocols, topology, floorplan, and a low-
power and low-cost optical router, OTAR. FONoC carries payload data as well as
network control data on the same optical network, while using circuit switching for
the former and packet switching for the latter. We analyze the power consumption,
optical power loss, and network performance of FONoC. An analytical model is
developed to assess the power consumption of FONoC. Based on the analytical
model and SPICE simulations, we compare FONoC with a matched electronic
NoC in 45 nm. The results show that FONoC can save 87% power to achieve the
same performance for a 64-core MPSoC. OTAR can passively route packets with-
out powering on any microresonator in 40% of all cases. Comparing with three
other optical routers, OTAR has the lowest optical power loss and uses the lowest
number of microresonators. We simulate the FONoC for a 64-core MPSoC and
show the end-to-end delay and network throughput under various offered loads and
packet sizes.
Acknowledgments This work is partially supported by HKUST PDF and RGC of the Hong
Kong Special Administrative Region, China.
150 J. Xu et al.
References
1. Benini L, De Micheli G (2002) Networks on chip: a new paradigm for systems on chip design.
In: Design, automation and test in Europe conference and exhibition, Paris, France
2. Sgroi M, Sheets M, Mihal A, Keutzer K, Malik S, Rabaey J, Sangiovanni-Vincentelli A (2001)
Addressing the system-on-a-chip interconnect woes through communication-based design. In:
Design automation conference, Las Vegas, NV, USA
4 FONoC: A Fat Tree Based Optical Network-on-Chip 151
25. Xu Q, Manipatruni S, Schmidt B, Shakya J, Lipson M (2007) 12.5 Gbit/s carrier injection-
based silicon microring silicon modulators. Opt Express 15(2):430436
26. Kromer C, Sialm G, Berger C, Morf T, Schmatz ML, Ellinger F et al (2005) A 100-mW
4 10 Gb/s transceiver in 80-nm CMOS for high-density optical interconnects. IEEE J Solid-
State Circuit 40(12):26672679
27. Xiao S, Khan MH, Shen H, Qi M (2007) Multiple-channel silicon micro-resonator based filters
for WDM applications. Opt Express 15:74897498this figure will be printed in b/
wthis figure will be printed in b/wthis figure will be printed in b/w
Chapter 5
On-Chip Optical Ring Bus Communication
Architecture for Heterogeneous MPSoC
S. Pasricha ()
Electrical and Computer Engineering Department, Colorado State University, Fort Collins,
1373 Campus Delivery, Fort Collins, CO 80523-1373, USA
e-mail: sudeep@colostate.edu
N.D. Dutt
University of California, Irvine, Irvine, CA 92617, USA
Introduction
Recently, it has been shown that it may be beneficial to replace global on-chip
electrical interconnects with optical interconnects [9]. Optical interconnects can
theoretically offer ultra-high communications bandwidths in the terabits per sec-
ond range, in addition to lower access latency and susceptibility to electromag-
netic interference [10]. Optical signaling also has low power consumption, as the
power consumption of optically transmitted signals at the chip level is indepen-
dent of the distance covered by the light signal [11]. While optical interconnects
at the chip-to-chip level are already being actively developed [12], on-chip optical
interconnects have only lately begun to receive attention. This is due to the rela-
tively recent advances in the field of nanoscale silicon (Si) optics that have led to
the development of CMOS compatible silicon-based optical components such as
light sources [13], waveguides [14], modulators [15, 16], and detectors [17, 18].
As a result, while on-chip optical interconnects were virtually inconceivable with
previous generations of photonic technologies a few years ago, these recent
advances have enabled the possibility of creating highly integrated CMOS com-
patible optical interconnection fabrics that can send and receive optical signals
with superior power efficiencies. In order to practically implement an on-chip
optical interconnect based fabric, it is highly likely that future CMOS ICs will
utilize 3D integration [19] as shown conceptually in Fig. 5.2. 3D integration will
allow logic and Si photonics planes to be separately optimized [20, 21]. In the
figure, the bottom plane consists of a CMOS IC with several microprocessor and
memory cores, while the top plane consists of an optical waveguide that transmits
optical signals at the chip level. It is also possible for all memory cores to be
implemented on a dedicated layer, separate from the microprocessor layer. Vertical
through silicon via (TSV) interconnects provide interconnections between cores
in different layers. As optical memories and optical buffered transfers cannot be
156 S. Pasricha and N.D. Dutt
Fig. 5.3 Proposed optical ring bus (ORB) on-chip communication architecture for MPSoCs
Related Work
The concept of optical interconnects for on-chip communication was first introduced
by Goodman et al. [22]. Several works in recent years have explored chip-to-chip
photonic interconnects [2329]. With advances in the fabrication and integration of
optical elements on a CMOS chip in recent years, several works have presented a
comparison of the physical and circuit-level properties of non-pipelined on-chip
electrical (copper-based) and optical interconnects [9, 3035]. In particular, Collet
et al. [30] compared simple optical and point-to-point links using a Spice-like simu-
lator. Tosik et al. [31] studied more complex interconnects, comparing optical and
electrical clock distribution networks, using physical simulations, synthesis tech-
niques and predictive transistor models. Both works studied power consumption
and bandwidth, and highlighted the benefits of on-chip optical interconnect technol-
ogy. Intels Technology and Manufacturing Group also performed a preliminary
study evaluating the benefits of optical intra-chip interconnects [32]. They con-
cluded that while optical clock distribution networks are not especially beneficial,
wave division multiplexing (WDM) based on-chip optical interconnects offer inter-
esting advantages for intra-chip communication over copper in UDSM process
technologies. While all of these studies have shown the promise of on-chip optical
interconnects, they have primarily focused on clock networks and non-pipelined
point-to-point links. One of the contributions of this chapter is to contrast on-chip
optical interconnects with all-electrical pipelined global bus-based communication
architectures that are used by designers to support high bandwidth on-chip data
transfers today.
158 S. Pasricha and N.D. Dutt
transmission distance at the chip level, and (4) routing and placement is simplified
since it is possible to physically intersect light beams with minimal crosstalk. Once
a path is acquired, the transmission latency of the optical data is very short, depend-
ing only on the group velocity of light in a silicon waveguide: approximately
6.6 107 m/s, or 300 ps for a 2-cm path crossing a chip [45]. After an optical path is
established, data can be transmitted end to end without the need for repeating or
buffering, which can lead to significant power savings.
Realizing on-chip optical interconnects as part of our proposed ORB communi-
cation architecture requires several CMOS compatible optical devices. Although
there are various candidate devices that exist for these optical elements, we select
specific devices that satisfy on-chip requirements. Figure 5.4 shows a high level
overview of the various components that make up our ORB optical interconnect
architecture. There are four primary optical components: a multi-wavelength laser
(light source), an opto-electric modulator/transmitter, an optical ring waveguide and
an optical receiver. The modulator converts electrical signals into optical light (E/O),
which is propagated through the optical waveguide, and then detected and converted
back into an electrical signal at the receiver (O/E). Integrating such an optical sys-
tem on a chip requires CMOS compatibility which puts constraints on the types of
materials and choices of components to be used. Recent technological advances
indicate that it is possible to effectively fabricate various types of optical compo-
nents on a chip. However, there are still significant challenges in efficiently integrat-
ing a silicon based laser on a chip. Using an off-chip laser can actually be beneficial
because it leads to lower on-chip area and power consumption. Consequently, in our
optical interconnect system we use an off-chip laser from which light is coupled
onto the chip using optical fibers, much like what is done in chip-to-chip optical
interconnects today [12, 46].
160 S. Pasricha and N.D. Dutt
The transmission part in Fig. 5.4 consists of a modulator and a driver circuit. The
electro-optic modulator converts an input electrical signal into a modulated optical
wave signal for transmission through the optical waveguide. The modulators are
responsible for altering the refractive index or absorption coefficient of the optical
path when an electrical signal arrives at the input. Two types of electrical structures
have been proposed for opto-electric modulation: pin diodes [47] and MOS
capacitors [15]. Micro-ring resonator based pin diode type modulators [16, 47]
are compact in size (1030 mm) and have low power consumption, but possess low
modulation speeds (several MHz). Such micro-ring resonators couple light when
the relation: l m = Neff,ring 2R is satisfied, where R is the radius of the micror-
ing resonator, Neff,ring is the effective refractive index, m is an integer value, and l
is the resonant wavelength [48]. As resonance wavelength is a function of R and
Neff,ring, by changing R and Neff,ring, the resonant wavelength of the microring
can be altered, thus enabling it to function as an optical modulator (wavelength on-
off switch). In general the resonance wavelength shift (Dlc) is achieved as a func-
tion of the change in refractive index by tuning DNeff which is given by l DNeff/
Neff,ring Dlc. In contrast to microring resonators, MOS capacitor structures such
as the MachZehnder interferometer based silicon modulators [15, 46] have higher
modulation speed footprint (several GHz) but a large power consumption and
greater silicon footprint (around 10 mm). While these electro-optical modulators
today are by themselves not very attractive for on-chip implementation, there is a lot
of ongoing research which is attempting to combine the advantages of both these
modulator types [16]. Consequently, we make use of a predictive modulator model
which combines the advantages of both structures. We assume a modulator capaci-
tance that scales linearly with modulator length at the rate of 1.7 pF/mm [33].
The modulator is driven by a series of tapered inverters (i.e., driver). The first stage
consists of a minimum sized inverter. The total number of stages N is given as
cm
N = log / log3.6
cg
(i.e., width) and lower area footprint compared to polymer waveguides. This leads
to better bandwidth density (i.e., transmitted bits per unit area). However polymer
waveguides have lower propagation delay than SOI waveguides. The area over-
head for polymer waveguides is mitigated if they are fabricated on a separate,
dedicated layer. Additionally, if wavelength division multiplexing (WDM) is
used, polymer waveguides provide superior performance and bandwidth density
compared to SOI waveguides [49]. Consequently, in our optical ring bus, we make
use of a low refractive index polymer waveguide with an effective index of 1.4.
We chose a ring shaped optical waveguide to avoid sharp turns in the waveguide
which can lead to significant signal loss. The optical ring is implemented on a
dedicated layer and covers a large portion of the chip so that it can effectively
replace global electrical pipelined interconnects.
The receiver part in Fig. 5.4 consists of a photo-detector to convert the light sig-
nal into an electrical signal, and a circuit to amplify the resulting analog electrical
signal to a digital voltage level. In order to support WDM, where transmission
occurs on multiple wavelengths, the receiver includes a wave-selective microring
resonator filter for each wavelength that is received. An important consideration in
the selection of a photo-detector is the trade-off between detection speed and sensi-
tivity (quantum efficiency) of the detector. Interdigitated metalsemiconductor
metal (MSM) Ge and SiGe photo-detectors have been proposed [17, 50] that have
fast response, excellent quantum efficiency and low power consumption. These
attributes makes the MSM detector a suitable candidate as a photo-detector in our
optical interconnect architecture. We assume a detector capacitance of 100 fF based
on a realistic detector realization [34].
A trans-impedance amplifier (TIA) is used to amplify the current from the photo-
detector [33]. The TIA consists of an inverter and a feedback resistor, implemented
as a PMOS transistor. Additional minimum sized inverters are used to amplify the
signal to a digital level. The size of the inverter and feedback transistor in the TIA
is determined by bandwidth and noise constraints. To achieve high-gain and high-
speed detection, a higher analog supply voltage than the digital supply voltage may
be required, which may consume higher power. We assume a TIA supply voltage
that is 20% higher than the nominal supply for our power calculations. The amplified
digital signal is subsequently sent to the receiving bridge (Rx Bridge) component,
which decodes the destination address, and passes the received data to a specific
core in the cluster.
The previous section gave an overview of the various components that are part
of our on-chip optical interconnect architecture. In this section we elaborate on
the operation of our optical ring bus based hybrid opto-electric communication
architecture.
162 S. Pasricha and N.D. Dutt
BW
i = . N i
BWj
j =1
where BWi is the bandwidth requirements of a cluster i, and the number of allocated
wavelengths li being rounded to the nearest integer. The total number of transmit-
ters for cluster i on the optical ring bus is
Ti _ total = i . (ai + di + ci )
Ri _ total = ( i ). (ai + di + ci )
The photonic waveguides in ORB are logically partitioned into four channels:
reservation, reservation acknowledge, data (a combination of address, data, and
control signals), and data acknowledge, as shown in Fig. 5.5. In order to reserve an
164 S. Pasricha and N.D. Dutt
optical path for a data transfer, ORB utilizes a single write multiple read (SWMR)
configuration on dedicated reservation channel waveguides. The source cluster uses
one of its available wavelengths (lt) to multicast the destination ID via the reserva-
tion channel to other gateway interfaces. This request is detected by all of the other
interfaces, with the destination interface accepting the request, while the other inter-
faces ignore it. As each gateway interface has a dedicated set of wavelengths allo-
cated to it, the destination can determine the source of the request, without the
sender needing to send its ID with the multicast.
If the request can be serviced by the available wavelength and buffer resources at
the destination, a reservation acknowledgement is sent back via the reservation
ACK channel on an available wavelength. The reservation ACK channel also has a
SWMR configuration, but a single waveguide per gateway interface is sufficient to
indicate the success or failure of the request. Once the photonic path has been
reserved in this manner, data transfer proceeds on the data channel, which has a low
cost multiple writer multiple reader (MWMR) configuration. In ORB, the number
of data channel waveguides is equal to the total number of address bus, data bus, and
control lines. The same wavelength (lt) used for the reservation phase is used by the
source to send data on. The destination gateway interface tunes one of its available
microring resonators to receive data from the sender on that wavelength after the
reservation phase. Once data transmission has completed, an acknowledgement is
sent back from the destination to the source interface via a data ACK channel that
also has a SWMR configuration with a single waveguides per interface to indicate
if the data transfer completed success or failed.
The advantage of having a fully optical path setup and acknowledgement based
flow control in ORB is that it avoids using the electrical interconnects for path setup,
as is proposed with some other approaches [39, 43], which our analysis shows can
be a major latency and power bottleneck to the point of mitigating the advantage of
having fast and low power photonic paths.
One final important design consideration is to ensure that light does not circulate
around the optical ring for more than one cycle, because that could lead to undesir-
able interference from older data. This is resolved by using attenuators with each
modulator, to act as a sink for the transmitted wavelength(s), once the signal has
completely traversed the optical ring.
Communication Serialization
Serialization of electrical communication links has been widely used in the past to
reduce wiring congestion, lower power consumption (by reducing link switching and
buffer resources), and improve performance (by reducing crosstalk) [5254]. As
reducing power consumption is a critical design goal in future MPSoCs, we propose
using serialization at the transmitting/receiving interfaces, to reduce the number of
optical components (waveguides, transmitters/receivers) and consequently reduce
area and complexity on the photonic layer as well as lower the power consumption.
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 165
Fig. 5.6 Serialization scheme for interface (a) serializer, (b) de-serializer
the transmitter ring oscillator) and the ring counter. The n-bit data word is read bit
by bit from the serial line into a shift register, in the next n clock cycles. Thus, after
n clock cycles, the n bit data will be available on the parallel output lines, while the
least significant bit output of the ring counter (r0) becomes 1 to indicate data
word availability at the output. With the assertion of r0, the RS flip-flop is also
reset, disabling the ring oscillator. At this point the receiver is ready to start receiv-
ing the next data frame. In case of a slight mismatch between the transmitter and
receiver ring oscillator frequencies, correct operation can be ensured by adding a
small delay in the clock path of the receiver shift register.
The preceding discussion assumed n:1 serialization, where n data bits are trans-
mitted on one serial line (i.e., a serialization degree of n). If wider links are used,
this scheme can be easily extended. For instance, consider the scenario where 4n
data bits need to be transmitted on four serial lines. In such a case, the number of
shift registers in the transmitter must be increased from 1 to 4. However the control
circuitry (flip-flop, ring oscillator, ring counter) can be reused among the multiple
shift registers and remains unchanged. At the destination, every serial line must
have a separate receiver to eliminate jitter and mismatch between parallel lines.
Experiments
In this section we present comparison studies between ORB and traditional all-
electrical on-chip communication architectures. The ORB communication architec-
ture uses an optical ring bus as a global interconnect, whereas the traditional
all-electrical communication architecture uses electrical pipelined global intercon-
nects. Both configurations use electrical buses as local interconnects within clusters.
Experimental Setup
We select several MPSoC applications for the comparison between our hybrid opto-
electric ORB communication architecture and the traditional pipelined electrical
communication architectures. These applications are selected from the well known
SPLASH-2 benchmark suite [58] (Barnes, Ocean, FFT, Radix, Cholesky, Raytrace,
Water-NSq). We also select applications from the networking domains (proprietary
benchmarks Netfilter and Datahub [59]). These applications are parallelized and
implemented on multiple cores. Table 5.1 shows the characteristics of the imple-
mentations of these applications, such as the number of cores (e.g., memories,
peripherals, processors), programmable processors and clusters on the MPSoC chip.
The die size is assumed to be 2 2 cm.
The applications described above are modeled in SystemC [60] and simulated at
the transaction level bus cycle accurate abstraction [61] to quickly and accurately
estimate performance and power consumption of the applications. The various cores
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 167
are interconnected using the AMBA AXI3 [8] standard bus protocol. A high level
simulated annealing floorplanner based on sequence pair representation PARQUET
[62] is used to create an early layout of the MPSoC on the die, and Manhattan dis-
tance based wire routing estimates are used to determine wire lengths for accurate
delay and power dissipation estimation.
The global optical ring bus length is calculated using simple geometric calcula-
tions and found to be approximately 43 mm. Based on this estimate, as well as opti-
cal component delay values (see section Performance Estimation Models), we
determine the maximum operating frequencies for ORB as 1.4 GHz (65 nm), 2 GHz
(45 nm), 2.6 GHz (32 nm) and 3.1 GHz (22 nm). To ensure a fair comparison, we
clock the traditional all-electrical global pipelined interconnect architecture at the
same frequencies as the optical ring bus architecture in our experiments. The cores
in the clusters are assumed to operate at twice the interconnect frequencies. We set
the width of the address bus as 32 bits and that of the separate read and write data
buses as 64 bits. The bus also uses 68 control bits, based on the AMBA AXI3 pro-
tocol. These translate into a total of 228 (address + read data + write data + control)
data optical waveguides, as discussed in section ORB On-Chip Communication
Architecture. Finally, WDM is used, with a maximum of l = 32 wavelengths allo-
cated based on cluster bandwidth requirements, on a per-application basis.
For the global electrical interconnect, wire delay and optimal delay repeater insertion
points are calculated using an RLC transmission line wire model described in [63].
Latches are inserted based on wire length (obtained from routing estimates), wire
delay, and clock frequency of the bus, to pipeline the bus and ensure correct opera-
tion [7]. For instance, a corner to corner wire of length 4 cm for a 2 cm 2 cm die size
has a projected delay of 1.6 ns in 65 nm technology, for a minimum width wire size
Wmin [63]. To support a frequency of 2.5 GHz (corresponding to a clock period of
0.4 ns), 4 latches need to be inserted to ensure correct (multi-cycle) operation. It has
been shown that increasing wire width can reduce propagation delay at the cost of
168 S. Pasricha and N.D. Dutt
Table 5.2 Delay (in ps) of optical components for 1 cm optical data path
Tech node 65 nm 45 nm 32 nm 22 nm
Modulator driver 45.8 25.8 16.3 9.5
Modulator 52.1 30.4 20.0 14.3
Polymer waveguide 46.7 46.7 46.7 46.7
Photo detector 0.5 0.3 0.3 0.2
Receiver amplifier 16.9 10.4 6.9 4.0
Total optical delay 162.0 113.6 90.2 74.7
area. For our global interconnects, we therefore consider wider interconnects with a
width 3Wmin which results in a near optimal power delay product at the cost of only a
slight area overhead. The delay of such a wide, repeater-inserted wire is found to be
approximately 26 ps/mm, varying only slightly (1 ps/mm) between the 6522 nm
nodes. Delay due to bridges, arbitration, and the serialization/de-serialization at the
interfaces was considered by annotating SystemC models with results of detailed
analysis for the circuits from the gate-level.
For the optical ring bus (ORB) architecture, we model all the components
described in section Optical Ring Bus Architecture: Building Blocks, annotated
with appropriate delays. Table 5.2 shows delays of the various optical interconnect
components used in ORB, for a 1 cm optical data path, calculated based on estimates
from [5, 33]. It can be seen that the optical interconnect delay remains constant for
the waveguide, while the delay due to other components reduces with each technol-
ogy generation. This is in contrast to the minimum electrical interconnect delay
which is expected to remain almost constant (or even increase slightly) despite opti-
mal wire sizing (i.e., increasing wire width) and repeater insertion to reduce delay.
To estimate the power consumption for the electrical interconnects, we must account
for power consumed in the wires, repeaters, serialization/de-serialization circuits,
and bus logic components (latches, bridges, arbiters and decoders). For bus wire
power estimation, we determine wire lengths using our high level floorplan and rout-
ing estimates as described earlier. We then make use of bus wire power consumption
estimates from [64], and extend them to account for repeaters [65]. Static repeater
power and capacitive load is obtained from data sheets. Capacitive loads for compo-
nents connected to the bus are obtained after logic synthesis. Other capacitances (e.g.
ground, coupling) are obtained from the Berkeley Predictive Technology Model
(PTM) [66], and ITRS estimates [5]. The power consumed in the serialization cir-
cuitry and bus logic components is calculated by first creating power models for the
components, based on our previous work on high-level power estimation of com-
munication architecture components using regression based analysis of gate level
simulation data [65]. These power models are then plugged into the SystemC simula-
tion models. Power numbers are obtained for the components after simulation and
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 169
are within 5% accuracy of gate-level estimates [65]. Simulation is also used to obtain
accurate values for switching activity, which is used for bus wire power estimation.
For the optical interconnect, power consumption estimates for a transmitter and
receiver in a single optical data path are derived from [33] and shown in Table 5.3. It
can be seen that the power consumed by the transmitter dominates power consumed
by the receiver. The size as well as the capacitance of the modulator is large, requir-
ing a large driving circuit. To maintain their resonance under on-die temperature
variations, microring resonators need to be thermally tuned. We assume a dedicated
thermal tuner for every microring resonator in the proposed communication fabric,
dissipating approximately 20 mW/K, with a mean temperature deviation of about
20. In addition, we also consider the laser power driving the optical interconnect. As
an optical message propagates through the waveguide, it is attenuated through wave-
guide scattering and ring resonator insertion losses, which translates into optical
insertion loss. This loss sets the required optical laser power and correspondingly the
electrical laser power as it must be sufficient to overcome losses due to electrical
optical conversion efficiencies as well as transmission losses in the waveguide. We
conservatively set an electrical laser power of 3.3 W (with 30% laser efficiency) in
our power calculations based on per component optical losses for the coupler/splitter,
non-linearity, waveguide, ring modulator, receiver filter, and photodetector.
Performance Comparison
a 5 65 nm
es
ky
fft
ce
sq
r
di
ea
lte
hu
es
rn
ra
-n
ra
oc
tfi
ta
ba
ol
yt
er
ne
da
ch
ra
at
w
b 4.5 65 nm
Latency Reduction Factor
4 45 nm
3.5 32 nm
3 22 nm
2.5
2
1.5
1
0.5
0
x
es
ky
fft
ce
sq
r
di
ea
lte
hu
es
rn
ra
-n
ra
oc
tfi
ta
ba
ol
yt
er
ne
da
ch
ra
at
w
c 4 65 nm
Latency Reduction Factor
3.5 45 nm
3 32 nm
2.5 22 nm
2
1.5
1
0.5
0
x
es
y
fft
ce
sq
r
di
sk
ea
te
hu
rn
ra
-n
ra
l
oc
tfi
ta
ba
ol
yt
er
ne
da
ch
ra
at
w
Fig. 5.7 Latency reduction for ORB over traditional all-electrical bus-based communication
architectures (a) no serialization, (b) serialization degree of 2, (c) serialization degree of 4
It can be seen that the ORB architecture provides a latency reduction over tradi-
tional all-electrical bus-based communication architectures for UDSM technology
nodes. The speedup is small for 65 nm because of the relatively lower global clock
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 171
frequency (1.4 GHz) which does not require as much pipelining. However, from
the 45 nm down to the 22 nm nodes, the speedup increases steadily because of ris-
ing clock frequencies which introduce more pipeline stages in the electrical global
interconnect, increasing its latency, compared to the ORB architecture. The speedup
for radix is lower than other applications due to the smaller length of global inter-
connect wires, which reduces the advantage of having an optical link for global
data transfers. On the other hand, the lower speedup for ocean is due to the smaller
number of global inter-cluster data transfers, despite having long global intercon-
nects. With the increasing degree of serialization, a notable reduction in improve-
ment is observed. This is primarily because of the latency overhead of the
serialization/de-serialization process. The applications are impacted differently
depending upon the amount of inter cluster communication each must support. For
instance, the smaller number of inter-cluster transfers in ocean results in smaller
latency degradation because of serialization than for an application with a higher
proportion of inter-cluster transfers, such as datahub. While increase in latency is
an undesirable side effect of serialization, it nonetheless brings other benefits, as
discussed in the next section. Overall, the ORB architecture speeds up global data
transfers due to the faster optical waveguide. Despite the costs associated with
converting the electrical signal into an optical signal and back, it can be seen that
at 22 nm, ORB can provide up to a 4.7 speedup without serialization, up to a 4.1
speedup with serialization degree of 2, and up to a 3.5 speedup with serialization
degree of 4. With improvements in optical component fabrication over the next few
years, the opto-electrical conversion delay is expected to decrease leading to even
greater performance benefits.
Power Comparison
With increasing core counts on a chip aimed at satisfying ever increasing bandwidth
requirements of emerging applications, the on-chip power dissipation has been ris-
ing steadily. High power dissipation on a chip significantly increases cooling costs.
It also increases chip temperature which in turn increases the probability of timing
errors and overall system failure. On-chip communication architectures have been
shown to dissipate an increasing proportion of overall chip power in multicore chips
(e.g., ~40% in the MIT RAW chip [67] and ~30% in the Intel 80-core Teraflop chip
[68]) due to the large number of network interface (NI), router, link, and buffer
components in these architectures. Thus it is vital for designers to focus on reducing
power consumption in the on-chip interconnection architecture.
Figure 5.8ac show the power savings that can be obtained when using the
ORB architecture instead of an all-electrical pipelined interconnect architecture,
for three degrees of serialization1 (no serialization), 2, and 4. It can be
seen that the ORB architecture consumes more power for the 65 nm node, com-
pared to the all-electrical pipelined interconnect architecture. However, for tech-
nology nodes from 45 nm onwards, there is a significant reduction in ORB power
172 S. Pasricha and N.D. Dutt
a 12 65 nm
0
x
es
ky
ce
sq
fft
lte
di
ea
hu
es
rn
ra
-n
ra
tfi
oc
ta
ba
yt
er
ol
ne
da
ra
ch
at
w
b 12 65 nm
Power Reduction Factor
45 nm
10
32 nm
8 22 nm
0
es
ky
ce
sq
x
fft
r
lte
ea
hu
di
es
rn
ra
-n
ra
tfi
oc
ta
ba
yt
er
ol
ne
da
ra
ch
at
w
c 14 65 nm
Power Reduction Factor
12 45 nm
32 nm
10
22 nm
8
6
4
2
0
es
ce
sq
x
fft
r
lte
k
ea
hu
di
es
rn
ra
-n
ra
tfi
oc
ta
ba
yt
er
ol
ne
da
ra
ch
at
w
Fig. 5.8 Power consumption reduction for ORB over traditional all-electrical bus-based communi-
cation architectures (a) no serialization, (b) serialization degree of 2, (c) serialization degree of 4
depicted in Fig. 5.8a, ORB can provide up to 10.3 power reduction at 22 nm,
compared to all-electrical pipelined global interconnect architectures, which is a
strong motivation for adopting it in the near future. When serialization of degree 2
is utilized, there is actually a slight increase in power consumption for the 65 nm
node due to the overhead of the serialization/de-serialization circuitry (Fig. 5.8b).
However, from the 45 nm node and below, the reduction in power dissipation due
to fewer active microring resonators and associated heaters overshadows the
serialization overhead, and leads to a slight reduction in power consumption. At
the 22 nm node, up to a 7% reduction in power consumption is observed, com-
pared to the base case without any serialization. A similar trend is observed when
a serialization degree of 4 is utilized, shown in Fig. 5.8c. At the 22 nm node, up
to a 13% reduction in power consumption is observed compared to the base case
with no serialization. These results indicate the usefulness of serialization as a
mechanism to reduce on-chip power dissipation. In addition serialization also
reduces the complexity of the photonic layer by reducing the number of resona-
tors, waveguides, and photodetectors, which can tremendously boost yield and
lower costs for fabricating hybrid opto-electric communication architectures
such as ORB.
References
1. Pham D et al (2005) The design and implementation of a first-generation CELL processor. In:
Proceedings of the IEEE ISSCC, pp 184185 San Francisco, CA
2. Vangal S et al (2007) An 80-tile 1.28 TFLOPS network-on-chip in 65 nm CMOS. In: Proceedings
of the IEEE international solid state circuits conference, paper 5.2 San Francisco, CA
3. Tilera Corporation (2007) TILE64 Processor. Product Brief
4. Ho R, Mai W, Horowitz MA (2001) The future of wires. Proc IEEE 89(4):490504
5. International Technology Roadmap for Semiconductors (2006) http://www.itrs.net/ Accessed
on Oct 2011
6. Adler V, Friedman E (1998) Repeater design to reduce delay and power in resistive intercon-
nect. In: IEEE TCAS
7. Nookala V, Sapatnekar SS (2005) Designing optimized pipelined global interconnects: algo-
rithms and methodology impact. IEEE ISCAS 1:608611
8. AMBA AXI Specification. www.arm.com/armtech/AXI Accessed on Oct 2011
9. Haurylau M et al (2006) On-chip optical interconnect roadmap: challenges and critical direc-
tions. IEEE J Sel Top Quantum Electron 12(6):16991705
10. Miller DA (2000) Rationale and challenges for optical interconnects to electronic chips. Proc
IEEE 88:728749
11. Ramaswami R, Sivarajan KN (2002) Optical networks: a practical perspective, 2nd edn.
Morgan Kaufmann, San Francisco
12. Young I (2004) Intel introduces chip-to-chip optical I/O interconnect prototype. Technology@
Intel Magazine
13. Rong H et al (2005) A continuous-wave Raman silicon laser. Nature 433:725728
14. McNab SJ, Moll N, Vlasov YA (2003) Ultra-low loss photonic integrated circuit with mem-
brane-type photonic crystal waveguides. Opt Express 11(22):29272939
15. Liu A et al (2004) A high-speed silicon optical modulator based on a metal-oxide-semiconduc-
tor capacitor. Nature 427:615618
16. Xu Q et al (2007) 12.5 Gbit/s carrier-injection-based silicon microring silicon modulators. Opt
Express 15(2):430436
17. Reshotko MR, Kencke DL, Block B (2004) High-speed CMOS compatible photodetectors for
optical interconnects. Proc SPIE 5564:146155
18. Koester SJ et al (2004) High-efficiency, Ge-on-SOI lateral PIN photodiodes with 29 GHz
bandwidth. In: Proceedings of the Device Research Conference, Notre Dame, pp 175176
19. Haensch W (2007) Is 3D the next big thing in microprocessors? In: Proceedings of interna-
tional solid state circuits conference (ISSCC), San Francisco
20. Pasricha S, Dutt N (2008) Trends in emerging on-chip interconnect technologies. IPSJ Trans
Syst LSI Design Methodology 1:217
21. Pasricha S (2009) Exploring serial vertical interconnects for 3D ICs. In: IEEE/ACM design
automation conference (DAC) San Diego, CA 581586
22. Goodman JW et al (1984) Optical interconnects for VLSI systems. Proc IEEE 72(7):850866
23. Tan M et al (2008) A high-speed optical multi-drop bus for computer interconnections. In:
Proceedings of the 16th IEEE Symposium on high performance interconnects, pp 310
24. Chiarulli D et al (1994) Optoelectronic buses for high performance computing. Proc IEEE
82(11):1701
25. Kodi AK, Louri A (2004) Rapid: reconfigurable and scalable all-photonic in-104 interconnect
for distributed shared memory multiprocessors. J Lightwave Technol 22:21012110
26. Kochar C et al (2007) Nd-rapid: a multidimensional scalable fault-tolerant optoelectronic
interconnection for high performance computing systems. J Opt Networking 6(5):465481
27. Ha J, Pinkston T (1997) Speed demon: cache coherence on an optical multichannel intercon-
nect architecture. J Parallel Distrib Comput 41(1):7891
28. Carrera EV, Bianchini R (1998) OPNET: a cost-effective optical network for multiprocessors.
In: Proceedings of the international conference on supercomputing 98 401408
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 175
29. Batten C et al (2008) Building many core processor-to-dram networks with monolithic silicon
photonics. In: Proceedings of the 16th annual symposium on high-performance interconnects,
August 2728, pp 2130 Stanford, CA
30. Collet JH, Caignet F, Sellaye F, Litaize D (2003) Performance constraints for onchip optical
interconnects. IEEE J Sel Top Quantum Electron 9(2):425432
31. Tosik G et al (2004) Power dissipation in optical and metallic clock distribution networks in
new VLSI technologies. IEE Electron Lett 40(3):198200
32. Kobrinsky MJ et al (2004) On-chip optical interconnects. Intel Technol J 8(2):129142
33. Chen G, Chen H, Haurylau M, Nelson N, Albonesi D, Fauchet PM, Friedman EG (2005)
Predictions of CMOS compatible on-chip optical interconnect. In: Proceedings of the SLIP,
pp 1320 San Francisco, CA
34. Ian OConnor (2004) Optical solutions for system-level interconnect. In: Proceedings of the
SLIP Paris, France
35. Pappu AM, Apsel AB (2005) Analysis of intrachip electrical and optical fanout. Appl Opt
44(30):63616372
36. Benini L, Micheli GD (2002) Networks on chip: a new SoC paradigm. IEEE Comput
49(2/3):7071
37. Dally WJ, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In:
Design automation conference, pp 684689 Las Vegas, NV
38. Vangal S et al (2007) An 80-tile 1.28 TFLOPS network-on-chip in 65 nm CMOS. In:
Proceedings of the ISSCC San Francisco, CA
39. Shacham A, Bergman K, Carloni L (2007) The case for low-power photonic networks on chip.
In: Proceedings of the DAC 2007 San Diego, Ca
40. Kirman N et al (2006) Leveraging optical technology in future bus-based chip multiprocessors.
In: Proceedings of the MICRO Orlando, FL
41. Vantrease D et al (2008) Corona: system implications of emerging nanophotonic technology.
In: Proceedings of the ISCA Beijing, China
42. Poon AW, Xu F, Luo X (2008) Cascaded active silicon microresonator array cross-connect
circuits for WDM networks-on-chip. Proc SPIE Int Soc Opt Eng 6898:689812
43. Kodi A, Morris R, Louri A, Zhang X (2009) On-chip photonic interconnects for scalable
multi-core architectures. In: Proceedings of the 3rd ACM/IEEE international symposium on
network-on-chip (NoCs09), San Diego, 1013 May 2009, p 90
44. Pan Y et al (2009) Firefly: illuminating future network-on-chip with nanophotonics. In:
Proceedings of the ISCA, pp 429440
45. Hsieh I-W et al (2006) Ultrafast-pulse self-phase modulation and third-order dispersion in si
photonic wire-waveguides. Opt Express 14(25):1238012387
46. Gunn C (2006) CMOS photonics for high-speed interconnects. IEEE Micro 26(2):5866
47. Barrios CA et al (2003) Low-power-consumption short-length and high-modulation-depth sili-
con electro-optic modulator. J Lightwave Technol 21(4):10891098
48. Woo S, Ohara M, Torrie E, Singh J, Gupta A (1995) The SPLASH-2 programs: characteriza-
tion and methodological considerations. In: Proceedings of the international symposium on
computer architecture (ISCA), Santa Margherita Ligure, June 1995, pp 2436
49. Eldada L, Shacklette LW (2000) Advances in polymer integrated optics. IEEE JQE 6(1):
5468
50. Gupta A et al (2004) High-speed optoelectronics receivers in SiGe. In: Proceedings of the
VLSI design, pp 957960
51. Lee BG et al (2007) Demonstrated 4 4 Gbps silicon photonic integrated parallel electronic to
WDM interface. OFC
52. Dobkin R et al (2008) Parallel vs. serial on-chip communication. In: Proceedings of the SLIP
Newcastle, United Kingdom
53. Morgenshtein A et al (2004) Comparative analysis of serial vs parallel links in NoC. In:
Proceedings of the SSOC
176 S. Pasricha and N.D. Dutt
54. Ghoneima M et al (2005) Serial-link bus: a low-power on-chip bus architecture. In: Proceedings
of the ICCAD San Jose, CA
55. Kimura S et al (2003) An on-chip high speed serial communication method based on indepen-
dent ring oscillators. In: Proceedings of the ISSCC
56. I-Chyn Wey et al (2005) A 2 Gb/s high-speed scalable shift-register based on-chip serial com-
munication design for SoC applications. In: Proceedings of the ISCAS
57. Saneei M, Afzali-Kusha1 A, Pedram M (2008) Two high performance and low power serial
communication interfaces for on-chip interconnects. In: Proceedings of the CJECE
58. Woo SC et al (1995) The SPLASH-2 programs: characterization and methodological consid-
erations. In: Proceedings of the ISCAS S. Margherita Ligure, Italy
59. Pasricha S, Dutt N (2008) The optical ring bus (ORB) on-chip communication architecture.
CECS technical report, February 2008
60. SystemC initiative. www.systemc.org Accessed on Oct 2011
61. Mller W, Ruf J, Rosenstiel W (2003) SystemC methodologies and applications. Kluwer, Norwell
62. Adya SN, Markov IL (2003) Fixed-outline floorplanning: enabling hierarchical design. In:
IEEE Transactions on TVLSI
63. Ismail YI, Friedman EG (2000) Effects of inductance on the propagation delay and repeater
insertion in VLSI circuits, IEEE TVLSI 8(2):195206
64. Kretzschmar C et al (2004) Why transition coding for power minimization of on-chip buses
does not work. In: DATE
65. Pasricha S, Park Y, Kurdahi F, Dutt N (2006) System-level power-performance trade-offs in
bus matrix communication architecture synthesis. In: CODES+ISSS
66. Berkeley Predictive Technology Model, U.C. Berkeley. http://www-devices.eecs.berkeley.
edu/~ptm/ Accessed on Oct 2011
67. Taylor M et al (2002) The raw microprocessor. IEEE Micro
68. Vangal S et al (2007) An 80-tile 1.28 TFLOPS network-on-chip in 65 nm CMOS. In: Proceeindgs
spelling error with Proceedings of the IEEE ISSCC Proceedings San Francisco, CA
Part III
System Integration and Optical-Enhanced
MPSoC Performance
Chapter 6
A Protocol Stack Architecture for Optical
Network-on-Chip
Organization and Performance Evaluation
Introduction
The proposed ONoC protocol stack follows the architecture of the classical OSI
reference model. This ONoC layered protocol architecture allows the modular design
of each ONoC building block that boosts the interoperability and design reuse. In
addition, it allows scalability and manages the design complexity of the ONoC.
The physical layer is concerned with the physical characteristics of the communica-
tion medium [14]. In optical NoCs, the physical layer is realized with devices from
heterogeneous domains. Some components depend on the optical technology, while
others depend on the CMOS technology. The optical physical layer defines the
specifications of the photonic and optoelectronic devices in the communication path.
It specifies the free spectral range (FSR) and the number of working wavelengths,
and the photonic power levels of the optical beam. In addition, the physical layer
specifies the width of wires as well as the levels and timing of the signals. There are
three classes of physical links in the ONoC: (1) the heterogeneous optoelectronic
multi-wavelength transmitter (MWL-Tx) link, (2) the heterogeneous optoelectronic
multi-wavelength receiver (MWL-Rx) link, and (3) the purely optical link composed
of the waveguides through the optical router. Concerning the IP connectivity, each
pair of MWL-Tx and MWL-Rx links are dedicated to a single SoC IP.
The multi-wavelength transmitter link converts serial digital signals into a form
suitable to be transmitted in the form of light over the optical router. It consists of
182 A. Allam and I. OConnor
ser Data
LCS
the laser drivers, and laser source modules in addition to the on/off demultiplexer
(see Fig. 6.2). Each laser source generates a laser beam with a wavelength corre-
sponding to the packet destination, and with instantaneous photonic power propor-
tional to the level of each input serial data bit. Note that this architecture considers
an array of fixed-wavelength laser sources. While tunable laser sources also exist,
overall size and inter-wavelength switching speed prohibits their practical use.
However, from the point of view of the model, there is no fundamental reason why
the architecture could not include this type of device. The laser source in our ONoC
is an on-chip directly modulated compact IIIV type.
The multi-wavelength receiver link converts the router optical signals into electrical
digital format. It consists of the photodiode (PD), transimpedance-amplifier (TIA),
and the comparator modules as shown in Fig. 6.3. Demultiplexing is carried out in
the optical domain, where the incoming photonic beam (composed of several mul-
tiplexed wavelengths) is exposed to the set of photodiodes each sensitive to a single
wavelength. When a certain photodiode is stimulated by the photonic beam, it pro-
duces an electrical current proportional to the photonic power in that beam. The
photodetector considered in our simulation is a broadband Ge-detector integrated
with silicon nanophotonic waveguides [15], which uses a set of filters before it for
wavelength selection.
6 A Protocol Stack Architecture for Optical Network-on-Chip 183
The physical-adapter layer is a sublayer of the physical (L1) layer of the OSI
reference model. Its main objective is to hide and wrap the heterogeneous (electri-
cal analog and optical) physical layer L1 of the ONoC protocol stack. Two units
define the architecture of the physical-adapter layer: the transmitter physical
adapter (Tx-PhyAdapter) and the receiver physical adapter (Rx-PhyAdapter)
units. Bit encoding is a vital service to be implemented in the physical-adapter
layer, the objective being to reduce the average power consumption. Our optical
bus inverter (OBI) module implements a source encoding technique allowing the
number of 1s within a flit to be reduced, so as to keep the laser switched off as
much as possible during the transmission of information. With this approach, the
signal being serialized contains the smallest number of logic ones, in such a way
that the laser is kept turned on for as short a time as possible. Encoding and decod-
ing are implemented with the OBI unit in the Tx-PhyAdapter and Rx-PhyAdapter,
respectively.
The receiver physical adapter is constructed from the deserializer (DESER) and the
synchronizer (Sync) modules, in addition to the receiving part of the optical bus
inverter (OBI) unit (see Fig. 6.5). Its main function is the synchronization of the
serial communication and data conversion from serial to parallel.
184 A. Allam and I. OConnor
Sync
FCU
FC Process
FC_sig rxFIF
FCU
FC
Buf
rxFIF
TxIU RxIU
The objective of the data link layer is to provide the reliability and the synchroniza-
tion functionality to the packet flow. Its main task in the ONoC protocol stack is to
ensure reliable communication of data packets along the two complementary rout-
ers used in the ONoC (see Network Layer section). Unlike macro computer net-
works, NoCs have to deliver messages between the IPs with guaranteed zero loss.
Thus, ONoC has to adapt a rigorous flow control scheme.
The flow control is a key aspect of the data link layer. It is the mechanism that deter-
mines the packet movement along the network path and it is concerned with the alloca-
tion of shared resources (buffers and physical channels) as well as contention resolution
[16]. Since the electrical distributed router in ONoC employs the wormhole packet
switching technique (see Network Layer section), the proposed ONoC protocol stack
uses flit-buffer flow control, which allocates buffers and channel bandwidth in units of
flits. The flow-control is implemented in the electrical domain (see Fig. 6.6) so as to
reduce the complexity of the data conversion modules between different domains. It
can employ any suitable flow-control scheme (such as the credit-based or the backpres-
sure handshake on-off scheme). Our model adopts the handshake on-off scheme.
Figure 6.7 shows the frame structure of the data link layer. It is composed of
three fields: (1) the OBI field that is 1-bit wide used to indicate if the encoded
data is the OBI-inverted or the original bit-stream; (2) the PSS (protocol specific
signals) that is a variable-width field used to carry the protocol specific signal
communication (e.g. flit-id and aux signals in the VSTNoC protocol [17]); and
6 A Protocol Stack Architecture for Optical Network-on-Chip 185
1-bit Protocol-dependent
TxIU
MWL-Tx
RxIU
MWL-RX
Optical
Router
TxIU
MWL-Tx
RxIU
MWL-RX
TxIU
MWL-Tx
EDR
Fig. 6.8 Electrical distributed-router and optical router connection in 3-2 ONoC
(3) the Flit field that is used to carry the flit bitsits width is protocol-dependent
(e.g. 36-, 72-, or 144-bits in the VSTNoC protocol).
The network layer is responsible for transferring the data packets from the source IP
to the intended destination IP. It provides the routing functionality and the buffering
service to the data packets. The network layer in optical NoC is realized with two
complementary routers, the electrical distributed router (EDR) and the optical cen-
tralized router (OCR) (see Fig. 6.8); and it uses a two-level routing mechanism:
1. The optical routing level, which is implemented using the optical centralized
router. At this optical level, the routing mechanism is contention free and it is
based on wavelength division multiplexing (WDM).
2. The electrical routing level, which is implemented inside the electrical distrib-
uted router. Here, the routing mechanism is distributed among the transmitting-
and receiving-path interface units (TxIU and RxIU) of the electrical distributed
router. Inside the TxIU, the routing information extracted from the header flit is
used to feed the serial data to the corresponding laser driver and to activate its
corresponding laser source. On the other hand, at the RxIU, a packet from one
buffer among the group of buffers associated with different sources is released to
186 A. Allam and I. OConnor
a b
Fig. 6.9 N N l-router architecture (a), 4-port optical switch example (b)
Controller FCU
PSS FC_sig
CH
H_DEC serData
Flit_in
DMUX
Buffer TxPA
LCS
transmitter. On the other hand, the data accumulated by the ONoC receiver from
various source IPs needs to be delivered to a single target, complying with a stan-
dard communication protocol. Thus, the main objective of the Electrical Distributed
Router is to adapt the SoC traffic to and from the ONoC data format (according to a
standard network interface protocol) with the signaling and timing required by the
optical NoC transmitter and receiver modules. It consists of two building blocks: (1)
the transmitting-path interface unit, TxIU, which is analogous to the input unit of
the conventional NoC router, and (2) the receiving-path interface unit, RxIU, which
is analogous to the output unit of the conventional NoC router.
The receiving-path interface unit, RxIU, operates as an adapter between the ONoC
receiver physical adapter module and the destination network interface, NI. It includes
188 A. Allam and I. OConnor
PSSE
rxFIFO
Rx_data_1
Buffer
MUX
Flit_out
rxFIFO
Rx_data_m
FIFO buffers, rxFIFO, to store the receiver data; and an Arbiter module to arbitrate
between buffered packets so as to be delivered to the output NI (see Fig. 6.11). Its
Controller module manages and adapts the flow of released flits with the NI protocol
signals, PSS that have been extracted with the PSSE module. The Flow-Control unit,
FCU, generates the flow-control signals as part of the flow-control mechanism.
Performance Evaluation
This section investigates the performance evaluation of the proposed layered pro-
tocol architecture of the ONoC built with the novel EDR. The ONoC performance
analysis has been carried out both at the system-level (network latency and through-
put) and at the physical level. In physical-level (optical) performance analysis of
the ONoC, we study the communication reliability of the ONoC formulated by the
signal-to-noise ratio (SNR) and the bit error rate (BER). Optical performance of
the ONoC is carried out based on the system parameters, component characteris-
tics and technology. The system-level analysis is carried out through simulation
using flit-level-accurate SystemC model.
Communication channels in the optical NoC architecture defined above can be cat-
egorized in equivalence classes. An equivalence class, as introduced by Draper and
Ghosh [20], is defined as a set of channels with similar stochastic properties with
respect to the arrival and service rate. There are five main equivalence classes to
which a channel in ONoC can be assigned:
6 A Protocol Stack Architecture for Optical Network-on-Chip 189
Input channel (ICH), i.e. the input queue interfacing the ONoC to the NI of the
connecting source IP.
Transmitting-path channel (TxCH), which consists of the buffer of the transmit-
ting-path interface unit, TxIU, in addition to the serializer of the transmitter
module.
Serial channel (serCH) constructed from the whole serial datapath starting from
the laser-driver module up to the comparator module and passing through the
optical router.
Receiving-path channel (RxCH), which consists of the FIFO buffers of the
receiving-path interface unit, RxIU, in addition to the deserializer of the receiver
module.
Output channel (OCH), which is the output connection interfacing the ONoC to
the NI of the connecting destination IP.
The ONoC datapath is constructed from the series connection of these channels
as depicted in Fig. 6.12.
All datapaths through ONoC between any pair of source and destination IP nodes
are symmetric. Using this property in addition to the equivalence classes introduced
earlier, the characterization of ONoC performance metrics can be achieved by ana-
lyzing one ONoC datapath as depicted in Fig. 6.12.
Preliminary Definitions
In the Optical NoC, each input channel, as well as each transmitting-path channel,
is dedicated to a single input port of ONoC, while each output channel can accept
traffic from all associated RxCH channels.
Each SoC IP interacts with ONoC through the NI, according to a predefined
communication protocol. The number of clock cycles required to transfer one unit
of data (flit) between the NI and ONoC varies and is defined in the communica-
tion protocol used. In the following we denote the protocol clock cycle as PCC.
The system-on-chip runs with a system clock frequency denoted fsys. Some compo-
nents of ONoC run at this system frequency (such as TxIU and RxIU) while the serial
datapath runs with serialization frequency fser. The ONoC is expected to be clocked
with a frequency higher than the nominal clock frequency, f0, which is the system
clock frequency corresponding to the minimum clock-period T0 (the time required by
a flit to be completely serialized through the serial datapath). As such, we define the
190 A. Allam and I. OConnor
ratio of the system frequency fsys to the nominal frequency f0, as the speed factor and
denote it spf, and is given as:
( )
spf = Tsys / T0 = fsys / fser (FS / PCC) (6.1)
The optical NoC works in linear operation (no saturation) as long as the serial chan-
nel bandwidth is able to accommodate the flow of input traffic, assuming infinite
FIFO buffers. This assumption is only used to characterize the interaction of ONoC
to the traffic flow in order to obtain an upper bound for network throughput.
An output channel, under ideal conditions, can release one flit every PCC clock
cycles. Thus, the output channel bandwidth, OCHBW, can be given (in pkts/cycles)
as in (6.2). In addition, considering the fact that OCHBW is shared among traffic
from all associated RxCH channels and define pij as the probability of sending pack-
ets from node i to node j, the ideal capacity, Cap, can be given (in pkts/cycles) as in
(6.3), given that Nf is the number of flits per packet.
( )
Cap = 1 / PCC Nf i p ij . (6.3)
Maximum throughput occurs when some channel in the network becomes satu-
rated. The throughput upper bound is obtained by considering the role of the speed
factor and the traffic injection rate, assuming infinite RxIU FIFO-buffers.
Running the ONoC with a serialization frequency that is not high enough com-
pared to the operating system frequency will result in high spf, which leads to
flooding the ONoC with the injected traffic. As a result, saturation due to limited
serial channel bandwidth can occur for these injected traffic levels.
Let us define NP0 as the number of injected packets during one nominal clock-
period T0 at input channel and the injection ratio iR as the ratio of injected traffic to
the capacity; and recall that T0 = spf Tsys.
(
NP0 = spf iR / PCC Nf i p ij ) (6.4)
The maximum number of flits that can pass through the serial channel, serCH,
during T0 is 1/PCC [see (6.2)]. Saturation due to the serial channel occurs when
NP0 > 1/(PCC Nf). Thus, saturation occurs at the injection ratio, iRsat, given by:
6 A Protocol Stack Architecture for Optical Network-on-Chip 191
Thus, the ONoC can work in linear operation, while accommodating maximum
offered traffic (iR = 1), with speed factor
spf i p ij . (6.7)
System-Level Simulation
To analyze the ONoC behavior and to evaluate its performance metrics (latency and
throughput), a BCA SystemC model for the ONoC has been developed. The model
implements all the micro-architectural details of the ONoC in addition to the struc-
tural details of its components. This model simulates the network at flit-level so as
to produce very accurate performance information.
The performance evaluation of the ONoC is carried out under two traffic test
sets: (1) a synthetic workload simulating real world traffic characteristics and (2) the
SPLASH-2 benchmark [21] traffic. In addition, it is compared to the performance
of the ENoC with mesh topology.
MPSoC with 64128 processors are common today in high-end servers, and
this number is increasing with time. The modern microprocessor executes over 109
instructions per second with an average bandwidth of about 400 MB/s. However,
to avoid increasing memory latency, most processors still need larger peak band-
width [22].
The simulation experiment is carried out for various number of IPs (8, 16, 32, and
64). The synthetic workload traffic is used to evaluate the ONoC performance under
various bandwidth requirements of 8, 16, 24, and 32 Gbps from each IP. In the
conducted simulation tests, the MPSoC is clocked with a frequency of 1 GHz, while
the ONoC is allowed to deliver serial data with a rate of 12.5 Gbps using the current
state of the art photonic component parameters [15, 23] shown in Table 6.1. The flit-
size is set to be 64-bits with a packet length of 4 flits.
192 A. Allam and I. OConnor
The system level analysis shows that the ONoC under study is a stable network,
as is clearly revealed from the simulated throughput in Fig. 6.13. The network is
stable since it continues to deliver the peak throughput (and does not decrease) after
the saturation point.
In passive-type ONoCs such as that under study, there is a single central optical
router and the buffering queues are located at the end of the communication datap-
ath (compared to the ENoC which has buffers and routing switches at each routing
node). Thus, the resource contentions are far less in the case of the ONoC as com-
pared to the ENoC, and hence the ONoC deliverable bandwidth is expected to be
higher as the size of the MPSoC becomes larger. Simulation results in Fig. 6.14 bear
out this hypothesis.
Figure 6.14 shows that the ONoC can handle the necessary bandwidth success-
fully as long as it does not exceed its physical bandwidth (12.5 Gbps in our setup)
and achieve a bandwidth equal to that of the ENoC for relatively low bandwidth
demands. It also proves the scalability of the ONoC, where it illustrates that the
ONoC achievable bandwidth is almost constant regardless of the network size com-
pared to the ENoC.
A similar observation and conclusion can be drawn from the results of simulating
the SPLASH-2 benchmark, as shown in Fig. 6.15. The ONoC delivers a comparable
bandwidth to the ENoC.
6 A Protocol Stack Architecture for Optical Network-on-Chip 193
In NoCs, the typical packet size is 1,024 bits, where it is divided into several flits
for efficient resource utilization with a typical size of 64-bits. Because of the limited
width of the physical channel in the ENoC, the flit is subdivided into one or more
physical transfer digits or phits; typically with a size between 1- and 64-bits. Each
phit is transferred across the channel in a single clock cycle. Each input channel of
the ENoC router accepts and deserializes the incoming phits. Once a complete flit is
constructed, it is allocated to the input buffer and can arbitrate for an output channel.
On the other end, the output channel plays the complementary role where it serial-
izes the buffered flit again to phits for physical channel bandwidth allocation.
194 A. Allam and I. OConnor
avg BW (Gbps)
Fig. 6.16 ONoC performance against ENoC with various Phit lengths
The previous section examined the ONoC performance from the system-level per-
spective. In our physical-level performance analysis of the ONoC, we study the
communication reliability of the ONoC formulated by the signal-to-noise ratio SNR
(the relative level of the signal to the noise) and the bit error rate BER (the rate of
occurrence of erroneous bits relative to the total number of bits received in a trans-
mission). This has been achieved through analyzing the heterogeneous communica-
tion path of the ONoC based on:
System parameters such as: ONoC size (passive optical-router structure and its
number of routing elements) and data rate.
Technology characteristics (micro-resonator roundtrip and coupling losses, wave-
guide sidewall roughness and reflection losses, and manufacturing variability).
Component characteristics (detector responsivity, source threshold current and
efficiency, TIA input referred noise).
6 A Protocol Stack Architecture for Optical Network-on-Chip 195
Preliminary Definitions
The path of data through the heterogeneous domains is as follows: first, the laser-driver
generates two electrical current values corresponding to digital data bits 1 and 0. This
current drives the laser-source module to generate an optical beam with photonic
power proportional to this input current. This optical beam is synthesized at a specific
wavelength according to the physical characteristics of the laser-source. Optical
beams are routed inside the passive optical-router (using the wavelength division
multiplexingWDMrouting mechanism). Then, the photodetector produces an
electrical current proportional to the incident photonic power, which is fed to the TIA
that generates the equivalent voltage. After the received signal (and associated noise)
has been amplified by the TIA, a decision to convert the received signal to a logic 1
or 0 will be carried out by the comparator and will be subject to errors based on the
relative level of the signal to the noise (SNR).
Each micro-resonator switch of the optical-router has a nominal resonant wave-
length lres (see Fig. 6.2); and each lres is i.Dl distant from the systems base wave-
length. Here, i is the optical channel index (i = 0 N1, N being the number of
channels, a function of the ONoC size and structure), and Dl is the channel spacing
(equal to FSR/N, FSR being the free spectral range of the micro-resonator switches).
In practice, due to manufacturing variations and heating, the actual resonant wave-
length will be in a range of dl around the nominal resonant wavelength lres, with
dl being the maximum error or detuning that can occur in the system.
The injected laser signal to the optical-router, on its normal path to the destination,
passes through a number of micro-resonator switches. In each switch, the signal
encounters attenuation in the drop and through channels depending on its wave-
length (see Fig. 6.17). One of the switches will be resonant to the signals own
196 A. Allam and I. OConnor
nominal wavelength, which directs it to follow the drop path while it is being
manipulated by its drop transfer function. In all other switches (with different wave-
lengths), the signal follows the through path and is manipulated by the switch
through transfer function.
Since the micro-resonator filter cannot achieve perfect wavelength selectivity,
crosstalk and interference coming from signals on other wavelengths are added to
the data signal in the drop path. Similarly, a small fraction of the data signal extracted
by the filters through transfer function is added to the signals on other wavelengths
in the transfer path depending on the wavelength, which is considered to be one
source of the optical-router losses. The other sources of optical-router losses are the
micro-resonators drop and through attenuation (which depend on the device param-
eters such as the rings roundtrip loss coefficient, r, and the coupling coefficients
between straight waveguide and the ring, k1, and between the two rings, k2, in the
double-rings micro-resonator filter), in addition to the losses caused by the passive
waveguides (due to the sidewall roughness and the reflection losses).
To obtain the SNR figure of the ONoC, the N digital sources are allowed to trans-
mit 1s and 0s to the N destinations randomly. The laser signal is represented as a
Gaussian shape around the transmitting wavelength so that the whole wavelength
spectrum of the signal is accurately manipulated by each micro-disk throughout the
path in the router. At the receiver, the wavelength selection at the photodetector is
carried out with the same type of filter switch as is used inside the optical-router;
and the input referred noise of the TIA, coupled with the photodetector dark current,
gives the total noise at the input of the TIA circuit. This noise and the received opti-
cal power at the photodetector for logic 1 and 0 are used to calculate the SNR
and the BER using the methodology in [23].
Parametric Exploration
In this section, we explore and analyze the SNR and the BER of the ONoC against
the maximum detuning dl (upwards from the ideal case of dl=0 nm, i.e. no manu-
facturing or thermal variations) for various system specifications and technology
parameters. The reference point for the photonic device parameters is the current
state of the art component parameters [15, 23] shown in Table 6.1. We will contrast
our ONoC BER against the typical BER figures required by Synchronous Optical
NETwork (SONET), Gigabit Ethernet and Fibre Channel specifications, which are
1010 to 1012 or better [23].
Figures 6.18 and 6.19 show the SNR and the BER for various values of the ring-
resonators roundtrip loss coefficient, r, for a 16-node ONoC working with data rate
of 12.5 Gbps. When no mistuning exists, the SNR is between 21 and 26 dB resulting
in a BER in the range of 1024 to 109 bits1 for roundtrip loss coefficient, r, between
0.03 and 0.01, respectively. As the detuning increases, the SNR decreases and the
BER increases. As the detuning value increases beyond 0.4 nm, the BER becomes
unacceptable, resulting in unreliable data communication irrespective of the
roundtrip loss coefficient.
6 A Protocol Stack Architecture for Optical Network-on-Chip 197
Figure 6.20 shows the BER for a 16-node ONoC operating with various data
rates corresponding to a roundtrip loss coefficient of 0.02 and coupling coefficients
k1 and k2 of 0.38 and 0.08, respectively. With calibration and careful design result-
ing in a maximum detuning value of 0.2 nm, the 16-node ONoC with the previous
parameters working with 4 Gbps data rate can achieve communication with a BER
of 1019 bits1 which is considered to be highly reliable compared to the Optical
NETwork (SONET). For the same ONoC configuration, the BER is found to worsen
as the data rate becomes higher, which imposes more constraints on the calibration
for achieving an acceptable BER. On the other hand, implementing the micro-reso-
nator filter with a photonic technology that can realize a roundtrip loss coefficient of
0.01 can achieve very low BER and tolerate larger detuning values even in the case
of high data rates, as Fig. 6.21 illustrates.
198 A. Allam and I. OConnor
-10
10
BER
-20
10
12.5 Gbps
8 Gbps
4 Gbps
-30
10
0 0.2 0.4 0.6 0.8
(nm)
-10
10
-20
10
BER
-30
10
12.5 Gbps
-40 8 Gbps
10
4 Gbps
-50
10
0 0.2 0.4 0.6 0.8
(nm)
-10
10
BER
-20 64-IPs
10
32-IPs
16-IPs
-30
10
0 0.1 0.2 0.3 0.4 0.5 0.6
(nm)
6 A Protocol Stack Architecture for Optical Network-on-Chip 199
As the ONoC size increases (i.e. the number of micro-resonator switches and the
number of required resonant wavelengths increases), the photonic channel spacing
becomes smaller for the same FSR, and the photonic signal encounters a through
attenuation in a large number of micro-disks. This will increase the interference and
the router losses, which decreases the SNR and increases the BER. Figure 6.22
shows the BER for different ONoC sizes working with 12.5 Gbps data rate, and
corresponding to a roundtrip loss coefficient of 0.01 and coupling coefficients k1
and k2 of 0.38 and 0.08, respectively. Achieving an acceptable BER in large size
ONoC requires a larger FSR, which would impose more stringent constraints on the
design of the micro-resonator filters (both in terms of choice of parameters as well
as in the development of improved filter structures).
Conclusion
In this chapter, we have introduced the concept and the micro-architecture of a new
router called Electrical Distributed Router as a wrapper for the ONoC. We have also
presented a novel layered protocol architecture for the ONoC. The Network Layer
in the proposed protocol stack is flexible enough to accommodate various router
architectures realizing the same function. The performance of the ONoC layered
architecture has been investigated both at system level and at the physical level. In
our optical performance analysis, we explored and analyzed the SNR and the BER
of the ONoC against maximum detuning under various system specifications and
technology parameters. In passive-type ONoCs such as that under analysis, there is
a single central optical router and the buffering queues are located at the end of the
communication path (compared to the electrical NoC). Thus, resource contentions
are low in the case of the ONoC, and hence the performance is expected to be high.
The models and analyses described in this work bear out this conclusion. In particu-
lar, the performance analysis showed that the ONoC is capable of absorbing a high
level of traffic before saturation. Moreover, experimental results proved the scal-
ability of the ONoC and demonstrated that the ONoC is able to deliver a comparable
bandwidth or even better (in large network sizes) to the ENoC.
References
Abstract There is little doubt that the most important limiting factors of the
performance of next-generation chip multiprocessors (CMPs) will be the power
efficiency and the available communication speed between cores. Photonic net-
works-on-chip (NoCs) have been suggested as a viable route to relieve the off- and
on-chip interconnection bottleneck. Low-loss integrated optical waveguides can
transport very high-speed data signals over longer distances as compared to on-
chip electrical signaling. In addition, novel components such as silicon microrings,
photonic switches and other reconfigurable elements can be integrated to route
signals in a data-transparent way.
In this chapter, we look at the behavior of on-chip network traffic and show how
the locality in space and time which it exhibits can be advantageously exploited by
what we will define as slowly reconfiguring networks. We will review existing
work on photonic reconfigurable NoCs, and provide implementation details and a
performance and power characterization of our own reconfigurable photonic NoC
proposal in which the topology is adapted automatically (on a microsecond scale) to
the evolving traffic situation by use of silicon microrings.
W. Heirman (*)
Computer Systems Laboratory, Ghent University,
Sint-Pietersnieuwstraat 41, Gent, 9000, Belgium
e-mail: wim.heirman@ugent.be
I. Artundo
iTEAM, Universidad Politcnica de Valencia, Valencia, Spain
e-mail: iiarmar@iteam.upv.es
C. Debaes
Department of Applied Physics and Photonics, Vrije Universiteit Brussel, Brussel, Belgium
e-mail: christof.debaes@vub.ac.be
Introduction
Power efficiency has become one of the prime design consideration within todays
ICT landscape. As a result, power density limitations at the chip level have placed
constraints on further clock speed improvements and pushed the field into increased
parallelism. This has led to the development of multicore architectures or chip mul-
tiprocessors (CMPs) [20]. In the embedded domain, a similar evolution resulted in
the emergence of multi-processor systems-on-chip (MPSoCs), which combine sev-
eral general and special purpose processors with memory banks and input/output
(I/O) devices on a single chip [44].
As such, both CMPs and MPSoCs have begun to resemble highly parallel com-
puting systems integrated on a single chip. One of the most promising paradigm
shifts that has emerged in this domain are packet-switched networks-on-chips
(NoCs) [15]. Since interconnect resources in these networks are shared between
different data flows, they can operate at significantly higher power efficiencies than
fixed interconnect topologies. However, due to the relentless increase in required
throughput and number of cores, the links of those networks are starting to stretch
beyond the capabilities of electrical wires. In fact, some recent CMP prototypes
with eighty cores show that the power dissipated by the NoC accounts for up to 25%
of the overall power [40].
Meanwhile, recent developments in integrating photonic devices within CMOS
technology have demonstrated photonic interconnects as a viable alternative for
high performance off-chip and global on-chip communication [67]. This has sparked
interest among several research groups to propose architectures with photonic
NoCs [10, 12, 66]. Nevertheless, using optical links as mere drop-in replacements
for the connections of electronic packet-switched networks is not yet a reality.
Conversion at each routing point from the optical to the electrical domain and back
can be power inefficient and increase latency. But novel components, such as silicon
microring resonators [89], which can now be integrated on-chip, are opening new
possibilities to build optical, switched interconnection networks [49, 77].
In a first step, we will take a look at exactly how reconfiguration helps to improve
network latency and power requirements. As an initial approximation, energy usage
and packet latency increase mainly with the number of hops a network packet has to
travel. Once weve fixed the networks topology and the mapping of computational
threads to the processors at each network node, the characteristics of the resulting
network traffic have been mostly defined. In reconfigurable networks, one will
exploit certain properties of this network traffic to minimize the number of hops
packets have to travel. The following section will analyze these network traffic
properties in detail and describe how they can be used to trigger network optimiza-
tion through reconfiguration.
7 Reconfigurable Networks-on-Chip 203
To do this we will look at network traffic at different time scales. At each of these
scales, a different mechanism is at play providing structure to the network traffic,
and, if understood by a network designer, providing insight into how traffic and
network interact. This in turn can lead to opportunities for improving network per-
formance, lower power usage or increase reliability.
While a large body of existing work on network traffic locality is set in multi-chip
multi-processor systems such as servers or supercomputers, only more recent work
considers the same effect in on-chip settings. Indeed, parallel (super-)computing
has been in existence since the 1980s, and has had much time to mature as a research
field. Yet the growing number of cores per chip [85] will make the conclusions
drawn for off-chip networks valid for on-chip networks as well.
In fact, when compared to off-chip networks, the on-chip variants are usually
situated at an architectural level that is closer to the processor. The bandwidth and
latency requirements imposed on them are therefore much more stringent. Figure 7.1
shows the system-level architectural difference: on-chip networks mostly connect
between the L1 and L2 caches (Fig. 7.1, top), while off-chip networks are connected
after the L2 cache or even after main memory (Fig. 7.1, bottom). In multi-chip sys-
tems, a larger fraction of memory references can therefore be serviced without
requiring use of the interconnection network, yielding lower network bandwidth
and latency requirements.1 On-chip networks on the other hand will be used much
more often: each memory access that doesnt hit in the first-level cache, typically
once every few thousand memory references for each processor, results in a remote
memory operationversus once every few million memory accesses for a typical
off-chip network. Additionally, the network latency that can be tolerated from an
on-chip network will be much lower, in the order of a few tens of nanoseconds,
versus multiple hundreds of nanoseconds for a typical off-chip network.
It is known that memory references exhibit locality in space and time, in a fractal or
self-similar way [24, 60]. This locality is commonly exploited by caches to improve
performance. Due to the self-similar nature of locality, this effect is present at all
time scales, from the very fast nanosecond scales exploited by first-level caches,
down to micro- and millisecond scales which are visible on the interconnection
1
Or, in a message-passing system, processors can work on local data for a longer time before mes-
sages need to be sent with new data.
204 W. Heirman et al.
2
See [51] for the original description of Rents law relating the number of devices in a subset of an
electronic circuit to its number of terminals, [14] for a theoretical derivation of the same law, and
[23] for an extension of Rents rule which replaces the number of terminals with network band-
width. In essence, a low Rent exponent (near zero) signifies very localized communication, such
as nearest-neighbor only, while a very high Rent exponent (near one) denotes global, all-to-all
communication.
7 Reconfigurable Networks-on-Chip 205
1.4
1.2
0.6 1
0.4 0.8
0.6
0.2 0.4
0.2
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Simulation time (M cycles) Simulation time (M cycles)
Fig. 7.2 Estimated rent exponent (left) and relative per-node bandwidth (right) through time for
the water.sp benchmark run on 64 nodes
Context Switching
In systems where the number of threads is greater than the number of processors,
multiple threads are time-shared on a single processor. This is usually the case in,
for instance, web and database servers where context switches happen when a thread
needs to wait while an I/O operation is completed. Each time a processor switches
to a different thread, this new thread will proceed to load its data set into the cache.
This causes a large burst of cache misses. Sometimes all of the threads data can be
found in the local memory of the processors node, but often remote memory
accesses are required. In this case, the thread switch causes a communication burst.
One such example is the case where a thread just woke up because its I/O-request
was completed, the thread will now read or write new data on another nodes mem-
ory or I/O-interface.
A study of these context switch induced bursts was done in [4]. One experiment
time-shared multiple SPLASH-2 benchmarks [88] on the same machine, another
used the Apache web server loaded with the SURGE request generator [7] to study
an I/O-intensive workload. A clear correlation was found between context switches
and bursts. This is illustrated in Fig. 7.3, which shows the traffic generated by a
single node through time and the points where context switches occurred. Here, four
instances of the cholesky benchmark, with 16 threads each, were run on a single
16-node machine. Solid lines denote a context switch on this node, at this point a
burst of outgoing memory requests is generated to fill the local cache with the new
threads working set. Dashed lines show context switches on other nodes. In some
206 W. Heirman et al.
160
140
120
Traffic flow (MB/s)
100
80
60
40
20
0
8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5
Time (s)
Fig. 7.3 Traffic observed in and out of a single node through time, while running four 16-thread
cholesky applications on a single 16-node machine. Solid arrows are shown when a context switch
occurs on this node, dashed lines denote context switches on other nodes [4]
3
Often, the operating system tries to avoid context switches at the same time on all nodes as this
would initiate communication bursts on all nodes simultaneously, this can easily saturate the whole
network.
7 Reconfigurable Networks-on-Chip 207
Scenarios
A lot of research that is currently being done in the context of MPSoC design
resolves around system scenarios. This concept groups system behaviors that are
similar from a multidimensional cost perspectivesuch as resource requirements,
delay, and energy consumptionin such a way that the system can be configured to
exploit this cost similarity [21]. Often, scenarios can be traced back to a certain
usage pattern of the system. Modern cellular phones, for instance, can be used to
watch video, play games, browse the internet, and even make phone calls. Each of
these usage scenarios imposes its own specific requirements on the device, in terms
of required processing power, use of the various subcomponents (graphics, radios,
on-chip network), etc. At design-time, these scenarios can be individually opti-
mized. Mechanisms for predicting the current scenario at runtime, and for switching
between scenarios, are also being investigated.
The system configuration, which is the result of the system being operated in a
specific scenario, consists of the setting of a number of system knobs which allow
trade-offs to be made between performance and power, among other cost metrics.
One well-known technique used in this case is dynamic voltage and frequency scal-
ing (DVFS), which changes the processors clock speed and core voltage [45]. This
allows a designer to choose between high-performance, high-power operation when
needed to meet a real-time deadline, or low-power operation when possible. [21]
describe one example system in which an H.264 video decoder, which has a fixed
per-frame deadline (at 30 frames per second) but a variable computational complex-
ity per frame (depending on the video frame type, complexity, level of movement,
etc.). By choosing the correct DVFS setting for each frame, the energy requirement
for the decoding of lower-complexity frames could be reduced by up to 75%, while
keeping the hardware fixed.
In this design pattern, network reconfiguration can easily be integrated as another
system knob. Communication requirements can be profiled at design-time [37],
while runtime scheduling and mapping can be done to optimize communication
flows and configure the network accordingly [62]. Changes to network parameters
(link speed and width) or topology (adding extra links) can thus be done in response
to system scenario changes.
other. In the same way, information (current velocities and direction, water
temperature) flows between the processors responsible for these parts. Clearly, if the
processors themselves are neighbors on the communication network (i.e. connected
directly), this makes for very efficient communication because a large fraction of
network traffic does not need intermediate nodes. There is a similar communication
pattern in several other physical simulations, where data is distributed by dividing
space in 1-D, 2-D or 3-D grids and communication mainly happens between
neighboring grid points. Other physical mechanisms, such as gravity, work over
long distances. Cosmic simulations therefore require communication among all
processors (although the traffic intensity is not uniform).
An important property here is how many communication partners each processor
has. In some cases, the number of communication partners is higher than the net-
work fan-out; or the topology, created by connecting all communication partners,
cannot be mapped to the network topology using single-hop connections only. Then,
some packets will have to be forwarded by intermediate nodes, making communica-
tion less efficient. For instance, when communication is structured as a tree, which
is the case for several sorting algorithms, it is not obvious how threads and data
should be placed on a ring network. In a clientserver architecture, where one thread
is the server which answers questions from all other threads, the fan-out of the
server thread is extremely high. The node that runs this thread will never have an
equally high physical fan-out. In those cases, a large fraction of network traffic will
require forwarding.
Moreover, for some applications each nodes communication partners change
through time. This happens for instance in algorithms where the work on each data set
is not equal, and redistribution of work or data takes place to balance the workload of
all processors. Another situation is in scatter-gather algorithms, in which data is dis-
tributed to or collected from a large number of nodes by a single threadwhich will
thus communicate in turn with different nodes. And sometimes the data set of one
processor just does not fit in its local memory, and has to be distributed over several
nodes. In this case, for part of the data, external memory accesses are required.
Regularity in the application is again visible on the network as communication
bursts. Highly regular applications like the ocean simulation will have bursts,
between the nodes simulating neighboring parts of the ocean, that span the entire
length of the program. For other applications, communication is less regular, but
even there, bursts of significant lengths (several milliseconds) can be detected. They
can also be exploited by the same techniques that exploit bursts caused by other
mechanisms explored here.
Application-Driven Reconfiguration
algorithmic level), by someone with a view of the complete program and algorithm
(programmer or compiler), it can be expected that this method allows for very
accurate prediction of the communication pattern, and would therefore result in the
largest gains. It does, however, require a large effort to analyze the application in
this way. Moreover, due to dependencies on the input data, it is not always possible
to predict, at compile time, a fraction of total communication that is large enough
to be of benefit.
A very early example of the approach of application-driven reconfiguration can
be found in 1982, when Snyder introduced the configurable, highly parallel (CHiP)
computer [79]. Its processing elements are connected to a reconfigurable switching
lattice, in which several (virtualized) topologies can be embedded, such as a mesh
for dynamic programming or a binary tree used for sorting.
Another example is the interconnection cached network (ICN) [26]. This
architecture combines many small, fast crossbars with a large, slow-switching
crossbar. By choosing the right configuration of the large crossbar, a large class
of communication patterns can be supported efficiently: meshes, tori, trees, etc.,
can be embedded in the architecture. The large crossbar thus acts (under control
of the application) as a cachehence the name ICNfor the most commonly
used connections. This approach is also used, to some extent, in the earth simula-
tor [28]. Its architecture centers around a 640640 crossbar, on which communi-
cation between 640 processing nodes occurs through application-defined circuits.
Inside each processing node, eight vector processors are connected through a
smaller but higher data-rate crossbar.
[8] built on the ICN concept, and describe a dual network approach. Long-lived
burst transfers use a fast optical circuit switching (OCS) network, which is reconfigured
using MEMS mirrors (with a switching time of a few milliseconds) under control of
the application. The other, irregular trafficwhich is usually of a much lower vol-
umeuses a secondary network, the Electronic Packet Switching (EPS) network,
which is implemented as a classic electrical network with lower bandwidth but higher
fan-outs, to obtain low routing latencies on uniform traffic patterns.
Clearly, descriptions of network traffic locality, and the idea that networks can be
reconfigured to exploit this fact, have been around since the days of the first multi-
processors. Demonstration systems using optical reconfiguration were starting to be
built not much later.
In the free-space interconnects paradigm, [76] demonstrated, with the COSINE-1
system dating back to 1991, a manually reconfigurable free-space optical switch
operating with LEDs. This is to our knowledge the first demonstrator that showed
the possibilities of reconfigurable optical systems. Since then, technology has
advanced drastically and new and improved reconfiguration schemes have appeared
210 W. Heirman et al.
in the research scene. In the following paragraphs, we will try to give an overview
of the state of the art of reconfigurable technology, showing that the optical
reconfiguration is getting possible in a near future in the light of the new studies and
practical implementations.
There have been lots of proposals for reconfigurable optical architectures in the
past, but only a few of them have been accomplished in the form of demonstrators.
During the recent years, some of them have achieved remarkable results by imple-
menting the reconfiguration in very different ways. For example we have the
OCULAR-II system [58, 59], developed by the University of Tokyo and Hamamatsu
Photonics, which is a two-layer pipelined prototype in which the processing ele-
ments, with VCSEL outputs and photodetector input arrays, are connected via mod-
ular, compactly stacked boards. Between each of the layers there is a free-space
optical interconnection system, and by changing the phase pattern displayed on a
phase-modulating parallel-aligned spatial light modulator (SLM), the light paths
between the nodes can be dynamically altered with a speed of 100 ms.
The proposed latest OCULAR-III architecture [13] is a multistage interconnec-
tion network relying on fixed fiber-based block interconnects between stages. These
interconnections are based on modular, reusable and easy to align fiber-based
blocks. Network reconfiguration is based on electronics though, by setting the states
of local crossbars on the processing plane.
Another reconfigurable architecture constructed was the Optical Highway [75],
a free space design which interconnects multiple nodes through a series of relays
used to add/drop thousands of channels at a time (see Fig. 7.4). The architecture
considered here was a network-based distributed memory system (cluster style),
with a 670 nm laser diode as a transmitter and a diffractive optical element to pro-
duce a fan-out simulating a laser array. Polarizing optics defined a fixed network
topology, and a polarizing beam splitter deflected channels of a specific polarization
to the corresponding node, with each channels polarization state determined by
patterned half wave plates. It can be made reconfigurable using also a SLM, allow-
ing to switch the beam-path of a single channel from an electronic control signal,
and route it to only one of three detectors.
An alternative modular systems has been presented in 2002, with a powerful
optical interconnection network [1]. The solution is based on a generic optical
7 Reconfigurable Networks-on-Chip 211
Fig. 7.7 SELMOS system: photoresistive materials are put where vertical coupling paths will be formed,
and write beams through the waveguides construct a self-organized micro-optical network [90]
Other approach using liquid crystals (LC) is the work by [9], where they imple-
ment reconfiguration by using a 8 4 multilevel phase-only holograms written in a
nematic LC panel. The splitting diffraction efficiency achieved is rather low (15%),
as well as the switching time of 100 ms at an operational wavelength of 620 nm.
7 Reconfigurable Networks-on-Chip 213
Fig. 7.9 Architectural overview of RAPID. Every node is connected to two scalable intercon-
nects: an optical intraboard interconnect and a scalable remote superhighway [48]
Apart from these implementations, there has been very interesting architecture mod-
els proposed in the last years that have not been implemented yet to our knowledge.
Like RAPID, a high bandwidth, low power, scalable and reconfigurable optical inter-
connect [48]. It is an all-photonic passive network, composed of tunable VCSELs,
photodetectors, couplers, multiplexers, and demultiplexers (see Fig. 7.9). It provides
large bandwidth by using WDM and space division multiplexing (SDM) techniques,
and combining them into a multiple WDM technique that needs fast switching times,
in the order of nanoseconds, over 2-D torus, hypercube, and fat-tree topologies.
There has been some work also on reconfigurable buses, being the linear array
with a reconfigurable pipelined bus system (LARPBS) [74] the best example of a
complete architecture, although again a complete implementation has not been real-
ized yet. It is a fiber-based optical parallel bus model that uses three folded wave-
guides, one for message passing and the other two for addressing via the coincident
pulse technique. The reconfigurability in this model is provided by pairs of 2 2
bus-partition optical switches, located between each processor, that can partition the
system into two subsystems with the same characteristics at any of these pairs of
switches by introducing some conditional delay.
The HFAST [46] architecture is a MPI HPC interconnect, not targeting
shared-memory applications, but still interesting enough to comment here from
an architectural point of view. HFAST attempts to minimize the number of
7 Reconfigurable Networks-on-Chip 215
optical transceivers and switch ports required for a large scale system design,
since the transceivers are both expensive and power hungry, and uses circuit
switches for wiring the packet switches together. It tries to minimize pathways
that have bandwidth contention measuring explicit MPI communication pat-
terns rather than shared-memory cache-line transfers. HFAST is based on the
observation that short messages are strictly latency bound and benefit from a
completely different low-power network layer since they rarely hit the band-
width limits for the network. Thus, the problem for the big messages reduced to
a strictly bandwidth contention minimization problem.
Other architectures with shared or switched topologies include the simultane-
ous optical multiprocessor exchange bus (SOME-bus) from Drexel University
[47], the optical centralized shared bus from the University of Texas at Austin
[29], and Columbia Universitys data vortex optical packet switching interconnec-
tion network [30].
Finally, it has also been suggested in the Washington University to dynamically
reconfigure a router switch fabric using optical chip-to-chip communication but
CMOS technology for decision, control, and signal switching functions [50]. The
obtained speedup in packet latency is 1.71 for a 400 ms reconfiguration period
illustrates the clear potential for slow reconfiguration techniques.
Moreover, since an optical channel can offer very high aggregate bandwidth, one
can also use techniques such as a fixed time-division multiplexing, as proposed in
[73] with a technique called reconfiguration with time division multiplexing
(RTDM). With RTDM, only a subset of all possible connections, as required by the
running applications, needs to be optically multiplexed in the network, letting the
network go through a set of personalized configurations.
As a summary, we include in Table 7.1 a brief comparison of the different
reconfigurable optical interconnects presented in this section, according to several
key parameters.
Fig. 7.10 Mapping and routing of different processes into a tiled-mesh topology [41]
The first approach considers the three-step design flow in systems-on-chip, where
each application is divided into a graph of concurrent tasks and, using a set of avail-
able IPs, they are assigned and scheduled. Here, a mapping algorithm decides to
which tile each selected IP should be mapped such that the metrics of interest are
optimized. For this approach, mesh topologies are mostly used, due to their regular
two dimensional structure that results in IP re-use, easier layout, and predictable
electrical properties. [41] uses a branch-and-bound mapping algorithm to construct
a deadlock-free deterministic routing function such that the total communication
energy is minimized (Fig. 7.10). Others, like the NMAP heuristic algorithm [63],
optimize bandwidth by splitting the traffic between the cores across multiple paths,
and [5] use a genetic algorithm based technique. In [81], they consider not only
minimizing the communication energy according to bandwidth constraints, but also
to latency constraints. Here, the energy consumption of the input and output ports at
each router node varied linearly with the injection and acceptance rates.
7 Reconfigurable Networks-on-Chip 217
Fig. 7.12 The network nodes of the ReNoC system consist of a router that is wrapped by a topol-
ogy switch, allowing for different logical topologies [82]
Fig. 7.13 Mesh network with additional long-range links inserted according to hotspot traffic
measurements [65]
Fig. 7.14 The Nostrum backbone with the application resource network interface (RNI), that
maps processes to resources in a mesh architecture [61]
Finally, a third approach tries to adapt the communications among the nodes to
the network infrastructure by the creation or reservation of virtual resources over the
physical topology, to maximize and guarantee application performance in terms of
bandwidth and latency delivered. Most of the time, virtual channels (VCs) are cre-
ated as a response to quality-of-service (QoS) demands from applications, corre-
sponding to a loose classification of their communication patterns in four classes:
signaling (for inter-module control signals), real time (representing delay-con-
strained bit streams), read/writes (modeling short data access) and block transfers
(handling large data bursts). For example, the technique of spatial division multi-
plexing (SDM), used in [56], consists of allocating only a subset of the link wires to
a given virtual circuit. Messages are digit-serialized on a portion of the link (i.e.
serialized on a group of wires). The switch configuration is set once and for all at
the connection setup. No inside router configuration memory is therefore needed
and the constraints on the reservation of the circuits are relaxed. [27] introduces a
simple static timing analysis model that captures virtual channeled wormhole net-
works with different link capacities and eliminates the reliance on simulations for
timing estimations. It proposes an allocation algorithm that greedily assigns link
capacities using the analytical model so that packets of each flow arrive within the
required time. The temporally disjoint networks (TDNs) of [61] are used in order to
achieve several privileged VCs in the network, along the ordinary best effort traffic.
The TDNs are a consequence of the deflective routing policy used, and gives rise to
an explicit time-division-multiplexing within the network (Fig. 7.14). The NoC
described in [17] provides tight time-related guarantees by a dynamic link arbitra-
tion process that depends on the current traffic and maximizes link utilization.
Fig. 7.15 Implementation of the physical architecture of RePNoC and logical topology of an
application pattern [19]
this way intermediate nodes and optimizing application demands, and latency
performance simulations in this case show a 50% decrease compared to a static
photonic NoC (Fig. 7.15).
The basic building blocks to introduce dynamism on a network are switches and
routers. On a photonic NoC though, these elements must have very limited space
and power requirements, and must have a good integration with the processing and
memory elements. That is why silicon photonics poses itself as an ideal candidate
for integration here. [52] gives an overview of state-of-the-art silicon modulators
and switches, with modulating speeds of 4 Gb/s and switching speeds of down to
1 ns in compact (10 mm) configurations of 12, 22, and 44) based on microring
resonators pumped at 1.5 mm [53, 72] (Fig. 7.16). Active devices have been realized
in InP as well though, like the 116 phased-array switch presented in [80], with a
more modest response time of 11 ns (Fig. 7.17). On a slower timescale, [84] intro-
duce the use of electro-optic Bragg grating couplers to demonstrate a reconfigurable
waveguide interconnect with switching times of 75 ms operating at 850 nm.
Lacking a cheap and effective way of optically controlling the routing (and doing
possible buffering), most of the approaches described above necessarily work in a
circuit-switched way. And while the actual switching of the optical components can
nowadays be done in mere nanoseconds or less [18], the set-up of an optical circuit
7 Reconfigurable Networks-on-Chip 221
Fig. 7.17 Micrograph of the 116 optical switch. The total device size is 4.1 mm 2.6 mm,
including the input/output bends and lateral tapers [80]
still requires at least one network round-trip time, which accounts for several tens of
nanoseconds. This makes that such proposals only reach their full potential at large
packet sizes, or in settings where software-controlled circuit switching can be used
with relatively long circuit lifetimes. Indeed, in [77], packets of several kilobytes are
needed to reach a point where the overhead of setting up and tearing down the opti-
cal circuits (which is done with control packets sent over an electrical network), can
be amortized by the faster optical transmission.
In SoC architectures, and to a lesser extent in CMPs, large direct memory access
(DMA) transfers can reach packet sizes of multiple KB. However, most packets are
coherency control messages and cache line transfers. These are usually latency
bound and very short. In practice, this would mean that most of the traffic would not
be able to use the optical network, as they do not reach the necessary size to com-
pensate for the latency overhead introduced, and that the promised power savings
could not be realized!4
4
One might consider using a larger cache line size to counter this, but an increase to multiple kilo-
byte would in most cases only result in excessive amounts of false sharing, negating any obtained
performance increase.
222 W. Heirman et al.
We propose to use the combination of the electrical control network and the
optical circuit-switched links as a packet-switched network with slow
reconfiguration. This idea is based on existing work such as the Interconnection
Cached Network [26], or see [8] for a modern application of the same idea. But
rather than relying on application control of the network reconfiguration, which
requires explicit software intervention and does not agree with the implicit com-
munication paradigm of the shared memory programming model, our approach
provides for an automatic reconfiguration based on the current network traffic.
This concept has been described in [2], and was proven to provide significant per-
formance benefits in (off-chip) multiprocessor settings. Here, we will apply the
same approach to on-chip networks, and model the physical implementation on the
architecture introduced by [71, 77].
Physical Architecture
The photonic NoC proposed by [71] introduces a non-blocking torus topology, con-
necting the different cores of the system, based on a hybrid approach: a high-bandwidth
circuit-switched photonic network combined with a low-bandwidth packet-switched
electronic network. This way, large data packets are routed through a time and wave-
length multiplexed network, for a combined bandwidth of 960 Gb/s, while delay-criti-
cal control packets and data messages with sizes below a certain threshold are routed
through the low-latency electrical layer. As the basic switching element, a 44 hitless
silicon router is presented by [78], based on eight silicon microring resonators with a
bandwidth per port of 38.5 GHz on a single wavelength configuration.
An example 16-node architecture is depicted in Fig. 7.18. Each square repre-
sents a 44 router containing eight microring resonators. In this architecture, each
node has a dedicated 33 router to inject and eject packets from the network, rep-
resented by the smaller squares. The network nodes themselves are represented by
discs. By means of the electronic control layer, each node first sends a control
packet to make the reservation of a photonic circuit from source to destination.
Once this is done, transmission is done uninterrupted for all data packets. To end
the transmission phase, a control packet is sent back from the destination to free
the allocated resources.
For our architecture, we combine a standard electrical network-on-chip with a
dedicated reconfigurable photonic layerformed by the architecture proposed by
[71]. The photonic layer will established a set of extra links in a circuit-switched
fashion for certain intervals of time, depending on automated load measurements
over the base topology. The reconfiguration will follow slowly-changing dynamics
of the traffic, while the base electronic network layer will still be there to route con-
trol and data messages.
Other architectures, similar to [71], have been proposed and can be interchanged
as the physical layer on which to apply our slow reconfiguration architecture. For
7 Reconfigurable Networks-on-Chip 223
instance, [25] avoids the need for an electrical control layer by sending all packets
through an all-optical network using different wavelengths. Still, the separation
between control and data layers, even when they are sent through the same physical
channels, is maintained. Our approach is valid to any network architecture where
this distinction is kept, as the reconfigurable layer can be virtually established irre-
spective of the underlying physical implementation.
Our network architecture, originally proposed in [32], starts from a base network
with fixed topology. In addition, we provide a second network that can realize a
limited number of connections between arbitrary node pairsthe extra links or
elinks. A schematic overview is given in Fig. 7.19.
The elinks are placed such that most of the traffic has a short path (a low number of
intermediate nodes) between source and destination. This way, a large percentage of
packets has a correspondingly low (uncongested) latency. In addition, congestion is
lowered because heavy traffic is no longer spread out over a large number of intermedi-
ate links. For the allocation of the elinks, a heuristic is used that tries to minimize the
aggregate hop distance traveled multiplied by the size of each packet sent over the net-
work, under a set of implementation-specific conditions: these can be the maximum
number of elinks n, the number of elinks that can terminate at one node (the fanout, f),
etc. After each interval of length Dt (the reconfiguration interval), a new optimum topol-
ogy is computed using the traffic pattern measured in the previous interval. A more
detailed description of the underlying algorithms can be found in [31].
Although the actual reconfiguration, done by switching the microrings, happens
in mere picoseconds, the execution time of the optimization algorithm, which
includes collecting traffic patterns from all nodes and distributing new configuration
and routing data, cannot be assumed negligible. The time this exchange and calcula-
tion takes will be denoted by the selection time (tSe). The actual switching of optical
reconfigurable components will then take place during a certain switching time (tSw),
7 Reconfigurable Networks-on-Chip 225
after which the new set of elinks will be operational. Traffic cannot be flowing
through the elinks while they are being reconfigured. Therefore, the reconfiguration
process starts by draining all elinks before switching any of the microrings. This
takes at most 20 ns (the time to send our largest packet, which is 80 bytes, over a
40 Gbps link). During the whole reconfiguration phase, network packets can still
use the base network, making our technique much less costly than some other more
intrusive reconfiguration schemes, where all network traffic needs to be stopped and
drained from the complete network during reconfiguration.
The reconfiguration interval, denoted by Dt, must be chosen as short as possible to be
able to follow the dynamics of the evolving traffic and get a close-to-optimal topology.
On the other hand, it must be significantly larger than the switching time of the chosen
implementation technology to amortize the fraction of time that the elinks are off-line.
Gathering traffic information for each of the nodes to compute the optimal net-
work configuration is straightforward if each node can count the number of bytes
sent to each destination. Collecting this data at a centralized arbiter over our high-
performance interconnect only takes one network round-trip time. Finally, compu-
tation needs to be done on this data at the centralized unit. This computation is
largely based on heuristics and pre-computed tables, and can therefore quickly
determine a near-optimal elink configuration and its corresponding routing tables.
We assume that this selection algorithm can be executed on one of the systems
processors, and even for a 64-node network we expect this to take only a few micro-
seconds. Of course, this will only hold for slowly-reconfiguring networks, where
the reconfiguration interval is long enough to amortize this delay.
If we want to reduce the reconfiguration interval even further, we will have to
move to a decentralized scheme, where traffic information is spread locally to
neighboring nodes only, and the selection mechanism is done at each processor with
just local information.
Applying this architecture to the specifics of a NoC, we can consider the network
presented in [71] as being an instantiation of our general reconfigurable network
model, where the number of elinks n equals the number of processing nodes p, and
with a maximum fan-out per node of one (n = p, f = 1). This way, each extra link
would be considered as a dedicated circuit of the non-blocking mesh. The
reconfiguration interval, Dt, was fixed to 1 ms.
With optical components that can switch in the 30 ps range, the switching time (tSw)
will only take a negligible fraction of the reconfiguration interval Dt. However the selec-
tion time (tSe) will remain significant as it requires exchange of data over the network.
We propose a scheduling where we allow the selection to take up to a full
reconfiguration interval. The three phases (shown in Fig. 7.20) of collecting traffic
information (measure), making a new elink selection (select), and adjusting the
226 W. Heirman et al.
Fig. 7.20 Sequence of events in the on-chip reconfigurable network. During every reconfiguration
interval of 1 ms, traffic patterns are measured. In the next interval, the optimal network configuration
is computed for such patterns. One interval later, this configuration is enabled. The reconfiguration
itself takes place at the start of each configure box, but the switching time is very short (just 2% of
the reconfiguration interval in this architecture) and is therefore not shown here
C = d (i, j )T (i, j )
i, j
(7.1)
with d(i, j) the distance between nodes i and j, which is a function of the elinks that
are selected to be active, and T(i, j) the number of bytes sent from node i to node j in
the time interval of interest.
Since the time available to perform this optimization is equal to the
reconfiguration time (1 ms here), we use a greedy heuristic that can quickly find a
set of active elinks that satisfies the constraints imposed by the architecture, and
has an associated cost close to the global optimum. More details on this algorithm
can be found in [33].
7 Reconfigurable Networks-on-Chip 227
5
Actually another set of VCs is used since separate request and reply VCs are already employed to
avoid fetch deadlocks at the protocol level.
228 W. Heirman et al.
For routing packets through the elinks we use a static routing table: when
reconfiguring the network, the routing table in each node is updated such that, for
each destination, it tells the node to route packets either through an elink starting at
that node, to the start of an elink on another node, or straight to its destination, the
latter two using normal dimension order routing.
Simulation Platform
We have based our simulation platform on the commercially available Simics simu-
lator [57]. It was configured to simulate a multicore processor inspired by the
UltraSPARC T1/T2, which runs multiple threads per core (four in our experiments).
This way, the traffic of 64 threads is concentrated on a 16-node network, stressing
the interconnection network with aggregated traffic. The processor core is modelled
as a single issue scalar core, running at 1 GHz. Stall times for caches and main
memory are set to conservative values for CMP settings (2 cycles access time for
L1 caches, 19 cycles for L2 and 100 cycles for main memory). Cache coherency is
maintained by a directory-based coherency controller at each node, which uses a
full bit-vector directory protocol. The interconnection network models a packet-
switched 44 network with contention and cut-through routing. The time required
for a packet to traverse a router is three cycles. The directory controller and the
interconnection network are custom extensions to Simics. Both the coherency traffic
(read requests, invalidation messages etc.) and data traffic are sent over the base
network. The resulting remote memory access times are around 100 ns, depending
on network size and congestion.
The proposed reconfigurable NoC has been configured with a link throughput
of 10 Gb/s in the base network. To model the elinks, a number of extra point-to-
point links can be added to the base torus topology at the start of each
reconfiguration interval. The speed of these reconfigurable optical elinks were
assumed to be four times faster than that of the base network links (40 Gb/s). For
evaluation, we have compared the proposed solution with three standard, non-
reconfigurable NoCs: a 10 Gb/s electrical NoC, a 40 Gb/s electrical NoC and a
40 Gb/s photonic NoC.
The network traffic is the result of both coherency misses and cold, capacity and
conflict misses. To make sure that private data transfer does not become excessive,
a first-touch memory allocation was used that places data pages of 8 KB on the node
7 Reconfigurable Networks-on-Chip 229
of the processor core that first references them. Also each thread is pinned down to
one processor (using the Solaris processor_bind() system call), so the thread stays
on the same network node as its private data for the duration of the program.
Power Modeling
Workload Generation
While most network performance studies employ simple synthetic network traffic
patterns (such as hotspot, random uniform, butterfly, etc.), and are able to obtain
reasonable accuracies with them, this is not possible for reconfigurable networks.
Indeed, the very nature of reconfigurable networks makes them exploit long-lived
dynamics, present in the network traffic generated by real applications, but which is
absent in most simple synthetic patterns.
The SPLASH-2 benchmark suite [88] was used as the workload. It consists
of a number of scientific and technical algorithms using a multi-threaded,
shared-memory programming model. Still, the detailed execution-driven simu-
lation of a single SPLASH-2 benchmark program takes a significant amount of
computation time. We therefore developed a new method of generating syn-
thetic traffic which does take the required long-lived traffic dynamics into
account. The traffic model and methodology for constructing the traces is
described in [35]. This way, we could quickly yet accurately simulate the per-
formance and power consumption of our network under realistic traffic
conditions.
Simulation Results
A direct comparison with our reference architecture [78] is difficult, since in the
original case, only large DMA transfers (of which there are usually very few in
realistic CMP systems) would use the optical network, while most of the traffic
both by aggregate size and by latency sensitivitynecessarily sticks to the electri-
cal control network. Yet, just comparing the performance of our solution with a
7 Reconfigurable Networks-on-Chip 231
Network Performance
Table 7.3 Comparison of the link activity and average remote memory access latency for the
different types of networks-on-chip
BWmax (Gbps) BWavg (Gbps) Tmem (#cycles) dhop (#hops) Ptot (mW)
Electrical NoC 10 5.70 308.9 2.13 315
Reconfigurable NoC 202.1 1.66 378
Base electrical NoC 10 5.21
Reconfigurable photonics 40 5.08
NoC
High-speed NoC 40 17.28 87.2 2.13
Electrical NoC 985
Photonic NoC 814
In Fig. 7.22 and Table 7.3, we show the average number of hops per byte sent.
Comparing with the non-reconfigurable topologyin which the network consists
of just a 2-D toruswe obtain a clear, 22% reduction of the hop distance. Similar
simulations on larger scale CMPs with up 64 cores show a 34.7% reduction in hop
distance. This will increase further as the network scales [3].
There is only a small variability between the different applications measured
because, at any time, there is exactly the same number of elinks present. The only
thing that can differ is that, sometimes, slightly longer routes are created, but
since the elink selection always tries to maximize data hop-distance, the aver-
age hop distance will also be not that different. Note that the number of active
microrings depends on the shape of the traffic pattern (the source-destination pair
distribution)albeit not by a great amountbut it does not depend on the traffic
magnitude.
7 Reconfigurable Networks-on-Chip 233
Fig. 7.23 Total power consumption per interval under different network architectures
Network Usage
A key factor in understanding the power consumption is the usage of the switches
and links in the network. For a normal r p/r torus topology, the diameter (maxi-
mum number of hops between any node pair) is [69]:
r p
D = +
2 2r (7.2)
where p is the number of processors and r is the size of the torus. In regular tori this
makes D = 4 hops when p = 16. For our benchmark applications, the average hop
distance is 2.13 for p = 16.
In our simulations, we use a folded torus topology as shown in Fig. 7.18. The
complete topology contains p/4 hitless switches (4 4 optical routing elements)
and p gateway switches. We found that the mean number of (non-gateway) switches
used per elink during each reconfiguration interval is 3.28. This results in a total of
37.5 active optical routing elements (out of the 64 available ones), of which 13 rout-
ers are traversed by more than one elink. From all routers, on average 73.7 micror-
ings are in the active state.
Table 7.3 furthermore details the average data volume over the different NoC
architectures. For the proposed reconfigurable NoC we can see that the total volume
is almost evenly distributed between the electrical base links and the high-speed
reconfigurable elinks. This clearly indicates that the heuristic to allocate the
234 W. Heirman et al.
Power Consumption
In this section, we evaluate the power consumed by the NoCs and include the
powering of the microring resonators when establishing the elinks on the
reconfigurable layer.
An estimation of the power consumption consumed by the NoC can be calcu-
lated by combining the parameters given in Table 7.2 and the activity of the links
and optical switches in sections Network Performance and Network Usage. In
comparison to the low-speed NoC with fixed topology, the reconfigurable NoC con-
sumes modestly more power (20%) and improves significantly averaged network
performance. Moreover, in comparison to the high-speed fixed NoCs, the proposed
solution consumes significantly less -corresponding to a reduction from 54% to
62% as compared to the fixed photonic and electrical NoCs (Fig. 7.23).
It is important to note at this stage that we have adopted rather conservative
memory stall times (see section Simulation Platform). Future CMPs, equipped
with improved cache hierarchies, will impose significantly higher throughput
demands on the intercore network and further increase the power consumption of
the NoC. In addition, the proposed solution based on the reconfigurable NoC will
benefit from this scaling as it will decrease the network traffic contention between
the most active communicating pairs.
The estimated power consumption is of course highly dependent on the param-
eters chosen in Table 7.2 which was taken from [77]. Nevertheless, the conclusions
7 Reconfigurable Networks-on-Chip 235
that we draw from the results are generic. The proposed reconfigurable NoCs will
always perform better than the fixed NoC consisting solely of a base network. The
reason is that in our proposal, links with more bandwidth and lower latency are
added only where and when relevant. When compared to high-speed NoCs our pro-
posal consumes less power since it requires fewer high-speed links and transceivers.
The proposed photonic NoC thus allows for a very efficient resource utilization of
the high-speed transceivers.
In our study, we assumed that the silicon microrings do not consume energy in
their off-state. This justifies our choice to adopt the proposal by [77] for the photo-
nic links, where a network of p nodes requires 8p2 microring switches (excluding
the gateway switches). Temperature detuning of the microrings might require extra
power dissipation to stabilize the temperature locally at each ring. In recent
work [54], however, silicon microrings were demonstrated with a temperature
dependence as low as 0.006 nm/ C.
Conclusions
In this chapter, we first described the different forms of network traffic locality, and
acknowledged the possibility of exploiting this locality, through network
reconfiguration, to optimize network performance in terms of several important
characteristics such as bandwidth, latency, power usage and reliability. We also sur-
veyed existing works for optical reconfigurable on-chip networks, both demonstra-
tors and architectural proposals. Finally, we presented our own proposal for a
self-adapting, traffic-driven reconfigurable optical on-chip network. We believe that
optical, reconfigurable on-chip networks offer a viable and attractive road towards
future, high-performance and high core-count CMP and MPSoC systems.
Acknowledgements This work was supported by the European Commissions 6th FP Network of
Excellence on Micro-Optics (NEMO), the BELSPO IAP P6/10 photonics@be network sponsored
by the Belgian Science Policy Office, the GOA, the FWO, the OZR, the Methusalem and Hercules
foundations. The work of C. Debaes is supported by the FWO (Fund for Scientic Research
Flanders) under a research fellowship.
References
23. Greenfield D, Banerjee A, Lee JG, Moore S (2007) Implications of Rents rule for NoC design
and its fault-tolerance. In: Proceedings of the first international symposium on networks-on-
chips (NOCS07), Princeton, pp 283294
24. Greenfield D, Moore S (2008) Fractal communication in software data dependency graphs. In:
Proceedings of the 20th ACM symposium on parallelism in algorithms and architectures
(SPAA08), Munich, pp 116118. DOI 101145/13785331378555
25. Gu H, Xu J, Zhang W (2009) A low-power fat tree-based optical network-on-chip for multi-
processor system-on-chip. In: Proceedings of the conference on design automation and test in
Europe, Nice, pp 38
26. Gupta V, Schenfeld E (1994) Performance analysis of a synchronous, circuit-switched inter-
connection cached network. In: ICS 94: proceedings of the 8th international conference on
supercomputing, ACM, Manchester, pp 246255. DOI 101145/ 181181181540
27. Guz Z, Walter I, Bolotin E, Cidon I, Ginosar R, Kolodny A (2006) Efficient link capacity and
QOS design for network-on-chip. In: Proceedings of the conference on design, automation and
test in Europe, pp 914
28. Habata S, Umezawa K, Yokokawa M, Kitawaki S (2004) Hardware system of the earth simula-
tor. Parallel Comput 30(12):12871313. DOI 101016/jparco200409004
29. Han X, Chen RT (2004) Improvement of multiprocessing performance by using optical cen-
tralized shared bus. Proc SPIE 5358:8089
30. Hawkins C, Small BA, Wills DS, Bergman K (2007) The data vortex, an all optical path mul-
ticomputer interconnection network. IEEE Trans Parallel Distr Syst 18(3):409420.
DOI 101109/TPDS200748
31. Heirman W (2008) Reconfigurable optical interconnection networks for shared-memory mul-
tiprocessor architectures. PhD Thesis, Ghent University
32. Heirman W, Artundo I, Carvajal D, Desmet L, Dambre J, Debaes C, Thienpont H, Van Campenhout
J (2005) Wavelength tuneable reconfigurable optical interconnection network for shared-mem-
ory machines. In: Proceedings of the 31st European conference on optical communication
(ECOC 2005), vol 3. The Institution of Electrical Engineers, Glasgow, pp 527528
33. Heirman W, Dambre J, Artundo I, Debaes C, Thienpont H, Stroobandt D, Van Campenhout J
(2008) Predicting the performance of reconfigurable optical interconnects in distributed
shared-memory systems. Photon Netw Commun 15(1):2540. DOI 101007/s11107-007-
0084-z
34. Heirman W, Dambre J, Stroobandt D, Van Campenhout, J (2008) Rents rule and parallel pro-
grams: characterizing network traffic behavior. In: Proceedings of the 2008 international work-
shop on system level interconnect prediction (SLIP08), ACM, Newcastle, pp 8794
35. Heirman W, Dambre J, Van Campenhout J (2007) Synthetic traffic generation as a tool for
dynamic interconnect evaluation. In: Proceedings of the 2007 international workshop on sys-
tem level interconnect prediction (SLIP07), ACM, Austin, pp 6572
36. Heirman W, Dambre J, Van Campenhout J, Debaes C, Thienpont H (2005) Traffic temporal
analysis for reconfigurable interconnects in shared-memory systems. In: Proceedings of the
19th IEEE international parallel & distributed processing symposium, IEEE Computer Society,
Denver, p 150
37. Heirman W, Stroobandt D, Miniskar NR, Wuyts R, Catthoor F (2010) PinComm: character-
izing intra-application communication for the many-core era. In: Proceedings of the 16th IEEE
international conference on parallel and distributed systems (ICPADS), Shanghai, pp 500507.
DOI 101109/ICPADS201056
38. Hemenway R, Grzybowski R, Minkenberg C, Luijten R (2004) Optical-packet-switched inter-
connect for supercomputer applications. J Opt Netw Special Issue Supercomput Interconnects
3(12):900913. DOI 101364/JON3000900
39. Henderson CJ, Leyva DG, Wilkinson TD (2006) Free space adaptive optical interconnect at
125 Gb/s, with beam steering using a ferroelectric liquid-crystal SLM. IEEE/OSA J Lightwave
Technol 24(5):19891997. DOI 101109/JLT2006871015
40. Hoskote Y, Vangal S, Singh A, Borkar N, Borkar S (2007) A 5-GHz mesh interconnect for a
teraflops processor. IEEE Micro 27(5):5161. DOI 101109/MM200777
238 W. Heirman et al.
41. Hu J, Marculescu R (2003) Exploiting the routing flexibility for energy/performance aware
mapping of regular NoC architectures. In: Proceedings of the conference on design, automa-
tion and test in Europe, pp 688693. DOI 101109/DATE20031253687
42. Hu J, Marculescu R (2004) Application-specific buffer space allocation for networks-on-chip
router design. In: Proceedings of the IEEE/ACM international conference on computer-aided
design, San Jose, pp 354361. DOI 101109/ ICCAD20041382601
43. Jalabert A, Murali S, Benini L, Micheli GD (2004) xPipesCompiler: a tool for instantiating
application-specific NoCs. In: Proceedings of the conference on design, automation and test in
Europe, vol 2, Paris, pp 884889. DOI 101109/ DATE20041268999
44. Jerraya A, Wolf W (eds) (2005) Multiprocessor systems-on-chips. Elsevier/Morgan Kaufmann,
San Francisco
45. Jha NK (2001) Low power system scheduling and synthesis. In: ICCAD 01: proceedings of
the 2001 IEEE/ACM international conference on computer-aided design, IEEE, Piscataway,
pp 259263 (2001)
46. Kamil S, Pinar A, Gunter D, Lijewski M, Oliker L, Shalf J (2007) Reconfigurable hybrid intercon-
nection for static and dynamic scientific applications. In: Proceedings of the 4th international con-
ference on computing frontiers, ACM, Ischia, pp 183194. DOI 101145/12425311242559
47. Katsinis C (2001) Performance analysis of the simultaneous optical multi-processor exchange
bus. Parallel Comput 27(8):10791115
48. Kodi A, Louri A (2006) RAPID for high-performance computing systems: architecture and
performance evaluation. Appl Opt 45:63266334
49. Koohi S, Hessabi S (2009) Contention-free on-chip routing of optical packets. In: Proceedings
of the 3rd ACM/IEEE international symposium on networks-on-chip, pp 134143
50. Krishnamurthy P, Chamberlain R, Franklin M (2003) Dynamic reconfiguration of an optical
interconnect. In: Proceedings of the 36th annual simulation symposium, pp 8997
51. Landman BS, Russo RL (1971) On a pin versus block relationship for partitions of logic
graphs. IEEE Trans Comput C-20(12):14691479
52. Lee BG, Biberman A, Chan J, Bergman K (2010) High-performance modulators and switches
for silicon photonic networks-on-chip. IEEE J Sel Top Quantum Electron 16(1):622.
DOI 101109/JSTQE20092028437
53. Lee BG, Biberman A, Sherwood-Droz N, Poitras CB, Lipson M, Bergman K (2009) High-
speed 2 2 switch for multiwavelength silicon-photonic networks-on-chip. J Lightwave
Technol 27(14):29002907
54. Lee J, Kim D, Ahn H, Park S, Pyo J, Kim G (2007) Temperature-insensitive silicon nano-wire
ring resonator. In: Optical fiber communication conference and exposition and the national
fiber optic engineers conference, OSA technical digest series (CD), Anaheim, p OWG4
55. Lee SJ, Lee K, Yoo HJ (2005) Analysis and implementation of practical, cost-effective net-
works on chips. IEEE Design Test Comput 22(5):422433
56. Leroy A, Marchal A, Shickova A, Catthoor F, Robert F, Verkest D (2005) Spatial division
multiplexing: a novel approach for guaranteed throughput on NoCs. In: Proceedings of the
third IEEE/ACM/IFIP International conference on hardware/software codesign and system
synthesis, pp 8186
57. Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt
A, Werner B (2002) Simics: a full system simulation platform. IEEE Comput 35(2):5058
58. McArdle N, Fancey SJ, Dines JAB, Snowdon JF, Ishikawa M, Walker AC (1998) Design of paral-
lel optical highways for interconnecting electronics. Proc SPIE Opt Comput 3490:143146
59. McArdle, N, Naruse M, Ishikawa M, Toyoda H, Kobayashi Y (1999) Implementation of a pipe-
lined optoelectronic processor: OCULAR-II. In: Optics in computing, OSA technical digest
60. McNutt B (2000) The fractal structure of data reference: applications to the memory hierarchy.
Kluwer Academic, Norwell, MA, USA
61. Millberg M, Nilsson E, Thid R, Jantsch A (2004) Guaranteed bandwidth using looped contain-
ers in temporally disjoint networks within the nostrum network on chip. In: Proceedings of the
conference on design, automation and test in Europe, pp 890895
7 Reconfigurable Networks-on-Chip 239
62. Miniskar NR, Wuyts R, Heirman W, Stroobandt D (2009) Energy efficient resource manage-
ment for scalable 3D graphics game engine. Tech report, IMEC
63. Murali S, De Micheli G (2004) Bandwidth-constrained mapping of cores onto NoC architec-
tures. In: Proceedings of the conference on design, automation and test in Europe, IEEE
Computer Society, Washington, p 20896
64. Ogras U, Marculescu R (2006) Prediction-based flow control for network-on-chip traffic. In:
Proceedings of the 43rd design automation conference, pp 839844
65. Ogras UY, Marculescu R (2006) Its a small world after all: NoC performance optimization via
long-range link insertion. IEEE Trans Very Large Scale Integr Syst Special Sect Hardware/
Software Codesign Syst Synth 14(7):693706. DOI 101109/ TVLSI2006878263. Index terms
Design automation, multiprocessor system-onchip (MP-SoC), network-on-chip (NoC), perfor-
mance analysis
66. Ohashi K, Nishi K, Shimizu T, Nakada M, Fujikata J, Ushida J, Torii S, Nose K, Mizuno M,
Yukawa H, Kinoshita M, Suzuki N, Gomyo A, Ishi T, Okamoto D, Furue K, Ueno T,
Tsuchizawa T, Watanabe T, Yamada K, Itabashi S, Akedo J (2009) On-chip optical intercon-
nect. Proc IEEE 97(7):11861198. DOI 101109/JPROC20092020331
67. Owens JD, Dally WJ, Ho R, Jayasimha D, Keckler SW, Peh LS (2007) Research challenges
for on-chip interconnection networks. IEEE Micro 27(5):96108
68. Pande PP, Grecu C, Jones M, Ivanov A, Saleh R (2005) Performance evaluation and design trade-
offs for network-on-chip interconnect architectures. IEEE Trans Comput 54(8): 10251040
69. Parhami B (1999) Introduction to parallel processing: algorithms and architectures. Kluwer
Academic
70. Patel R, Bond S, Pocha M, Larson M, Garrett H, Drayton R, Petersen H, Krol D, Deri R,
Lowry M (2003) Multiwavelength parallel optical interconnects for massively parallel pro-
cessing. IEEE J Sel Top Quantum Electron 9:657666
71. Petracca M, Lee BG, Bergman K, Carloni LP (2008) Design exploration of optical intercon-
nection networks for chip multiprocessors. In: Proceedings of the 16th IEEE symposium on
high performance interconnects, Stanford, pp 3140. DOI 101109/ HOTI200820
72. Poon AW, Xu F, Luo X (2008) Cascaded active silicon microresonator array cross-connect
circuits for WDM networks-on-chip. In: Proceedings of SPIE photonics west, pp 1924
73. Qiao C, Melhem R, Chiarulli D, Levitan S (1994) Dynamic reconfiguration of optically inter-
connected networks with time-division multiplexing. J Parallel Distr Comput 22(2):268278
74. Roldan R, dAuroil B (2003) A preliminary feasibility study of the LARPBS optical bus paral-
lel model. In: Proceedings of the 17th annual international symposium on high performance
computing systems and applications, pp 181188
75. Russell G (2004) Analysis and modelling of optically interconnected computing systems. PhD
Thesis, Heriot-Watt University
76. Sakano T, Matusumoto T, Noguchi K, Sawabe T (1991) Design and performance of a multi-
processor system employing board-to-board free-space interconnections: COSINE-1. Appl
Opt 30:23342343
77. Shacham A, Bergman K, Carloni L (2008) Photonic networks-on-chip for future generations
of chip multi-processors. IEEE Trans Comput 57(9):12461260. DOI 101109/TC200878
78. Sherwood-Droz N, Wang H, Chen L, Lee BG, Biberman A, Bergman K, Lipson M (2008)
Optical 4 4 hitless silicon router for optical networks-on-chip (NoC). Opt Express
16(20):1591515922. DOI 101364/OE16015915
79. Snyder L (1982) Introduction to the configurable, highly parallel computer. Computer 15(1)
:4756
80. Soganci IM, Tanemura T, Williams KA, Calabretta N, de Vries T, Smalbrugge E, Smit MK,
Dorren HJS, Nakano Y (2010) Monolithically integrated InP 1 16 optical switch with wave-
length-insensitive operation. IEEE Photon Technol Lett 22(3):143145
81. Srinivasan K, Chatha K (2005) A technique for low energy mapping and routing in net-
work-on-chip architectures. In: Proceedings of the international symposium on low power
electronics and design, pp 387392
240 W. Heirman et al.
82. Stensgaard MB, SparsJ (2008) ReNoC: a network-on-chip architecture with reconfigurable
topology. In: 2nd ACM/IEEE international symposium on networks-on-chip, Newcastle, pp
5564. DOI 101109/NOCS20084492725
83. Stuart MB, Stensgaard MB, SparsJ (2009) Synthesis of topology configurations and deadlock
free routing algorithms for renoc-based systems-on-chip. In: Proceedings of the 7th IEEE/
ACM international conference on hardware/software codesign and system synthesis, pp 481
490. DOI 101145/16294351629500
84. Tang S, Tang Y, Colegrove J, Craig DM (2004) Electro-optic Bragg grating couplers for fast
reconfigurable optical waveguide interconnects. In: Proceedings of the conference on lasers
and electro-optics (CLEO), p 2
85. Vangal S, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Singh A, Jacob T, Jain
S, Erraguntla V, Roberts C, Hoskote Y, Borkar N, Borkar S (2008) An 80-tile sub-100-W
TeraFLOPS processor in 65-nm CMOS. IEEE J Solid-State Circuits 43(1):2941 (2008).
DOI 101109/JSSC2007910957
86. Vlasov Y, Green WMJ, Xia F (2008) High-throughput silicon nanophotonic wavelength-insen-
sitive switch for on-chip optical networks. Nat Photon 2:242246
87. Wolkotte PT, Smit GJM, Rauwerda GK, Smit LT (2005) An energy-efficient reconfigurable
circuit-switched network-on-chip. In: Proceedings of the 19th IEEE international parallel and
distributed processing symposium (IPDPS), Denver, p 155a
88. Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: charac-
terization and methodological considerations. In: Proceedings of the 22nd international
symposium on computer architecture, Santa Margherita Ligure, pp 2436
89. Xu Q, Fattal D, Beausoleil RG (2008) Silicon microring resonators with 15-mm radius. Opt
Express 16(6):4309
90. Yoshimura T, Ojima M, Arai Y, Asama K (2003) Three-dimensional self-organized
microoptoelectronic systems for board-level reconfigurable optical interconnects-perfor-
mance modeling and simulation. IEEE J Sel Top Quantum Electron 9(2):492511.
DOI 101109/JSTQE2003812503
Chapter 8
System Level Exploration for the Integration
of Optical Networks on Chip in 3D MPSoC
Architectures
S. Le Beux (*)
cole Polytechnique de Montral, Montreal, QC, Canada
Ecole Centrale de Lyon Lyon Institute of Nanotechnology, University of Lyon,
36 avenue Guy de Collongue, Ecully Cedex 69134, France
e-mail: sebastien.le-beux@ec-lyon.fr
J. Trajkovic G. Nicolescu G. Bois
cole Polytechnique de Montral, Montreal, QC, Canada
e-mail: Jelena.Trajkovic@polymtl.ca; Gabriela.Nicolescu@polymtl.ca;
Guy.Bois@polymtl.ca
I. OConnor S. LeBeux ()
Ecole Centrale de Lyon Lyon Institute of Nanotechnology, University of Lyon,
36 avenue Guy de Collongue, Ecully Cedex 69134, France
e-mail: sebastien.le-beux@ec-lyon.fr
P. Paulin
STMicroelectronics (Canada) Inc., Ottawa, ON, Canada
e-mail: Pierre.Paulin@st.com
Introduction
Related Work
Many contributions address ONoC design. ONoCs have been considered as full
replacement solutions for electrical NoCs in [12, 13]. Fat tree [12] and 2D Mesh
[13] topologies are implemented using optical interconnects in the context of planar
244 S. Le Beux et al.
identical pattern is regularly repeated on a die (e.g. such as in Mesh and Torus net-
works), the more an architecture is generic and scalable. The same principle can
also be applied to 3D architectures: the more an identical layer is regularly repeated,
the more the architecture is generic and scalable. The architecture proposed in this
paper follows this trend: it is composed of identical electrical layers. Our approach
thus has the potential for scaling to complex systems.
Design tradeoffs for various 3D electrical interconnects have been studied in
[2731]. Methodologies are proposed to design application specific 3D SoCs in [25,
32]. System level methodologies for 3D chip allow the maximum clock frequency
to be evaluated [33], as well as power consumption [34] and even chip cost [35].
Thermal-aware application mapping for 3D chips is investigated in [36]. Our work
is complementary to this related work, since we address design tradeoffs for a 3D
architecture integrating optical interconnects.
This section presents the 3D architecture model used in this work. The architecture
is defined by the extension of two planar approaches: (1) the electrical NoC and (2)
the optical network-on-chipONoC. Figure 8.1a illustrates the 3D architecture
integrating an ONoC. It is composed of a set of stacked electrical layers and one
optical layer. The electrical layer is composed of a set of computing nodes intercon-
nected through a NoC while the optical layer integrates only an ONoC (computing
nodes are not a part of the optical layer). Two types of communications are distin-
guished in this architecture:
Intra-layer communications are used for data transfers between nodes situated
within the same electrical layer.
Inter-layer communications are used for data transfers between nodes situated
on different electrical layers.
The ONoC is obviously dedicated to the inter-layer type of communications. All
electrical layers are connected to the optical layer using electrical, point-to-point,
vertical TSVs. Inter-layer communications require routing composed of three main
steps: (1) electrical routing from the source node to a TSV, (2) optical routing within
the ONoC and (3) electrical routing from a TSV to the destination node (the details
will be presented in the remainder of the section). Given that the ONoC is used for
inter-layer communications, the optical layer is located in the middle of the 3D
architecture to minimize the length of TSVs and eliminate the need for high aspect
ratio TSVs (a notorious difficulty in the 3DIC domain).
This architecture relies on a communication hierarchy similar to that proposed
in the Firefly 2D architecture [20]. Due to 3D integration, additional communica-
tion resources could also be considered, especially to provide point-to-point con-
nections between nodes located on different (but adjacent) layers (e.g. to provide
direct connection between nodei,j,k and nodei,j,k+1, where k and k + 1 denote adjacent
246 S. Le Beux et al.
Fig. 8.1 3D architecture: (a) overview, (b) focus on electrical resources and (c) focus on optical
resources
layers). However, in addition to increased routing complexity (i.e. with these addi-
tional resources, inter-layer communication can be performed with ONoC or with
point-to-point connection), analysis demonstrated that the performance gain was
negligible [20]. Thus we do not consider such point-to-point connections for the
remainder of the chapter.
Intra-Layer Communication
Intra-layer communications are used for data transfers between the nodes situated
on the same electrical layer. Each electrical layer is composed of a set of homoge-
neous nodes interconnected by a 2D Mesh NoC. We define a node as a computing
8 System Level Exploration for the Integration... 247
subsystem including a processor and a local memory. The node accesses the net-
work via an electrical Network Interface (NI) (see Fig. 8.1b). The NoC is composed
of links and switches that are used to route data from a source NI to a destination NI.
The XY routing policy is used in this work.
Inter-Layer Communication
Inter-layer communications are used for data transfers between nodes situated on
different electrical layers. The inter-layer communications are enabled by the opti-
cal network interfaces (ONI). By providing opto-electrical and electro-optical con-
versions, ONIs allow the sending/receiving of data. Their location in the architecture
is illustrated in Fig. 8.1.
The main components of an ONI are shown in Fig. 8.2. The ONI is composed of
an electrical and an optical part. Thus, the transmitter and receiver chains of each
ONI are implemented in both electrical and optical layers. The components of the
transmitter chain in the electrical layer are a serializer (SER) and CMOS driver
circuits. An uploading TSV links the electrical layer to the optical layer. For the
transmitter functionality, the optical layer includes microsource lasers [5]. The
receiver chain includes a photodetector (on the optical layer), a downloading TSV
(connecting an optical to an electrical layer), and a CMOS receiver circuit and a
deserializer (DES) (on an electrical layer). The CMOS receiver circuit consists of a
transimpedance amplifier (TIA) and a comparator. The TIA takes in electrical cur-
rent, generated by the photodetector, and transforms it into an electrical voltage,
while the comparator decides a value of each bit based on the provided electrical
voltage. For an ONoC interconnecting N ONIs, transmitter and receiver chains are
replicated N times.
The inter-layer communication process starts on an electrical layer when an
ONI receives a data and a destination ID. The data is serialized and the appropriate
CMOS driver circuit then modulates the current flowing through the microsource
248 S. Le Beux et al.
Fig. 8.3 Diagonal and straight states: (a) logical view and (b) layout
Optical Layer
The optical layer is composed of the ONoC and the optical part of each ONI. The
ONoC used in this work is composed of waveguides and contention-free optical
switches. The waveguides transmit optical signals and the optical switches manage
the routing of signals into these waveguides.
From a functional point of view, an optical switch operates in a similar way to a
classical electronic switch. From any input port, switching is obtained to one of the
two output ports depending on the wavelength value of the optical signal injected
(Fig. 8.3). An optical switch is characterized by its resonant wavelength ln. As illus-
trated in Fig. 8.3, there are two possible switch states:
The diagonal state that occurs when a signal characterized by a wavelength l is
different from ln (l ln) is injected. In this case, the optical switch does not reso-
nate, and the signal is propagated along the same waveguide.
The straight state that occurs when a signal characterized by a wavelength l = ln
is injected. In this case, the optical switch resonates and signals are coupled to
the opposite waveguide.
We utilize wavelength division multiplexing (WDM) where multiple signals of
different wavelengths flow through a waveguide. When these multiple signals
8 System Level Exploration for the Integration... 249
a b
Fig. 8.4 ONoC interconnecting 4 ONIs located on (a) 4 layers and (b) 2 layers
encounter an optical switch, each of them is routed through the switch according to
the individual wavelength, as if it were the only signal flowing through the wave-
guide. Thanks to these optical properties, multiple signals can be transmitted
simultaneously, which facilitates the design of high bandwidth and potentially
contention-free ONoC.
The main constraint of an optical interconnect is the number of optical switches
crossed (nOSC) by one optical signal (note that this is different from the total num-
ber of optical switches in the network) [5]. nOSC for an ONoC is defined by the
path crossing the maximal number of optical switches. Recent work [37] reports
values of an output power of an integrated laser to be around 2.5 mW/mm2. To
achieve an acceptable communications bit error rate (below 1018) with an input
referred TIA noise density of 1024 A2/Hz, a total loss of no more than 13 dB in the
passive optical structure may be tolerated. For current technology, 2 cm die sizes,
typical values for loss in passive waveguides (2 dB/cm) and for loss per optical
switch (0.3 dB), the limit for nOSC is reached for 48 optical switches crossed.
Further technology improvements are expected to reduce switch losses to 0.2 dB
that may lead to reliable structures with 64 optical switches crossed. Given these
observations, we consider that a design with nOSC equal to 48 represents a cur-
rently feasible solution and a design with nOSC between 48 and 64 represents a
feasible solution in the near future. The design feasibility step (section Study of
Optical Layer Complexity) evaluates all the optical paths in the ONoC.
Since the ONoC aims only to manage inter-layer communications, full connec-
tivity between ONIs is not necessary (ONIs located on the same layer communicate
through the electrical NoCs). As a consequence, the number of optical switches
crossed by an optical signal can be reduced. Figure 8.4 illustrates an ONoC connect-
ing ONIs A, B, C and D. The initiator parts of ONI are shown on the left hand side
and the target parts are on the right. In Fig. 8.4a, the four ONIs are located on dif-
ferent layers while in Fig. 8.4b A and B are located on one layer, and C and D are
250 S. Le Beux et al.
located on the other layer. In Fig. 8.4a three targets are reachable from each source,
and the wavelengths used for these communications are illustrated in the corre-
sponding truth table. For instance, ONI A communicates with ONI C using l1. In
this case, two optical switches are crossed (nOSC = 2), as illustrated by the dashed
line. In the example illustrated in Fig. 8.4b, there is no need to connect ONIs on the
same layer (e.g. ONI A and ONI B) using the ONoC, and therefore half the com-
munication scheme is deleted, as illustrated by the corresponding truth table and the
resulting ONoC. A total of just two optical switches are necessary for entire net-
work and a single switch is crossed to perform communication from A to C
(nOSC = 1). Hence, by using this method, the number of optical switches is reduced
without impacting ONoC performance, which remains contention free. Only [22,
23] consider contention-free ONoC, but they do not consider any method to reduce
the implementation complexity when total connectivity is not required.
In order to respect the nOSC constraint, we will explore a scenario where only a
subset of the nodes in electrical layer is connected to TSVs through ONIs. For the
nodes that are not connected to TSVs, a routing path to the closest node connected
to TSVs is required. We thus introduce the concept of interconnect ratio (IR). An IR
is defined as the number of ONI divided by the total number of electrical nodes (or
switches), in percent. In this study, we consider 100%, 50% and 25% IR, as are
illustrated in Fig. 8.5.
to their bandwidth, electrical routing resources (i.e. NI, electrical part of ONI,
electrical links and electrical switches) are also characterized by their latency.
ONIs are abstracted at the transmitter and receiver level.
Waveguides and optical switches are considered latency free and they are
characterized by a set of wavelengths potentially flowing through them. Optical
switches are also characterized by their resonant wavelength. The clock speed of
the architecture is limited by the speed of opto-electrical interfaces which require
serialization of data. The maximum conversion frequency currently supported
[5] is 100 MHz. Therefore, system frequency is also 100 MHz. Note that while
electrical layer components operate at 100 MHz, the optical layer components
operate at 3.2 GHz [i.e. the frequency of optical components equals the system
frequency (100MHZ) multiplied by the data bit width (32 bits)].
The model is configurable in terms of the number of electrical layers, the number
of nodes per layer and the value of IR.
Design Tradeoffs
In this section, we evaluate complexity and performance metrics for various archi-
tectural configurations. This evaluation shows some of the design tradeoffs for 3D
architectures including ONoC.
100
80
Optical Switches Crossed
60
40
P
20
IR=100%
IR=50%
IR=25%
0
2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9
Electrical layer configuration (number of nodes)
Fig. 8.6 Implementation complexity of the optical layer for four electrical layer architectures
0.45
8x4x2 3D MESH (2 layers)
4x4x4 3D MESH (4 layers)
0.4 4x2x8 3D MESH (8 layers)
8x4x2 ONoC (2 layers)
0.35 4x4x4 ONoC (4 layers)
Throughput (flits/node/cycle)
0.25
0.2
0.15
0.1
0.05
0
0 0.1 0.2 0.3 0.4 0.5 0.6
Injection rate (flits/node/cycle)
constant values for throughput at injection rates greater than 0.4, across all
configurations. Furthermore, for ONoC-based architectures, the throughput
increases with the number of layers. This is not the case for 3D Mesh configurations:
the optimal 3D Mesh configuration is 4 4 4. According to our experiments, not
presented here, when IR is set to 50% the 3D architecture integrating ONoC still
outperforms the electrical 3D architectures. When IR = 25%, the two architectures
provide similar performances. From these results, we conclude that ONoC-based
architectures scale better with next generation MPSoC, where hundreds of nodes
located on different layers are expected to communicate with each other.
2.5
2.2
2 1.8
1.6
1.4
1.5
1.2
1
inter-layers, IR=100%
1 inter-layers, IR=50%
ONoC, IR=100%
0.8 inter-layers, IR=25%
ONoC, IR=50% intra-layers, IR=100%
ONoC, IR=25% 0.6 intra-layers, IR=50%
3D Mesh intra-layers, IR=25%
0.5 0.4
2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9 2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9
Electrical layer configuration (number of nodes) Electrical layer configuration (number of nodes)
Fig. 8.8 Average transfer times for 2-layer architectures: (a) all, (b) intra-layer and inter-layer
communications
configurations with two electrical layers. We observe that the average transfer time
depends on both the number of nodes and the IR. With an increase in the number
of nodes, the average transfer time increases for both 3D Mesh and all ONoC
architectures. The increase for 3D Mesh is linear. As for the ONoC configurations,
the average transfer time also increases with IR. For all values of IR, it may be
observed that the increase in average transfer time is less rapid than for the 3D
Mesh. These results allow system designers to rapidly evaluate the benefits of opti-
cal interconnect (compared to electrical ones), and thus aid in designing the most
efficient interconnect architecture.
Figure 8.8b illustrates the average transfer times for intra-layer and inter-layer
communications:
The average transfer time for intra-layer communication increases with the num-
ber of nodes. This behavior is due to the increasing number of electrical switches
that need to be crossed. The average transfer time also depends on the IR since
additional contentions in electrical NoC occur for a reduced number of ONI.
The average transfer time for inter-layer communications strongly depends on the
IR. Indeed, when the IR value is reduced, contentions occur for electro-optical
and opto-electronic conversions. One can observe that the inter-layer communica-
tions time slightly increases with the number of nodes. The main reason for this
dependency is the necessity of the electrical routing to and from TSVs.
Figure 8.7b shows that, for a small set of nodes, intra-layer communications
perform faster than inter-layer communications. This trend is reversed for larger
sets (e.g. 5 5 when IR is set to 50%). The number of nodes for which inter-layer
communications perform faster depends on the IR. These results aid system design-
ers to take advantage of both electrical and optical interconnect technologies for
short-range and long range communications, respectively. Experiments with 4-layer
and 8-layer architectures (not presented here) validate our observation for this type
of architectures.
In this section we analyzed the complexity of the optical layer and highlighted
current and possible short-term trends. We also illustrated the potential for using
8 System Level Exploration for the Integration... 255
a b
Case Study
This section presents the results obtained by using the presented design space explo-
ration technique to optimize the mapping of the image processing application
Demosaic onto various network architecture configurations. The application is an
industrial reference application, provided by STMicroelectronics. The Demosaic
8 System Level Exploration for the Integration... 257
image processing application performs color filter array interpolation, also called
demosaicing, as a part of the digital image processing chain. Demosaicing is neces-
sary since the standard camera sensor detects only one color value per pixel (green,
blue or red). In order to reconstruct the output image, the Demosaic application
performs three interpolations on an input image: (1) the green interpolation, (2) the
blue interpolation, and (3) the red interpolation. Figure 8.11a represents the corre-
sponding application model (using the annotations for task ti and edge ei as explained
in section Design Space Exploration). In order to lay stress on communications, a
task set identified as the Demosaic kernel (in Fig. 8.10a) is replicated 8 times, allow-
ing the application to manage larger image blocks (see Fig. 8.10b).
We use our design space exploration flow to optimize the mapping of 8
Demosaic kernels onto 64-node architectures where nodes are distributed across
2, 4, and 8 stacked electrical layers. For each of the layer configurations, we con-
sider several different values of Interconnect Ratio IR: 25%, 50% and 100%. We
present results for the pareto optimal mapping obtained by our design space
exploration tool. Figure 8.12 compares speedup of different configurations of the
ONoC-based architectures to the (reference) speedup of the architectures integrat-
ing only electrical layers (3D Mesh). The speedup is shown relative to a corre-
sponding 3D architecture, e.g. speedup of the 8 4 2 ONoC is relative to the
8 4 2 3D Mesh, while that of the 4 4 4 ONoC is relative to the 4 4 4 3D
Mesh. For 2-layer configurations, the ONoC-based architecture and the electrical
architecture provide almost equivalent performance for IR values of 50% and
100%. The configuration where IR is 25% slightly underperforms (0.8%) com-
pared to the 3D Mesh architecture. This is due to the relatively large time required
for intra-layer communication (larger than for architecture with 4 4 or 4 2 mesh
size) in addition to the time required for electro-optical and opto-electrical con-
version. A similar, but less pronounced, effect may be seen for 4-layer configuration
with 4 4 mesh size. For the remaining 4-layer configurations, ONoC-based
258 S. Le Beux et al.
1.4
3D Mesh
ONoC, IR: 25%
ONoC, IR: 50%
1.3 ONoC, IR: 100%
Speedup (compared to 3D Mesh)
1.2
1.1
0.9
0.8
8x4x2 (2 layers) 4x4x4 (4 layers) 4x2x8 (8 layers)
3D architecture configuration
Fig. 8.12 Execution performance (speedup) for different 64-node architectures executing
Demosaic kernel (replicated eight times)
architectures provide significant speedup: 9% for IR = 50% and 17% for IR = 100%
compared to the 4 4 4 3DMesh. Note that the 4 4 4 3D Mesh architecture is
an optimal electrical-only architecture, as shown in Fig. 8.7. As for 8-layer
configurations, ONoC with 25%, 50% and 100% IR uniformly provides better
performance than the corresponding 3D Mesh, showing 8%, 18% and 35%
speedup, respectively. The corresponding speedup values are directly proportional
to the increase in throughput.
Optical interconnects enable novel communication possibilities (e.g. WDM
offers a new dimension for data or address coding) and provide high performance
levels (e.g. near zero latency for long range communications). However, to maxi-
mize the benefits from these features it is necessary to carry out careful design space
exploration at different levels:
At the architectural level, reducing the design complexity of the ONoC by taking
into account its context (e.g. the number of electrical computing resources, com-
munication hierarchy and the resulting communication scenarios)
At the application level, optimizing the mapping of complex applications while
matching to ONoC communication performance levels
In this work, we consider exploration at both architecture and application levels
in order to (1) reduce the number of optical switches crossed by optical signals (thus
reducing communication losses and power consumption) and (2) maximize the
application execution throughput by using WDM. The obtained results demonstrate
that our exploration flow effectively exploits the routing capabilities of the ONoC to
maximize the system speedup factor. We believe that such a methodology allows
energy-efficient 3D MPSoCs to be designed, which further efficiently execute
8 System Level Exploration for the Integration... 259
Conclusion
References
7. Koester SJ, Dehlinger G, Schaub JD, Chu JO, Ouyang QC, Grill A (2005) Germanium-on-
insulator photodetectors. In: IEEE international conference on group VI photonics, Antwerpen,
Belgium, pp 171173
8. Massoud Y et al (2008) Subwavelength nanophotonics for future interconnects and architec-
tures. In: Invited talk, NRI SWAN Center, Rice University, in fact, it is a presentation given in
Univeristy of Austin (see http://www.src.org/library/publication/p024870/)
9. Miller D (2009) Device requirement for optical interconnects to silicon chips. Proc IEEE
Special Issue Silicon Photon 97(7):11661185
10. Minz JR, Thyagara S, Lim SK (2007) Optical routing for 3-D system-on-package. IEEE Trans
Components Packaging Technol 30(4):805812
11. OConnor I, Gaffiot F (2004) On-chip optical interconnect for low-power. In: Macii E (ed)
Ultra-low power electronics and design. Kluwer, Dordrecht
12. Gu H, Zhang W, Xu J (2009) A low-power fat tree-based optical network-on-chip for multiproces-
sor system-on-chip. In: Proceedings of design, automation, and test in Europe (DATE), Nice,
France, pp 38
13. Gu H, Xu J, Wang Z (2008) A novel optical mesh network-on-chip for gigascale systems-on-
chip. In: Proceedings of APCCAS, Macao, pp 17281731
14. Pasricha S, Dutt N (2008) ORB: an on-chip optical ring bus communication architecture for
multi-processor systems-on-chip. In: Proceedings of ASP-DAC, seoul, korea, pp 789794
15. Kirman N, Kirman M, Dokania RK, Martinez JF, Apsel AB, Watkins MA, Albonesi DH
(2006) Leveraging optical technology in future bus-based chip multiprocessors. In: Proceedings
of the 39th annual IEEE/ACM international symposium on microarchitecture, Orlando,
Florida, USA
16. Shacham A, Bergman K, Carloni L (2008) Photonic networks-on-chip for future generations
of chip multiprocessors. IEEE Trans Comput 57(9):12461260
17. Vantrease D, Schreiber R, Monchiero M, McLaren M, Jouppi NP, Fiorentino M, Davis A,
Binkert NL, Beausoleil RG, Ahn JH (2008) Corona: system implications of emerging nano-
photonic technology. In: Proceedings of the international symposium on computer architecture
(ISCA), Beijing, pp 153164
18. Beausoleil RG, Ahn J, Binkert N, Davis A, Fattal D, Fiorentino M, Jouppi NP, McLaren M,
Santori CM, Schreiber RS, Spillane SM, Vantrease D, Xu Q (2008) A nanophotonic intercon-
nect for high-performance many-core computation. In: Proceedings of the 16th IEEE sympo-
sium on high performance interconnects, pp 182189 Cardiff, UK
19. Pan Y, Kumar P, Kim J, Memik G, Zhang Y, Choudhary A (2009) Firefly: illuminating future
network-on-chip with nanophotonics. In: Proceedings of International Symposium on
Computer Architecture (ISCA), Austin, Texas, pp 429440
20. Pan Y, Kim J, Memik G (2010) FlexiShare: channel sharing for an energy-efficient nanopho-
tonic crossbar. In: Proceedings of the IEEE international symposium on high-performance
computer architecture (HPCA), Bangalore, pp 112
21. Briere M, Girodias B, Bouchebaba Y, Nicolescu G, Mieyeville F, Gaffiot F, OConnor I (2007)
System level assessment of an optical NoC in an MPSoC platform. In: Proceedings of design
automation and test in Europe, Nice, 1620 April 2007, pp 10841089
22. Joshi A, Batten C, Kwon Y, Beamer S, Shamim I, Asanovic K, Stojanovic V (2009) Silicon-
photonic clos networks for global on-chip communication. In: Proceedings of the 3rd ACM/
IEEE international symposium on networks-on-chip (NOCS), Catania, Italy, pp 124133
23. Cianchetti MJ, Kerekes JC, Albonesi DH (2009) Phastlane: a rapid transit optical routing net-
work. In: In: Proceedings of International Symposium on Computer Architecture (ISCA),
Austin, Texas, pp 441450
24. Loi I, Angiolini F, Benini L (2007) Supporting vertical links for 3D networks-on-chip: toward
an automated design and analysis flow. In: Proceedings of international conference on nano-
networks, Catania, Italy, pp 15:115:5
25. Seiculescu C, Murali S, Benini L, De Micheli G (2010) Sunfloor 3D: a tool for networks on
chip topology synthesis for 3-D systems on chips. IEEE Trans Comput Aided Des Integr
Circuits Syst 29:19872000
8 System Level Exploration for the Integration... 261
A evaluation, 126128
Advanced microcontroller bus architecture network design, 122126
(AMBA), 6, 167 Dual data rate (DDR), 5
Architectural-level design Dynamic voltage and frequency scaling
Clos and fat-tree topologies, 92 (DVFS), 207
electrical baseline architectures, 91
first-order analysis, 93, 94
global crossbar topology, 92 E
hierarchical topology, 93 Edge coupler, 3738
logical network topology, 90 Electrical distributed router (EDR), 180, 185
multiple global buses, 92 Electronic design automation (EDA), 4, 243
symmetric topologies, 91 Electronic packet switching (EPS), 209
Arrayed waveguide grating (AWG), 4042 End-to-end (ETE) delay, 148, 149
Automatic repeat request (ARQ), 13 Energy-efficient turnaround routing (EETAR),
145
B
Back-end-of-line (BEOL) integration, 67, 85 F
Bit error rate (BER), 20, 188 FabryPerot (FP) laser, 48
Buried oxide (BOX), 86 Fat tree based optical network-on-chip
Bus, 67 (FONoC)
CMOS technologies, 138
comparison and analysis
C network performance, 148149
Chemical/mechanical polishing (CMP), 68 optical power loss, 148
Clos topology, 92, 112 power consumption, 146148
Context switching, 205206 MPSoC, 138
Corona architecture, 244 multi-computer systems, 139
COSINE-1, 209 OTAR
control interfaces, 141
microresonator and switching elements,
D 139140
Dimension order routing (DOR), 227 non-blocking property, 142
Directed acyclic graph (DAG), 255 payload packets, 142
Direct memory access (DMAs), 5 traditional switching fabrics,
DRAM memory channel 140141
design themes, 129 turnaround routing algorithm, 141
T W
Temporally disjoint networks (TDNs), 219 Wave division multiplexing
Thermal tuning, 54, 55 (WDM), 157
Through silicon via (TSV), 6971, 155 Waveguide
Trans-impedance amplifier (TIA), 161, 182 coupling structures
Transmitter physical adapter edge coupler, 3738
(Tx-PhyAdapter), 183 vertical coupler, 38
Transmitting-path interface unit (TxIU), 187 rib waveguide, 36
Two-photon absorption (TPA), 6364 strip waveguide, 3436
wavelength filters and routers
arrayed waveguide grating,
U 4042
Ultra deep submicron (UDSM) domain, 154 fabrication accuracy, 4243
MachZehnder interferometer,
3941
V planar concave gratings, 4143
Vertical-cavity surface emitting lasers resonant ring filter, 39, 40
(VCSEL), 45 temperature control, 4344
Vertical coupler, 38 Wavelength division multiplexing (WDM),
Virtual channels (VCs), 219 2930, 83, 138