Sie sind auf Seite 1von 278

OVERLAY VIRTUAL NETWORKS IN SOFTWAREDEFINED DATA CENTERS

ARCHITECTURES,TECHNICAL DETAILS AND PRODUCTS

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

OVERLAY VIRTUAL NETWORKS IN


SOFTWARE-DEFINED DATA CENTERS
Ivan Pepelnjak, CCIE#1354 Emeritus

Copyright 2014 ipSpace.net AG

WARNING AND DISCLAIMER


This book is a collection of blog posts written between March 2011 and the book publication date,
providing independent information about overlay virtual networks and software-defined data
centers. Every effort has been made to make this book as complete and as accurate as possible, but
no warranty or fitness is implied. Read the introductory paragraphs before the blog post headings to
understand the context in which the blog posts have been written, and make sure you read the
Introduction section.
The information is provided on an as is basis. The authors, and ipSpace.net shall have neither
liability nor responsibility to any person or entity with respect to any loss or damages arising from
the information contained in this book.

Copyright ipSpace.net 2014

Page ii

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

CONTENT AT A GLANCE
FOREWORD ........................................................................................................................... IX
INTRODUCTION .................................................................................................................... XI
1

OVERLAY VIRTUAL NETWORKING 101 ...................................................................1-1

OVERLAY VIRTUAL NETWORKING TECHNICAL DETAILS.......................................2-1

OVERLAY VIRTUAL NETWORKING PRODUCT DETAILS .........................................3-1

GATEWAYS TO OVERLAY VIRTUAL NETWORKS.....................................................4-1

LONG-DISTANCE OVERLAY VIRTUAL NETWORKS .................................................5-1

ALTERNATE APPROACHES TO NETWORK VIRTUALIZATION .................................6-1

Copyright ipSpace.net 2014

Page iii

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

CONTENTS
FOREWORD ........................................................................................................................... IX
INTRODUCTION .................................................................................................................... XI
1

OVERLAY VIRTUAL NETWORKING 101 ...................................................................1-1


WHY IS NETWORK VIRTUALIZATION SO HARD? .............................................................. 1-5
VLANS ARE THE WRONG ABSTRACTION FOR VIRTUAL NETWORKING ........................... 1-8
TRANSPARENT BRIDGING (AKA L2 SWITCHING) SCALABILITY ISSUES ........................... 1-10
VMWARE VSWITCH THE BASELINE OF SIMPLICITY.......................................................... 1-13
VIRTUAL SWITCHES FROM SIMPLE TO SCALABLE ........................................................... 1-18
COMPLEXITY BELONGS TO THE NETWORK EDGE ............................................................. 1-23
DECOUPLE VIRTUAL NETWORKING FROM THE PHYSICAL WORLD ................................. 1-26
VIRTUAL NETWORKS: THE SKYPE ANALOGY .................................................................... 1-33
SOFT SWITCHING MIGHT NOT SCALE, BUT WE NEED IT................................................... 1-38
EMBRACE THE CHANGE ... RESISTANCE IS FUTILE ............................................................... 1-41
VXLAN AND EVB QUESTIONS .......................................................................................... 1-44
DOES IT MAKE SENSE TO BUILD NEW CLOUDS WITH OVERLAY NETWORKS ............. 1-48

Copyright ipSpace.net 2014

Page iv

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

HOW DO I START MY FIRST OVERLAY VIRTUAL NETWORKING PROJECT?.................. 1-52


VIRTUAL NETWORKING IS MORE THAN VMS AND VLAN DUCT TAPE .......................... 1-54

OVERLAY VIRTUAL NETWORKING TECHNICAL DETAILS.......................................2-1


VIRTUAL NETWORKING IMPLEMENTATION TAXONOMY .................................................. 2-4
A DAY IN A LIFE OF AN OVERLAID VIRTUAL PACKET ........................................................ 2-7
CONTROL PLANE PROTOCOLS IN OVERLAY NETWORKS .............................................. 2-13
VXLAN, IP MULTICAST, OPENFLOW AND CONTROL PLANES ....................................... 2-15
VXLAN SCALABILITY CHALLENGES.................................................................................... 2-20
IGMP AND PIM IN MULTICAST VXLAN TRANSPORT NETWORKS ............................... 2-23
PVLAN, VXLAN AND CLOUD APPLICATION ARCHITECTURES .................................... 2-25
VM-LEVEL IP MULTICAST OVER VXLAN ........................................................................... 2-30
VXLAN RUNS OVER UDP DOES IT MATTER? ................................................................ 2-32
NVGRE BECAUSE ONE STANDARD JUST WOULDNT BE ENOUGH .............................. 2-35
DO WE REALLY NEED STATELESS TRANSPORT TUNNELING (STT)................................. 2-38
COULD MPLS-OVER-IP REPLACE VXLAN OR NVGRE? ................................................. 2-44
ARE OVERLAY NETWORKING TUNNELS A SCALABILITY NIGHTMARE? .......................... 2-48
OVERLAY NETWORKS AND QOS FUD ............................................................................. 2-51

Copyright ipSpace.net 2014

Page v

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

MICE, ELEPHANTS AND VIRTUAL SWITCHES ..................................................................... 2-54


HOW MUCH DATA CENTER BANDWIDTH DO YOU REALLY NEED?............................ 2-57
CAN WE REALLY IGNORE SPAGHETTI AND HORSESHOES? ............................................... 2-60
TTL IN OVERLAY VIRTUAL NETWORKS ............................................................................ 2-63
VMOTION AND VXLAN ..................................................................................................... 2-66

OVERLAY VIRTUAL NETWORKING PRODUCT DETAILS .........................................3-1


OVERLAY VIRTUAL NETWORKING SOLUTIONS OVERVIEW .............................................. 3-4
WHAT IS VMWARE NSX? ..................................................................................................... 3-6
VMWARE NSX CONTROL PLANE ........................................................................................ 3-9
LAYER-2 AND LAYER-3 SWITCHING IN VMWARE NSX .................................................. 3-16
LAYER-3 FORWARDING WITH VMWARE NSX EDGE SERVICES ROUTER....................... 3-18
OPEN VSWITCH UNDER THE HOOD ................................................................................. 3-21
ROUTING PROTOCOLS ON NSX EDGE SERVICES ROUTER ............................................. 3-24
UNICAST-ONLY VXLAN FINALLY SHIPPING .................................................................... 3-32
WHATS COMING IN HYPER-V NETWORK VIRTUALIZATION (WINDOWS SERVER 2012
R2) ......................................................................................................................................... 3-35
NETWORKING ENHANCEMENTS IN WINDOWS SERVER 2012 R2 ................................. 3-38

Copyright ipSpace.net 2014

Page vi

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VIRTUAL PACKET FORWARDING IN HYPER-V NETWORK VIRTUALIZATION ................ 3-40


HYPER-V NETWORK VIRTUALIZATION PACKET FORWARDING IMPROVEMENTS IN
WINDOWS SERVER 2012 R2 .............................................................................................. 3-47
COMPLEX ROUTING IN HYPER-V NETWORK VIRTUALIZATION..................................... 3-51
THIS IS NOT THE HOST ROUTE YOURE LOOKING FOR ................................................. 3-56
OPENSTACK NEUTRON PLUG-IN: THERE CAN ONLY BE ONE ..................................... 3-61
PACKET FORWARDING IN AMAZON VPC ........................................................................ 3-67
MIDOKURAS MIDONET: A LAYER 2-4 VIRTUAL NETWORK SOLUTION ......................... 3-69
BIG SWITCH AND OVERLAY NETWORKS .......................................................................... 3-80

GATEWAYS TO OVERLAY VIRTUAL NETWORKS.....................................................4-1


VXLAN TERMINATION ON PHYSICAL DEVICES ................................................................. 4-3
CONNECTING LEGACY SERVERS TO OVERLAY VIRTUAL NETWORKS ............................. 4-9
IT DOESNT MAKE SENSE TO VIRTUALIZE 80% OF THE SERVERS..................................... 4-12
INTERFACING OVERLAY VIRTUAL NETWORKS WITH MPLS/VPN WAN ..................... 4-14
VMWARE NSX GATEWAY QUESTIONS ............................................................................ 4-17
ARISTA LAUNCHES THE FIRST HARDWARE VXLAN TERMINATION DEVICE ................... 4-19
OVERVIEW OF HARDWARE GATEWAYS TO OVERLAY VIRTUAL NETWORKS ............... 4-21

Copyright ipSpace.net 2014

Page vii

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

LONG-DISTANCE OVERLAY VIRTUAL NETWORKS .................................................5-1


HOT AND COLD VM MOBILITY ........................................................................................... 5-3
VXLAN, OTV AND LISP ...................................................................................................... 5-8
VXLAN IS NOT A DATA CENTER INTERCONNECT TECHNOLOGY ............................... 5-12
EXTENDING LAYER-2 CONNECTION INTO A CLOUD ..................................................... 5-15
REVISITED: LAYER-2 DCI OVER VXLAN ........................................................................... 5-17
VXLAN AND OTV: IVE BEEN SUCKERED ......................................................................... 5-19

ALTERNATE APPROACHES TO NETWORK VIRTUALIZATION .................................6-1


NETWORK VIRTUALIZATION AND SPAGHETTI WALL ....................................................... 6-3
SMART FABRICS VERSUS OVERLAY VIRTUAL NETWORKS .................................................. 6-6
NETWORK VIRTUALIZATION AT TOR SWITCHES? MAKES AS MUCH SENSE AS IP-OVERAPPN .................................................................................................................................... 6-12

Copyright ipSpace.net 2014

Page viii

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

FOREWORD
Network virtualization is long overdue, and its the biggest change in data center networking since
the x86 computer and the Ethernet switch. The reason for a network is to provide a reliable and
performant connectivity service to the computers for which it connects. There will always be a need
for computers, and hence there will always be a need for a network and networking professionals.
*That* will never change. What will change, many times over, is the form factor and architecture of
how computing is both delivered and consumed. And when the computing architecture evolves, the
network architecture must evolve with it. Over the last decade, both on the server side and user
side, computing has fundamentally changed through the wide spread deployment of server
virtualization and the revolution of cloud, and mobile computing. To meet the new demands of
instant, anywhere access, on any device, for billions of users, the network needs to become more
virtualized, software-centric, and programmable.
To the reprobation of those invested in the past, overlay virtual networks will play a major role in
this new future of networking. Overlays allow software to construct a persistent and feature rich
end-to-end networking service from any location, on any device, for any existing or new application.
Like it or not, overlay virtual networking is here to stay. As a networking professional in this new
era of mobile and cloud computing you will be asked to plan, design, troubleshoot, and operate
networks that implement a variety of overlay based networking architectures. In this book you will
find the fundamental technical knowledge to equip your career in the era of overlays, from the
worlds best networking teacher and practitioner, Ivan Pepelnjak.

Copyright ipSpace.net 2014

Page ix

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In your career youll find many sources of information on overlay networking, often difficult to
discern that which is represented with a bias towards one specific solution, or some who have a
vested interest in its failure. In this book, Ivan does what he does better than anybody in the
industry, filtering out the hype and bias, and taking you straight to the information that matters and
things you need to know. Nobody can deliver such comprehensive and wide ranging material on this
subject better than Ivan Pepelnjak. Pat yourself on the back for acquiring the best assimilation of
independent technical content ever produced on how overlay virtual networking works, covering the
different solutions you will encounter in the market place. Now its up to you to take this valuable
tool, learn from it, and put yourself on a path to lead and prosper in the next era of networking.

Brad Hedlund
Office of the CTO at VMware NSBU, and Blogger
http://BradHedlund.com

Copyright ipSpace.net 2014

Page x

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

INTRODUCTION
Until Cisco launched VXLAN in 2011, server virtualization vendors used VLANs to create virtual
subnets between virtual machines, resulting in rigid architectures with tight coupling between
hypervisor virtual switches and adjacent physical switches. The rigidity of the resulting architecture
and VLAN scalability problems significantly hamper operational efficiencies, resulting in a flurry of
overlay virtual networking products that transport VM-level payloads across IP infrastructure.
The responses of the traditional networking engineers was easy to predict:

Overlay virtual networking is nothing more than tunnels in disguise;

Tunnels are complex and hard to provision;

Well lose QoS and end-to-end visibility.

It took years to debunk some of these misconceptions and prove that the overlay virtual networks
make architectural sense (and even today you can see the raging debates between proponents of
hardware-based network virtualization products and overlay virtual networking products). In these
years I wrote over fifty blog posts explaining the architectural details of overlay virtual networks,
design guidelines, and product details.
This book contains a collection of the most relevant blog posts describing overlay virtual networking
concepts, benefits and drawbacks, architectures, technical details and individual products. I cleaned
up the blog posts and corrected obvious errors and omissions, but also tried to leave most of the
content intact. The commentaries between the individual blog posts will help you understand the
timeline or the context in which a particular blog post was written.

Copyright ipSpace.net 2014

Page xi

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The book covers these topics:

Introduction to overlay virtual networking concepts and architectures (Chapter 1);

Overlay virtual networking technical details (Chapter 2);

Product details, covering Cisco Nexus 1000V, VMware NSX, Microsoft Hyper-V network
virtualization and a few smaller vendors (Chapter 3);

Gateways between overlay virtual networks and physical networks (Chapter 4);

Challenges of long-distance overlay virtual networks (Chapter 5);

My opinions on alternate approaches to network virtualization (Chapter 6);

As always, please do feel free to send me any questions you might have the best way to reach me
is to use the contact form on my web site (www.ipSpace.net).

Happy reading!
Ivan Pepelnjak
August 2014

Copyright ipSpace.net 2014

Page xii

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

OVERLAY VIRTUAL NETWORKING 101

IN THIS CHAPTER:
WHY IS NETWORK VIRTUALIZATION SO HARD?
VLANS ARE THE WRONG ABSTRACTION FOR VIRTUAL NETWORKING
TRANSPARENT BRIDGING (AKA L2 SWITCHING) SCALABILITY ISSUES
VMWARE VSWITCH THE BASELINE OF SIMPLICITY
VIRTUAL SWITCHES FROM SIMPLE TO SCALABLE
COMPLEXITY BELONGS TO THE NETWORK EDGE
DECOUPLE VIRTUAL NETWORKING FROM THE PHYSICAL WORLD
VIRTUAL NETWORKS: THE SKYPE ANALOGY
SOFT SWITCHING MIGHT NOT SCALE, BUT WE NEED IT
EMBRACE THE CHANGE ... RESISTANCE IS FUTILE
VXLAN AND EVB QUESTIONS

Copyright ipSpace.net 2014

Page 1-1

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

DOES IT MAKE SENSE TO BUILD NEW CLOUDS WITH OVERLAY NETWORKS


HOW DO I START MY FIRST OVERLAY VIRTUAL NETWORKING PROJECT?
VIRTUAL NETWORKING IS MORE THAN VMS AND VLAN DUCT TAPE

Copyright ipSpace.net 2014

Page 1-2

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Overlay virtual networking appeared at approximately the same time as OpenFlow and SDN with
Ciscos introduction of VXLAN (MAC-over-IP encapsulation) in Nexus 1000V.
The movement that started as a simple hack to bypass the limitations of layer-2 switching (aka
bridging) quickly gained momentum just a few years later all major virtualization platforms
(vSphere, Hyper-V, KVM, Xen) support overlay virtual networks, and most new products targeting
large-scale environments use this technology.
The architectural benefits of overlay virtual networking are easy to validate: Amazon VPC and
Microsoft Azure are using MAC-over-IP encapsulation to build public clouds spanning hundreds of
thousands of physical servers and running millions of virtual machines.
This chapter focuses on the fundamental principles of overlay virtual networking and its benefits as
compared to more traditional (usually VLAN-based) approaches. The subsequent chapters delve into
the technical details.

Copyright ipSpace.net 2014

Page 1-3

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

NEED EVEN MORE INFORMATION?


Check out my virtualization webinars or get in touch if you need design review or technology
recommendation.
The webinars to consider include:

Introduction to Virtualized Networking to start the journey.

Cloud Computing Networking if you need a broad technology overview;

Virtual Firewalls if you want to know more about appliance- and NIC-based virtual firewalls;

Overlay Virtual Networking if youre looking for in-depth architecture and product details;

VMware NSX Architecture if youre evaluating the feasibility of VMware NSX;

VXLAN Technical Deep Dive if you plan to build your cloud with VXLAN.

Copyright ipSpace.net 2014

Page 1-4

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post written in September 2013 is a perfect introduction to the topic. It explains why its
so hard to implement a scalable network virtualization solution.

WHY IS NETWORK VIRTUALIZATION SO HARD?


Weve been hearing how the networking is the last bastion of rigidity in the wonderful unicornflavored virtual world for the last few years. Lets see why its so much harder to virtualize the
networks as opposed to compute or storage capacities (side note: it didnt help that virtualization
vendors had no clue about networking, but things are changing).
When you virtualize the compute capacities, youre virtualizing RAM (well-known problem for at least
40 years), CPU (same thing) and I/O ports (slightly trickier, but doable at least since Intel rolled out
80286 processors). All of these are isolated resources limited to a single physical server. Theres
zero interaction or tight coupling with other physical servers, theres no shared state, so its a
perfect scale-out architecture the only limiting factor is the management/orchestration system
(vCenter, System Center ).
So-called storage virtualization is already a fake (in most cases) hypervisor vendors are not
virtualizing storage, theyre usually using a shared file system on LUNs someone already created for
them (architectures with local disk storage use some variant of a global file system with automatic
replication). I have no problem with that approach, but when someone boasts how easy it is to
create a file on a file system as compared to creating a VLAN (= LUN), I get mightily upset. (Side
note: why do we have to use VLANs? Because the hypervisor vendors had no better idea).

Copyright ipSpace.net 2014

Page 1-5

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Theres limited interaction between hypervisors using the same file system as long as they only
read/write file contents. The moment a hypervisor has to change directory information (VMware) or
update logical volume table (Linux), the node doing the changes has to lock the shared resource.
Due to SCSI limitations, the hypervisor doing the changes usually locks the whole shared storage,
which works really well just ask anyone using large VMFS volumes accessed by tens of vSphere
hosts. Apart from the locking issues and shared throughput (SAN bandwidth and disk throughput)
between hypervisor hosts and storage devices, theres still zero interaction between individual VMs
or hypervisors hosts scaling storage is as easy (or as hard) as scaling files on a shared file system.
In the virtual networking case, there was extremely tight coupling between virtual switches and
physical switches and there always will be tight coupling between all the hypervisors running VMs
belonging to the same subnet (after all, thats what networking is all about), be it layer-2 subnet
(VLAN/VXLAN/) or layer-3 routing domain (Hyper-V).
Because of the tight coupling, the virtual networking is inherently harder to scale than the virtual
compute or storage. Of course, the hypervisor vendors took the easiest possible route, used
simplistic VLAN-based layer-2 switches in the hypervisors and pushed all the complexity to the
network edge/core, while at the same time complaining how rigid the network is compared to their
software switches. Of course its easy to scale out totally stupid edge layer-2 switches with no
control plane (that have zero coupling with anything else but the first physical switch) if someone
else does all the hard work.

Copyright ipSpace.net 2014

Page 1-6

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Not surprisingly, once the virtual switches tried to do the real stuff (starting with Nexus 1000V),
things got incredibly complex (no surprise there). For example, Ciscos Nexus 1000V only handles up
to 128 hypervisor hosts (because the VSM runs the control plane protocols). VMware NSX is doing
way better because they decoupled the physical transport (IP) from the virtual networks
controllers are used solely to push forwarding entries into the hypervisors when the VMs are started
or moved around.
Summary: Every time someone tells you how network virtualization will get as easy as compute or
storage virtualization, be vary. They probably dont know what theyre talking about.

Copyright ipSpace.net 2014

Page 1-7

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Why do we need overlay virtual networks? Arent VLANs good enough? This blog post starts the
journey that will tell you why VLANs might not be the best approach.

VLANS ARE THE WRONG ABSTRACTION FOR VIRTUAL


NETWORKING
Are you old enough to remember the days when operating systems had no file system? Fortunately I
never had to deal with storing files on one of those (I was using punch cards), but miraculously you
can still find the JCL DLBL/EXTENT documentation online.
On the other hand, you probably remember the days when a SCSI LUN actually referred to a
physical disk connected to a computer, not an extensible virtual entity created through point-andclick exercise on a storage array.
You might wonder what the ancient history has to do with virtual networking. Dont worry
were getting there in a second ;)

When VMware started creating their first attempt at server virtualization software, they had readily
available storage abstractions (file system) and CPU abstraction (including MS-DOS support under
Windows, but the ideas were going all the way back to VM operating system on IBM mainframes).
Creating virtual storage and CPU environments was thus a no-brainer, as all the hard problems were
already solved. Most server virtualization solutions use the file system recursively (virtual disk = file

Copyright ipSpace.net 2014

Page 1-8

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

on a file system) and abstract the CPU by catching and emulating privilege-mode instructions
(things got way easier with modern CPUs supporting virtualization in hardware). There was no
readily-available networking abstraction, so they chose the simplest possible option: VLANs (after
all, its simple to insert a 12-bit tag into a packet and pretend its no longer your problem).
The only problem with using VLANs is that they arent the right abstraction. Instead of being like
files on a file system, VLANs are more like LUNs on storage arrays someone has to provision them.
You could probably imagine how successful the server virtualization would be if youd have to ask
storage administrators for a new LUN every time you need a virtual disk for a new VM.
So every time I see how the Software-Defined Data Center [...] provides unprecedented
automation, flexibility, and efficiency to transform the way you deliver IT I cant help but read it
took us more than a decade to figure out the right abstraction. Virtual networking is nothing else
but another application riding on top of IP (storage and voice people got there years before).

Copyright ipSpace.net 2014

Page 1-9

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Layer-2 switching (the technology originally known as transparent bridging) has numerous
scalability challenges that limit its usability in large-scale virtual networking environments. This blog
post lists some of them.

TRANSPARENT BRIDGING (AKA L2 SWITCHING)


SCALABILITY ISSUES
Stephen Hauser sent me an interesting question after the Data Center fabric webinar I did with
Abner Germanow from Juniper:
A common theme in your talks is that L2 does not scale. Do you mean that Transparent
(Learning) Bridging does not scale due to its flooding? Or is there something else that
does not scale?
As is oft the case, Im not precise enough in my statements, so lets fix that first:
There are numerous layer-2 protocols, but when I talk about layer-2 (L2) scalability in data center
context, I always talk about Ethernet bridging (also known under its marketing name switching),
more precisely, transparent bridging that uses flooding of broadcast, unknown unicast, and multicast
frames (I love the BUM acronym) to compensate for lack of host-to-switch- and routing (MAC
reachability distribution) protocols.

Copyright ipSpace.net 2014

Page 1-10

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Large transparently bridged Ethernet networks face three layers of scalability challenges:
Dismal control plane protocol (Spanning Tree Protocol in its myriad incarnations), combined with
broken implementations of STP kludges. Forward-before-you-think behavior of Ciscos PortFast and
lack of CPU protection on some of the switches immediately come to mind.
TRILL (or a proprietary TRILL-like implementation like FabricPath) would solve most of the STPrelated issues once implemented properly (ignoring STP does not count as properly scalable
implementation in my personal opinion). However, we still have limited operational experience and
some vendors implementing TRILL might still face a steep learning curve before all the loop
detection/prevention and STP integration features work as expected.
Flooding of BUM frames is an inherent part of transparent bridging and cannot be disabled if you
want to retain its existing properties that are relied upon by broken software implementations.
Every broadcast frame flooded throughout a L2 domain must be processed by every host
participating in that domain (where L2 domain means a transparently bridged Ethernet VLAN or
equivalent). Ethernet NICs do perform some sort of multicast filtering, but its usually hash-based
and not ideal (for more information, read multicast-related blog posts written by Chris Marget).
Finally, while Ethernet NICs usually ignore flooded unicast frames (those frames still eat the
bandwidth on every single link in the L2 domain, including host-to-switch links), servers running
hypervisor software are not that fortunate. The hypervisor requirements (number of unicast MAC
addresses within a single physical host) typically exceed the NIC capabilities, forcing hypervisors to
put physical NICs in promiscuous mode. Every hypervisor host thus has to receive, process, and oft
ignore every flooded frame. Some of those frames have to be propagated to one or more VMs
running in that hypervisor and further processed by them (assuming the frame belongs to the
proper VLAN).

Copyright ipSpace.net 2014

Page 1-11

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In a typical every-VLAN-on-every-access-port design, every hypervisor host has to processes every


BUM frame generated anywhere in the L2 domain (regardless of whether its VMs belong to the VLAN
generating the flood or not).
You might be able to make bridging scale better if youd implement fully IP-aware L2 solution. Such
a solution would have to include ARP proxy (or central ARP servers), IGMP snooping and a total ban
on other BUM traffic. TRILL as initially envisioned by Radia Perlman was moving in that direction and
got thoroughly crippled and force-fit into the ECMP bridging rathole by the IETF working group.
Lack of addressing hierarchy is the final stumbling block. Modern data center switches (most of
them using the same hardware) support up to 100K MAC addresses, so other problems will probably
kill you way before you reach this milestone.
Finally, every L2 domain (VLAN) is a single failure domain (primarily due to BUM flooding). There are
numerous knobs you can try to tweak (storm control, for example), but you cannot change two
basic facts:

A software glitch in a switch that causes a forwarding (and thus flooding) loop involving core
links will inevitably cause a network-wide meltdown (due to lack of TTL field in L2 headers);

A software glitch (or virus/malware/you-name-it), or uncontrolled flooding started by any host or


VM attached to a VLAN will impact all other hosts (or VMs) attached to the same VLAN, as well
as all core links. A bug resulting in broadcasts will also impact the CPU of all layer-3 (IP)
switches with IP addresses configured in that VLAN.

You can use storm control to reduce the impact of an individual VM, but even the market leader
might have a problem or two with this feature.

Copyright ipSpace.net 2014

Page 1-12

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Speaking about VLANs, bridging and scalability, lets see what the traditional virtualization solutions
did to address this problem (hint: nothing, as I explained in the following blog post written in late
2011).

VMWARE VSWITCH THE BASELINE OF SIMPLICITY


If youre looking for a simple virtual switch, look no further than VMwares venerable vSwitch. It
runs very few control protocols (just CDP or LLDP, no STP or LACP), has no dynamic MAC learning,
and only a few knobs and moving parts ideal for simple deployments. Of course you have to pay
for all that ease-of-use: designing a scalable vSwitch-based solution is tough (but then it all depends
on what kind of environment youre building).

HOW DID IT ALL START?


As always, theres a bit of a history there. Like many other disruptive technologies (including
Netware and Windows networking), VMware entered the enterprise networks under the radar
geeks playing with it and implementing totally undercover solutions.
It was important in those days to be able to connect an ESX host to the network with minimum
disruption (even if you had sub-zero networking skills). The decision to avoid STP and implement
split-horizon switching made perfect sense; running STP in an ESX host would get you banned in a
microsecond. vSwitch is also robust enough that you can connect it to a network that was
designed by someone who got all his networking skillz through Linksys Web UI and it would still
work.

Copyright ipSpace.net 2014

Page 1-13

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

THE GROWING PAINS


In the meantime, VMware has grown from a tiny disruptive startup to a major IT company and THE
major virtualization vendor (literally created that market), and became part of almost every
virtualized data center but the vSwitch has failed to grow up.
vSwitch got some scalability enhancements (distributed vSwitch), but only on the management
plane; apart from a few features that are enabled in vDS and not in vSwitch, the two products use
the same control/data plane. Theres some basic QoS (per-VM policing and 802.1p marking) and
some support for network management and troubleshooting (Netflow, SPAN, remote SPAN). Still no
STP nor LACP.
Lack of LACP is a particularly tough nut. Once you try to do anything a bit more complex, like proper
per-session load balancing, or achieving optimum traffic flow in a MLAG environment, you have to
carefully configure vSwitch and pSwitch just right. You can eventually squeeze the vSwitch into
those spots, and get it to work, but it will definitely be a tight fit, and it wont be nearly as reliable
as it could have been were vSwitch to support proper control-plane protocols.

IS IT JUST VMWARE?
Definitely not. Other virtual switches fare no better, and the Open vSwitch is no more intelligent
without an external OpenFlow controller. At the moment, VMwares vSwitch is probably still the most
intelligent vSwitch shipping with a hypervisor.
The only reason XenServer supports LACP is because LACP support is embedded in the underlying
Linux kernel but even then the LACP-based bonding is not officially supported.

Copyright ipSpace.net 2014

Page 1-14

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

MULTI-TENANT SUPPORT
vSwitchs multi-tenant support reflects its typical use case (virtualized enterprise data center). The
only virtual networking technology it supports is 802.1Q-based VLANs (using a single VLAN tag),
limiting you to 4000 logical networks (assuming the physical switches can support that many
VLANs). Theres also no communication between the virtual switches and adjacent physical switches
a vSwitch embedded in a vSphere host cannot tell the adjacent physical switch which VLANs it
needs.
vCDNI and VXLAN (both scale much better and offer wider range of logical networks) are not part of
vSwitch. vCDNI is an add-on module using VMsafe API and VXLAN exists within Nexus 1000V, or as
a loadable kernel module on top of VMware virtual distributed switch (vDS).
On top of all that, vSwitch assumes friends and family environment. BPDUs generated by a VM can
easily escape into the wild and trigger BPDU guard on upstream switches; its also possible to send
tagged packets from VMs into the network (implementing VLAN hopping would take a few extra
steps and a misconfigured physical network), and theres no per-VM broadcast storm control. Using
a vSwitch in a potentially hostile cloud environment is a risky proposition.

SCALABILITY? NO THANKS.
There is an easy way to deploy vSwitch in the worst-case any VM can be started on any hypervisor
host scenario configure all VM-supporting VLANs on all switch-to-server access trunks, effectively
turning the whole data center into a single broadcast domain. As hypervisor NICs operate in
promiscuous mode, every hypervisor receives and processes every flooded packet, regardless of its
VLAN and its actual target.

Copyright ipSpace.net 2014

Page 1-15

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

There are three factors that limit the scalability of such a design:
Reliance on bridging, which usually implies reliance on STP. STP is not necessarily a limiting
factor; you can create bridged networks with thousands of ports without having a single blocked link
if you have a well-designed spine & leaf architecture, and large core switches. Alternatively, you
could trust emerging technologies like FabricPath or QFabric.
Single broadcast domain. I dont want to be the one telling you how many hosts you can have in
a broadcast domain, lets turn to TRILL Problem and Applicability Statement (RFC 5556). Its section
2.6 (Problems Not Addressed) is very clear: a single bridged LAN supports around 1000 hosts. Due
to physical NICs being in promiscuous mode and all VLANs being enabled on all access trunks, VLAN
segmentation doesnt help us; effectively we still have a single broadcast domain. Were thus talking
about ~1000 VMs (regardless of the number of VLANs they reside in).
Im positive Ill get comments along the lines of Im running 100.000 VMs in a single bridged
domain and they work just fine. Free soloing (rock climbing with zero protection) also works great
until the first fall. Seriously, I would appreciate all data points you're willing to share in the
comments.
Number of VLANs. Although vSphere supports the full 12-bit VLAN range, many physical switches
dont. The number of VLANs doesnt matter in a traditional virtualized data center, with only a few
(or maybe a few tens) security zones, but its a major showstopper in a public cloud deployment.
Try telling your boss that your solution supports only around 1000 customers (assuming each
customer wants to have a few virtual subnets) after replacing all the switches you bought last
year.

Copyright ipSpace.net 2014

Page 1-16

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

CONCLUSIONS
The vSwitch is either the best thing ever invented (if youre running a small data center with a few
VLANs) or a major showstopper (if youre building an IaaS cloud). Use it in environments it was
designed for and youll have a fantastically robust solution.
There are also a few things you can do in the physical network to improve the scalability of vSwitchbased networks; Ill describe them in the next post.

Copyright ipSpace.net 2014

Page 1-17

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Before finally starting the overlay virtual networking discussion, lets ask a simple question: do we
really need more than a simple layer-2 virtual switch? As always, the answer is it depends, as I
explained in this post written in November 2011.

VIRTUAL SWITCHES FROM SIMPLE TO SCALABLE


Dan sent me an interesting comment after watching a recording of my Data Center 3.0 webinar:
I have a different view regarding VMware vSwitch. For me its the best thing happened in
my network in years. The vSwitch is so simple, and its so hard to break something in it,
that I let the server team to do what ever they want (with one small rule, only one vNIC
per guest). I never have to configure a server port again.
As always, the right answer is it depends what kind of vSwitch you need depends primarily on
your requirements.
Ill try to cover the whole range of virtual networking solutions (from very simple ones to pretty
scalable solutions) in a series of blog posts, but before going there, lets agree on the variety of
requirements that we might encounter.
We use virtual switches in two fundamentally different environments today: virtualized data centers
on one end of the spectrum, and private and public clouds at the other end (and youre probably
somewhere between these two extremes).

Copyright ipSpace.net 2014

Page 1-18

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VIRTUALIZED DATA CENTERS


Youd expect to see only a few security zones (and logical segments) in a typical small data center;
you might even see different applications sharing the same security zone.
The number of physical servers is also reasonably low (in tens, maybe low hundreds, but definitely
not thousands) as is the number of virtual machines. The workload is more or less stable, and the
virtual machines are moved around primarily for workload balancing / fault tolerance / maintenance
/ host upgrade reasons.

PUBLIC AND PRIVATE CLOUDS


Cloud environment is a completely different beast. Workload is dynamic and unpredictable (after all,
the whole idea of cloudifying the server infrastructure revolves around the ability to be able to start,
stop, move, grow and shrink the workloads instantaneously), there are numerous tenants, and each
tenant wants to have its own virtual networks, ideally totally isolated from other tenants.
The unpredictable workload places extra strains on the networking infrastructure due to large-scale
virtual networks needed to support it.
You could limit the scope of the virtual subnets in a more static virtualized data center; after all, it
doesnt make much sense to have the same virtual subnet spanning more than one HA cluster (or at
most a few of them).
In a cloud environment, you have to be able to spin up a VM whenever a user requests it and you
usually start the VM within the physical server that happens to have enough compute (CPU+RAM)
resources. That physical server can be sitting anywhere in the data center, and the tenants logical

Copyright ipSpace.net 2014

Page 1-19

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

network has to be able to extend to it; you simply cannot afford to be limited by the geography of
the physical network.

HYBRID ENVIRONMENTS
These data centers can offer you the most fun (or headache) there is a combination of traditional
hosting (with physical servers owned by the tenants) and IaaS cloud (running on hypervisorpowered infrastructure) presents some very unique requirements just ask Kurt (@networkjanitor)
Bales about his DC needs.

VIRTUAL MACHINE MOBILITY


One of the major (network-related) headaches were experiencing in the virtualized data centers is
the requirement for VM mobility. You cannot change the IP address of a running VM as you move it
between hypervisor hosts due to broken TCP stack (or youd lose all data sessions). The common
way to implement VM mobility without changes to the guest operating system is thus L2 connectivity
between the source and destination hypervisor host.

Copyright ipSpace.net 2014

Page 1-20

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 1-1: VM mobility usually requires VM-level L2 connectivity between hypervisors

In a virtualized enterprise data center youd commonly experience a lot of live VM migration; the
workload optimizers (like VMwares DRS) constantly shift VMs around high availability clusters to
optimize the workload on all hypervisor hosts (or even shut down some of the hosts if the overall
load drops below a certain limit). These migration events are usually geographically limited
Vmware HA cluster can have at most 32 hosts and while its prudent to spread them across two
racks or rows (for HA reasons), thats the maximum range that makes sense.

Copyright ipSpace.net 2014

Page 1-21

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VM migration events are rare in public clouds (at least those that charge by usage). While the cloud
operator might care about the server utilization, its simpler to allocate resources statically when the
VMs are started, implement resource limits to ensure VMs cant consume more than what the users
paid for, and let the users perform their own workload balancing (or not).

Copyright ipSpace.net 2014

Page 1-22

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

OK, so maybe VLANs and layer-2 switching are the wrong approach. What are the alternatives? Lets
start with some fundamental architectural principles that made the global Internet as scalable as it
is.
The blog post was written in 2011 (before Cisco launched VXLAN) and thus refers to VMwares thenpopular solution (vCDNI), which use a totally non-scalable MAC-over-MAC approach. I retained the
now-irrelevant parts of the article to give you a historic perspective on what we had to argue in
2011.
Finally, please allow me to point out that virtualization vendors did exactly what I said they should
be doing (because it makes perfect sense, not because I was writing about it).

COMPLEXITY BELONGS TO THE NETWORK EDGE


Whenever I write about vCloud Director Networking Infrastructure (vCDNI), be it a rant or a more
technical post, I get comments along the lines of What are the network guys going to do once the
infrastructure has been provisioned? With vCDNI there is no need to keep network admins full time.
Once we have a scalable solution that will be able to stand on its own in a large data center, most
smart network admins will be more than happy to get away from provisioning VLANs and focus on
other problems. After all, most companies have other networking problems beyond data center
switching. As for disappearing work, we've seen the demise of DECnet, IPX, SNA, DLSw and multiprotocol networks (which are coming back with IPv6) without our jobs getting any simpler, so I'm
not worried about the jobless network admin. I am worried, however, about the stability of the

Copyright ipSpace.net 2014

Page 1-23

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

networks we are building, and thats the only reason Im ranting about the emerging flat-earth
architectures.
In 2002 IETF published an interesting RFC: Some Internet Architectural Guidelines and Philosophy
(RFC 3439) that should be a mandatory reading for anyone claiming to be an architect of solutions
that involve networking (you know who you are). In the End-to-End Argument and Simplicity section
the RFC clearly states: In short, the complexity of the Internet belongs at the edges, and the IP
layer of the Internet should remain as simple as possible.
We should use the same approach when dealing with virtualized networking: the complexity belongs
to the edges (hypervisor switches) with the intervening network providing the minimum set of
required services. I dont care if the networking infrastructure uses layer-2 (MAC) addresses or
layer-3 (IP) addresses as long as it scales. Bridging does not scale as it emulates a logical thick coax
cable. Either get rid of most bridging properties (like packet flooding) and implement proper MACaddress-based routing without flooding, or use IP as the transport. I truly dont care.
Reading RFC 3439 a bit further, the next paragraphs explain the Non-Linearity and Network
Complexity. To quote the RFC: In particular, the largest networks exhibit, both in theory and in
practice, architecture, design, and engineering non-linearities which are not exhibited at smaller
scale. Allow me to paraphrase this for some vendors out there: just because it works in your lab
does not mean it will work at Amazon or Google scale.
The current state of affairs is just the opposite of what a reasonable architecture would be: VMware
has a barebones layer-2 switch (although it does have a few interesting features) with another nonscalable layer (vCDNI) on top of (or below) it. The networking vendors are inventing all sorts of
kludges of increasing complexity to cope with that, from VN-Link/port extenders and EVB/VEPA to

Copyright ipSpace.net 2014

Page 1-24

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

large-scale L2 solutions like TRILL, Fabric Path, VCS Fabric or 802.1aq, and L2 data center
interconnects based on VPLS, OTV or BGP MAC VPN.
I dont expect the situation to change on its own. VMware knows server virtualization is just a
stepping stone and is already investing in PaaS solutions; the networking vendors are more than
happy to sell you all the extra proprietary features you need just because VMware never
implemented a more scalable solution, increasing their revenues and lock-in. It almost feels like the
more network is in my way complaints we hear, the happier everyone is: virtualization vendors
because the blame is landing somewhere else, the networking industry because these complaints
give them a door opener to sell their next-generation magic (this time using a term borrowed from
the textile industry).
Imagine for a second that VMware or Citrix would actually implement a virtualized networking
solution using IP transport between hypervisor hosts. The need for new fancy boxes supporting
TRILL or 802.1aq would be gone, all you would need in your data center would be high-speed simple
L2/L3 switches. Clearly not a rosy scenario for the flat-fabric-promoting networking vendors, is it?
Is there anything you can do? Probably not much, but at least you can try. Sit down with the
virtualization engineers, discuss the challenges and figure out the best way to solve problems both
teams are facing. Engage the application teams. If you can persuade them to start writing scale-out
applications that can use proper load balancing, most of the issues bothering you will disappear on
their own: there will be no need for large stretched VLANs and no need for L2 data center
interconnects. After all, if you have a scale-out application behind a load balancer, nobody cares if
you have to shut down a VM and start it in a new IP subnet.

Copyright ipSpace.net 2014

Page 1-25

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post, written in December 2011, goes into a bit more details and explains how the
decoupling between the physical network and the virtualized customer networks could look like. It
also documents the problems of Ciscos VXLAN implementation that Cisco finally fixed in 2013.

DECOUPLE VIRTUAL NETWORKING FROM THE PHYSICAL


WORLD
Isnt it amazing that we can build the Internet, run the same web-based application on thousands of
servers, give millions of people access to cloud services and stumble badly every time were
designing virtual networks. No surprise, by trying to keep vSwitches simple (and their R&D and
support costs low), the virtualization vendors violate one of the basic scalability principles:
complexity belongs to the network edge.

VLAN-BASED SOLUTIONS
The simplest possible virtual networking technology (802.1Q-based VLANs) is also the least scalable,
because of its tight coupling between the virtual networking (and VMs) and the physical world.

Copyright ipSpace.net 2014

Page 1-26

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 1-2: VLAN-based virtual switch network segmentation

VLAN-based virtual networking uses bridging (which doesnt scale), 12-bit VLAN tags (limiting you to
approximately 4000 virtual segments), and expect all switches to know the MAC addresses of all
VMs. Youll get localized unknown unicast flooding if a ToR switch experiences MAC address table
overflow and a massive core flooding if the same thing happens to a core switch.
In its simplest incarnation (every VLAN enabled on every server port on ToR switches), the VLANbased virtual networking also causes massive flooding proportional to the total number of VMs in the
network.
VM-aware networking scales better (depending on the number of VLANs you have and the number
of VMs in each VLAN). The core switches still need to know all VM MAC addresses, but at least the
dynamic VLAN changes on the server-facing ports limit the amount of flooding on the switch-to-

Copyright ipSpace.net 2014

Page 1-27

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

server links; flooding becomes proportional to the number of VLANs active in a particular hypervisor
host, and the number of VMs in those VLANs.

Figure 1-3: Arista EOS VM Tracer uses CDP to detect vSphere hosts connected to ToR switches

OTHER BRIDGING-BASED SOLUTIONS


vCDNI is the first solution that decouples at least one of the aspects of the virtual networks from the
physical world. It uses MAC-in-MAC encapsulation and thus hides the VM MAC addresses from the
network core. vCDNI also removes VLAN limitations, but causes massive flooding due to its

Copyright ipSpace.net 2014

Page 1-28

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

suboptimal implementation the amount of flooding is yet again proportional to the total number of
VMs in the vCDNI domain.
Provider Backbone Bridging (PBB) or VPLS implemented in ToR switches fare better. The core
network needs to know the MAC addresses (or IP loopbacks) of the ToR switches; all the other
virtual networking details are hidden.

Figure 1-4: Carrier Ethernet or MPLS as VLAN replacement

Copyright ipSpace.net 2014

Page 1-29

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Major showstopper: dynamic provisioning of such a network is a major pain; Im not aware of any
commercial solution that would dynamically create VPLS instances (or PBB SIDs) in ToR switches
based on VLAN changes in the hypervisor hosts ... and the dynamic adaptation to VLAN changes is a
must if you want the network to scale.
While PBB or VPLS solves the core network address table issues, the MAC address table size in ToR
switches cannot be reduced without dynamic VPLS/PBB instance creation. If you configure all VLANs
on all ToR switches, the ToR switches have to store the MAC addresses of all VMs in the network (or
risk unicast flooding after MAC address table experiences trashing).

MAC-OVER-IP SOLUTIONS
The only proper way to decouple virtual and physical networks is to treat virtual networking like yet
another application (like VoIP, iSCSI or any other infrastructure application). Virtual switches that
can encapsulate L2 or L3 payloads in UDP (VXLAN) or GRE (NVGRE/Open vSwitch) envelopes appear
as IP hosts to the network; you can use the time-tested large-scale network design techniques to
build truly scalable data center networks.

Copyright ipSpace.net 2014

Page 1-30

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 1-5: VXLAN uses IP network to transport VM-level MAC frames

However, MAC-over-IP encapsulation might not bring you to seventh heaven. VXLAN does not have
a control plane and thus has to rely on IP multicast to perform flooding of virtual MAC frames. All
hypervisor hosts using VXLAN have to join VXLAN-specific IP multicast groups, creating lots of (S,G)
and (*,G) entries in the core network. The virtual network data plane is thus fully decoupled from
the physical network, the control plane isnt.
A truly scalable virtual networking solution would require no involvement from the transport IP
network. Hypervisor hosts would appear as simple IP hosts to the transport network, and use only
unicast IP traffic to exchange virtual network payloads; such a virtual network would use the same
transport mechanisms as todays Internet-based applications and could thus run across huge
transport networks. Im positive Amazon has such a solution, and it seems Niciras Network
Virtualization Platform is another one (but Ill believe that when I see it).

Copyright ipSpace.net 2014

Page 1-31

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 1-6: Controller-based virtual networking architecture (Nicira NVP, now VMware NSX)

Copyright ipSpace.net 2014

Page 1-32

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The idea of overlay virtual networking was extremely hard to grasp for many networking engineers.
The overlay virtual networking is like Skype analogy worked pretty well (even though some
vendors disagreed with it).
The blog post was written in May 2012 and still refers to Nicira NVP (which later became VMware
NSX).

VIRTUAL NETWORKS:THE SKYPE ANALOGY


I usually use the Nicira is Skype of virtual networking analogy when describing the differences
between Niciras NVP and traditional VLAN-based implementations. Cade Metz liked it so much he
used it in his What Is a Virtual Network? Its Not What You Think It Is article, so I guess a blog post
is long overdue.
Before going into more details, you might want to browse through my Cloud Networking Scalability
presentation (or watch its recording) the crucial slide is this one:

Copyright ipSpace.net 2014

Page 1-33

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 1-7: Virtual networking architectural models

Copyright ipSpace.net 2014

Page 1-34

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

IN THE BEGINNING,THERE WAS A PATCH CORD


Typical virtualized data centers were seeing today are no better than manual service exchanges
using cord pairs to connect the users the hypervisor virtual switches using VLANs to create virtual
networks are too simplistic to tell the network what they need, so the networking team has to
provision the required VLANs manually.

Figure 1-8: Good morning, which VLAN would you like to talk with today? (source: Wikipedia)

The VM-aware networking is an interesting twist in the story the exchange operator is
listening to the user traffic and trying to figure out who they want to talk with.

Copyright ipSpace.net 2014

Page 1-35

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

AUTOMATIC SERVICE EXCHANGES FOR VIRTUAL NETWORKS


Following the great example of telephone exchange vendors that heaped tons of complexity into
their gear (ensuring hefty margins and pricey support contracts), the networking vendors are trying
to persuade you that you should keep the edge (hypervisors) as simple as possible and let the
network (= their gear) deal with the complexities of scaling VLANs and L2 switching.
Does it make sense? Lets see to get a somewhat scalable VLAN-based solution, youd need at
least the following components:

A signaling protocol between the hypervisors and ToR switches that would tell the ToR switches
which VLANs the hypervisors need. Examples: EVB (802.1Qbg) or VM-FEX.

Large-scale multipath bridging technology. Examples: SPB (802.1aq) or TRILL.

VLAN pruning protocol. Examples: MVRP (802.1ak) or VTP pruning. SPB might also offer
something similar with service instances.

VLAN addressing extension, and automatic mapping of hypervisor VLANs into a wider VLAN
address space used in the network core. Q-in-Q (802.1ad) or MAC-in-MAC (802.1ah) could be
used as the wider address space, and I have yet to see ToR gear performing automatic VLAN
provisioning.

It might be just me, but looking at this list, RFC 1925 comes to mind (with sufficient thrust, pigs fly
just fine).
To understand the implications of ever-increased complexity vendors are throwing at us, go through
the phenomenal presentation Randy Bush had @ NANOG26, in which he compared the complexities
of voice switches with those of IP routers. The last slide of the presentation is especially relevant to
the virtual networking environment:

Copyright ipSpace.net 2014

Page 1-36

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

With enough complexity we strongly suspect we can operate [whatever environment] in


polynomial time and dollars.

We are working on a proof that [whatever environment] can be made to be NP hard [the list of
emerging technologies you need to scale bridging is a great move in the right direction].

And then youll just wonder where your margins went [sounds familiar, right?]

ENTER THE SKYPE ERA


Going back to the voice world: eventually someone figured out its way simpler to move the voice
processing complexity to the end-devices (VoIP phones) and use a simple and cheap transport (the
Internet) between them.
You dont think VoIP scales better than traditional voice? Just compare the costs of doing a Skype
VoIP transatlantic call with the costs of a traditional voice call from two decades ago (the
international voice calls became way cheaper in the meantime, partly because most carriers started
using VoIP for long-distance trunks). Enough said.
We can watch the same architectural shift happening in the virtual networking world: VXLAN,
NVGRE and STT are solutions that move the virtual networking complexity to the hypervisor, and
rely on proven, simple, cheap and reliable IP transport in the network. No wonder the networking
companies like you more if you use VLAN-based L2 hypervisor switches (like the Alcatels, Lucents
and Nortels of the world preferred you buy stupid phones and costly phone exchanges).
Does that mean that EVB, TRILL, and other similar technologies have no future? Absolutely not.
Networking industry made tons of money deploying RSRB, DLSw and CIPs in SNA environments
years after it was evident TCP/IP-based solutions (mostly based on Unix-based minicomputers) offer
more flexible services for way lower price. Why should it be any different this time?

Copyright ipSpace.net 2014

Page 1-37

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Do we need virtual switches in the hypervisors or would it be better to implement all the switching
functionality in the hardware network edge? It turns out we need virtual switches anyway, so we
might as well implement the complex functionality in software, not in hardware.
The blog post was written in August 2011; in the meantime Cisco and VMware implemented vMotion
support for VM-FEX (VM bypassing the hypervisor and accessing a virtualized physical NIC directly).
Youll also notice that I wasnt explicitly arguing for the overlay virtual networking approach, but just
for more intelligence in the virtual switches.

SOFT SWITCHING MIGHT NOT SCALE, BUT WE NEED IT


Following a series of soft switching articles written by Nicira engineers (hint: they are using a similar
approach as Junipers QFabric marketing team), Greg Ferro wrote a scathing Soft Switching Fails at
Scale reply. While I agree with many of his arguments, the sad truth is that with the current state of
server infrastructure virtualization we need soft switching regardless of the hardware vendors
claims about the benefits of 802.1Qbg (EVB/VEPA), 802.1Qbh (port extenders) or VM-FEX.
A virtual switch embedded in a typical hypervisor OS serves two purposes: it does (usually abysmal)
layer-2 forwarding and (more importantly) hides the details of the physical hardware from the VM.
Virtual machines think they work with a typical Ethernet NIC usually based on a well-known
chipset like Intels 82545 controller or AMD Lance controller or you could use special drivers that
allow the VM to interact with the hypervisor more effectively (for example, VMwares VMXNET
driver).

Copyright ipSpace.net 2014

Page 1-38

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 1-9: Networking stack, from VM application to hypervisor physical NIC

In both cases, the details of the physical hardware are hidden from the VM, allowing you to deploy
the same VM image on any hypervisor host in your data center (or cloudburst it if you believe in that
particular mythical beast), regardless of the hosts physical Ethernet NIC. The hardware abstraction
also makes the vMotion process run smoothly the VM does not need to re-initialize the physical
hardware once its moved to another physical host. VMware (and probably most other hypervisors)
solves the dilemma in a brute force way it doesnt allow you to vMotion a VM thats communicating
directly with the hardware using VMDirectPath.
The hardware abstraction functionality is probably way more resource-consuming than the simple L2
forwarding performed by the virtual switches; after all, how hard could it be to do a hash table
lookup, token bucket accounting, and switch a few ring pointers?

Copyright ipSpace.net 2014

Page 1-39

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The virtualized networking hardware also allows the hypervisor host to perform all sorts of memory
management tricks. Most modern NICs use packet buffer rings to exchange data between the
operating system and the NIC; both parties (NIC and the CPUs) can read or write the ring structures
at any time. Allowing a VM to talk directly with the physical hardware effectively locks it into the
physical memory, as the hypervisor can no longer control how the VM has set up the NIC hardware
and the ring structures; the Ethernet NIC can write into any location belonging to the VM its
communicating with at any time.
I am positive there are potential technical solutions to all the problems Ive mentioned, but they are
simply not available on any server infrastructure virtualization platform Im familiar with. The
vendors deploying new approaches to virtual networking thus have to rely on a forwarding element
embedded in the hypervisor kernel, like the passthrough VEM module Cisco is using in its VM-FEX
implementation.
In my opinion, it would make way more sense to develop a technology that tightly integrates
hypervisor hosts with the network (EVB/VDP parts of the 802.1Qbg standard) than to try to push a
square peg into a round hole using VEPA or VM-FEX, but we all know thats not going to happen.
Hypervisor vendors dont seem to care and the networking vendors want you to buy more of their
gear.

Copyright ipSpace.net 2014

Page 1-40

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In early 2012 it was already evident that we cannot avoid the reality of overlay virtual networking.
The Borg tagline was thus a no-brainer.

EMBRACE THE CHANGE ... RESISTANCE IS FUTILE


After all the laws-of-physics-are-changing hype it must have been anticlimactic for a lot of people to
realize what Nicira is doing (although Ive been telling you that for months). Not surprisingly, there
were the usual complaints and twitterbursts:

Its just an overlay solution;

Its yet another tunneling protocol;

It doesnt have end-to-end QoS;

Its a simple solution using too-complex technology;

Why are they playing at the edge instead of solving the whole problem?

All of these complaints have merits ... and Ive heard them at least three or four times:

When we started encapsulating SNA in TCP/IP using RSRB and later DLSw;

When we started replacing voice switches with VoIP and transporting voice over IP networks;

When we replaced Frame Relay and ATM switches with MPLS/VPN.

Interestingly I dont remember a huge outcry when we started using IPsec to build private networks
over the Internet ... maybe the immediate cost savings made everyone forget we were actually
building tunnels with no QoS.

Copyright ipSpace.net 2014

Page 1-41

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Anyhow, weve proven time and again in the last 20+ years that the only way to scale a networking
solution is to push the complexity to the edge and to decouple edge from the core (in case of virtual
networks, decouple them from the physical ones).
Assuming one could design the whole protocol stack from scratch, one could do a proper job of
eliminating all the redundancies. Given the fact that the only ubiquitous transport we have today is
IP, and that you cant expect the equipment vendors to invest into anything else but Ethernet+IP in
the foreseeable future, the only logical conclusion is to use IP as the transport for your virtual
networking data ... like any other application is doing these days. It obviously works well enough for
Amazon.
You have to use transport over IP if you want the solution to scale ... or a completely
revamped layer-2 forwarding paradigm, which is not impossible, merely impractical in a
reasonable timeframe ... but of course OpenFlow will bring us there ;)
Im not saying Niciras solution is the right one. Im not saying GRE or VXLAN or NVGRE or
something else is the right tunneling protocol. Im not saying transporting Ethernet frames in IP
tunnels is a good decision I would prefer to have full IP routing in the hypervisors and transport IP
datagrams, not L2 frames, between hypervisor hosts. Im also not saying IP is the right transport
protocol, its just the only scalable one we have today.

Copyright ipSpace.net 2014

Page 1-42

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

However, Im positive that the only way to build scalable virtual networks is to:

Split hypervisor host addressing (which is visible in the core) from VM addressing (which is only
visible to hypervisors);

Use simple routed core transport which allows the edge (hypervisor) addresses to be aggregated
for scalability;

Remove all VM-related state from the transport core;

Use proper control plane that will minimize the impact of stupidities we have to deal with if we
have to build L2 virtual networks.

But, as always, this is just my personal opinion, and I'm known to be biased.

Copyright ipSpace.net 2014

Page 1-43

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The blog post I wrote in early 2012 is an ideal conclusion to the architectural part of this chapter. It
compares overlay virtual networks (VXLAN) with VM-aware ToR network virtualization (EVB).

VXLAN AND EVB QUESTIONS


Wim (@fracske) De Smet sent me a whole set of very good VXLAN- and EVB-related questions that
might be relevant to a wider audience.
If I understand you correctly, you think that VXLAN will win over EVB?
I wouldnt say they are competing directly from the technology perspective. There are two ways you
can design your virtual networks: (a) smart core with simple edge (see also: voice and Frame Relay
switches) or (b) smart edge with simple core (see also: Internet). EVB makes option (a) more
viable, VXLAN is an early attempt at implementing option (b).
When discussing virtualized networks I consider the virtual switches in the hypervisors the
network edge and the physical switches (including top-of-rack switches) the network core.
Historically, option (b) (smart edge with simple core) has been proven to scale better ... the largest
example of such architecture is allowing you to read my blog posts.
Is it correct that EVB isn't implemented yet?
Actually it is IBM has just launched its own virtual switch for VMware ESX (a competitor to Nexus
1000V) that has limited EVB support (the way I understand the documentation, it seems to support
VDP, but not the S-component).

Copyright ipSpace.net 2014

Page 1-44

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Update August 2014: In the meantime, Juniper and HP started offering EVB on their ToR
switches.

But VXLAN has its limitations for example, only VXLAN-enabled VMs will be able to
speak to each other.
Almost correct. VMs are not aware of VXLAN (they are thus not VXLAN-enabled). From VM NIC
perspective the VM is connected to an Ethernet segment, which could be (within the vSwitch)
implemented with VLANs, VXLAN, NVGRE, STT or something else.
At the moment, the only implemented VXLAN termination point is Nexus 1000V, which means that
only VMs residing within ESX hosts with Nexus 1000V can communicate over VXLAN-implemented
Ethernet segments. Some vendors are hinting they will implement VXLAN in hardware (switches),
and Cisco already has the required hardware in Nexus 7000 (because VXLAN has the same header
format as OTV).
Update August 2014: Several vendors offer software and hardware VXLAN gateways. Refer
to the Gateways to Overlay Virtual Networks chapter for more details.
VXLAN encapsulation will also take some CPU cycles (thus impacting your VM
performance.
While VXLAN encapsulation will not impact VM performance per se, it will eat CPU cycles that could
be used by VMs. If your hypervisor host has spare CPU cycles, VXLAN overhead shouldnt matter, if
youre pushing it to the limits, you might experience performance impact.

Copyright ipSpace.net 2014

Page 1-45

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

However, the elephant in the room is the TCP offload. It can drastically improve I/O performance
(and reduce CPU overhead) of network-intensive VMs. The moment you start using VXLAN, TCP
offload is gone (most physical NICs cant insert the VXLAN header during TCP fragmentation), and
the TCP stack overhead increases dramatically.
If your VMs are CPU-bound you might not notice; if they generate lots of user-facing data, lack of
TCP offload might be a killer.
I personally see VXLAN as a end to end solution where we can't interact on the network
infrastructure anymore. For example, how would these VMs be able to connect to the
first-hop gateway?
Today you can use VXLAN to implement closed virtual segments that can interact with the outside
world only through VMs with multiple NICs (a VXLAN-backed NIC and a VLAN-backed NIC), which
makes it perfect for environments where firewalls and load balancers are implemented with VMs
(example: VMwares vCloud with vShield Edge and vShield App). As said above, VXLAN termination
points might appear in physical switches.
With EVB we would still have full control and could do the same things were doing today
on the network infrastructure, and the network will be able to automatically provide the
correct VLAN's on the correct ports.
Thats a perfect summary. EVB enhances todays VLAN-backed virtual networking infrastructure,
while overlay virtual networks completely change the landscape.
Is then the only advantage of VXLAN that you can scale better because you don't have
the VLAN limitation?

Copyright ipSpace.net 2014

Page 1-46

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VXLAN and other MAC-over-IP solutions have two advantages: they allow you to break through the
VLAN barrier (but so do vCDNI, Q-in-Q or Provider Backbone Bridging), but they also scale better
because the core network uses routing, not bridging. With MAC-over-IP solutions you dont need
novel L2 technologies (like TRILL, FabricPath, VCS Fabric or SPB), because they run over IP core
that can be built with existing equipment using well-known (and well-tested) designs.

Copyright ipSpace.net 2014

Page 1-47

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Overlay virtual networks might be as ubiquitous as Skype is today in a decade but would it make
sense to use them in a private or public cloud design today? As always, it depends.

DOES IT MAKE SENSE TO BUILD NEW CLOUDS WITH


OVERLAY NETWORKS
TL&DR Summary: It depends on your business model
With the explosion of overlay virtual networking solutions (with every single reasonably-serious
vendor having at least one) one might get the feeling that it doesn't make sense to build greenfield
IaaS cloud networks with VLANs. As usual, there's significant difference between theory and
practice.
You should always consider the business requirements before launching on a technology
crusade. IaaS networking solutions are no exception.
If you plan to sell your services to customers with complex application stacks, overlay virtual
networks make perfect sense. These customers usually need multiple internal networks and an
appliance between their internal networks and the outside world. If you decide to implement the
Internet-facing appliance with a VM-based solution, and all subnets behind the appliance with
overlay virtual networks, you've almost done.

Copyright ipSpace.net 2014

Page 1-48

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 1-10: Always focus on the services youre selling and the needs of your customers

Customers buying a single VM, and maybe access to central MySQL or SQL Server database, are a
totally different story. Having a subnet and a VM-based appliance for each customer paying for a
single VM makes absolutely no sense. We need something similar to PVLANs, and the only overlay
virtual networking product with a reasonably simple PVLAN implementation is VMware NSX for

Copyright ipSpace.net 2014

Page 1-49

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Multiple Hypervisors. If you want to use any other hypervisor/virtual networking platform, you have
to get creative:

Use a single subnet (VLAN- or overlay-based) and protect individual customer VMs with VM NIC
firewall (or iptables)

Figure 1-11: VM NIC isolation options available on major hypervisor platforms

Copyright ipSpace.net 2014

Page 1-50

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

When using an overlay-based subnet for numerous single-VM customers, use a simple L2 or L3
gateway to connect the subnet to the outside world. Most overlay solutions include hardware or
software gateways, and a 2-NIC Linux VM will easily route 1Gbps of traffic with a single vCPU.

Worst case, use small PVLANs. There's no need for large or stretched VLANs if every customer
has a single VM, more so if you don't give the customers fixed IP addresses but force them to
rely on DNS.

Copyright ipSpace.net 2014

Page 1-51

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

You should always deploy a new and untested technology in a small pilot. This blog post (written in
May 2014) explains how you might do that with overlay virtual networking. The product packaging
and licensing details were still valid in summer 2014, but you might want to check them with the
virtualization vendor of your choice.

HOW DO I START MY FIRST OVERLAY VIRTUAL


NETWORKING PROJECT?
After the Designing Private Cloud Infrastructure workshop I had in Slovenia last week (in a packed
room of ~60 people), someone approached me with a simple question: I like the idea of using
overlay virtual networks in my private cloud, but where do I start?

PREREQUISITES
As always, it makes sense to start with the prerequisites.
If youre fortunate enough to run Hyper-V 3.0 R2, you already have all you need Hyper-V Network
Virtualization is included in Hyper-V 3.0, and configurable through the latest version of System
Center (I doubt youd want to write PowerShell scripts to get your first pilot project off the ground).
vSphere users are having a slightly harder time. VXLAN is part of the free version of Nexus 1000V,
but you still need Enterprise Plus vSphere license to get distributed virtual switch functionality
needed by Nexus 1000V, and you have to configure VXLAN segments through the Nexus 1000V CLI
(or write your own NETCONF scripts).

Copyright ipSpace.net 2014

Page 1-52

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VXLAN configurable through vShield Manager is also included in vCNS (a separate license) starting
with release 5.1. vCNS relies on distributed virtual switch and thus requires Enterprise Plus license.
In Linux environments use GRE tunneling available in Open vSwitch. OpenStacks default Neutron
plugin can configure inter-hypervisor tunnels automatically (just dont push it too far).

WHERE WOULD YOU USE OVERLAY VIRTUAL NETWORKS?


The obvious place to start is as far away from production as possible. You could use overlay virtual
networks to implement development, staging or testing environments.
Ideally, youd find a development group (or a developer) willing to play with new concepts, set up
development environment for them (including virtual segments and network services), and help
them move their project all the way to production, creating staging and testing virtual segments and
services on the fly (warning: some programming required; also check out Cloudify).

Copyright ipSpace.net 2014

Page 1-53

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Finally, dont forget that you still have to use reasonable network design techniques even if you can
create subnets on demand and configure them with a click of a mouse. This fact is slightly more
appreciated today than it was in the heady early days of Software Defined Data Centers in 2012
when I wrote this blog post.

VIRTUAL NETWORKING IS MORE THAN VMS AND VLAN


DUCT TAPE
VMware has a fantastic-looking cloud provisioning tool vCloud director. It allows cloud tenants to
deploy their VMs and create new virtual networks with a click of a mouse (the underlying network
has to provide a range of VLANs, or you could use VXLAN or vCDNI to implement the virtual
segments).
Needless to say, when engineers not familiar with the networking intricacies create point-and-click
application stacks without firewalls and load balancers, you get some interesting designs.
The following one seems to be particularly popular. Assuming your application stack has three layers
(web servers, app servers and database servers), this is how you are supposed to connect the VMs:

Copyright ipSpace.net 2014

Page 1-54

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 1-12: Virtual networking gone wild

When Id heard about this design being discussed in VMware training I politely smiled (and one of
our CCIEs attending that particular class totally wrecked it). When I saw the same design on a slide
with Ciscos logo on it, my brains wanted to explode.
Lets see if we can list all things that are wrong with this design:
Its a security joke. Anyone penetrating your web servers gets a free and unlimited pass to try
and hack the app servers. Repeat recursively through the whole application stack.
How will you manage the servers? Usually wed use SSH to access the servers. How will you
manage the app servers that are totally isolated from the rest of the network? Virtual console? Fine
with me.

Copyright ipSpace.net 2014

Page 1-55

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

How will you update your application code? Effectively this is the same question as above
without the virtual console Get out Of Jail card.
How will you download operating system patches? Pretty interesting one if you happen to
download them from the Internet. Will the database servers go through the app servers and through
the web servers to access the Internet? Will you configure proxy web servers on every layer?
IP routing in vShield Edge (that you're supposed to be using as the firewall, router and load
balancer) is another joke. It supports static routes only. Even if you decide to go through multiple
layers of VMs to get to the outside world, trying to get the return packet forwarding to work will fry
your brains.
VMware NSX Edge Services Router supports routing protocols, but that doesnt make this
particularly broken design any more valid.

How will you access common services? Lets say you use company-wide LDAP services. How will
the isolated VMs access them? Will you create yet another segment and connect all VMs to it ...
exposing your crown jewels to the first intruder that manages to penetrate the web servers? How
about database mirroring or log shipping?
Im positive Ive forgotten at least a few issues other issues.

Copyright ipSpace.net 2014

Page 1-56

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

SUMMARY
Just because you can design your virtual application stack with a mouse doesnt mean that you can
forget the basic network design principles.
It doesnt matter if you use VLANs or some other virtual networking technology. It doesnt matter if
you use physical firewalls and load balancers or virtual appliances if you want to build a proper
application stack, you need the same functional components youd use in the physical world, wired
in approximately the same topology.

Copyright ipSpace.net 2014

Page 1-57

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

OVERLAY VIRTUAL NETWORKING


TECHNICAL DETAILS

IN THIS CHAPTER:
VIRTUAL NETWORKING IMPLEMENTATION TAXONOMY
A DAY IN A LIFE OF AN OVERLAID VIRTUAL PACKET
CONTROL PLANE PROTOCOLS IN OVERLAY NETWORKS
VXLAN, IP MULTICAST, OPENFLOW AND CONTROL PLANES
VXLAN SCALABILITY CHALLENGES
IGMP AND PIM IN MULTICAST VXLAN TRANSPORT NETWORKS
PVLAN, VXLAN AND CLOUD APPLICATION ARCHITECTURES
VM-LEVEL IP MULTICAST OVER VXLAN
VXLAN RUNS OVER UDP DOES IT MATTER?
NVGRE BECAUSE ONE STANDARD JUST WOULDNT BE ENOUGH

Copyright ipSpace.net 2014

Page 2-1

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

DO WE REALLY NEED STATELESS TRANSPORT TUNNELING (STT)


COULD MPLS-OVER-IP REPLACE VXLAN OR NVGRE?
ARE OVERLAY NETWORKING TUNNELS A SCALABILITY NIGHTMARE?
OVERLAY NETWORKS AND QOS FUD
MICE, ELEPHANTS AND VIRTUAL SWITCHES
HOW MUCH DATA CENTER BANDWIDTH DO YOU REALLY NEED?
CAN WE REALLY IGNORE SPAGHETTI AND HORSESHOES?
TTL IN OVERLAY VIRTUAL NETWORKS
VMOTION AND VXLAN

Copyright ipSpace.net 2014

Page 2-2

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This chapter describes numerous technical details of overlay virtual networking, starting with virtual
networking taxonomy and control plane challenges.
Further details include:

Step-by-step description of packet forwarding process;

VXLAN IP multicast details and related scalability challenges;

IGMP and PIM in VXLAN-based overlay virtual networks;

Encapsulation details of VXLAN, NVGRE, STT and MPLS-over-IP;

QoS challenges and transport fabric design considerations;

TTL handling in layer-2 and layer-3 overlay virtual networking implementations.

MORE INFORMATION

Watch the Overlay Virtual Networking webinar (and the Following Packets across Overlay Virtual
Networks addendum).

Check out cloud computing and networking webinars and webinar subscription.

Use ExpertExpress service if you need short online consulting session, technology discussion or a
design review.

Copyright ipSpace.net 2014

Page 2-3

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Overlay virtual networking implementation are as diverse as simple bridges and complex routers
they range from simple Ethernet emulations to complex IP routing environments. This blog post
(written in June 2014) tries to categorize them.

VIRTUAL NETWORKING IMPLEMENTATION TAXONOMY


Overlay virtual networks (and other virtual networking implementations) implement either layer-2
segments or layer-3 networks. Layer-2-based implementation might include layer-3 support, either
in centralized or distributed form.

LAYER-2 OR LAYER-3 NETWORKS?


Some virtual networking solutions emulate thick coax cable (more precisely, layer-2 switch), giving
their users the impression of having regular VLAN-like layer-2 segments.
Examples: traditional VLANs, VXLAN on Nexus 1000v, VXLAN on VMware vCNS, VMware NSX,
Nuage Networks Virtual Services Platform, OpenStack Open vSwitch Neutron plugin.
Other solutions perform layer-3 forwarding at the first hop (vNIC-to-vSwitch boundary),
implementing a pure layer-3 network.
Examples: Hyper-V Network Virtualization, Juniper Contrail, Amazon VPC.

Copyright ipSpace.net 2014

Page 2-4

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

LAYER-2 NETWORKS WITH LAYER-3 FORWARDING


Every layer-2 virtual networking solution allows you to implement layer-3 forwarding on top of pure
layer-2 segments with a multi-NIC VM.
Some virtual networking solutions provide centralized built-in layer-3 gateways (routers) that you
can use to connect layer-2 segments.
Examples: inter-VLAN routing, VMware NSX, OpenStack
Other layer-2 solutions provide distributed routing the same default gateway IP and MAC address
are present in every first-hop switch, resulting in optimal end-to-end traffic flow.
Examples: Cisco DFA, Arista VARP, Juniper QFabric, VMware NSX, Nuage VSP, Distributed layer-3
forwarding in OpenStack Icehouse release.

LAYER-3 NETWORKS AND DYNAMIC IP ADDRESSES


Some layer-3 virtual networking solutions assign static IP addresses to end hosts. The end-to-end
layer-3 forwarding is determined by the orchestration system.
Example: Amazon VPC
Other layer-3 virtual networking solutions allow dynamic IP addresses (example: customer DHCP
server) or IP address migration between cluster members.
Examples: Hyper-V network virtualization in Windows Server 2012 R2, Juniper Contrail

Copyright ipSpace.net 2014

Page 2-5

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Finally, there are layer-3 solutions that fall back to layer-2 forwarding when they cannot route the
packet (example: non-IP protocols).
Example: Juniper Contrail

WHY DOES IT MATTER?


In a nutshell: the further away from bridging a solution is, the more scalable it is from the
architectural perspective (theres always an odd chance of having clumsy implementation of a great
architecture). No wonder Amazon VPC and Hyper-V network virtualization (also used within the
Azure cloud) lean so far toward pure layer-3 forwarding.

Copyright ipSpace.net 2014

Page 2-6

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

After establishing the basics, lets see how a customer (VM-generated) packet traverses a typical
overlay virtual networking implementation.
The blog post was written in August 2013 and updated to reflect the changes introduced in recent
release of overlay virtual networking products.

A DAY IN A LIFE OF AN OVERLAID VIRTUAL PACKET


Overlay virtual networking products that implement layer-2 segments (Cisco Nexus 1000V, VMware
vShield, VMware NSX) dont change the intra-hypervisor network behavior: a virtual machine
network interface card (VM NIC) is still connected to a layer-2 hypervisor switch. The magic happens
between the internal layer-2 switch and the physical (server) NIC.

Figure 2-1: Sample network with two hypervisor hosts connected to an overlay virtual networking segment

Copyright ipSpace.net 2014

Page 2-7

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The diagrams were taken from the VXLAN course and thus use VXLAN terminology. Hyper-V
uses similar concepts and slightly different acronyms and encapsulation format.
The TCP/IP stack (or any other network-related software working with the VM NIC driver) is totally
oblivious to its virtual environment it looks like the VM NIC is connected to a real Ethernet
segment, and so when the TCP/IP stack needs to send a packet, it sends a full-fledged L2 frame
(including source and destination VM MAC address) to the VM NIC.

Figure 2-2: Virtual machine sends a regular Ethernet frame

The first obvious question you should ask is: how does the VM know the MAC address of the other
VM? Since the VM TCP/IP stack thinks the VM NIC connects it to an Ethernet segment, it uses ARP to
get the MAC address of the other VM.

Copyright ipSpace.net 2014

Page 2-8

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Second question: how does the ARP request get to the other VM? Please allow me to handwave over
this tiny little detail for the moment; BUM (Broadcast, Unknown Unicast, Multicast) flooding is a
topic for another blog post.
Layer-3-only products like Hyper-V network virtualization in Windows Server 2012 R2,
Amazon VPC or Juniper Contrail use different mechanisms. The Hyper-V network
virtualization behavior is described in the Overlay Virtual Networking Product Details
chapter.
Now lets focus on what happens with the layer-2 frame sent through the VM NIC once it hits the
soft switch. If the destination MAC address belongs to a VM residing in the same hypervisor, the
frame gets delivered to the destination VM (even Hyper-V does layer-2 forwarding within the
hypervisor, as does Niciras NVP unless youve configured private VLANs).
If the destination MAC address doesnt belong to a local VM, the layer-2 forwarding code sends the
layer-2 frame toward the physical NIC ... and the frame gets intercepted on its way toward the real
world by an overlay virtual networking module (VXLAN, NVGRE, GRE or STT
encapsulation/decapsulation module).
The overlay virtual networking module uses the destination MAC address to find the IP address of
the target hypervisor, encapsulates the virtual layer-2 frame into an VXLAN/(NV)GRE/STT envelope
and sends the resulting IP packet toward the physical NIC (with the added complexity of vKernel
NICs in vSphere environments).

Copyright ipSpace.net 2014

Page 2-9

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 2-3: Source hypervisor adds overlay virtual networking header and IP header

Glad you asked the third question: how does the overlay networking module know the IP address of
the target hypervisor? Thats the crux of the problem and the main difference between VXLAN and
Hyper-V/NVP. Its clearly a topic for yet another blog post (and heres what I wrote about this
problem a while ago). For the moment, lets just assume it does know what to do.
The physical network (which has to provide nothing more than simple IP transport) eventually
delivers the encapsulated layer-2 frame to the target hypervisor, which uses standard TCP/IP
mechanisms (match on IP protocol for GRE, destination UDP port for VXLAN and destination TCP
port for STT) to deliver the encapsulated layer-2 frame to the target overlay networking module.
Things are a bit more complex: in most cases youd want to catch the encapsulated traffic
somewhere within the hypervisor kernel to minimize the performance hit (each trip through
the userland costs you extra CPU cycles), but you get the idea.

Copyright ipSpace.net 2014

Page 2-10

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 2-4: Target hypervisor receives an IP packet with VM-level Ethernet payload

Last step: the target overlay networking module strips the envelope and delivers the raw layer-2
frame to the layer-2 hypervisor switch which then uses the destination MAC address to send the
frame to the target VM-NIC.

Copyright ipSpace.net 2014

Page 2-11

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 2-5: Original Ethernet frame is delivered to the target VM

Summary: major overlay virtual networking implementations are essentially identical when it
comes to frame forwarding mechanisms. The encapsulation wars are thus stupid, with the sole
exception of TCP/IP offload, and some vendors have already started talking about multiencapsulation support.

Copyright ipSpace.net 2014

Page 2-12

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The previous blog post described the layer-2 forwarding process in overlay virtual networks, and
ignored the question of how does the hypervisor know where to send the packet. This blog post
(written in August 2013) goes into more details.

CONTROL PLANE PROTOCOLS IN OVERLAY NETWORKS


Multiple overlay network encapsulations are nothing more than a major inconvenience (and religious
wars based on individual bit fields close to meaningless) for anyone trying to support more than one
overlay virtual networking technology (just ask F5 or Arista).
The key differentiator between scalable and not-so-very-scalable architectures and technologies is
the control plane the mechanism that maps (at the very minimum) remote VM MAC address into a
transport network IP address of the target hypervisor (see A Day in a Life of an Overlaid Virtual
Packet for more details).
Overlay virtual networking vendors chose a plethora of solutions, ranging from Ethernet-like
dynamic MAC address learning to complex protocols like MP-BGP. Heres an overview of what theyre
doing:
The original VXLAN as implemented by Ciscos Nexus 1000V, VMwares vCNS release 5.1, Arista
EOS, and F5 BIG-IP TMOS release 11.4 has no control plane. It relies on transport network IP
multicast to flood BUM traffic and uses Ethernet-like MAC address learning to build mapping between
virtual network MAC address and transport network IP addresses.
Unicast VXLAN as implemented in Ciscos Nexus 1000V release 4.2(1)SV2(2.1) has something that
resembles a control plane. VSM distributes segment-to-VTEP mappings to VEMs to replace IP

Copyright ipSpace.net 2014

Page 2-13

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

multicast with headend unicast replication, but the VEMs still use dynamic MAC learning (more about
VSM and VEM).
VXLAN MAC distribution mode is a proper control plane implementation in which the VSM
distributes VM-MAC-to-VTEP-IP information to VEMs. Unfortunately it seems to be based on a
proprietary protocol, so it wont work with hardware gateways from Arista or F5.
Hyper-V Network Virtualization uses PowerShell cmdlets to configure VM-MAC-to-transport-IP
mappings, virtual network ARP tables and virtual network IP routing tables. The same cmdlets can
be implemented by hardware vendors to configure NVGRE gateways.
Nicira NVP (part of VMware NSX) uses OpenFlow to install forwarding entries in the hypervisor
switches and Open vSwitch Database Management Protocol to configure the hypervisor switches.
NVP uses OpenFlow to implement L2 forwarding and VM NIC reflexive ACLs (L3 forwarding uses
another agent in every hypervisor host).
Midokura Midonet doesnt have a central controller or control-plane protocols. Midonet agents
residing in individual hypervisors use shared database to store control- and data-plane state.
Contrail (now Juniper JunosV Contrail) uses MP-BGP to pass MPLS/VPN information between
controllers and XMPP to connect hypervisor switches to the controllers.
IBM SDN Virtual Edition uses a hierarchy of controllers and appliances to implement NVP-like
control plane for L2 and L3 forwarding using VXLAN encapsulation. I wasnt able to figure out what
protocols they use from their whitepapers and user guides.
Nuage Networks is using MP-BGP to exchange L3VPN or EVPN prefixes with the external devices,
and OpenFlow with extensions between the controller and hypervisor switches.

Copyright ipSpace.net 2014

Page 2-14

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post (written in late 2011 before Nicira launched its NVP platform, which later became
VMware NSX) describes the challenges of multicast-based VXLAN and describes an alternate
approach using controller-based distribution of forwarding information.
As it happens, the alternate approach pretty accurately described what Nicira launched a few
months later in its NVP product (and the ARP handling discussion is still relevant in 2014).

VXLAN, IP MULTICAST, OPENFLOW AND CONTROL


PLANES
A few days ago I had the privilege of being part of an VXLAN-related tweetfest with @bradhedlund,
@scott_lowe, @cloudtoad, @JuanLage, @trumanboyes (and probably a few others) and decided to
write a blog post explaining the problems VXLAN faces due to lack of control plane, how it uses IP
multicast to solve that shortcoming, and how OpenFlow could be used in an alternate architecture to
solve those same problems.

MAC-TO-VTEP MAPPING PROBLEM IN MAC-OVER-IP ENCAPSULATIONS


As long as the vSwitches remained simple layer-2 devices and pretended IP didnt exist, their life
was simple. A vSwitch would send VM-generated layer-2 (MAC) frames straight through one of the
uplinks (potentially applying a VLAN tag somewhere along the way), hoping that the physical
network is smart enough to sort out where the packet should go based on destination MAC address.

Copyright ipSpace.net 2014

Page 2-15

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Some people started building networks that were larger than whats reasonable for a single L2
broadcast domain, and after figuring out all the network-based kludges dont work and/or scale too
well, Cisco (Nexus 1000V) and Nicira (Open vSwitch) decided to bite the bullet and implement MACover-IP encapsulation in the vSwitch.
In both cases, the vSwitch takes L2 frames generated by VMs attached to it, wraps them in
protocol-dependent envelopes (VXLAN-over-UDP or GRE), attaches an IP header in front of those
envelopes ... and faces a crucial question: what should the destination IP address (Virtual Tunnel
End Point VTEP in VXLAN terms) be. Like any other overlay technology, a MAC-over-IP vSwitch
needs virtual-to-physical mapping table (in this particular case, VM-MAC-to-host-IP mapping table).

Figure 2-6: Multicast-based VXLAN architecture

As always, there are two ways to approach such a problem:

Solve the problem within your architecture, using whatever control-plane protocol comes handy
(either reusing existing ones or inventing a new protocol);

Copyright ipSpace.net 2014

Page 2-16

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Try to make it someone elses problem (most often, the network becomes the solution-of-lastresort).

Niciras Network Virtualization Platform (NVP) seems to be solving the problem using OpenFlow as
the control-plane protocol; VXLAN offloads the problem to the network.

VXLAN: FLOODING OVER IP MULTICAST


The current VXLAN draft is very explicit: VXLAN has no control plane. There is no out-of-band
mechanism that a VXLAN host could use to discover other hosts participating in the same VXLAN
segment, or MAC addresses of VMs attached to a VXLAN segment.
VXLAN is a very simple technology and uses existing layer-2 mechanisms (flooding and dynamic
MAC learning) to discover remote MAC addresses and MAC-to-VTEP mappings, and IP multicast to
reduce the scope of the L2-over-UDP flooding to those hosts that expressed explicit interest in the
VXLAN frames.
Ideally, youd map every VXLAN segment (or VNI VXLAN Network Identifier) into a separate IP
multicast address, limiting the L2 flooding to those hosts that have VMs participating in the same
VXLAN segment. In a large-scale reality, youll probably have to map multiple VXLAN segments into
a single IP multicast address due to low number of IP multicast entries supported by typical data
center switches.
According to the VXLAN draft, the VNI-to-IPMC mapping remains a management plane decision.

Copyright ipSpace.net 2014

Page 2-17

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VXLAN AND IP MULTICAST: SHORT SUMMARY

VXLAN depends on IP multicast to discover MAC-to-VTEP mappings and thus cannot work
without an IP-multicast-enabled core;

You cannot implement broadcast reduction features in VXLAN (ARP proxy); they would interfere
with MAC-to-VTEP learning;

VXLAN segment behaves exactly like a regular L2 segment (including unknown unicast flooding);
you can use it to implement every stupidity ever developed (including Microsofts NLB in unicast
mode);

In case you do need multicast over a VXLAN segment (including IP multicast), you could use the
physical IP multicast in the network core to provide optimal packet flooding.

The IP multicast tables in the core switches will probably explode if you decide to go from shared
trees to source-based trees in a large-scale VXLAN deployment.

OPENFLOW AN POTENTIAL CONTROL PLANE FOR MAC-OVER-IP VIRTUAL


NETWORKS
Its perfectly possible to distribute the MAC-to-VTEP mappings with a control-plane protocol. You
could use a new BGP address family (Im not saying it would be fast), L2 extensions for IS-IS (Im
not saying it would scale), a custom-developed protocol, or an existing network- or serverprogramming solution like OpenFlow or XMPP.
Nicira seems to be going down the OpenFlow path. Open vSwitch uses P2P GRE tunnels between
hypervisor hosts with GRE tunnel key used to indicate virtual segments (similar to NVGRE draft).

Copyright ipSpace.net 2014

Page 2-18

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

You cant provision new interfaces with OpenFlow, so Open vSwitch depends on yet another daemon
(OVSDB) to create on-demand GRE tunnels; after the tunnels are provisioned, OpenFlow can be
used to install MAC-to-tunnel forwarding rules.
You could use OVS without OpenFlow create P2P GRE tunnels, and use VLAN encapsulation and
dynamic MAC learning over them for a truly nightmarish non-scalable solution.

OPEN VSWITCH GRE TUNNELS WITH OPENFLOW: SHORT SUMMARY

Tunnels between OVS hosts are provisioned with OVSDB;

OVS cannot use IP multicast flooded L2 packets are always replicated at the head-end host;

If you want OVS to scale, you have to install MAC-to-VTEP mappings through a control-plane
protocol. OpenFlow is a good fit as its already supported by OVS.

Once an OpenFlow controller enters the picture, youre limited only by your imagination (and the
amount of work youre willing to invest):

You could intercept all ARP packets and implement ARP proxy in the OpenFlow controller;

After implementing ARP proxy you could stop all other flooding in the layer-2 segments for a
truly scalable Amazon-like solution;

You could intercept IGMP joins and install L2 multicast or IP multicast forwarding tables in OVS.
The multicast forwarding would still be suboptimal due to P2P GRE tunnels head-end host
would do packet replication.

You could go a step further and implement full L3 switching in OVS based on destination IP
address matching rules.

Copyright ipSpace.net 2014

Page 2-19

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In early 2013 I wrote another blog post that described the scalability challenges of multicast-based
VXLAN in even more details.

VXLAN SCALABILITY CHALLENGES


VXLAN, one of the first MAC-over-IP (overlay) virtual networking solutions is definitely a major
improvement over traditional VLAN-based virtual networking technologies but not without its own
scalability limitations.

IMPLEMENTATION ISSUES
VXLAN was first implemented in Nexus 1000V, which presents itself as a Virtual Distributed Switch
(vDS) to VMware vCenter. A single Nexus 1000V instance cannot have more than 64 VEMs (vSphere
kernel modules), limiting the Nexus 1000V domain to 64 hosts (or approximately two racks of UCS
blade servers).
Its definitely possible to configure the same VXLAN NVI and IP multicast address on different Nexus
1000V switches (either manually or using vShield Manager), but you cannot vMotion a VM out of the
vDS (that Nexus 1000V presents to vCenter).
VXLAN on Nexus 1000V is thus a great technology if you want to implement HA/DRS clusters spread
across multiple racks or rows (you can do it without configuring end-to-end bridging), but falls way
short of the deploy any VM anywhere in the data center holy grail.

Copyright ipSpace.net 2014

Page 2-20

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VXLAN is also available in VMwares vDS switch ... but can only be managed through vShield
Manager. vDS can span 500 hosts (the vMotion domain is ~8 times bigger than if you use Nexus
1000V), and supposedly vShield Manager configures VXLAN segments across multiple vDS (using
the same VXLAN VNI and IP multicast address on all of them).

IP MULTICAST SCALABILITY ISSUES


VXLAN floods layer-2 frames using IP multicast (Cisco has demonstrated unicast-only VXLAN but
theres nothing I could touch on their web site yet), and you can either manually associate an IP
multicast address with a VXLAN segment, or let vShield Manager do it automatically (using IP
multicast addresses from a single configurable pool).
Cisco launched unicast VXLAN in June 2013.

The number of IP multicast groups (together with the size of the network) obviously influences the
overall VXLAN scalability. Here are a few examples:
One or few multicast groups for a single Nexus 1000V instance. Acceptable if you dont need
more than 64 hosts. Flooding wouldnt be too bad (not many people would put more than a few
thousand VMs on 64 hosts) and the core network would have a reasonably small number of (S/*,G)
entries (even with source-based trees the number of entries would be linearly proportional to the
number of vSphere hosts).

Copyright ipSpace.net 2014

Page 2-21

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Many virtual segments in large network with a few multicast groups. This would make
VXLAN as scalable as vCDNI. Numerous virtual segments (and consequently numerous virtual
machines) would map into a single IP multicast address (vShield Manager uses a simple wraparound IP multicast address allocation mechanism), and vSphere hosts would receive flooded
packets for irrelevant segments.
Use per-VNI multicast group. This approach would result in minimal excessive flooding but
generate large amounts of (S,G) entries in the network.
The size of the multicast routing table would obviously depend on the number of hosts, number of
VXLAN segments, and PIM configuration do you use shared trees or switch to source tree as soon
as possible and keep in mind that Nexus 7000 doesnt support more than 32000 multicast entries
and Aristas 7500 cannot have more than 4000 multicast routes on a linecard.

RULES-OF-THUMB
VXLAN has no flooding reduction/suppression mechanisms, so the rules-of-thumb from RFC 5556
still apply: a single broadcast domain should have around 1000 end-hosts. In VXLAN terms, thats
around 1000 VMs per IP multicast address.
However, it might be simpler to take another approach: use shared multicast trees (and hope the
amount of flooded traffic is negligible), and assign anywhere between 75% and 90% of (lowest) IP
multicast table size on your data center switches to VXLAN. Due to vShield Managers wraparound
multicast address allocation policy, the multicast traffic should be well-distributed across all the
whole allocated address range.

Copyright ipSpace.net 2014

Page 2-22

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

How does a vSphere host using multicast VXLAN interact with the physical network? Does it have to
use IGMP or PIM? This blog post describes the details.

IGMP AND PIM IN MULTICAST VXLAN TRANSPORT


NETWORKS
Got a really interesting question from Michael Haines: When and how does VXLAN use IGMP and
PIM in transport (underlay) networks?
Obviously you need IGMP and PIM in multicast environments only (vCNS 5.x, Nexus 1000V
in multicast mode).

IGMP is used by the ESXi hosts to tell the first-hop routers (in case you run VXLAN across multiple
subnets) that they want to participate in particular multicast group, so the subnet in which they
reside gets added to the distribution tree.
PIM is used between routers to figure out how the IP multicast flooding tree should look like.

Copyright ipSpace.net 2014

Page 2-23

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

DO WE HAVE TO USE IGMP SNOOPING?


Some VXLAN design documents talk about IGMP snooping heres why that feature might be
relevant in your data center network.
IP multicast datagrams are sent as MAC frames with multicast destination MAC addresses. There
frames are flooded by dumb L2 switch, resulting in wasted network bandwidth and server CPU
cycles.
IGMP snooping gives some L3 smarts to L2 switches - they don't flood IP multicast frames out all
ports, but only on ports from which they've received corresponding IGMP joins. You might decide
not to care in a small data center network with tens of servers; IGMP snooping will definitely help in
large (hundreds of servers) deployments.
Finally, if you want to use IGMP snooping in L2-only environment (all VXLAN hosts in the same IP
subnet), you need a node that pretends it's a router and sends out IGMP queries, or the L2 switches
have nothing to snoop.

Copyright ipSpace.net 2014

Page 2-24

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Lack of control plane in multicast-based VXLAN reintroduces numerous problems that were already
solved in VLAN-based products. For example, its impossible to implement private VLANs over
VXLAN.

PVLAN,VXLAN AND CLOUD APPLICATION


ARCHITECTURES
Aldrin Isaac made a great comment to my Could MPLS-over-IP replace VXLAN? article:
As far as I understand, VXLAN, NVGRE and any tunneling protocol that use global ID in
the data plane cannot support PVLAN functionality.
Hes absolutely right, but you shouldnt try to shoehorn VXLAN into existing deployment models. To
understand why that doesnt make sense, we have to focus on the typical cloud application
architectures.
To be more precise, any tunneling protocol that uses global ID in the data plane and uses
flooding to compensate for lack of control plane cannot support PVLAN. VMware NSX
for multiple hypervisors has port isolation (which is equivalent to a simple PVLAN); they
could do it because the NSX controller(s) download all MAC-to-IP mappings and MAC
forwarding entries into the hypervisor switches.

Copyright ipSpace.net 2014

Page 2-25

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

SMB LAMP STACK


Numerous service providers that were previously offering simple web hosting are now selling
cloudwashed VM-based services (example: the hosting company I use for one of my private web
sites is now offering Virtual Private Servers). The deployment model is simple: you get a single
Linux (or Windows) server with Internet connectivity (hopefully firewalled to stop the script kiddies),
and youd usually install the LAMP stack or something similar on that server (LAMP = Linux, Apache,
MySQL, PHP/Perl/Python). Sometimes the service provider offers hosted database service promising
redundancy and backup.

Figure 2-7: Typical LAMP stack

Copyright ipSpace.net 2014

Page 2-26

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In this environment, each tenant gets a single VM that has Internet connectivity and (optionally)
access to some central services. VMs are not supposed to communicate with each other (even
though you might buy more than one).
PVLAN is the perfect infrastructure solution for this environment deploy a PVLAN in each compute
pod (whatever that might be usually a few racks), and use IP routing between pods. You can still
use vMotion and HA/DRS within a pod, so you can move the customer VMs when you want to
perform maintenance on individual pod components.
Evacuating a whole pod is bit more complex, but then (hopefully) you wont be doing that every
other day. If you really want to have this capability (because restarting customer VMs every now
and then is not an option), develop a migration process where you temporarily provision a PVLAN
between two pods, move the VMs and shut down the temporary inter-pod L2 connection, thus
minimizing the risk of having a large-scale VLAN across multiple pods.
Summary: you dont need VXLAN if youre selling individual VMs. PVLANs work just fine.

SCALE-OUT APPLICATION ARCHITECTURE


Modern scale-out application architectures are way more complex than a simple non-redundant
LAMP stack. Youd have numerous web servers sitting behind a load balancer, you might be using
web caches (example: Varnish) or offload servers (example: FastCGI), message queues, batch
worker processes, cache daemons, and a bunch of database servers. As @devops_borat wrote:

Copyright ipSpace.net 2014

Page 2-27

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The servers youd use in a scale-out application usually belong to different security zones, so youd
want to use firewalls between them. You might also need load balancing between tiers (some
programmers dont grasp the importance of having redundant database connections ... or the
stupidity of having hard-coded database connection information), and the servers within a tier might
have to communicate with each other (example: database servers or web caches and web servers).
In short you need multiple isolated virtual network segments with firewalls and load balancers
sitting between the segments and between the web server(s) and the outside world.

Figure 2-8: Simplified scale-out application architecture

Copyright ipSpace.net 2014

Page 2-28

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VXLAN, NVGRE or NVP, combined with virtual appliances, are an ideal solution for this type of
application architectures. Trying to implement these architectures with PVLANs would result in a
total spaghetti mess of isolated and community VLANs with multiple secondary VLANs per tenant.

SUMMARY
MAC-over-IP virtual networking solutions are not a panacea. They cannot replace some of the
traditional isolation constructs (PVLAN), but then they were not designed to do that. Their primary
use case is an Amazon VPC-like environment with numerous isolated virtual networks per tenant.

Copyright ipSpace.net 2014

Page 2-29

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Regardless of what you develop, someone is quick to ask how can I run IP multicast over this.
Heres the VXLAN answer to that question:

VM-LEVEL IP MULTICAST OVER VXLAN


Dumlu Timuralp (@dumlutimuralp) sent me an excellent question:
I always get confused when thinking about IP multicast traffic over VXLAN tunnels. Since
VXLAN already uses a Multicast Group for layer-2 flooding, I guess all VTEPs would have
to receive the multicast traffic from a VM, as it appears as L2 multicast. Am I missing
something?
Short answer: no, youre absolutely right. IP multicast over VXLAN is clearly suboptimal.
In the good old days when the hypervisor switches were truly dumb and used simple VLAN-based
layer-2 switching, you could control the propagation of IP multicast traffic by deploying IGMP
snooping on layer-2 switches (or, if you had Nexus 1000V, you could configure IGMP snooping
directly on the hypervisor switch).
Those days are gone (finally), but the brave new world still lacks a few features. No ToR switches
are currently capable of digging into the VXLAN payload to find IGMP queries and joins, and its
questionable whether Nexus 1000V can do IGMP snooping over VXLAN (IGMP snooping on Nexus
1000V is configured on VLANs).

Copyright ipSpace.net 2014

Page 2-30

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

End result: IP multicast running across a VXLAN segment will be delivered to all VMs in the same
segment. Both hypervisor switches and VMs will have to spend CPU cycles to process unwanted
multicast packets.
Hyper-V network virtualization can map individual customer multicast groups to provider
(transport) multicast groups, resulting in way more optimal behavior.

Copyright ipSpace.net 2014

Page 2-31

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VXLAN runs over UDP, but we know UDP is not a reliable transport protocol? Does that matter and
do we need a reliable transport protocol to implement virtual networks?

VXLAN RUNS OVER UDP DOES IT MATTER?


Scott Lowe asked a very good question in his Technology Short Take #20:
VXLAN uses UDP for its encapsulation. What about dropped packets, lack of sequencing,
etc., that is possible with UDP? What impact is that going to have on the inner protocol
thats wrapped inside the VXLAN UDP packets? Or is this not an issue in modern
networks any longer?
Short answer: No problem.
Somewhat longer one: VXLAN emulates an Ethernet broadcast domain, which is not reliable anyway.
Any layer-2 device (usually known as a switch although a bridge would be more correct) can drop
frames due to buffer overflows or other forwarding problems, or the frames could become corrupted
in transit (although the drops in switches are way more common in modern networks).
UDP packet reordering is usually not a problem packet/frame reordering is a well-known challenge
and all forwarding devices take care not to reorder packets within a layer-4 (TCP or UDP) session.
The only way to introduce packet reordering is to configure per-packet load balancing somewhere in
the path (hint: dont do that).

Copyright ipSpace.net 2014

Page 2-32

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Brocade uses very clever tricks to retain proper order of packets while doing per-packet
load balancing across intra-fabric links.

Using UDP to transport Ethernet frames thus doesnt break the expected behavior. Things might get
hairy if youd extend VXLAN across truly unreliable links with high error rate, but even then VXLANover-UDP wouldnt perform any worse than other L2 extensions (for example, VPLS or OTV) or any
other tunneling techniques. None of them uses a reliable transport mechanism.

GETTING ACADEMIC
Running TCP over TCP (which would happen in the end if one would want to run VXLAN over TCP) is
a really bad idea. This paper describes some of the nitty-gritty details, or you could just google for
TCP-over-TCP.
Some history: The last protocol stacks that had reliable layer-2 transport were SNA and X.25. SDLC
or LAPB (for WAN links) and LLC2 (for LAN connections) were truly reliable LLC2 peers
acknowledged every L2 packet ... but even LLC2 was running across Token Ring or Ethernet bridges
that were never truly reliable. We used reliable SNA-over-TCP/IP WAN transport (RSRB and later
DLSW+) simply because the higher error rates experienced on WAN links (transmission errors and
packet drops) caused LLC2 performance problems if we used plain source-route bridging.

Copyright ipSpace.net 2014

Page 2-33

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

And finally storage digression: Some people think Fiber Channel (FC) offers reliable transport. It
doesnt ... it just tries to minimize the packet loss by over-provisioning every device in the path
because its primary application (SCSI) lacks fast retransmission/recovery mechanisms. We use FCIP
(FC-over-TCP) on WAN links to reduce the packet drop rate, not to retain the end-to-end reliable
transport.

Copyright ipSpace.net 2014

Page 2-34

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

My initial thoughts on NVGRE were vitriolic do we really need another encapsulation standard just
to be different? The details of Hyper-V control plane were still a year off (the blog post was written
in September 2011), so I looked only at the encapsulation (and I still dont understand why
Microsoft used NVGRE instead of VXLAN).
The blog post also describes why it makes more sense to transport virtual network traffic over UDP
than over GRE.

NVGRE BECAUSE ONE STANDARD JUST WOULDNT BE


ENOUGH
Two weeks after VXLAN (backed by VMware, Cisco, Citrix and Red Hat) was launched at VMworld,
Microsoft, Intel, HP & Dell published NVGRE draft (Arista and Broadcom are cleverly sitting on both
chairs) which solves the same problem in a slightly different way.
If youre still wondering why we need VXLAN and NVGRE, read my VXLAN post (and the one
describing how VXLAN, OTV and LISP fit together), watch the Introduction to Virtual
Networking webinar recording or read the Introduction section of the NVGRE draft.
Its obvious the NVGRE draft was a rushed affair; its only significant and original contribution to
knowledge is the idea of using the lower 24 bits of the GRE key field to indicate the Tenant Network
Identifier (but then, lesser ideas have been patented time and again). Like with VXLAN, most of the
real problems are handwaved to other or future drafts.

Copyright ipSpace.net 2014

Page 2-35

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The way to obtain remote VM MAC to physical IP mapping will be covered in a different draft
(section 3.1). VXLAN specifies the use of IP multicast to flood within the virtual segment and relies
on dynamic MAC learning.
The NVGRE approach is actually more scalable than the VXLAN one because it does not mandate the
use of flooding-based MAC address learning. Even more, NVGRE acknowledges that there might be
virtual L2 networks that will not use flooding at all (like Amazon EC2).
Mapping between TNIs (virtual segments) and IP multicast addresses will be specified in a
future version of this draft. VXLAN solves the problem by delegating it to the management layer.
IP fragmentation (due to oversized VXLAN/NVGRE frames). NVGRE draft at least
acknowledges the problem and indicates that a future version might use Path MTU Discovery to
detect end-to-end MTU size and reduce the intra-virtual-network MTU size for IP packets.
VXLAN ignores the problem and relies on jumbo frames. This might be a reasonable approach
assuming VXLAN would stay within a Data Center (keep dreaming, vendors involved in VXLAN are
already peddling long-distance VXLAN snake oil).
ECMP-based load balancing is the only difference between NVGRE and VXLAN worth mentioning.
VXLAN uses UDP encapsulation and pseudo-random values in UDP source port (computed by
hashing parts of the inner MAC frame), resulting in automatic equal-cost load balancing in every
device that uses 5-tuple to load balance.
GRE is harder to load balance, so the NVGRE draft proposes an interim solution using multiple IP
addresses per endpoint (hypervisor host) with no details on the inter-VM-flow-to-endpoint-IPaddress mapping. The final solution?

Copyright ipSpace.net 2014

Page 2-36

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The Key field may provide additional entropy to the switches to exploit path diversity
inside the network. One such example could be to use the upper 8 bits of the Key field to
add flow based entropy and tag all the packets from a flow with an entropy label.
OK, might even work. But do the switches support it? Oh, dont worry...
A diverse ecosystem play is expected to emerge as more and more devices become
multitenancy aware.
I know they had to do something different from VXLAN (creating another UDP-based scheme and
swapping two fields in the header would be a too-obvious me-too attempt), but wishful thinking like
this usually belongs to another type of RFCs.

SUMMARY
Two (or more) standards solving a single problem seems to be the industry norm these days. Im
sick and tired of the obvious me-too/Im-different/look-whos-innovating ploys. Making matters
worse, VXLAN and NVGRE are half-baked affairs today.
VXLAN has no control plane and relies on IP multicast and flooding to solve MAC address learning
issues, making it less suitable for very large scale or inter-DC deployments.
NVGRE has the potential to be a truly scalable solution: it acknowledges there might be need for
networks not using flooding, and at least mentions the MTU issues, but it has a long way to go
before getting there. In its current state, its worse than VXLAN because its way more
underspecified.

Copyright ipSpace.net 2014

Page 2-37

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

A few months after VXLAN and NVGRE were introduced, Nicira described yet another encapsulation
standard optimized to use the capabilities of existing Ethernet NICs.

DO WE REALLY NEED STATELESS TRANSPORT


TUNNELING (STT)
The first question everyone asked after Nicira had published yet another MAC-over-IP tunneling
draft was probably do we really need yet another encapsulation scheme? Arent VXLAN or NVGRE
enough? Bruce Davie tried to answer that question in his blog post (and provided more details in
another one), and Ill try to make the answer a bit more graphical.
The three drafts (VXLAN, NVGRE and STT) have the same goal: provide emulated layer-2 Ethernet
networks over scalable IP infrastructure. The main difference between them is the encapsulation
format and their approach to the control plane:

VXLAN ignores the control plane problem and relies on flooding emulated with IP multicast;

NVGRE authors handwave over the control plane issue (the way to obtain [MAC-to-IP mapping]
information is not covered in this document);

STT authors claim the draft describes just the encapsulation format.

Everything else being equal, why does STT make sense at all? The answer lies in the capabilities of
modern server NICs.

Copyright ipSpace.net 2014

Page 2-38

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

TCP SEGMENTATION OFFLOAD


Applications using TCP (for example, a web server) are not aware of the intricacies of TCP (window
size, maximum segment size, retransmissions) and perceive a TCP connection as a reliable byte
stream. Applications send streams of bytes to an open socket and the operating systems TCP/IP
stack slices and dices the data into individual TCP+IP packets, prepends MAC header (built from the
ARP cache) in front of the IP header, and sends the L2 frames to the Network Interface Card (NIC)
for transmission.

Figure 2-9: Segmentation performed within the TCP stack

Modern NICs allow the TCP stacks to offload some of the heavy lifting to the hardware most
commonly the segmentation and reassembly (retransmissions are still performed in software). A

Copyright ipSpace.net 2014

Page 2-39

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

TCP stack using a Large Segment Offload (LSO)-capable NIC would send a single jumbo MAC frame
to the NIC and the NIC would slice it into properly sized TCP segments (increasing byte counts and
computing IP+TCP checksums while doing that).

Figure 2-10: Segmentation performed by VM NIC

LSO significantly increases the TCP performance. If you dont believe me (and you shouldnt), run
iperf tests on your server with TCP offload turned on and off (and report your results in a comment).

Copyright ipSpace.net 2014

Page 2-40

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

MAC-OVER-IP KILLS TCP OFFLOAD


Typical NICs can segment TCP-IP-MAC frames. They cannot segment TCP-IP-MAC-VXLAN-UDP-IPMAC frames (or TCP-IP-MAC-NVGRE-IP-MAC frames). Sending L2 frames over VXLAN or NVGRE
incapacitates TCP offload on most server NICs available today (I didnt want to write all if youre
aware of a NIC that could actually handle IP-over-MAC-over-GRE encapsulation, please write a
comment). Does that matter? Do the tests I suggested in the previous paragraph to figure out
whether it matters to you.

STT A CLEVER TCP OFFLOAD HACK


STT uses a header that looks just like the TCP header to the NIC. The NIC is thus able to perform
Large Segment Offload on what it thinks is a TCP datagram.
The reality behind the scenes is a bit more complex: what gets handled to the NIC is an oversized
TCP-IP-MAC frame (up to 64K long) with STT-IP-MAC header. The TCP segments produced by the
NIC are thus not the actual TCP segments, but segments of STT frame passed to the NIC.

Copyright ipSpace.net 2014

Page 2-41

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 2-11: STT uses TCP segmentation offload to increase performance

Copyright ipSpace.net 2014

Page 2-42

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

WHY DO WE HAVE THREE DIFFERENT STANDARDS


Heres my cynical view: every single vendor launching a MAC-over-IP encapsulation scheme tried to
make its life easy. Cisco already has VXLAN-capable hardware (VXLAN header format is similar to
OTV and LISP), you can probably figure out who has GRE-capable hardware by going through the
list of NVGRE draft authors, and Nicira focused on what they see as the most important piece of the
puzzle the performance of the servers where the VMs are running.
Randy Bush called this approach to standard development throwing spaghetti at the wall to see
what sticks, which is definitely an amusing pastime unless you happen to be the wall.

Copyright ipSpace.net 2014

Page 2-43

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Do we really need new encapsulation standards? Wouldnt it be easier to use MPLS (perhaps MPLSover-GRE)? This blog post reflects my thinking in summer 2012; Ive inserted notes in summer 2014
to describe what has changed in the meantime.

COULD MPLS-OVER-IP REPLACE VXLAN OR NVGRE?


A lot of engineers are concerned with what seems to be frivolous creation of new encapsulation
formats supporting virtual networks. While STT makes technical sense (it allows soft switches to use
existing NIC TCP offload functionality), its harder to figure out the benefits of VXLAN and NVGRE.
Scott Lowe wrote a great blog post recently where he asked a very valid question: Couldnt we use
MPLS over GRE or IP? We could, but we wouldnt gain anything by doing that.
RFC 4023 specifies two methods of MPLS-in-IP encapsulation: MPLS label stack on top of IP (using
IP protocol 137) and MPLS label stack on top of GRE (using MPLS protocol type in GRE header). We
could use either one of these and use either the traditional MPLS semantics or misuse MPLS label as
virtual network identifier (VNI). Lets analyze both options.

MISUSING MPLS LABEL AS VNI


In theory, one could use MPLS-over-IP or MPLS-over-GRE instead of VXLAN (or NVGRE) and use the
first MPLS label as the VNI. While this might work (after all, NVGRE reuses GRE key as VNI), it
would not gain us anything. The existing equipment would not recognize this creative use of MPLS
labels, and we still wouldnt have the control plane and would have to rely on IP multicast to
emulate virtual network L2 flooding.

Copyright ipSpace.net 2014

Page 2-44

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The MPLS label = VNI approach would be totally incompatible with existing MPLS stacks and would
thus require new software in virtual-to-physical gateways. It would also go against the gist of MPLS
labels should have local significance (whereas VNI has network-wide significance) and should be
assigned independently by individual MPLS nodes (egress PE-routers in MPLS/VPN case).
Its also questionable whether the existing hardware would be able to process MAC-in-MPLS-in-GREin-IP packets, which would be the only potential benefit of this approach. I know that some
(expensive) linecards in Catalyst 6500 can process IP-in-MPLS-in-GRE packets (as do some switches
from Juniper and HP), but can it process MAC-in-MPLS-in-GRE? Who knows.
Finally, like NVGRE, MPLS-over-GRE or MPLS-over-IP framing with MPLS label being used as the VNI
lacks entropy that could be used for load balancing purposes; existing switches would not be able to
load balance traffic between two hypervisor hosts unless each hypervisor hosts would use multiple
IP addresses.

REUSING EXISTING MPLS PROTOCOL STACK


Reusing MPLS label as VNI buys us nothing; were thus better off using STT or VXLAN (at least
equal-cost load balancing works decently well). How about using MPLS-over-GRE the way it was
intended to be used as part of the MPLS protocol stack? Here were stumbling across several major
roadblocks:

No hypervisor vendor is willing to stop supporting L2 virtual networks because they just might be
required for mission-critical craplications running over Microsofts Network Load Balancing, so
we cant use L3 MPLS VPN.

Copyright ipSpace.net 2014

Page 2-45

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Theres no usable Ethernet-over-MPLS standard. VPLS is a kludge (= full mesh of pseudowires)


and alternate approaches (draft-raggarwa-mac-vpn and draft-ietf-l2vpn-evpn) are still on the
drawing board.
Summer 2014: EVPN is becoming a viable standard, and is used by Juniper Contrail and
Nuage VSP to integrate overlay layer-2 segments with external layer-2 gateways.

MPLS-based VPNs require decent control plane, including control-plane protocols like BGP, and
that would require some real work on hypervisor soft switches. Implementing an ad-hoc solution
like VXLAN based on doing-more-with-less approach (= lets push the problem into someone
elses lap and require IP multicast in network core) is cheaper and faster.
Juniper Contrail and Nuage VSP implemented MPLS/VPN control plane (including MP-BGP) in
their controller, and use simpler protocols (Contrail: XMPP, VSP: OpenFlow) to distribute the
forwarding information to the hypervisor virtual switches.

Copyright ipSpace.net 2014

Page 2-46

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

SUMMARY
Using MPLS-over-IP/GRE to implement virtual networks makes marginal sense, does not solve the
load balancing problems NVGRE is facing, and requires significant investment in the hypervisor-side
control plane if you want to do it right. I dont expect to see it implemented any time soon (although
Nicira could do it pretty quickly should they find a customer who would be willing to pay for it).
There are two shipping implementations: Juniper Contrail and Nuage VSP. Both are coming
from traditional networking vendors that already had field-tested MPLS protocol stack. Cisco
is talking about adding EVPN support to Nexus 1000V.

Copyright ipSpace.net 2014

Page 2-47

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Open vSwitch-based overlay virtual networking solutions use tunnels between hypervisor hosts,
resulting in claims of scalability problems by competing camps (example: hardware networking
vendors). In reality, those tunnels are nothing more than pure software construct required by
OpenFlows forwarding model, as I explained in this blog post written in August 2013.

ARE OVERLAY NETWORKING TUNNELS A SCALABILITY


NIGHTMARE?
Every time I mention overlay virtual networking tunnels someone starts worrying about the
scalability of this approach along the lines of In a data center with hundreds of hosts, do I have an
impossibly high number of GRE tunnels in the full mesh? Are there scaling limitations to this
approach?
Not surprisingly, some ToR switch vendors abuse this fear to the point where they look downright
stupid (but I guess thats their privilege), so lets set the record straight.
What are these tunnels?
The tunnels mentioned above are point-to-point GRE (or STT or VXLAN) tunnel interfaces between
Linux-based hypervisors. VXLAN implementations on Cisco Nexus 1000V, VMware vCNS or
(probably) VMware NSX for vSphere dont use tunnel interfaces (or at least we cant see them from
the outside).

Copyright ipSpace.net 2014

Page 2-48

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Why do we need the tunnel interfaces?


The P2P overlay tunnels are an artifact of OpenFlow-based forwarding implementation in Open
vSwitch. OpenFlow forwarding model assumes point-to-point interfaces (switch-to-switch or switchto-host links) and cannot deal with multipoint interfaces (mGRE tunnels in Cisco IOS parlance).
OpenFlow controller (Nicira NVP) thus cannot set the transport network next hop (VTEP in VXLAN)
on a multi-access tunnel interface in a forwarding rule; the only feasible workaround is to create
numerous P2P tunnel interfaces, associating one (or more) of them with every potential destination
VTEP.
Do I have to care about them?
Absolutely not. They are auto-provisioned by one of the Open vSwitch daemons (using ovsdbproto), exist only on Linux hosts, and add no additional state to the transport network (apart from
the MAC and ARP entries for the hypervisor host which the transport network has to have anyway).
Will they scale?
Short summary: Yes. The real scalability bottleneck is the controller and the number of hypervisor
hosts it can manage.
Every hypervisor host has only the tunnels it needs. If a hypervisor host runs 50VMs and every VM
belongs to a different logical subnet with another 50VMs in the same subnet (scattered across 50
other hypervisor hosts), the host needs 2500 tunnel interfaces going to 2500 destination VTEPs.
In recent releases of Open vSwitch, the tunnel interfaces remain a pure software construct
within Open vSwitch implementation they are not Linux kernel interfaces.

Copyright ipSpace.net 2014

Page 2-49

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Obviously you could potentially hit Open vSwitch scalability limits if you have large virtualization
ratios and huge subnets (and I couldnt find what they are comments welcome), and distributed
L3 forwarding makes things an order of magnitude worse, but since a single NVP controller cluster
doesnt scale beyond 5000 hypervisors at the moment that also puts an upper bound on the number
of tunnel interfaces a Linux host might need.
So whats all the fuss then?
As I wrote in the introductory paragraph its pure FUD created by hardware vendors. Now that you
know whats going on behind the scenes lean back and enjoy every time some mentions it (and you
might want to ask a few pointed questions ;).

Copyright ipSpace.net 2014

Page 2-50

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Another FUD item raised by traditional networking vendors is the lack of QoS support in overlay
virtual networking solutions. As I explained in September 2013, those claims have no basis in
reality.

OVERLAY NETWORKS AND QOS FUD


One of the usual complaints I hear whenever I mention overlay virtual networks is with overlay
networks we lose all application visibility and QoS functionality ... that worked so phenomenally in
the physical networks, right?

THE WONDERFUL QOS THE PHYSICAL HARDWARE GIVES YOU


To put my ramblings into perspective, lets start with what we do have today. Most hardware
vendors give you basic DiffServ functionality: classification based on L2-4 information, DSCP or
802.1p (CoS) marking, policing and queuing. Shaping is rare. Traffic engineering is almost
nonexistent (while some platforms support MPLS TE I havent seen many people brave enough to
deploy it in their data center network).
Usually a single vendor delivers an inconsistent set of QoS features that vary from platform to
platform (based on the ASIC or merchant silicon used) or even from linecard to linecard (dont even
mention Catalyst 6500). Sometimes you need different commands or command syntax to configure
QoS on different platforms from the same hardware vendor.

Copyright ipSpace.net 2014

Page 2-51

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

I dont blame the vendors. Doing QoS at gigabit speeds in a terabit fabric is hard. Really hard.
Having thousands of hardware output queues per port or hardware-based shaping is expensive (why
do you think we had to pay an arm and a leg for ATM adapters?).

DO WE NEED QOS?
Maybe not. Maybe its cheaper to build a leaf-and-spine fabric with more bandwidth than your
servers can consume. Learn from the global Internet - everyone talks about QoS, but the emperor is
still naked.

HOW SHOULD QOS WORK?


The only realistic QoS technology that works at terabit speeds is DiffServ packet classification is
encoded in DSCP or CoS (802.1p bits). In an ideal world the applications (or host OS) set the DSCP
bits based on their needs, and the network accepts (or rewrites) the DSCP settings and provides the
differentiated queuing, shaping and dropping.
In reality, the classification is usually done on the ingress network device, because we prefer playing
MacGyvers instead of telling our customers (= applications) what you mark is what you get.
Finally, there are the poor souls that do QoS classification and marking in the network core because
someone bought them edge switches that are too stupid to do it.

Copyright ipSpace.net 2014

Page 2-52

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

HOW MUCH QOS DO WE GET IN THE VIRTUAL SWITCHES?


Now lets focus on the QoS functionality of the new network edge: the virtual switches. As in the
physical world, theres a full range of offerings, from minimalistic to pretty comprehensive:

vDS in vSphere 5.1 has minimal QoS support: per-pool 802.1p marking and queuing;

Nexus 1000V has a full suite of classification, marking, policing and queuing tools. It also copies
inner DSCP and CoS values into VXLAN+MAC envelope;

VMware NSX (the currently shipping NVP 3.1 release) uses a typical service provider model: you
can define minimal (affecting queuing) and maximal (triggering policing) bandwidth per VM,
accept or overwrite DSCP settings, and copy DSCP bits from virtual traffic into the transport
envelopes;

vDS in vSphere 5.5 is has full 5-tuple classifier and CoS/DSCP marking. NSX for vSphere uses
vDS and probably relies on its QoS functionality.

In my opinion, you get pretty much par for the course with the features of Nexus 1000V, VMware
NSX or vSphere 5.5 vDS, and you get DSCP-based classification of overlay traffic with VMware NSX
and Nexus 1000V.
It is true that you wont be able to do per-TCP-port classification and marking of overlay virtual
traffic in your ToR switch any time soon (but Im positive there are at least a few vendors working
on it).
Its also true that someone will have to configure classification and marking on the new network
edge (in virtual switches) using a different toolset, but if thats an insurmountable problem, you
might want to start looking for a new job anyway.

Copyright ipSpace.net 2014

Page 2-53

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Virtual switches can do more informed QoS decisions than their physical counterparts, as they have
more visibility into the VM behavior. Heres a description of some initial ideas (written in summer of
2014):

MICE, ELEPHANTS AND VIRTUAL SWITCHES


The Mice and Elephants is a traditional QoS fable latency-sensitive real time traffic (or requestresponse protocol like HTTP) stuck in the same queue behind megabytes of file transfer (or backup
or iSCSI) traffic.
The solution is also well known color the elephants pink (aka DSCP marking) and sort them into a
different queue until the reality intervenes.
It seems oh-so-impossible to figure out which applications might generate elephant flows and mark
them accordingly on the originating server; theres no other way to explain the need for traffic
classification and marking on the ingress switch, and other MacGyver contraptions the networking
team uses to make sure its not the networks fault instead of saying were a utility youre
getting exactly what youve asked for.
Matching TCP and UDP port numbers on the server (because FTP sessions tend to be more
elephantine than DNS requests) and setting DSCP values of outbound packets is also obviously a
mission-impossible for some people; its way easier to pretend the problem doesnt exist and blame
the network for lack of proper traffic classification.

Copyright ipSpace.net 2014

Page 2-54

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

One has to wonder how well the recent surge of application-aware networking solution will
fare if the server/application teams cannot be bothered to tell the network what type of
traffic its facing by setting a simple one-byte value in each packet, but lets not go there.
Anyway, situation gets worse in environments with truly unclassifiable traffic (as the ultimate
abomination imagine a solution doing backups over HTTP) where its impossible to separate elephant
from mice based on their TCP/UDP port numbers.
If, however, one would have insight into the operating system TCP buffers, or measure per-flow
rate, one might be able to figure out which flows exhibit overweight tendencies and thats exactly
what the Open vSwitch (OVS) team did.
Additionally, OVS appears as a TCP-offload-capable NIC to the virtual machines, and the bulk
applications happily dump megabyte-sized TCP segments straight into the output queue of the VM
NIC, where its easy for the underlying hypervisor software (OVS) to spot them and mark them with
a different DSCP value (this idea is marked as pending in Martin Casados presentation).
The results (documented in a presentation) shouldnt be surprising we know ping isnt affected by
an ongoing FTP transfer if they happen to be in different queues since the days Fred Baker proudly
presented the first measurement results of the then-revolutionary Weighted Fair Queuing
mechanism (this is the only presentation I could find, but WFQ already existed in late 1995) at some
mid- 90s incarnation of Cisco Live (probably even before the days Cisco Live was called
Networkers).
The OVS-based elephant identification is a cool idea, although one has to wonder how well it works
in practice if it measure the flow rate (see also OVS scaling woes).

Copyright ipSpace.net 2014

Page 2-55

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Telling people how awesome it is that Cumulus-powered switches react to elephant flows in
hardware is pure marketing every switch works well when faced with properly marked packets.
Calling DSCP marking overlay-to-underlay integration is hogwash (no, I will not link to the
source); weve been using DSCP marking for decades with no need for fancy names.

Copyright ipSpace.net 2014

Page 2-56

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Taking a step back from the QoS discussions do we need QoS at all? How much bandwidth do we
really need in a typical enterprise data center?

HOW MUCH DATA CENTER BANDWIDTH DO YOU


REALLY NEED?
Networking vendors are quick to point out how the opaqueness (read: we dont have the HW to look
into it) of overlay networks presents visibility problems and how their favorite shiny gizmo
(whatever it is) gives you better results (they usually forget to mention the lock-in that it creates).
Now lets step back and ask a fundamental question: how much bandwidth do we need?
Disclaimer: If youre running a large public cloud or anything similarly sized, this is not the
post youre looking for.

Lets assume:

We have mid-sized workload of 10.000 VMs (thats probably more than most private clouds see,
but lets err on the high side);

The average long-term sustained network traffic generated by every VM is around 100 Mbps (I
would love to see a single VM thats not doing video streaming or network services doing that,
but thats another story).

Copyright ipSpace.net 2014

Page 2-57

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The average bandwidth you need in your data center is thus 1 Tbps. Every pizza box ToR switch you
can buy today has at least 1.28 Tbps of non-blocking bandwidth. Even discounting for marketing
math, you dont need more than two ToR switches to satisfy your bandwidth needs (remember: if
you have only two ToR switches you have 1.28 Tbps of full-duplex non-blocking bandwidth).
If thats not enough (or you think you should take in account traffic peaks), take a pair of Nexus
6000s or build a leaf-and-spine fabric.
In many cases VMs have to touch storage to deliver data to their clients, and thats where the real
bottleneck is. Assuming only 10% of the VM-generated data comes from the spinning rust (or SSDs)
Id love to see the storage delivering sustained average throughput of 100 Gbps.
How about another back-of-the-napkin calculation:

A data center has two 10Gbps WAN links;

90% of the traffic stays within the data center (yet again on the high side supposedly 70-80%
is a more realistic number).

Based on these figures, the total bandwidth needed in the data center is 200 Gbps. Adjust the
calculation for your specific case, but I dont think many of you will get above 1-2 Tbps.
Obviously you might have bandwidth/QoS problems if:

You use legacy equipment full of oversubscribed GE linecards;

You still run a three-tier DC architecture with heavy oversubscription between tiers;

Copyright ipSpace.net 2014

Page 2-58

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

You built a leaf-and-spine fabric with 10:1 oversubscription (yeah, Ive seen that);

You have no idea how much traffic your VMs generate and thus totally miscalculate the
oversubscription factor;

... but that has nothing to do with overlay virtual networks if anything of the above is true you
have a problem regardless of what you run in your data center.

Copyright ipSpace.net 2014

Page 2-59

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Although QoS seems to be irrelevant, and we might have plenty of bandwidth in new leaf-and-spine
fabrics, it doesnt mean that we can simply ignore the physical network (which is unfortunately the
message coming from the marketing department of a large virtualization vendor).

CAN WE REALLY IGNORE SPAGHETTI AND HORSESHOES?


Brad Hedlund wrote a thought-provoking article a few weeks ago, claiming that the horseshoes (or
trombones) and spaghetti created by virtual workloads and appliances deployed anywhere in the
network dont matter much with new data center designs that behave like distributed switches. In
theory, hes right. In practice, less so.
Brad has some other interesting ideas, for example L2 doesnt matter anymore (absolutely agree),
so make sure you read the whole article.
Lets make a step back. Brad started his reasoning by comparing data center fabrics with physical
switches, saying We dont need to engineer the switch and We dont worry too much about how
this internal fabric topology or forwarding logic is constructed, or the consequential number of
hops.
Well, we dont until we stumble across an oversubscribed linecard, or a linecard design that allows
us to send a total of 10Gbps through four 10GE ports. The situation gets worse when we have to
deal with stackable switches, where it matters a lot whether the traffic has to traverse the stacking
cables or not (not to mention designs where switches in a stack are connected with one or a few
regular links).

Copyright ipSpace.net 2014

Page 2-60

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The situation is no different in the virtual switch scenario. We dont care about the number of hops
across the virtual switch and the latency as long as its consistent and predictable. If the virtual
switch uses Clos-like architecture that Brad can build from tens of switches, or Cisco can build from
Nexus 7K/5K/2K (watch the Evolving Network Fabrics video from Ciscos presentation @ Networking
Tech Field Day 2 the fun part starts at approximately 26:00) ... or that you can buy prepackaged
and tested from Juniper, then the traffic flows truly dont matter any two points not connected to
the same access-layer switch are equidistant in terms of bandwidth and latency. As soon as you go
beyond a single fabric, or out of a single data center, the situation changes dramatically, and
bandwidth and latency become yet again very relevant.
Then theres also the question of costs. Given infinite budget, its easy to build very large fabrics
that give the location doesnt matter illusion to as many servers or virtual machines as needed.
Some of us are not that lucky; we have to live with fixed budgets, and were usually caught in a
catch-22 situation. Wasting bandwidth to support spaghetti-like traffic flows costs real CapEx money
(not to mention not-so-very-cheap maintenance contracts); trying to straighten cooked spaghettis
continuously being made by virtualized workloads generates OpEx costs you have to figure out
which one costs you less in the long run.
Last but not least, very large fabrics are more expensive (per port) than smaller ones due to
increased number of Clos stages, so you have to stop somewhere supporting constant
bandwidth/latency across the whole data center is simply too expensive.
Im positive Brad knows all that, as do a lot of very smart people doing large-scale data center
designs. Unfortunately, not everyone will get the right message, and a lot of people with subscribe
to the traffic flows dont matter anymore mantra without understanding the underlying
assumptions (like they did to the stretched clusters make perfect sense one), and get burnt in the
process because theyll deploy workloads across uneven fabrics or even across lower-speed links.

Copyright ipSpace.net 2014

Page 2-61

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

For example, a few microseconds after VXLAN was launched, someone in his infinite wisdom claimed
that VXLAN solves inter-data center VM mobility (no, it doesnt). It seems that every generation of
engineers has to rediscover the fallacies of distributed computing, and the new cycle of discoveries
has just begun fueled by the hype surrounding virtual switches and software-defined networking.

Copyright ipSpace.net 2014

Page 2-62

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Another interesting question to ask is: What happens with IP TTL in overlay virtual networks? This
post (written in October 2013) tries to answer that question for multiple varieties of overlay virtual
networking.

TTL IN OVERLAY VIRTUAL NETWORKS


After we get rid of the QoS FUD, the next question I usually get when discussion overlay networks is
how should these networks treat IP TTL?
As (almost) always, the answer is It depends.

LAYER-2 VIRTUAL NETWORKS


Overlay virtual networking solutions like VXLAN that implement layer-2 segments (effectively
Ethernet-over-something) should not modify the VM-generated traffic. These solutions are
emulating a transparent bridge and should NOT interact with the user traffic; all they can do is
forward, flood or drop.
Theres the minor annoyance of CoS or DSCP packet marking, but lets ignore that detail.

Copyright ipSpace.net 2014

Page 2-63

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Obviously the transport TTL (TTL generated by hypervisor when encapsulating the VM traffic)
shouldnt reflect the VM-generated TTL. VM-generated TTL could be anything (VM could also
generate non-IP traffic), while the transport TTL needs to be high enough to allow the packet to
traverse the data center core.
Conclusions:

Dont touch the overlay (VM) TTL;

Use whatever TTL makes sense in the transport network.

LAYER-3 VIRTUAL NETWORKS


Solutions that implement layer-3 forwarding are usually emulating Ethernet segments (layer-2
segments) connected with routers. In some cases the whole virtual network acts as a single virtual
router (VMware NSX Distributed Router, Hyper-V, NEC ProgrammableFlow ), in others the intersubnet traffic flows through a gateway appliance or a VM (VMware NSX Services Router, default
OpenStack networking ).
These solutions SHOULD decrement TTL like any other router (or layer-3 switch) would do. If they
wish to stay as close to the emulated Ethernet behavior as possible, they SHOULD decrement TTL if
and only if the packet crosses subnet boundaries (or you might get crazy problems with application
software that sends packets with TTL = 1).
For example, Hyper-V Network Virtualization SHOULD NOT decrement TTL if the source and
destination VM belong to the same subnet (even though the HNV module actually performs L3
lookup to figure out where to send the packet) but SHOULD decrement TTL if the destination VM
belongs to another IPv4 or IPv6 subnet.

Copyright ipSpace.net 2014

Page 2-64

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In fact, Hyper-V network virtualization in Windows Server 2012 R2 does not decrement TTL
even when a packet crosses the subnet boundary.
Like in the layer-2 case, the transport TTL has nothing in common with the VM-generated TTL
hypervisors should use whatever TTL they need to get the encapsulated traffic across the data
center fabric.
Conclusions:

Decrement TTL like a router would do;

Dont copy overlay TTL into transport TTL or vice versa;

Use whatever TTL makes sense in the transport network.

BUT THIS IS NOT HOW MPLS WORKS


Really? Well, this is EXACTLY how L2VPNs (EoMPLS, VPLS, EVPN) work.
MPLS-based L3VPN (the original MPLS/VPN) is a totally different story: its not supposed to
emulate a single virtual router, but a whole WAN. Copying customer TTL into provider TTL (and vice
versa) is the most natural thing to do under those circumstances (unless the provider wants to hide
the internal network details).

Copyright ipSpace.net 2014

Page 2-65

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The final blog post in this chapter is in the other category. It answers the question Does it make
sense to run vMotion over VXLAN?

VMOTION AND VXLAN


A while ago I wrote vMotion over VXLAN is stupid and unnecessary in a comment to a blog post by
Duncan Epping, assuming everyone knew the necessary background details. I was wrong (again).
Trying to understand my reasoning Jon sent me a very nice question:
vMotion is an exchange of data between hypervisor kernels, VXLAN is a VM networking
solution. I get that, but in one your videos you say VXLAN is a solution to meet the L2
adjacency requirements of vMotion.
If you design a L3 IP transport network ie. L3 from access to aggregation but you wanted
to use vMotion then how could you do that unless you used an overlay technoology such
as VXLAN to extend the vlan across the underlying IP network.
My somewhat imprecise claims often get me in trouble (this wouldnt be the first time), let me try to
straighten things out.
vMotion requires

L2 adjacency within the port group in which the VM resides between source and target
hypervisor hosts for the VM port group. Without the L2 adjacency you cannot move a live IP
address and retain all sessions (solutions like Enterasys host routing are an alternative if you
dont mind longer traffic interruptions caused by routing protocol convergence time);

Copyright ipSpace.net 2014

Page 2-66

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

IP connectivity between vmkernel interface of hypervisor hosts (vMotion uses TCP to transport
data between hypervisors). VMware always claimed that you need L2 connectivity between
hypervisor hosts, and that vMotion between hosts residing in multiple subnets is unsupported
(supposedly it became supported last year), but it always worked.

In other words, when you move a VM, it must reside in the same L2 segment after the move (the
source and target hypervisor hosts can be in different subnets). You can implement that
requirement with VLANs (which require end-to-end L2 connectivity) or VXLAN (which can emulate L2
segments across L3 infrastructure).

Copyright ipSpace.net 2014

Page 2-67

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

OVERLAY VIRTUAL NETWORKING PRODUCT


DETAILS

IN THIS CHAPTER:
OVERLAY VIRTUAL NETWORKING SOLUTIONS OVERVIEW
WHAT IS VMWARE NSX?
VMWARE NSX CONTROL PLANE
LAYER-2 AND LAYER-3 SWITCHING IN VMWARE NSX
LAYER-3 FORWARDING WITH VMWARE NSX EDGE SERVICES ROUTER
OPEN VSWITCH UNDER THE HOOD
ROUTING PROTOCOLS ON NSX EDGE SERVICES ROUTER
UNICAST-ONLY VXLAN FINALLY SHIPPING
WHATS COMING IN HYPER-V NETWORK VIRTUALIZATION (WINDOWS SERVER 2012
R2)

Copyright ipSpace.net 2014

Page 3-1

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

NETWORKING ENHANCEMENTS IN WINDOWS SERVER 2012 R2


VIRTUAL PACKET FORWARDING IN HYPER-V NETWORK VIRTUALIZATION
HYPER-V NETWORK VIRTUALIZATION PACKET FORWARDING IMPROVEMENTS IN
WINDOWS SERVER 2012 R2
COMPLEX ROUTING IN HYPER-V NETWORK VIRTUALIZATION
THIS IS NOT THE HOST ROUTE YOURE LOOKING FOR
OPENSTACK NEUTRON PLUG-IN: THERE CAN ONLY BE ONE
PACKET FORWARDING IN AMAZON VPC
MIDOKURAS MIDONET: A LAYER 2-4 VIRTUAL NETWORK SOLUTION
BIG SWITCH AND OVERLAY NETWORKS

Copyright ipSpace.net 2014

Page 3-2

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This chapter contains a collection of blog posts describing architectural and implementation details of
various overlay virtual networking products, including:

VMware NSX;

VXLAN on Cisco Nexus 1000V;

Hyper-V Network Virtualization;

OpenStack Neutron;

Amazon VPC;

Midokura Midonet.

Youll find even more details in the Overlay Virtual Networking webinar.

Copyright ipSpace.net 2014

Page 3-3

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This overview of overlay virtual networking solutions was written in early 2014 and updated in
August 2014.

OVERLAY VIRTUAL NETWORKING SOLUTIONS


OVERVIEW
2013 was definitely the year of overlay virtual networks, with every major networking and
virtualization vendor launching a new product or adding significant functionality to an existing one.
Heres a brief overview of what theyre offering:

Figure 3-1: Overlay virtual networking solution overview (from Cloud Computing Networking webinar)

Copyright ipSpace.net 2014

Page 3-4

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

NOTES

The table includes shipping products with publicly available documentation.

One can always implement a gateway between an overlay network and physical world with a
multi-NIC VM. Those solutions arent listed in the Gateways column.

Arista (7150), F5 (LTM) and Cisco (Nexus 9300) are shipping hardware VXLAN VTEPs using IP
multicast control plane. Theres no hardware gateway supporting unicast VXLAN as implemented
by Cisco Nexus 1000V or VMware NSX for vSphere;

VMware NSX for multi-hypervisors has bare-metal L2 and L3 gateways.

VMware NSX for vSphere has VM-based gateways with in-kernel L2 and L3 packet forwarding.

Brocade is shipping a hardware gateway for VMware NSX for multiple hypervisors using OVSDB
protocol.

Multiple vendors have announced hardware NVGRE gateways. No major vendor has shipped one.

Scalability of Hyper-V depends on the orchestration system. I couldnt find maximum number of
managed hosts supported by SC VMM 2012 earlier versions support up to 400 hosts.

I couldnt find the number of hosts or ports supported by a Contrail controller. Scale-out
architecture using controller federation and BGP route reflectors is probably limited by the
number of MP-BGP routes supported by the BGP route reflector (Contrail solution uses one
VPNv4 host route per VM IP address).

Copyright ipSpace.net 2014

Page 3-5

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This post was written in August 2013 a few days after the VMware NSX launch. It accurately
describes VMware NSX for multiple hypervisors release 4.0 and VMware NSX for vSphere release
6.0.

WHAT IS VMWARE NSX?


Answer#1: An overlay virtual networking solution providing logical bridging (aka layer-2 forwarding
or switching), logical routing (aka layer-3 switching), distributed or centralized firewalls, load
balancers, NAT and VPNs.
Answer#2: A merger of Nicira NVP and VMware vCNS (a product formerly known as vShield).
Oh, and did I mention its actually two products, not one?
VMware NSX for multi-hypervisor environment is Nicira NVP with ESXi and VXLAN
enhancements:

OVS-in-VM approach has been replaced with an NSX vSwitch within the ESXi kernel;

VMware NSX supports GRE, STT and VXLAN encapsulation, with VXLAN operating in unicast
mode with either source node or service node packet replication. The unicast mode is not
compatible with Nexus 1000V VXLAN unicast mode;

NSX unicast VXLAN implementation will eventually work with third-party VTEPs (theres usually a
slight time gap between a press release and a shipping product) using ovsdb-proto as the control
plane.

Copyright ipSpace.net 2014

Page 3-6

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Apart from that, the feature list closely matches existing Nicira NVP functionality: distributed L2
forwarding, distributed or centralized L2 or L3 forwarding, reflexive VM NIC ACLs, controllers and
L2/L3 gateways as physical appliances.
Use cases: OpenStack and CloudStack deployments using Xen, KVM or ESXi hypervisors.
VMware NSX optimized for vSphere is a totally different beast:

While the overall architecture looks similar to Nicira NVP, it seems theres no OVS or OpenFlow
under the hood.

Hypervisor virtual switches are based on vDS switches; VXLAN encapsulation, distributed
firewalls and distributed layer-3 forwarding are implemented as loadable ESXi kernel module.

NVP controllers run in virtual machines and are tightly integrated with vCenter through NSX
manager (which replaces vShield Manager);

Distributed layer-3 forwarding uses a central control plane implemented in NSX Edge Distributed
Router, which can run BGP or OSPF with the outside (physical) world;

Another variant of NSX Edge (Services Router) provides centralized L3 forwarding, N/S firewall,
load balancing, NAT, and VPN termination;

Most components support IPv6 (hooray, finally!).

The Nicira NVP roots of NSX are evident. Its also pretty easy to map how individual NSX
components map into vCNS/vShield Edge: NSX Edge Services Router definitely looks like vShield
Edge on steroids and the distributed firewall is probably based on vShield App.
Unfortunately, it seems that the goodies from vSphere version of NSX (routing protocols, in-kernel
firewall) wont make it to vCNS 5.5 (but lets wait and see how the packaging/licensing looks when
the products launch).

Copyright ipSpace.net 2014

Page 3-7

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

DOES IT ALL MAKE SENSE?


Sure it does. VMware NSX seems to be a successful blend of two pretty mature products with loads
of improvements (some of them badly needed), and it seems that once all the wrinkles have been
ironed out, VMware NSX for vSphere will be the most comprehensible virtual networking product you
can get (unfortunately you cant get your own copy of Amazon VPC).
The only problem I see is the breadth of the offering. VMware has three semi-competing partially
overlapping products implementing overlay virtual networks:

NSX for multi-hypervisor environment using NVP controllers, NVP gateways and OVS (for Linux
and ESXi environment);

NSX for vSphere using NVP controllers, vSphere kernel modules and NSX edge gateways;

vCNS with vShield App firewall and vShield Edge firewall/load balancer/router.

It will be fun to see how the three products evolve in the future and how the diverging code base
will impact feature parity.
To learn more about NSX architecture, watch the videos from the free VMware NSX Architecture
webinar sponsored by VMware.

Copyright ipSpace.net 2014

Page 3-8

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post describes the control plane used by VMware NSX for Multiple Hypervisors release 4.0.

VMWARE NSX CONTROL PLANE


In the previous posts I described how a typical overlay virtual networking data plane works and
what technologies vendors use to implement the associated control plane that maps VM MAC
addresses to transport IP addresses. Now lets walk through the details of a particular
implementation: Nicira Network Virtualization Platform (NVP), now rebranded as VMware NSX for
multiple hypervisors.

COMPONENTS
VMware NSX for multiple hypervisors release 4.0 (NSX for the rest of this blog post) relies on Open
vSwitch (OVS) to implement hypervisor soft switching. OVS could use dynamic MAC learning (and it
does when used with OpenStack OVS Quantum plugin) or an external OpenFlow controller.
A typical OpenFlow-based Open vSwitch implementation has three components:

Flow-based forwarding module loaded in Linux kernel;

User-space OpenFlow agent that communicates with OpenFlow controller and provides the kernel
module with flow forwarding information;

User-space OVS database (ovsdb) daemon that keeps track of the local OVS configuration
(bridges, interfaces, controllers ).

Copyright ipSpace.net 2014

Page 3-9

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-2: VMware NSX control plane

NSX uses a cluster of controllers (currently 3 or 5) to communicate with OVS switches (OVS
switches can connect to one or more controllers with automatic failover). It uses two protocols to
communicate with the switches: OpenFlow to download forwarding entries into the OVS and ovsdbproto to configure bridges (datapaths) and interfaces in OVS.

Copyright ipSpace.net 2014

Page 3-10

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

SIMPLE TWO HOST SETUP


Lets start with a simple two host setup, with a single VM running on each host. The GRE tunnel
between the hosts is used to encapsulate the layer-2 traffic between the VMs.

Figure 3-3: Simple two host setup

NSX OpenFlow controller has to download just a few OpenFlow entries into the two Open vSwitches
to enable the communication between the two VMs (for the moment, were ignoring BUM flooding).

Copyright ipSpace.net 2014

Page 3-11

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-4: OpenFlow forwarding entries downloaded by the VMware NSX OpenFlow controller

ADDING A THIRD VM AND HOST


When the user starts a third VM in the same segment on host C, two things have to happen:

NVP controller must tell the ovsdb-daemon on all three hosts to create new tunnel interfaces and
connect them to the correct OVS datapath;

NVP controller downloads new flow entries to OVS switches on all three hosts.

Copyright ipSpace.net 2014

Page 3-12

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-5: OpenFlow entries needed in a scenario with three VMs running on three separate hosts

BUM FLOODING
NVP supports two flooding mechanisms within a virtual layer-2 segment:
Flooding through a service node all hypervisors send the BUM traffic to a service node (an
extra server that can serve numerous virtual segments) which replicates the traffic and sends it to
all hosts within the same segment. We would need a few extra tunnels and a handful of OpenFlow
entries to implement the service node-based flooding in our network:

Copyright ipSpace.net 2014

Page 3-13

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-6: VMware NSX BUM flooding through a service node

If the above description causes you heartburn caused by ATM LANE flashbacks, youre not
the only one ... but obviously the number of solutions to a certain networking problem isnt
infinite.

Copyright ipSpace.net 2014

Page 3-14

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

You can also tell the NVP controller to use source node replication the source hypervisor sends
unicast copies of an encapsulated BUM frame to all other hypervisors participating in the same
virtual segment.
These are the flow entries that an NVP controller would configure in our network when using source
node replication:

Figure 3-7: VMware NSX source hypervisor flooding

Copyright ipSpace.net 2014

Page 3-15

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post, written in November 2013, describes the forwarding mechanisms used by VMware
NSX for multiple hypervisors release 4.0 and VMware NSX for vSphere release 6.0.

LAYER-2 AND LAYER-3 SWITCHING IN VMWARE NSX


All overlay virtual networking solutions look similar from far away: many provide layer-2 segments,
most of them have some sort of distributed layer-3 forwarding, gateways to physical world are
ubiquitous, and you might find security features in some products.
The implementation details (usually hidden behind the scenes) vary widely, and Ill try to document
at least some of them in a series of blog posts, starting with VMware NSX.

LAYER-2 FORWARDING
VMware NSX supports traditional layer-2 segments with proper flooding of BUM (Broadcast,
Unknown unicast, Multicast) frames. NSX controller downloads forwarding entries to individual
virtual switches, either through OpenFlow (NSX for multiple hypervisors) or a proprietary protocol
(NSX for vSphere). The forwarding entries map destination VM MAC addresses into destination
hypevisor (or gateway) IP addresses.
On top of static forwarding entries downloaded from the controller, virtual switches perform dynamic
MAC learning for MAC addresses reachable through layer-2 gateways.

Copyright ipSpace.net 2014

Page 3-16

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

LAYER-3 FORWARDING
NSX implements a distributed forwarding model with shared gateway IP and MAC addresses, very
similar to optimal IP forwarding offered by Arista or Enterasys. NSX virtual switches arent
independent devices, so they dont need independent IP addresses like physical ToR switches.
Layer-3 lookup is always performed by the ingress node (hypervisor host or gateway); packet
forwarding from ingress node to egress node and destination host uses layer-2 forwarding. Every
ingress node thus needs (for every tenant):

IP routing table;

ARP entries for all tenants hosts;

MAC-to-underlay-IP mappings for all tenants hosts (see layer-2 forwarding above).

NSX for vSphere implements layer-3 forwarding in a separate vSphere kernel module. The User
World Agent (UWA) running within the vSphere host uses proprietary protocol (mentioned above) to
get layer-3 forwarding information (routing tables) and ARP entries from the controller cluster. ARP
entries are cached in the layer-3 forwarding kernel module, and cache misses are propagated to the
controller.
NSX for multiple hypervisors implements layer-3 forwarding data plane in OVS kernel module,
but does not use OpenFlow to install forwarding entries.
A separate layer-3 daemon (running in user mode on the hypervisor host) receives forwarding
information from NSX controller cluster through OVSDB protocol, and handles all ARP processing
(sending ARP requests, caching responses ) locally.

Copyright ipSpace.net 2014

Page 3-17

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post (written in November 2013) describes layer-3 forwarding on VMware NSX Edge
Services Router in VMware NSX for vSphere release 6.0.

LAYER-3 FORWARDING WITH VMWARE NSX EDGE


SERVICES ROUTER
The easiest way of connecting overlay virtual networks implemented with VMware NSX for vSphere
to the outside world is NSX Edge Services Router. Its a much improved version of vShield Edge and
provides way more than just layer-3 forwarding services its also a firewall, load balancer, NAT and
VPN termination device.
You can use a VMware NSX Edge Services Router (ESR) to connect multiple VXLAN-backed layer-2
segments within an application stack. You would configure the services router through NSX Manager
(improved vShield Manager), and youd get a VM connected to multiple VXLAN-based port groups
(and probably one or more VLAN-based port groups) behind the scenes.

Copyright ipSpace.net 2014

Page 3-18

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-8: VMware NSX Edge Services Router

In this scenario, VXLAN kernel modules resident in individual vSphere hosts perform layer-2
forwarding, sending packets between VM and ESR NICs. ESR performs layer-3 forwarding within the
VM context.

Copyright ipSpace.net 2014

Page 3-19

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

NSX Edge Services Router is the ideal solution when you need network services (firewalls, load
balancers ) between the client and the server. Its more than good enough for smaller
deployments or when the majority of the traffic leaves the overlay virtual networking world (you can
push up to 10 Gbps of traffic through it) but dont use it in high-volume environments with large
amount of inter-subnet east-west traffic.
In those environments you might collapse multiple subnets into a single layer-2 segment (assuming
your security engineers approve the change in security paradigm introduced with VM NIC firewalls)
or use distributed routing functionality of VMware NSX.

Copyright ipSpace.net 2014

Page 3-20

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Open vSwitch is a Linux virtual switch used by numerous overlay virtual networking products, from
VMware NSX for multiple hypervisors release 4.0 to OpenStack Neutron, Nuages VSP and
Midokuras Midonet. This blog post describes some of the implementation details of Open vSwitch
when used with VMware NSX for multiple hypervisors (formerly Nicira NVP) OpenFlow controller.
Note: Open vSwitch 1.11 added support for megaflows (OpenFlow flow entries copied directly into
the kernel packet forwarding module), replacing the per-flow entries with per-OpenFlow-flow
entry entries. The following text has been updated to reflect the new functionality.

OPEN VSWITCH UNDER THE HOOD


Hatem Naguib claimed that the NSX controller cluster is completely out-of-band, and never handles
a data packet when describing VMware NSX Network Virtualization architecture, preemptively
avoiding the flow-based forwarding doesnt scale arguments usually triggered by stupidities like
this one.
Does that mean theres no packet punting in the NSX/Open vSwitch world? Not so fast.
First, to set the record straight, NVP OpenFlow controller (NSX controller cluster) does not touch
actual packets. Theres no switch-to-controller punting; NVP has enough topology information to
proactively download OpenFlow flow entries to Open vSwitch (OVS).
However, Open vSwitch has two components: the user-mode daemon (process switching in Cisco
IOS terms) and the kernel forwarding module, which implements flow matching, forwarding and
corresponding actions, not the full complement of OpenFlow matching rules.

Copyright ipSpace.net 2014

Page 3-21

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

There's a third component present in every OVS environment: the ovsdb (OVS database) daemon,
but it's not relevant to this discussion, so we'll conveniently ignore it.
Whenever the first packet of a new flow passes through the Open vSwitch kernel module, its sent to
the Open vSwitch daemon, which evaluates the OpenFlow rules downloaded from the OpenFlow
controller, accepts or drops the packet, and installs the corresponding forwarding rule into the kernel
module.
Open vSwitch 2.x user mode daemon copies OpenFlow matching rules to the kernel module
instead of creating per-flow entries. The initial packet matching a new OpenFlow rule is still
forwarded through the user-mode daemon.
Does this sound similar to Multi-Layer Switching or the way Ciscos VSG and Nexus 1000V VEM
work? Its exactly the same concept, implemented in kernel/user space of a single hypervisor host.
There really is nothing new under the sun.
I would strongly recommend you read the well written developer documentation if you want
to know the dirty details.

This approach keeps the kernel module simple and tidy, and allows the Open vSwitch architecture to
support other flow programming paradigms, not just OpenFlow you can use OVS as a simple
learning bridge supporting VLANs, sFlow and NetFlow (not hard once youve implemented per-flow
forwarding), or you could implement your own forwarding paradigm while leveraging the stability of
Open vSwitch kernel module thats included with version 3.3 of the Linux kernel and already made
its way into standard Linux distributions.

Copyright ipSpace.net 2014

Page 3-22

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Just to give you an example: Midokura chose to use the Open vSwitch kernel module in combination
with their user-mode daemon in the MidoNet product you can install MidoNet on recent Linux
distributions without touching the kernel.

Copyright ipSpace.net 2014

Page 3-23

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VM-based layer-3 gateways bundled with VMware NSX for vSphere support multiple routing
protocols. This blog post describes the behavior of NSX Edge Services Router in VMware NSX for
vSphere release 6.0.

ROUTING PROTOCOLS ON NSX EDGE SERVICES


ROUTER
VMware gave me early access to NSX hands-on lab a few days prior to VMworld 2013. The lab was
meant to demonstrate the basics of NSX, from VXLAN encapsulation to cross-subnet flooding, but I
quickly veered off the beaten path and started playing with routing protocols in NSX Edge
appliances.
I wont bore you with the configuration process. Lets just say that I got mightily annoyed with the
mandatory mouse chasing skills, confirmed every single CLI-versus-GUI prejudice I ever got, but
nonetheless managed to get OSPF and BGP running on an NSX Edge appliance. Heres what I
configured:

OSPF routing process with area 0 on the external interface and route redistribution of connected
routes into OSPF;

BGP routing process with an IBGP neighbor and route redistribution of connected routes into
BGP.

Copyright ipSpace.net 2014

Page 3-24

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The fun started after I managed to log into the appliance console. You might find this printout
familiar ;)

Figure 3-9: OSPF interfaces on NSX Edge Services Router

How about this one?

Figure 3-10: BGP routing table on NSX Edge Services Router

Copyright ipSpace.net 2014

Page 3-25

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Heres another one to warm your heart:

Figure 3-11: IP routing table on NSX Edge Services Router

As you can see, they still have plenty of work to do (example: the subnet length is missing in the
BGP table printout), but the code is still a few months from being shipped, so Im positive theyll fix
the obvious gotchas in the meantime.
Time to deploy the second appliance to see whether all this fun stuff actually works. It does.
You can see an OSPF neighbor...

Figure 3-12: OSPF neighbor on NSX Edge Services Router

Copyright ipSpace.net 2014

Page 3-26

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

... and a BGP neighbor.

Figure 3-13: BGP neighbor on NSX Edge Services Router

Copyright ipSpace.net 2014

Page 3-27

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

If you wish you can inspect the OSPF database:

Figure 3-14: OSPF database on NSX Edge Services Router

NSX Edge OSPF process inserts some funky stuff into the OSPF database (you might want to check
how that impacts other OSPF gear before deploying NSX Edge in production environment) and it
seems type-5 LSAs are not displayed (probably a bug).

Copyright ipSpace.net 2014

Page 3-28

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The BGP table has prefixes from both appliances...

Figure 3-15: BGP table after BGP adjacency establishment

...and the routing and forwarding tables look OK. The whole thing just might work outside of a lab
environment.

Copyright ipSpace.net 2014

Page 3-29

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-16: Final routing table

THE GRUMPY PERSPECTIVE


The addition of routing protocols to NSX Edge is a great next step toward implementing more
dynamic networking infrastructure. Does that mean that Id use NSX Edge as a router? You must be
kidding its a great edge device, with just enough features to integrate with the core routing
functionality of your network.

Copyright ipSpace.net 2014

Page 3-30

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Not unexpectedly, the configuration process really sucks. It takes forever to implement what one
could do with 10 CLI commands ... but then you probably wouldnt use NSX Manager GUI but API
calls or PowerCLI to configure appliances in large-scale deployments.
Finally, does it make sense to run routing protocols on L4-7 appliances? If you ever spent hours
debugging a static route pointing in a wrong direction you know the answer.

Copyright ipSpace.net 2014

Page 3-31

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Cisco shipped unicast VXLAN on Nexus 1000V in June 2013. This blog post describes the details of
the initial unicast VXLAN implementation.

UNICAST-ONLY VXLAN FINALLY SHIPPING


The long-promised unicast-only VXLAN has finally shipped with the Nexus 1000V release
4.2(1)SV2(2.1) (there must be some logic behind those numbers, but they all look like madness to
me). The new Nexus 1000V release brings two significant VXLAN enhancements: unicast-only mode
and MAC distribution mode.

UNICAST-ONLY VXLAN
The initial VXLAN design and implementation took the traditional doing-more-with-less approach:
VXLANs behave exactly like VLANs (including most of the scalability challenges VLANs have) and rely
on third-party tool (IP multicast) to solve the hard problems (MAC address learning) that both Nicira
and Microsoft solved with control-plane solutions.
Unicast-only VXLAN comes closer to what other overlay virtual networking vendors are doing: the
VSM knows which VEMs have VMs attached to a particular VXLAN segment and distributes that
information to all VEMs each VEM receives a per-VXLAN list of destination IP addresses to use for
flooding purposes.

Copyright ipSpace.net 2014

Page 3-32

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Read the Nexus 1000V blog post or watch my VMware Networking Technical Deep Dive
webinar for in-depth description of VSM and VEM.

MAC DISTRIBUTION MODE


MAC distribution mode goes a step further: it eliminates the process of data-plane MAC address
learning and replaces it with control-plane solution (similar to Nicira/VMware NVP) VSM is
collecting the list of MAC addresses and distributing the MAC-to-VTEP mappings to all VEMs
participating in a VXLAN segment.

OTHER GOODIES
Cisco also increased the maximum number of VEMs a single VSM can control to 128, and the
maximum number of virtual ports per VSM (DVS) to 4096.

DOES IT MATTER?
Sure it does. The requirement to use IP multicast to implement VXLAN flooding was a major
showstopper in data centers that have no other need for IP multicast (almost everyone apart from
financial institutions dealing with multicast-based market feeds). Unicast-only VXLAN will definitely
simplify VXLAN deployments and increase its adoption.

Copyright ipSpace.net 2014

Page 3-33

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

MAC distribution mode is a nice-to-have feature that youd need primarily in large-scale cloud
deployments. Most reasonably sized enterprise data centers can probably live happily without it (of
course I might be missing something fundamental do write a comment).

THE CAVEATS
The original VXLAN proposal was a data-plane-only solution boxes from different vendors (not that
there would be that many of them) could freely interoperate as long as you configured the same IP
multicast group everywhere.
Unicast-only VXLAN needs a signaling protocol between VSM (or other control/orchestration entity)
and individual VTEPs. The current protocol used between VSM and VEMs is probably proprietary;
Cisco claims to plan to use VXLAN over EVPN for inter-VSM connectivity, but who knows when the
Nexus 1000V code will ship. In the meantime, you cannot connect a VXLAN segment using unicastonly VXLAN to a third-party gateway (example: Arista 7150).
Due to the lack of inter-VSM protocol, you cannot scale a single VXLAN domain beyond 128 vSphere
hosts, probably limiting the size of your vCloud Director deployment. In multicast VXLAN
environments the vShield Manager automatically extends VXLAN segments across multiple
distributed switches (or so my VMware friends are telling me); it cannot do the same trick in
unicast-only VXLAN environments.

Copyright ipSpace.net 2014

Page 3-34

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post (written in summer 2013) introduced new features that Microsoft shipped with
Windows Server 2012 R2 in late 2013.

WHATS COMING IN HYPER-V NETWORK


VIRTUALIZATION (WINDOWS SERVER 2012 R2)
Right after Microsofts TechEd event CJ Williams kindly sent me links to videos describing new
features in upcoming Windows Server (and Hyper-V) release. I would strongly recommend you
watch Whats New in Windows Server 2012 R2 Networking and Deep Dive on Hyper-V Network
Virtualization in Windows Server 2012 R2, and heres a short(er) summary.

HYPER-V NETWORK VIRTUALIZATION


Support for dynamically learned customer IP addresses. Initial release of HNV relied
exclusively on PowerShell scripts to supply MAC, ARP and IP forwarding information. Next release of
HNV will support dynamic IP addresses used in environments with customer-owned DHCP servers or
HA solutions with IP address failover.
Unicast-based flooding. First HNV release did not need flooding all the necessary information
was provided by the orchestration system through HNV policies. Support of dynamic address
learning and customer-owned DHCP servers obviously requires flooding of DHCP requests and ARP
requests/replies.

Copyright ipSpace.net 2014

Page 3-35

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

HNV in Windows Server 2012 R2 will use provider network IP multicast to emulate flooding (similar
to initial VXLAN implementation) or unicast IP with replication at the source host (similar to current
VXLAN implementation). The process is further optimized once the hypervisor hosts learn the IP
addresses of customer VMs, they can use the orchestration system (SC VMM) to propagate the ARP
and IP forwarding information to other hosts participating in the same virtual subnet (similar to what
Ciscos Nexus 1000V does in MAC distribution mode).
Performance improvements. Lack of TCP offload is the biggest hurdle in overlay network
deployments (thats why Nicira decided to use STT). HNV will include NVGRE Task Offload in WS
2012 R2 and Emulex and Mellanox have already announced NVGRE-capable NICs. Mellanox
performance numbers mentioned in the Deep Dive video claim 10GE linerate forwarding (2 x
improvement) while reducing CPU overhead by a factor of 6.
HNV will also be able to do smarter NIC teaming and load balancing, resulting in better utilization of
all server NICs.
Built-in gateways. WS 2012 R2 distribution will include simple NVGRE-to-VLAN gateway similar to
early vShield Edge (VPN concentrator, NAT, basic L3 forwarding). F5 has announced NVGRE
gateways support, but as always Ill believe it when the product documentation appears on their
web site.
Improved diagnostics. Next release of HNV will include several interesting troubleshooting tools:
Ability to ping provider network IP address from customer VM, ability to insert or intercept traffic in
customer network (example: emulate pings to external destinations), and cloud administrator
access to customer VM traffic statistics.

Copyright ipSpace.net 2014

Page 3-36

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

MORE INFORMATION

Cloud Computing Networking webinar

Overlay Virtual Networks Explained webinar

Hyper-V Network Virtualization: Simply Amazing

Whats New in Hyper-V Network Virtualization in R2 (Microsoft blog post)

Whats New in Windows Server 2012 R2 Networking (TechEd video)

Deep Dive on Hyper-V Network Virtualization in Windows Server 2012 R2 (TechEd video)

How to Design and Configure Networking in Microsoft System Center - Virtual Machine
Manager and Hyper-V Part 1 (TechEd video)

How to Design and Configure Networking in Microsoft System Center - Virtual Machine Manager
and Hyper-V Part 2 (TechEd video)

Everything You Need to Know about the Software Defined Networking Solution from Microsoft
(TechEd video)

Copyright ipSpace.net 2014

Page 3-37

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post (written in August 2013) describes the other networking improvements of Windows
Server 2012 R2 that shipped in late 2013.

NETWORKING ENHANCEMENTS IN WINDOWS SERVER


2012 R2
The Whats coming in Hyper-V Network Virtualization (Windows Server 2012 R2 blog post got way
too long, so I had to split it in two parts: Hyper-V Network Virtualization and the rest of the features
(this post).
Stateful VM NIC firewalls. Windows Server 2012 included some basic VM NIC filtering
functionality. Release 2 has built-in stateful firewall. Its similar to vShield App or Junipers VGW it
can create per-flow ACL entries for return traffic, but does not inspect TCP session validity or
perform IP/TCP reassembly.
Dynamic NIC teaming can spread a single TCP flow across multiple outbound NICs a great
solution for I/O intensive applications that need more than 10GE per single flow (obviously it only
works with ToR switches that have 40GE uplinks, 10GE port channel uplinks on ToR switches would
quickly push all traffic of the same flow onto the same 10GE uplink).
Hyper-V Network Virtualization is now part of extensible switch. The initial release of HNV
was implemented as a device filter sitting between a physical NIC and the extensible switch. Switch
extensions had no access to HNV (just to customer VM traffic) as all the encap/decap operations
happened after the traffic has already left the extensible switch on its way toward the physical NIC.

Copyright ipSpace.net 2014

Page 3-38

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

HNV in Windows Server 2012 R2 integrates with the extensible switch, giving switch extensions
access to customer (VM) and provider (underlay) traffic ideal if you want to capture or filter both
VM-side traffic and encapsulated traffic.
Virtual RSS (vRSS) uses VMQ to extend Receive Side Scaling into VMs traffic received by a VM
can spread across multiple vCPUs. Ideal for high-performance appliances (firewalls, load balancers).
Remote live monitoring similar to SPAN and ERSPAN, including traffic captures for offline analysis.
Network switch management. Microsoft is trying to extend their existing OMI network
management solutions into physical switches because we desperately need yet another switch
management platform ;)

MORE INFORMATION

Hyper-V Network Virtualization: Simply Amazing

Whats New in Windows Server 2012 R2 Networking (Microsoft blog post)

Whats New in Windows Server 2012 R2 Networking (TechEd video)

Deep Dive on Hyper-V Network Virtualization in Windows Server 2012 R2 (TechEd video)

How to Design and Configure Networking in Microsoft System Center - Virtual Machine
Manager and Hyper-V Part 1 (TechEd video)

How to Design and Configure Networking in Microsoft System Center - Virtual Machine Manager
and Hyper-V Part 2 (TechEd video)

Everything You Need to Know about the Software Defined Networking Solution from Microsoft
(TechEd video)

Copyright ipSpace.net 2014

Page 3-39

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post (written in late 2013) describes the behavior of Hyper-V network virtualization in
original Windows Server 2012, and is included in this chapter for historical reasons, as it documents
the evolution of Hyper-V network virtualization from a mixed L2/L3 architecture to a pure L3
architecture used in Windows Server 2012 R2 that started shipping in late 2013.

VIRTUAL PACKET FORWARDING IN HYPER-V NETWORK


VIRTUALIZATION
Last week I explained how layer-2 and layer-3 packet forwarding works in VMware NSX a solution
that closely emulates traditional L2 and L3 networks. Hyper-V Network Virtualization (HNV) is
different its almost a layer-3-only solution with only a few ties to layer-2.

HNV ARCHITECTURE
Hyper-V Network Virtualization started as an add-on module (NDIS lightweight filter) for Hyper-V
3.0 extensible switch (it is fully integrated with the extensible switch in the Windows Server 2012
R2).

Copyright ipSpace.net 2014

Page 3-40

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-17: Hyper-V network virtualization was an add-on module for layer-2 switch

Hyper-V extensible switch is a layer-2-only switch; Hyper-V network virtualization module is a layer3-only solution an interesting mix with some unexpected side effects.
A distributed layer-3 forwarding architecture could use a single IP routing table to forward traffic
between IP hosts. Similar to traditional IP routing solutions, the end-user would configure directly
connected IP subnets and prefix routes (with IP next hops), and the virtual networking controller (or
the orchestration system) would add host routes for every reachable host. Forwarding within the
virtual domain would use host routes; forwarding toward external gateways would use configured IP
next hops (which would be recursively resolved from host routes).
Hyper-V network virtualization cannot use a pure layer-3 solution due to layer-2 forwarding within
the extensible switch two VMs connected to the same VLAN within the same hypervisor would
communicate directly (without HNV involvement) and would exchange MAC addresses through ARP

Copyright ipSpace.net 2014

Page 3-41

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

requests. The same communication path has to exist after one of them is moved to a different
hypervisor with Hyper-V live migration HNV must thus support a mix of layer-2 and layer-3
forwarding.

CONTROL PLANE SETUP


A distributed layer-2 + layer-3 forwarding architecture needs at least three tables to forward traffic:

IP routing table;

ARP table (mapping of IP addresses into MAC addresses);

MAC reachability information outbound ports in pure layer-2 world or destination transport IP
addresses in overlay virtual networks.

IP routing table is installed in the Hyper-V hosts with the New-NetVirtualizationCustomerRoute


PowerShell cmdlet, ARP table and MAC reachability table are installed as CustomerIP-MACTransportIP triplets with the New-NetVirtualizationLookupRecord cmdlet.

Copyright ipSpace.net 2014

Page 3-42

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-18: Set up NVGRE virtual network with PowerShell commands

Copyright ipSpace.net 2014

Page 3-43

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Hyper-V Network Virtualization supports IPv4 and IPv6. An IP address mentioned in this
blog posts means IPv4 or IPv6 address but do keep in mind that you have to configure
IPv4 and IPv6 network virtualization lookup records independently.

INTRA-SUBNET PACKET FORWARDING


When the Hyper-V extensible switch receives a packet from a VM, it has to decide where to send it.
At this point the extensible switch uses layer-2 forwarding rules:

If the destination MAC address exists within the same segment, send the packet to the
destination VM;

Flood multicast or broadcast frames to all VMs and the uplink interface;

Send frames with unknown destination MAC addresses to the uplink interface.

Hyper-V network virtualization module intercepts packets forwarded by the extensible switch toward
the uplink interface and performs layer-3 forwarding and local ARP processing:

All ARP requests are answered locally using the information installed with the NewNetVirtualizationLookupRecord cmdlet;

IP packets are forwarded to the destination hypervisor based on their destination IP address (not
destination MAC address);

Flooded frames, frames sent to unknown MAC addresses, and non-IP frames are dropped.

Copyright ipSpace.net 2014

Page 3-44

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-19: Intra-subnet packet forwarding across hypervisors

INTER-SUBNET PACKET FORWARDING


Traffic between IP subnets is intercepted by HNV module based on the default gateway destination
MAC address (which belongs to HNV). Hyper-V extensible switch sends the traffic toward the default
gateway MAC address to the uplink interface (unknown destination MAC address rule), where its
intercepted by HNV, which performs layer-3 lookup.
The true difference between intra-subnet and inter-subnet layer-3 forwarding is thus the destination
MAC address:

Intra-subnet IP packets are sent to the MAC address of the destination VM, intercepted by HNV
module, and forwarded based on destination IP address;

Inter-subnet IP packets are sent to the MAC address of the default gateway (virtual MAC address
shared by all HNV modules), also intercepted by HNV module, and forwarded based on
destination IP address (when the HNV module has a New-NetVirtualizationLookupRecord for

Copyright ipSpace.net 2014

Page 3-45

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

destination IP address) or destination IP prefix (when theres no NewNetVirtualizationLookupRecord for destination IP address).
Summary: Even though it looks like Hyper-V Network Virtualization in Windows Server 2012 works
like any other L2+L3 solutions, its a layer-3-only solution between hypervisors and layer-2+layer-3
solution within a hypervisor.

Copyright ipSpace.net 2014

Page 3-46

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post (written in late 2013) describes the packet forwarding behavior of Hyper-V network
virtualization in Windows Server 2012 R2.

HYPER-V NETWORK VIRTUALIZATION PACKET


FORWARDING IMPROVEMENTS IN WINDOWS SERVER
2012 R2
Initial release of Hyper-V Network Virtualization (HNV) was an add-on to the Hyper-V Extensible
Switch, resulting in an interesting mixture of bridging and routing. In Windows Server 2012 R2 the
two components became tightly integrated, resulting in a pure layer-3 solution.

HNV ARCHITECTURE
In Windows Server 2012 R2, Hyper-V Network Virtualization became an integral part of Hyper-V
Extensible Switch. It intercepts all packets traversing the switch and thus behaves exactly like a
virtual switch forwarding extension while still allowing another forwarding extension (Cisco Nexus
1000V or NEC PF1000) to work within the same virtual switch.

Copyright ipSpace.net 2014

Page 3-47

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-20: Hyper-V network virtualization in Windows Server 2012 R2

Hyper-V Network Virtualization module effectively transforms the Hyper-V layer-2 virtual switch into
a pure layer-3 switch.

CONTROL PLANE SETUP


A pure layer-3 switch could work without MAC reachability information (thats how early versions of
Amazon VPC behaved), but Microsoft decided to retain the semblance of layer-2 networks and IP
subnets. The Hyper-V Network Virtualization forwarding module thus still requires IP routing table,
ARP table, and remote VM reachability information (mapping of VM IP and MAC addresses into
transport network IP addresses).
The PowerShell scripts used to configure the HNV havent changed. IP routing table is installed in
the Hyper-V hosts with the New-NetVirtualizationCustomerRoute PowerShell cmdlet. Mappings

Copyright ipSpace.net 2014

Page 3-48

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

of VM IP and VM MAC addresses into transport IP addresses are created with the NewNetVirtualizationLookupRecord cmdlet.

ARP PROCESSING
Hyper-V Network Virtualization module is an ARP proxy: it replies to all broadcast ARP requests and
multicast ND request assuming it has the NetVirtualizationLookupRecord for the destination IP
address.
ARP requests for unknown IPv4 destinations and ND requests for unknown IPv6 destinations are
flooded if the virtual network contains layer-2-only NetVirtualizationLookupRecord entries (used to
implement dynamic IP addresses).
Unicast ARP/ND requests are forwarded to the destination VM to support IPv6 Network
Unreachability Detection (NUD) functionality.

PACKET FORWARDING
Hyper-V Network Virtualization module intercepts all packets received by the Hyper-V extensible
switch and drops all non-IP/ARP packets. ARP/ND packets are intercepted (see above), and the
forwarding of IP datagrams relies solely on the destination IP address:

Routing table lookup is performed to find the next-hop customer IP address. Every HNV module
has full routing table of the tenant virtual network next-hop customer IP address is the
destination IP address for all destinations within the virtual network;

A lookup in NetVirtualizationLookupRecord table transforms next-hop customer IP address into


transport IP address;

Copyright ipSpace.net 2014

Page 3-49

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Destination MAC address in the forwarded packet is rewritten using the value from the
NetVirtualizationLookupRecord (implementing destination MAC rewrite on inter-subnet
forwarding).

When the destination transport IP address equals the local IP address (destination customer IP
address resides within the local host), HNV sends the packet back to the Hyper-V Extensible Switch,
which delivers the packet to the destination VM.
Extra lookup steps within the transport network are performed for non-local destinations:
1. Routing table lookup in the global IP routing table transforms transport destination IP address
into transport next-hop IP address;
2. ARP/ND table lookup transforms transport next-hop IP address into transport MAC address.
HNV has full support and feature parity for IPv4 and IPv6. Whenever this blog post
mentions IP, the behavior described applies equally well to IPv4 and IPv6.
Multicast and broadcast IP traffic is flooded. Flooding mechanism uses IP multicast in the transport
network if theres a NetVirtualizationLookupRecord mapping destination multicast/broadcast
customer IP address into a transport multicast IP address, and source node packet replication in all
other cases.

Copyright ipSpace.net 2014

Page 3-50

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post describes complex routing scenarios using Hyper-V Network Virtualization in Windows
Server 2012 R2.

COMPLEX ROUTING IN HYPER-V NETWORK


VIRTUALIZATION
The layer-3-only Hyper-V Network Virtualization forwarding model implemented in Windows Server
2012 R2 thoroughly confuses engineers used to deal with traditional layer-2 subnets connected via
layer-3 switches.
As always, it helps to take a few steps back, focus on the principles, and the unexpected behavior
becomes crystal clear.

SAMPLE NETWORK
Lets start with a virtual network thats a bit more complex than a single VLAN:

Copyright ipSpace.net 2014

Page 3-51

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-21: Two independent overlay segments

Next, assume the following connectivity requirements:

VM A must use the gateway (GW) to communicate with the outside world;

VM B and VM X must communicate.

Its obvious we need a virtual router to link the two segments (otherwise B and X cannot
communicate). In a traditional VLAN-based network youd use a layer-3 switch somewhere in the

Copyright ipSpace.net 2014

Page 3-52

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

network to connect the two VLANs; you could use a VM-based router in a layer-2 overlay virtual
network world (for example: VXLAN).

Figure 3-22: Connecting the virtual segments with a distributed virtual router

There are at least two ways to achieve the desired connectivity in a traditional layer-2 world:

Set the default gateway to router (.1) on B and X, and set the default gateway to .250 (GW) on
A. Pretty bad design if Ive ever seen one.

Set the default gateway to router (.1) on all VMs and configure a static default route pointing to
.250 (GW) on the router.

Copyright ipSpace.net 2014

Page 3-53

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

DISTRIBUTED ROUTING IN HYPER-V NETWORK VIRTUALIZATION


Hyper-V Network Virtualization (HNV) doesnt have layer-2 segments. Every VM is connected
directly to the distributed layer-3 switch (implemented in all hypervisors). The virtual network
topology thus looks like this:

Figure 3-23: Final setup implemented with Hyper-V network virtualization

Copyright ipSpace.net 2014

Page 3-54

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Even though HNV documentation talks about subnets and prefixes, thats not how the
forwarding works. HNV forwarding works like a router with the same IP address on all
interfaces (in the same subnet) having a host route for every connected VM.
After redrawing the diagram its obvious what needs to be configured to get the desired
connectivity:

Set default gateway to the distributed router (.1) on all VMs;

Configure a default route in the HNV environment with the New-NetVirtualizationCustomerRoute


PowerShell cmdlet.

BUT I LOST MY PRECIOUS SECURITY


Im positive someone will start complaining at this point. In the pretty bad design I mentioned above
VM B couldnt communicate with the outside world (unless the layer-2 switch connecting the two
segments had a default route pointing to GW VM), and its impossible to achieve the same effect
with routing in HNV environment.
However, keep in mind that security through obscurity is never a good idea (told you it was a bad
design), and theres a good reason layer-3 switches have ACLs. Speaking of ACLs, you can configure
them in Hyper-V with the New-NetFirewallRule and since all VM ports are equivalent (there are no
layer-2 and layer-3 ports) you get consistent results regardless of whether the source and
destination VM are in the same or different subnets.

Copyright ipSpace.net 2014

Page 3-55

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

IP forwarding behavior has been traditionally explained in terms of routing table entries and ARP
entries. Many modern forwarding architectures fold the two into a single-step process as explained
in this blog post written in early 2014.

THIS IS NOT THE HOST ROUTE YOURE LOOKING FOR


When describing Hyper-V Network Virtualization packet forwarding I briefly mentioned that the
hypervisor switches create (an equivalent of) a host route for every VM they need to know about,
prompting some readers to question the scalability of such an approach. As it turns out, layer-3
switches did the same thing under the hood for years.

HOW WE THINK IT WORKS


IP forwarding process is traditionally explained along these lines:

Destination IP address is looked up in IP forwarding table (FIB), resulting in IP next hop or


connected interface (in which case the next hop is the destination IP address itself);

ARP cache is looked up to find the MAC address of the IP next hop.

According to this explanation, the IP FIB contains the prefixes copied from the IP routing table.
However, this is not how most layer-3 switches work.

Copyright ipSpace.net 2014

Page 3-56

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

HOW IT ACTUALLY WORKS


Ill use a recent implementation of Cisco Express Forwarding (CEF) to illustrate whats really going
on behind the scenes. The printouts were taken from vIOS running within Cisco CML (its great to
have cloud-based routers when you cant access your home lab due to 10-day-long power outage).
This is the routing table I had on the router (static route and default route were set through DHCP).
R1#show ip route 10.11.12.0 longer
[]
Gateway of last resort is 10.11.12.1 to network 0.0.0.0

C
S
L

10.0.0.0/8 is variably subnetted, 3 subnets, 2 masks


10.11.12.0/24 is directly connected, GigabitEthernet0/0
10.11.12.2/32 [254/0] via 10.11.12.1, GigabitEthernet0/0
10.11.12.3/32 is directly connected, GigabitEthernet0/0

CEF table closely reflects the IP table, but there are already a few extra entries:
R1#show ip cef | include 10.11.12
10.11.12.0/24
attached
10.11.12.0/32
receive
10.11.12.1/32
attached
10.11.12.2/32
10.11.12.1
10.11.12.3/32
receive
10.11.12.255/32
receive

Copyright ipSpace.net 2014

GigabitEthernet0/0
GigabitEthernet0/0
GigabitEthernet0/0
GigabitEthernet0/0
GigabitEthernet0/0
GigabitEthernet0/0

Page 3-57

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Now lets ping a directly connected host


R1#ping 10.11.12.4
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.11.12.4, timeout is 2 seconds:
.!!!!
and theres an extra entry in the CEF table:
R1#show ip cef | include 10.11.12
10.11.12.0/24
attached
10.11.12.0/32
receive
10.11.12.1/32
attached
10.11.12.2/32
10.11.12.1
10.11.12.3/32
receive
10.11.12.4/32
attached
10.11.12.255/32
receive

GigabitEthernet0/0
GigabitEthernet0/0
GigabitEthernet0/0
GigabitEthernet0/0
GigabitEthernet0/0
GigabitEthernet0/0
GigabitEthernet0/0

WAIT,WHAT?
Does that mean that the ping command created an extra entry in the CEF table? Of course not
but it did trigger the ARP process, which indirectly created a new glean adjacency in the CEF table
(these adjacencies dont expire due to repeated ARPing done by Cisco IOS). The glean adjacency
looks exactly like any other host route (although you can see from various fields in the detailed CEF
printout that its an adjacency route):

Copyright ipSpace.net 2014

Page 3-58

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

R1#show ip cef 10.11.12.4 internal


10.11.12.4/32, epoch 0, flags [att], refcnt 5, per-destination sharing
sources: Adj
subblocks:
Adj source: IP adj out of GigabitEthernet0/0, addr 10.11.12.4 0D0AB300
Dependent covered prefix type adjfib, cover 10.11.12.0/24
ifnums:
GigabitEthernet0/0(2): 10.11.12.4
path list 0D14E48C, 3 locks, per-destination, flags 0x4A [nonsh, rif, hwcn]
path 0D581C30, share 1/1, type adjacency prefix, for IPv4
attached to GigabitEthernet0/0, IP adj out of GigabitEthernet0/0, addr
10.11.12.4 0D0AB300
output chain:
IP adj out of GigabitEthernet0/0, addr 10.11.12.4 0D0AB300

HOW DO WE KNOW HARDWARE SWITCHES WORK THE SAME WAY?


Obviously its impossible to claim with any certainty how a particular switch works without seeing
the hardware specs (mission impossible for vendor ASICs as well as Broadcoms merchant silicon).
Some switch vendors still talk about IP routing entries and ARP entries, others (for example, Nexus
3000) already use IP prefix and IP host entry terminology. I chose Nexus 3000 for a reason many
data center switches use the same chipset and thus probably use the same forwarding techniques.
Intel is way more forthcoming than Broadcom the FM4000 data sheet contains plenty of details
about its forwarding architecture, and if I understand it correctly, the IP lookup must result in an
ARP entry (which means that the IP lookup table must contain host routes).

Copyright ipSpace.net 2014

Page 3-59

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

SUMMARY
Hardware layer-3 switches need an IP forwarding entry for every attached IP host, although some
vendors might call these entries ARP entries. Virtual layer-3 switches are no different, and might use
a totally different terminology for the host route forwarding entries to confuse the casual reader.

Copyright ipSpace.net 2014

Page 3-60

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

OpenStack Neutron (previously Quantum) module had a nasty limitation described in this blog post
written in October 2013 it supported a single networking plug-in. The limitation was removed in
Icehouse release with introduction of ML2 (Multiple Layer-2) architecture.

OPENSTACK NEUTRON PLUG-IN:THERE CAN ONLY BE


ONE
OpenStack seems to have a great architecture: all device-specific code is abstracted into plugins
that have a well-defined API, allowing numerous (more or less innovative) implementations under
the same umbrella orchestration system.
Looks great in PowerPoint, but whoever designed the network (Quantum, now Neutron) plugin must
have been either a vendor or a server-focused engineer using NIC device driver concepts.
You see, the major problem the Quantum plug-in architecture has is that there can only be one
Quantum plugin in a given OpenStack deployment, and that plugin has to implement all the
networking functionality: layer-2 subnets are mandatory, and there are extensions for layer-3
forwarding, security groups (firewalls) and load balancing.

Copyright ipSpace.net 2014

Page 3-61

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-24: Quantum plugin in OpenStack architecture

This approach worked well in early OpenStack days when the Quantum plugin configured virtual
switches (similar to what VMwares vCenter does) and ignored the physical world. You could choose
to work with Linux bridge or Open vSwitch and use VLANs or GRE tunnels (OVS only).
However, once the networking vendors started tying their own awesomesauce into OpenStack, they
had to replace the original Quantum plugin with their own. No problem there, if the vendor controls
end-to-end forwarding path like NEC does with its ProgrammableFlow controller, or if the vendor
implements end-to-end virtual networks like VMware does with NSX or Midokura does with Midonet
but what most hardware vendors want to do is to control their physical switches, not the
hypervisor virtual switches.

Copyright ipSpace.net 2014

Page 3-62

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

You can probably guess what happened next: theres no problem that cannot be solved by another
layer of indirection, in this case a layered approach where a networking vendor provides a top-level
Quantum plugin that relies on sub-plugin (usually OVS) controlling the hypervisor soft switches.

Figure 3-25: Typical vendor plugin implementation

Remember that OpenStack supports a single plugin. Yeah, you got it right if you want to use the
above architecture, youre locked into a single networking vendor. Perfect vendor lock-in within an
open-source architecture. Brilliant. Also, do remember that your vendor has to update the plugin to
reflect potential changes to Quantum/Neutron API.

Copyright ipSpace.net 2014

Page 3-63

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

I never claimed it was a good idea to mix multiple switching vendors in the same data center (its
not, regardless of what HP is telling you), but imagine youd like to have switches from vendor A and
load balancers from vendor B, all managed through a single plugin. Good luck with that.
Alas, wherever theres a problem, theres a solution in this case a Quantum plugin that ties
OpenStack to a network services orchestration platform (Tail-f NCS or Anuta nCloudX). These
platforms can definitely configure multi-vendor network environments, but if youre willing to go this
far down the vendor lock-in path, you just might drop the whole OpenStack idea and use VMware or
Hyper-V.

Figure 3-26: Tail-f OpenStack plugin

Copyright ipSpace.net 2014

Page 3-64

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The next OpenStack release might give you a different option: a generic plugin that would
implement the high-level functionality, work with virtual switches, and provide a hardware
abstraction layer (Modular Layer 2 ML2) where the vendors could plug in their own device driver.

Figure 3-27: ML2 plugin architecture

This approach removes the vendor lock-in of the monolithic vendor-supplied Quantum plugins, but
limits you to the lowest common denominator VLANs (or equivalent). Not necessarily something
Id want to have in my greenfield revolutionary forward-looking OpenStack-based data center, even
though Aristas engineers are quick to point out you can implement VXLAN gateway on ToR switches
and use VLANs in the hypervisors and IP forwarding in the data center fabric. No thanks, I prefer
Skype over a fancier PBX.

Copyright ipSpace.net 2014

Page 3-65

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Finally, theres an OpenDaylight Quantum plugin, giving you total vendor independence (assuming
you love riding OpenFlow unicorns). It seems OpenDaylight already supports layer-3 OpenFlowbased forwarding, so this approach might be an interesting option a year from now when
OpenDaylight gets some traction and bug fixes.
Cynical summary: Reinventing the wheel while ensuring a comfortable level of lock-in seems to be
a popular pastime of the networking industry. Lets see how this particular saga evolves and do
keep in mind that some people remain deeply skeptical of OpenStacks future.

Copyright ipSpace.net 2014

Page 3-66

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Its pretty easy to reverse-engineer what Amazon VPC is doing based on Amazon documentation,
packet traces and ARP tables. The details (as I understood them in early 2014) are described in this
blog post.

PACKET FORWARDING IN AMAZON VPC


Packet forwarding behavior of VMware NSX and Hyper-V Network Virtualization is well documented;
no such documentation exists for Amazon VPC. However, even though Amazon uses a proprietary
solution (heavily modified Xen hypervisor with homemade virtual switch), its pretty easy to figure
out the basics from the observed network behavior and extensive user documentation.
Chiradeep Vittal ran a number of tests between virtual machines in an Amazon VPC network and
shared the results in a blog post and extensive comments on one of my posts. Heres a short
summary:

Virtual switches in Amazon VPC perform layer-3-only unicast IPv4 forwarding (similar to recent
Hyper-V Network Virtualization behavior). All non-IPv4 traffic and multicast/broadcast IPv4
traffic is dropped.

Layer-3 forwarding in the hypervisor virtual switch does not decrement TTL its like all virtual
machines reside in the same subnet;

Hypervisor proxies all ARP requests and replies with the expected MAC addresses of target VMs
or first-hop gateway (early implementations of Amazon VPC used the same destination MAC
address in all ARP replies);

Virtual switch implements limited router-like functionality. For example, the default gateway IP
address replies to pings, but a VM cannot ping the default gateway of another subnet.

Copyright ipSpace.net 2014

Page 3-67

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Seems like a run-of-the-mill virtual networking implementation, but wait thats not all. The beauty
of Amazon VPC forwarding model is the multi-VRF approach: you can create multiple routing tables
in your VPC and assign one of them to each subnet.
You could, for example, use the default route toward the Internet for web server subnet, default
route toward your data center for database server subnet, and no default routing (local connectivity
only) for your application server subnet. Pretty cool stuff if youre an MPLS/VPN geek used to
schizophrenic routing tables, and quite a tough nut to crack for people who want to migrate their
existing layer-2 networks into the cloud. Massimo Re Ferre made a perfect summary: everyone else
is virtualizing the network, Amazon VPC is abstracting it.

Copyright ipSpace.net 2014

Page 3-68

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Midokuras Midonet is an interesting architecture that distributes forwarding state across the
network edge devices, and keeps it synchronized using a fast central database.
The architecture is definitely intriguing, but we have yet to see how well it copes with fast state
changes in large-scale cloud deployments.
The blog post was written in 2012; it was updated in summer 2014 to reflect improvements in other
network virtualization products mentioned in the blog post.

MIDOKURAS MIDONET: A LAYER 2-4 VIRTUAL


NETWORK SOLUTION
Almost everyone agrees the current way of implementing virtual networks with dumb hypervisor
switches and top-of-rack kludges (including Edge Virtual Bridging EVB or 802.1Qbg and
802.1BR) doesnt scale. Most people working in the field (with the notable exception of some
hardware vendors busy protecting their turfs in the NVO3 IETF working group) also agree virtual
networks running as applications on top of IP fabric are the only reasonable way to go ... but thats
all they currently agree upon.

Copyright ipSpace.net 2014

Page 3-69

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-28: Traditional VLAN-based virtual networks implemented in the physical switches

Figure 3-29: Virtual networks implemented in the hypervisor switches on top of an IP fabric

Copyright ipSpace.net 2014

Page 3-70

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

A BRIEF OVERVIEW OF WHERE WE ARE


Initial overlay virtual networking implementations chose the easiest way out: VXLAN-based MACover-IP virtual networks with no control plane. The layer-2 virtual networks are supposedly needed
to support existing applications (like Microsofts load balancing marvels) and multicast-based VXLAN
relies on flooding (emulated with IP multicast) to build the remote-MAC-to-remote-IP mappings in
hypervisor virtual switches.

Figure 3-30: VXLAN architecture

VMware NSX and Hyper-V are way better they rely on a central controller to distribute MAC-to-IP
mapping information to individual hypervisors.

Copyright ipSpace.net 2014

Page 3-71

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-31: VMware NSX architecture

THE MISSING L3-4 PROBLEM


We know application teams trying to deploy their application stacks on top of virtual networks
usually need more than a single virtual network (or security zone). A typical scale-out application
has multiple tiers that have to be connected with load balancers or firewalls.

Copyright ipSpace.net 2014

Page 3-72

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-32: Simplified scale-out application architecture

All the vendors mentioned above are dancing around that requirement claiming you can always
implement whatever L4-7 functionality you need with software appliances running as virtual
machines on top of virtual networks. A typical example of this approach is vShield Edge, a VM with
baseline load balancing, NAT and DHCP functionality.

Figure 3-33: Load balancer as a VM appliance

Copyright ipSpace.net 2014

Page 3-73

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

To keep the record straight:

VMware, Cisco, Juniper and a few others offer hypervisor-level firewalls; traffic going between
security zones doesnt have to go through an external appliance (although it still goes through a
VM if youre using VMwares vShield Zones/App);

VMware vDS (in vSphere 5.5), VMware NSX, Cisco Nexus 1000V and Hyper-V provide ACL-like
functionality.

MIDOKURAS MIDONET: A L2-4 VIRTUAL SDN


A month ago Ben Cherian left a comment on my blog saying Our product, MidoNet, supports BGP,
including multihoming and ECMP, for interfacing MidoNet virtual routers with external L3 networks.
Not surprisingly, I wanted to know more and he quickly organized a phone call with Dan Mihai
Dimitriu, Midokuras CTO. This is one of the slides they shared with me ... showing exactly what I
was hoping to see in a virtual networks solution:

Copyright ipSpace.net 2014

Page 3-74

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-34: Typical MidoNet virtual network topology

As expected, they decided to implement virtual networks with GRE tunnels between hypervisor
hosts. A typical virtual network topology mapped onto underlying IP transport fabric would thus like
this:

Copyright ipSpace.net 2014

Page 3-75

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 3-35: MidoNet virtual networks implemented with commodity compute nodes on top of an IP fabric

Copyright ipSpace.net 2014

Page 3-76

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Short summary of what theyre doing:

Their virtual networks solution has layer-2 virtual networks that you can link together with
layer-3 virtual routers.

Each virtual port (including VM virtual interface) has ingress and egress firewall rules and
chains (inspired by Linux iptables).

Virtual routers support baseline load balancing and NAT functionality.

Virtual routers are not implemented as virtual machines they are an abstract concept used
by hypervisor switches to calculate the underlay IP next hop.

As one would expect in a L3 solution, hypervisors are answering ARP and DHCP requests
locally.

The edge nodes run EBGP with the outside world, appearing as a single router to external
BGP speakers.

Interestingly, they decided to go against the current centralized control plane religion, and
implemented most of the intelligence in the hypervisors. They use Open vSwitch (OVS) kernel
module as the switching platform (proving my claim that OVS provides all you need to implement
L2-4 functionality), but replaced the OpenFlow agents and centralized controller with their own
distributed software.

Copyright ipSpace.net 2014

Page 3-77

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

MIDONET PACKET FORWARDING PROCESS


This is how Dan and Ben explained a day in the life of an IP packet passing through the MidoNet
overlay virtual networks (I havent set it up to see how it really works):
Their forwarding agents (running in user space on all hypervisor hosts) intercept traffic belonging to
unknown flows (much like the ovs-vswitchd), but process the unknown packets locally instead of
sending them to central OpenFlow controller.
The forwarding agent receiving an unknown packet would check the security rules, consult the
virtual network configuration, calculate the required flow transformation(s) and egress next hop,
install the flow in the local OVS kernel module, insert flow data in a central database for stateful
firewall filtering of return traffic, and send the packet toward egress node encapsulated in a GRE
envelope with the GRE key indicating the egress port on the egress node.
According to Midokura, the forwarding agents generate the most-generic flow specification they can
load balancing obviously requires microflows, simple L2 or L3 forwarding doesnt. While the OVS
kernel module supports only microflow-based forwarding, the forwarding agent doesnt have to
recalculate the virtual network topology for each new flow.
The egress OVS switch has pre-installed flows that map GRE keys to output ports. The packet is
thus forwarded straight to the destination port without going through the forwarding agent on the
egress node. Like in MPLS/VPN or QFabric, the ingress node performs all forwarding decisions, the
only difference being that MidoNet runs as a cluster of distributed software switches on commodity
hardware.

Copyright ipSpace.net 2014

Page 3-78

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Asymmetrical return traffic is no longer an issue because MidoNet uses central flow
database for stateful firewall functionality all edge nodes act as a single virtual firewall.

The end result: MidoNet (Midokuras overlay virtual networking solution) performs simple L2-4
operations within the hypervisor, and forwards packets of established flows within the kernel OVS.
Midokura claims they achieved linerate (10GE) performance on commodity x86 hardware ... but of
course you shouldnt blindly trust me or them. Get in touch with Ben and test-drive their solution.
For more details and a longer (and more comprehensive) analysis, read Brad Hedlund's blog
post.

Copyright ipSpace.net 2014

Page 3-79

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In one of their pivoting reiterations Big Switch Networks decided to implement overlay virtual
networking. Heres my take on that move (written in summer 2012).

BIG SWITCH AND OVERLAY NETWORKS


A few days ago Big Switch announced theyll support overlay networks in their upcoming software
release. After a brief told you so moment (because virtual networks in physical devices dont scale
all that well) I started wondering whether they simply gave up and decided to become a Nicira
copycat, so I was more than keen to have a brief chat with Kyle Forster (graciously offered by
Isabelle Guis).

THE BACKGROUND
Big Switch is building a platform that would allow you to create virtual networks out of any
combination of physical or virtual devices. Their API or CLI allows you to specify which VLANs or
MAC/IP addresses belong to a single virtual network, and their OpenFlow controller does the rest of
the job. To see their software in action, watch the demo they had during the OpenFlow webinar.

THE SHIFT
When I had a brief chat with Kyle during the OpenFlow symposium, I mentioned that (in my
opinion) they were trying to reinvent MPLS ... and he replied along the lines of if only DC switches
would have MPLS support, our life would be much easier.

Copyright ipSpace.net 2014

Page 3-80

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Because most vendors still think MPLS has no life in the Data Center (and Derick Winkworth
forcefully disagrees with that), Big Switch had to implement all sorts of kludges to emulate virtual
circuits with OpenFlow ... and faced the hard reality of real life.
Most switches installed in todays data centers dont support OpenFlow in GA software (HP Procurve
series and IBMs G8264 are obvious exceptions with a tiny market share); on top of that, some
customers are understandably reluctant to deploy OpenFlow-enabled switches in their production
environment. Time to take a step back and refocus on the piece of the puzzle that is easiest to
change the hypervisor, combined with L2 or L3 tunneling across the network core.
Not surprisingly, they decided to use Open vSwitch with their own OpenFlow controller in Linuxbased hypervisors (KVM, Xen) and they claim they have a solution for VMware environment, but
Kyle was a bit tight-lipped about that one.

THE DIFFERENCE?
Based on the previous two paragraphs, it does seem that Big Switch is following Niciras steps ...
only a year or two later. However, they claim there are significant technical differences between the
two approaches:

Using Big Switchs OpenFlow controller, you can mix-and-match physical and virtual switches.
Kyle claimed well see switches supporting OpenFlow-controlled tunneling encap/decap in Q3/Q4
of this year. That would be a real game-changer, but Ill believe when I see this particular
unicorn (update: in summer 2014 Im still waiting for them).

Copyright ipSpace.net 2014

Page 3-81

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Nicira is focused primarily on the hypervisor environments, Big Switch can create virtual
networks out of a set of hypervisors, a set of physical switches, or a combination of both
(currently without tunneling).

And finally, as expected, theres a positioning game going on. According to Big Switch, all
alternatives (VXLAN from Cisco, NVGRE from Microsoft and STT/GRE/OpenFlow from Nicira) expect
you to embrace a fully virtually integrated stack, whereas Big Switchs controller creates a platform
for integration partners. If they manage to pull this off, they just might become another 6WIND or
Tail-F not exactly a bad position to be in, but also not particularly exciting to the investors.
Whatever the case might be, we will definitely live in interesting times in the next few years, and
Im anxiously waiting for the moment when Big Switch decides to make its product a bit more public
(and Im still waiting for publicly available product documentation in summer 2014).

Copyright ipSpace.net 2014

Page 3-82

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

GATEWAYS TO OVERLAY VIRTUAL


NETWORKS

IN THIS CHAPTER:
VXLAN TERMINATION ON PHYSICAL DEVICES
CONNECTING LEGACY SERVERS TO OVERLAY VIRTUAL NETWORKS
IT DOESNT MAKE SENSE TO VIRTUALIZE 80% OF THE SERVERS
INTERFACING OVERLAY VIRTUAL NETWORKS WITH MPLS/VPN WAN
VMWARE NSX GATEWAY QUESTIONS
ARISTA LAUNCHES THE FIRST HARDWARE VXLAN TERMINATION DEVICE
OVERVIEW OF HARDWARE GATEWAYS TO OVERLAY VIRTUAL NETWORKS

Copyright ipSpace.net 2014

Page 4-1

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Gateways between overlay segments and physical world are an important aspect of every overlay
virtual networking solution.
You could implement the gateways with network services devices (load balancers or firewalls), or
with dedicated layer-3 (routers) or layer-2 (bridges) gateways.
Low-bandwidth (few Gbps) environments are easily server by VM-based solutions. Bare-metal
servers or in-kernel gateways provide at least 10 Gbps of throughput. Environments that need
higher throughput between the physical and the virtual world require dedicated hardware solutions.
This chapter describes several aspects of overlay virtual networking gateways, from design
considerations to an overview of hardware gateway products.

Copyright ipSpace.net 2014

Page 4-2

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The need to connect overlay virtual networks with the outside world was very obvious from the very
beginning of the idea (when Cisco announced VXLAN at VMworld 2011). I wrote the following blog
post outlining the concepts in October 2011; the notes I added in summer 2014 describe the
evolution of the concept in the intervening three years. Youll notice that the architectural details
havent changed, but we got way more operational experience.

VXLAN TERMINATION ON PHYSICAL DEVICES


Every time Im discussing the VXLAN technology with a fellow networking engineer, I inevitably get
the question how will I connect this to the outside world? Lets assume you want to build pretty
typical 3-tier application architecture (next diagram) using VXLAN-based virtual subnets and you
already have firewalls and load balancers can you use them? Today the answer is NO.
In the meantime, F5 supports multicast-based VXLAN on BIG-IP LTM. No other firewall or
load balancing vendor had overlay virtual networking support in a shipping software release
in August 2014.

Copyright ipSpace.net 2014

Page 4-3

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 4-1: Typical multi-tier application architecture

The only product supporting VXLAN Tunnel End Point (VTEP) in the near future is the Nexus 1000V
virtual switch; the only devices you can connect to a VXLAN segment are thus Ethernet interface
cards in virtual machines. If you want to use a router, firewall or load balancer (sometimes lovingly
called application delivery controller) between two VXLAN segments or between a VXLAN segment
and the outside world (for example, a VLAN), you have to use a VM version of the layer-3 device.
Thats not necessarily a good idea; virtual networking appliances have numerous performance
drawbacks and consume way more CPU cycles than needed ... but if youre a cloud provider billing
your customers by VM instances or CPU cycles, you might not care too much.
The performance of VM-based network services products has increased to the point where it
became a non-issue. See the Virtual Appliances chapter in the second volume of Software
Defined Data Centers book for more details.
The virtual networking appliances also introduce extra hops and unpredictable traffic flows into your
network, as they can freely move around the data center at the whim of workload balancers like

Copyright ipSpace.net 2014

Page 4-4

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VMwares DRS. A clean network design (left) is thus quickly morphed into a total spaghetti mess
(right):

Figure 4-2: Virtual and physical traffic flows

Ive totally changed my opinion in the meantime. It doesnt matter whether you use virtual
or physical network services appliances within a single leaf-and-spine fabric.

Cisco doesnt have any L3 VM-based product, and the only thing you can get from VMware is vShield
Edge a dumbed down Linux with a fancy GUI. If youre absolutely keen on deploying VXLAN, that

Copyright ipSpace.net 2014

Page 4-5

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

shouldnt stop you; there are numerous VM-based products, including BIG-IP load balancer from F5
and Vyattas routers. Worst case, you can turn a standard Linux VM into a usable router, firewall or
NAT device by removing less functionality from it than VMware did. Not that I would necessarily like
doing that, but its one of the few options we have at the moment.

NEXT STEPS?
Someone will have to implement VXLAN on physical devices sooner or later; running networking
functions in VMs is simply too slow and too expensive. While I dont have any firm information (not
even roadmaps), do keep in mind Ken Dudas enthusiasm during the VXLAN Packet Pushers podcast
(and remember that both Arista and Broadcom appear in the author list of VXLAN and NVGRE
drafts).
Arista was the first data center switching vendor to ship a working VXLAN implementation in
2012.

HOW COULD YOU DO IT?


Layer-3 termination of VXLAN segments is actually pretty easy (from the architectural and control
plane perspective):

VMs attached to a VXLAN segment are configured with the default gateways IP address (intraVXLAN subnet logical IP address of the physical termination device);

Copyright ipSpace.net 2014

Page 4-6

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

A VM sending an IP packet to an off-subnet destination has to send it to the default gateways IP


address and performs an ARP request;

One or more layer-3 VXLAN termination devices respond to the ARP request sent in the VXLAN
encapsulation and the Nexus 1000V switch in the hypervisor running the VM remembers
RouterVXLANMAC-to-RouterPhysicalIP address mapping;

When the VM sends an IP packet to the default gateways MAC address, the Nexus 1000V switch
forwards the IP-in-MAC frame to the nearest RouterPhysicalIP address.

No broadcast or flooding is involved in the layer-3 termination, so you could easily use the same
physical IP address and the same VXLAN MAC address on multiple routers (anycast) and achieve
instant redundancy without first hop redundancy protocols like HSRP or VRRP.
As of August 2014, no data center switching vendor has a shipping layer-3 VXLAN gateway
due to the limitations of Broadcom Trident-2 chipset everyone is using. The hardware of
Cisco Nexus 9300 is capable of layer-3 gateway functionality, but it hasnt been
implemented in the software yet.
Layer-2 extension of VXLAN segments into VLANs (that you might need to connect VXLAN-based
hosts to an external firewall) is a bit tougher. As youre bridging between VXLAN and an 802.1Q
VLAN, you have to ensure that you dont create a forwarding loop.
You could configure the VXLAN layer-2 extension (bridging) on multiple physical switches and run
STP over VXLAN ... but I hope well never see that implemented. It would be way better to use IP
functionality to select the VXLAN-to-VLAN forwarder. You could, for example, run VRRP between
redundant VXLAN-to-VLAN bridges and use VRRP IP address as the VXLAN physical IP address of the
bridge (all off-VXLAN MAC addresses would appear as being reachable via that IP address to other

Copyright ipSpace.net 2014

Page 4-7

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VTEPs). The VRRP functionality would also control the VXLAN-to-VLAN forwarding only the active
VRRP gateway would perform the L2 forwarding. You could still use a minimal subset of STP to
prevent forwarding loops, but I wouldnt use it as the main convergence mechanism.
Brocade is the first vendor shipping redundant VTEP implementation (see another blog post
at the end of this chapter). Arista supposedly has MLAG-based solution, but that code still
hasnt shipped in August 2014.

SUMMARY
VXLAN is a great concept that gives you clean separation between virtual networks and physical IPbased transport infrastructure, but we need VXLAN termination in physical devices (switches,
potentially also firewalls and load balancers) before we can start considering large-scale
deployments. Till then, it will remain an interesting proof-of-concept tool or a niche product used by
infrastructure cloud providers.

Copyright ipSpace.net 2014

Page 4-8

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Now lets move into a bit more details how many gateways should we have, what gateway
functionality would we need based on the services we offer, and should we be looking for a software
or hardware implementation?

CONNECTING LEGACY SERVERS TO OVERLAY VIRTUAL


NETWORKS
I wrote (and spoke) at length about layer-2 and layer-3 gateways between VLANs and overlay
virtual networks, but I still get questions along the lines of how will you connect legacy servers to
the new cloud infrastructure that uses VXLAN?

GATEWAY TYPES
You can connect an overlay virtual network and a physical subnet with:

A network services device (firewall, load balancer );

Layer-3 gateway (router);

Layer-2 gateway (bridge).

A network services device is the best choice if you have to connect a wholly virtualized application
stack to the outside world, or if youre connecting components that have to be isolated by a firewall
or load balancer anyway.

Copyright ipSpace.net 2014

Page 4-9

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Layer-3 gateway is the best option when youre connecting physical and virtual subnets and dont
have to retain IP addresses of newly virtualized physical servers. Layer-2 gateway is the last-resort
option used when you have to stretch the same IP subnet across physical and virtual domains.

PHYSICAL OR VIRTUAL GATEWAYS?


It doesnt make sense to waste our gray matter on this question in low-bandwidth environments (up
to 1Gbps of traffic between the legacy servers and the overlay virtual networks). VM-based virtual
gateways are good enough and extremely easy to deploy. Youre also avoiding any hardware lock-in
its pretty simple to replace the gateway solution if you dont like it.
Some overlay virtual networking solutions (example: unicast VXLAN on Cisco Nexus 1000V
dont work with any existing hardware gateway anyway).
x86-based gateways can provide at least 10Gbps of throughput. If you need more than that across a
single VLAN or tenant you should be looking at dedicated hardware. If you need more than 10Gbps
aggregate throughput, but not more than a Gbps or two per tenant, you might be better served with
a scale-out farm of x86-based gateways after all, you might be able to reuse them if your needs
change (and theres no hardware lock-in).

HOW MANY GATEWAYS SHOULD ONE HAVE?


Short answer: as few as possible.

Copyright ipSpace.net 2014

Page 4-10

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Every gateway has to be managed and configured. Numerous gateways between physical and virtual
worlds are a potential source of forwarding or routing loops, and some vendors limit the number of
gateways you can have anyway.
In the ideal world, youd have just two gateways (for redundancy purposes) connecting the legacy
servers to the cloud infrastructure using overlay virtual networking; you might need more than that
in high-bandwidth environments if you decide to use VM-based or x86-based gateways (see above).
The gateways would run in either active/backup configuration (example: Cisco VXLAN gateway, VMbased or x86-based VMware NSX gateways) or in MLAG-type deployment where two physical
switches present themselves as a single VTEP (IP address) to the overlay virtual networking fabric
(example: Arista VXLAN gateways, NSX VTEP on Brocade Logical Chassis, Cisco Nexus 9300).

Copyright ipSpace.net 2014

Page 4-11

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Retaining a mix of bare-metal and virtualized servers indefinitely doesnt make sense it only
increases the overall complexity of the network and its operational costs, as I explained in the
following blog post written in May 2014.

IT DOESNT MAKE SENSE TO VIRTUALIZE 80% OF THE


SERVERS
A networking engineer was trying to persuade me of importance of hardware VXLAN VTEPs. We
quickly agreed physical-to-virtual gateways are the primary use case, and he tried to illustrate his
point by saying Imagine you have 1000 servers in your data center and you manage to virtualize
80% of them. How will you connect them to the other 200? to which I replied, That doesnt make
any sense. Heres why.

HOW MANY HYPERVISOR HOSTS WILL YOU NEED?


Modern servers have ridiculous amounts of RAM and CPU cores as I explained in the Designing
Private Cloud Infrastructure webinar. Servers with 512 GB of RAM and 16 cores are quite common
and becoming relatively inexpensive.
Assuming an average virtualized server needs 8 GB of RAM (usually they need less than that) you
can pack over 60 virtualized servers into a single hypervisor hosts. The 800 virtualized servers thus
need less than 15 physical servers (for example, four Nutanix appliances), or 30 10GE ports less
than half a ToR switch.

Copyright ipSpace.net 2014

Page 4-12

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

BACK TO THE PHYSICAL WORLD


The remaining 200 physical servers need 400 ports, most commonly a mixture of everything from
Fast Ethernet to 1GE and (rarely) 10GE. Mixing that hodgepodge of legacy gear with high-end
hypervisor hosts and linerate 10GE switches makes no sense.

WHAT SHOULD YOU DO?


Ive seen companies doing network refreshes without virtualizing and replacing the physical servers.
They had to buy almost-obsolete gear to get 10/100/1000 ports required by existing servers, and
thus closed the doors for 10GE deployment (because they wont get new CapEx budget for then next
5 years).
Dont do that. When youre building a new data center network or refreshing an old one, start with
its customers the servers: buy new high-end servers with plenty of RAM and CPU cores, virtualize
as much as you can, and dont mix the old and the new world.
This does require synchronizing your activities with the server and virtualization teams,
which might be a scary and revolutionary thought in some organizations; well simply have
to get used to talking with other people.
Use one or two switches as L2/L3 gateways, and dont even think about connecting the old servers
to the new infrastructure. Make it abundantly clear that the old gear will not get any upgrades (the
server team should play along) and that the only way forward is through server virtualization and
let the legacy gear slowly fade into obsolescence.

Copyright ipSpace.net 2014

Page 4-13

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Service providers deploying public cloud environments or NFV-based network services have to
integrate the overlay virtual networking environment with the existing MPLS/VPN WAN network. This
blog post outlines some of the alternatives.

INTERFACING OVERLAY VIRTUAL NETWORKS WITH


MPLS/VPN WAN
During my ExpertExpress engagements with engineers building multi-tenant cloud infrastructure I
often get questions along the lines of How do I integrate my public IaaS cloud with my MPLS/VPN
WAN? Here are a few ideas.

DONT OVERCOMPLICATE
Lets eliminate the trivial options first.

If your our public cloud offers hosting of individual VMs with no per-customer virtual segments,
use one of the mechanisms I described in the Does It Make Sense to Build New Clouds with
Overlay Networks? post and ask the customers to establish a VPN from their VM to their home
network.

If your public cloud offers virtual private networks, but you dont plan to integrate the cloud
infrastructure with a multi-tenant transport network (using, for example, MPLS/VPN as the WAN
transport technology), establish VPN tunnels between the virtual network edge appliance
(example: vShield Edge) and customers VPN concentrator.

Copyright ipSpace.net 2014

Page 4-14

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The rest of this post applies to multi-tenant cloud providers that offer private virtual networks to
their customers and want to integrate those private networks directly with the MPLS/VPN service
they offer to the same customers.

VLAN-BASED VIRTUAL NETWORKS


Many public cloud deployments use the legacy VLAN-based virtual network approach. Interfacing
these networks with MPLS/VPN is trivial create VLAN (sub)interface in a customer VRF for each
outside customer VLAN on data center WAN edge PE-routers (Inter-AS Option A comes to mind).

OVERLAY VIRTUAL NETWORKS WITHOUT MPLS/VPN SUPPORT


If you use overlay virtual networking technology that has no integrated MPLS/VPN support
(example: VMware NSX, OpenStack Neutron OVS plugin with GRE tunnels), you have to use VLANs
as the demarcation point:

Create a VLAN per customer;

Use a VM-based appliance (firewall, load balancer) or L2/L3 gateway to connect the customers
outside overlay virtual network with the per-customer VLAN;

Read the previous section.

DIRECT INTEGRATION WITH MPLS/VPN INFRASTRUCTURE


Some overlay virtual networking solutions (Juniper Contrail, Nuage Virtualized Services Platform)
communicate directly with PE-routers, exchanging VPNv4 routes via MP-BGP and using MPLS-overGRE encapsulation to pass IP traffic between hypervisor hosts and PE-routers.

Copyright ipSpace.net 2014

Page 4-15

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Integrating these solutions with the MPLS/VPN backbone is a trivial undertaking establish MP-BGP
sessions between the overlay virtual network controllers and WAN edge PE-routers. I would use
Inter-AS Option B to establish a demarcation point between the cloud infrastructure and WAN
network and perform route summarization on the PE-router (it doesnt make much sense to leak
host routes created by Contrail solution into the WAN network).

VM-LEVEL INTEGRATION
If you dont want to use one of the MPLS/VPN-based overlay virtual networking solutions (they both
require Linux-based hypervisors and provide off-the-shelf integration with OpenStack and
CloudStack), use a VM-based PE-routers. You could deploy Ciscos Cloud Services Router (CSR) as a
PE-router, connect one of its interfaces to a VLAN-based network and all other interfaces to
customer overlay virtual networks.
The number of customer interfaces (each in a separate VRF) on the CSR router is limited by
the hypervisor, not by CSR (VMware maximum: 10).

Copyright ipSpace.net 2014

Page 4-16

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

This blog post is trying to answer some common VMware NSX gateway questions. The answers are
valid for VMware NSX for multiple hypervisors release 4.0 and VMware NSX for vSphere release 6.0

VMWARE NSX GATEWAY QUESTIONS


Gordon sent me a whole list of NSX gateway questions:
A) Do you need a virtual gateway for each VXLAN segment or can a gateway be the entry/exit
point across multiple VXLAN segments?
B) Can you setup multiple gateways and specify which VXLAN segments use each gateway?
C) Can you cluster gateways together (Active/Active) or do you setup them up as Active/Standby?
The answers obviously depend on whether youre deploying NSX for multiple hypervisors or NSX for
vSphere. Lets start with the former.

GATEWAYS IN NSX FOR MULTIPLE HYPERVISORS RELEASE 4.0


NSX gateways are implemented on NSX gateway transport nodes which run on bare-metal servers
or in dedicated VMs. NSX also supports third-party L2 gateways (VTEPs) with VXLAN encapsulation.
Each gateway node can run multiple instances of L2 or L3 gateway services (but not both). Each L2
gateway service can bridge between numerous overlay networks and VLANs (there must be a 1:1
mapping between an overlay network segment and an outside VLAN), each L3 gateway service can
route between numerous logical networks and a single uplink.

Copyright ipSpace.net 2014

Page 4-17

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Each gateway service can run on two gateway nodes in Active/Standby mode.

GATEWAYS IN NSX FOR VSPHERE RELEASE 6.0


Control plane of every NSX gateway is always implemented in a VM running NSX Edge software.
Data plane of L2 gateways and distributed routers is implemented in loadable kernel modules. Data
plane of NSX Edge Services Router is implemented within the VM (like the traditional vShield Edge).
Each L2 gateway instance (NSX Edge VM running as L2 gateway) can bridge a single VXLAN
segment to a VLAN segment. Multiple L2 gateway instances can run on the same vSphere host.
NSX Edge router (running just the control plane) can have up to eight uplinks and up to 1000
internal (VXLAN-based) interfaces. NSX Edge Services Router (with data plane implemented within
the VM) can have up to ten interfaces (the well-known vSphere limit on the number of interfaces of
a single VM). Multiple NSX Edge routers or NSX Edge Services Routers can run on the same vSphere
host.
Each NSX Edge instance can run in Active/Standby HA mode.

In theory you might have more than one NSX Edge instance connecting a VXLAN segment with the
outside world, but even if the NSX Manager software allows you to configure that, I wouldnt push
my luck.

Copyright ipSpace.net 2014

Page 4-18

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Arista was the first data center switching vendor with hardware VXLAN gateway implementation.
Prerelease code was available in late 2012 (when this blog post was written), GA code shipped in
2013.
In August 2014 the shipping EOS code supports multicast-based VXLAN or an Arista-specific
implementation of unicast VXLAN. Theres no support for redundant hypervisors or VMware NSXcontrolled VTEP (Arista claims both will become available soon).

ARISTA LAUNCHES THE FIRST HARDWARE VXLAN


TERMINATION DEVICE
Arista is launching a new product line today shrouded in mists of SDN and cloud buzzwords: the
7150 series top-of-rack switches. As expected, the switches offer up to 64 10GE ports with wire
speed L2 and L3 forwarding and 400 nanosecond(!) latency.
Also expected from Arista: unexpected creativity. Instead of providing a 40GE port on the switch
that can be split into four 10GE ports with a breakout cable (like everyone else is doing), these
switches group four physical 10GE SFP+ ports into a native 40GE (not 4x10GE LAG) interface.
The 7150 switches are also the first devices that offer VXLAN termination in hardware. Broadcoms
upcoming Trident-2 chipset supports VXLAN and NVGRE, so when Arista demonstrated VXLAN
termination at the recent VMworld 2012, everyone expected the product to be available next spring
... but according to Arista its orderable now and shipping in Q4. Turns out Arista decided to use

Copyright ipSpace.net 2014

Page 4-19

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Intels chipset this time, proving yet again that they can be remarkably agile with regard to the
merchant silicon.
Another goodie: you can run IEEE 1588 (Precision Time Protocol) on these devices to establish an
extremely precise time base in your network, drifting only a few nanoseconds per day (precision
clock module seems to be optional). Such a precision might not make sense at the first glance
(unless youre working in high-frequency trading), until you discover you can timestamp mirrored
(Aristas name for SPAN) or sFlow packets. Imagine being able to collect packets across the whole
network and having a (almost) totally reliable timestamp attached to all of them.
Finally (and my friend Tom Hollingsworth will love this part), 7150 switches can do NAT in hardware.
Yeah, you got that right they do NAT in silicon (dont even try to ask me whether its NAT44,
NAT64, or NAT66 ;) with less than one microsecond latency.

Copyright ipSpace.net 2014

Page 4-20

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In April 2014 Brocade shipped VMware NSX hardware gateway in their Network OS. This blog post
provides an overview of hardware gateways to overlay virtual networks.

OVERVIEW OF HARDWARE GATEWAYS TO OVERLAY


VIRTUAL NETWORKS
A comment by Brook Reams on my recent blog post was a fantastic surprise: Brocade is the first
vendor that actually shipped a VXLAN VTEP controlled by a VMware NSX controller. Its amazing to
see how Brocade leapfrogged everyone else (they also added tons of other new functionality in NOS
releases 4.0 and 4.1).

THE REALLY INTERESTING PART


Every other shipping hardware VXLAN gateway (Arista and F5) that integrates with vSphere
environment implements multicast-based VXLAN.
Brocade decided to skip the multicast VXLAN support and implemented only VMware NSX gateway
functionality. Obviously they dont believe in viability of Ciscos Nexus 1000V (or VMwares vCNS).

ANYONE ELSE?

Arista has a shipping L2 VTEP that uses IP multicast. They might have OVSDB agent (which is
needed to work with the NSX controller), but its not yet documented in public EOS
documentation.

Copyright ipSpace.net 2014

Page 4-21

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Nuage VSG 7850 is also shipping, but does not work with either multicast-based VXLAN or NSX
controller. It uses MP-BGP to integrate with other controllers within the Nuage's Virtual Services
Platform (VSP).

Cisco Nexus 9300 has L2 VTEP using IP multicast.

Dell seems stuck the Z9000 documentation published on their web is almost a year old,
everything else is older.

HP claims hardware VXLAN support on FlexFabric 5930AF switches (which probably means were
using Trident-2 chipset); havent found anything VXLAN-related in their manuals.

Juniper is promising VXLAN on QFX5100 and MX-series routers; it looks like they havent shipped
yet.

Copyright ipSpace.net 2014

Page 4-22

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

LONG-DISTANCE OVERLAY VIRTUAL


NETWORKS

IN THIS CHAPTER:
HOT AND COLD VM MOBILITY
VXLAN, OTV AND LISP
VXLAN IS NOT A DATA CENTER INTERCONNECT TECHNOLOGY
EXTENDING LAYER-2 CONNECTION INTO A CLOUD
REVISITED: LAYER-2 DCI OVER VXLAN
VXLAN AND OTV: IVE BEEN SUCKERED

Copyright ipSpace.net 2014

Page 5-1

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Literally minutes after VXLAN was launched at VMworld 2011, a marketing executive with slightly
vague touch with reality claimed the overlay virtual networks enable a seamless long-distance VM
mobility by eliminating all the constraints of the physical networks.
The few blog posts collected in this chapter try to bring some realism into the picture: VXLAN (or
any other virtual networking solution) is not a data center interconnect technology, but it could be
used as the technology-of-last-resort to minimize the impact of suboptimal and/or unrealistic
requirements.

MORE INFORMATION

Cloud Computing Networking webinar describes several over-the-top solutions that you can use
to connect a private cloud with a public cloud;

Check out other cloud computing and networking webinars;

Use ExpertExpress service if you need short online consulting session, technology discussion or a
design review.

Copyright ipSpace.net 2014

Page 5-2

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Seamless VM mobility coveted by application and virtualization teams comes in two flavors: cold VM
mobility that requires mobile IP endpoints, and hot VM mobility that imposes additional
requirements on the network infrastructure.
This blog post (written in early 2013) describes the differences between the two flavors.

HOT AND COLD VM MOBILITY


Another day, another interesting Expert Express engagement, another stretched layer-2 design
solving the usual requirement: We need inter-DC VM mobility.
The usual question: And why would you want to vMotion a VM between data centers? with a
refreshing answer: Oh, no, that would not work for us.

THE CONFUSION
There are two different mechanisms we can use to move VMs around a virtualized environment: hot
VM mobility where a running VM is moved from one hypervisor host to another and cold VM mobility
where a VM is shut down, and its configuration moved to another hypervisor, where the VM is
restarted.
Some virtualization vendors might offer a third option: warm VM mobility where you pause
a VM (saving its memory to a disk file), and resume its operation on another hypervisor.

Copyright ipSpace.net 2014

Page 5-3

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

WHY DO WE CARE?
You might not care about the mechanisms hypervisors use to move VMs around the data center, but
you probably do care about the totally different networking requirements of hot and cold VM moves.
Before going there, lets look at the typical use cases.

WHERE WOULD YOU NEED ONE OR THE OTHER?


Hot VM mobility is used by automatic resource schedulers (ex: DRS) that move running VMs
between hypervisors in a cluster to optimize their resource (CPU, RAM) utilization. It is also heavily
used for maintenance purposes: for example, you have to evacuate a rack of servers before shutting
it down for maintenance or upgrade.
Youll find cold VM mobility in almost every high-availability (ex: VMware HA restarts a VM after the
server failure) and disaster recovery solution (ex: VMwares SRM). Its also the only viable
technology for VM migration into the brave new cloudy world (aka cloudbursting).

HOT VM MOVE
VMwares vMotion is probably the best-known example of hot VM mobility technology. vMotion
copies memory pages of a running VM to another hypervisor, repeating the process for pages that
have been modified while the memory was transferred. After most of the VM memory has been
successfully transferred, vMotion freezes the VM on source hypervisor, moves its state to another
hypervisor, and restarts it there.

Copyright ipSpace.net 2014

Page 5-4

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

A hot VM move must not disrupt the existing network connections (why else would you insist on
moving a running VM?). There are a number of elements hat have to be retained to reach that goal:

VM must have the same IP address (obvious);

VM should have the same MAC address (otherwise we have to rely on hypervisor-generated
gratuitous ARP to update ARP caches on other nodes in the same subnet);

After the move, the VM must be able to reach first-hop router and all other nodes in the same
subnet using their existing MAC addresses (hot VM move is invisible to the VM, so the VM doesnt
know it should purge its ARP cache).

The only mechanisms we can use today to meet all these requirements are:

Stretched layer-2 subnets, whether in a physical (VLAN) or virtual (VXLAN) form;

Hypervisor switches with layer-3 capabilities. Hyper-V 3.0 Network Virtualization is pretty good,
and the virtual switch used by Amazons VPC would be perfect.

You might also want to keep in mind that:

Stretched layer-2 domains are not the best idea ever invented (server/OS engineers that
understand networking usually agree with that).

Layer-2 subnet with BUM flooding represents a single failure domain and a scalability roadblock.

Corollary: Keep the hot VM mobility domain small.

Copyright ipSpace.net 2014

Page 5-5

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

COLD VM MOVE
Cold VM move is a totally different beast a VM is shut down and restarted on another hypervisor.
It could easily survive a change in its IP and MAC address were it not for the enterprise craplications
written by programmers that have never heard of DNS. Lets thus assume we have to deal with a
broken application that relies on hard-coded IP addresses.
IP address of the first-hop router is usually manually configured in the VM (yeah, Im yearning for
the ideal world where people use DHCP to get network-related parameters) and thus cannot be
changed, but nothing stops us from configuring the same IP address on multiple routers (a trick
used by first-hop localization kludges).
We can also use routing tricks (ex: host routes generated by load balancers) or overlay networks
(ex: LISP) to make the moved VM reachable by the outside world a major use case promoted by
LISP enthusiasts.
The last time I was explaining how cold VM mobility works with LISP in an ExpertExpress WebEx
session, I got a nice question from the engineer on the other end: And how exactly is that different
from host routes? The best summary Ive ever heard.
However, theres a gotcha: even though the VM has moved to a different location, it left residual
traces of its presence in the original subnet: entries in ARP caches of adjacent hosts and routers.
Routers are usually updated with new forwarding information (be it a routing protocol or LISP
update), adjacent hosts arent. These hosts would try to reach the moved VM using its old MAC
address and fail unless theres a L2 subnet between the old and the new location.

Copyright ipSpace.net 2014

Page 5-6

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Does all this sound like complex spaghetti mess with loads of interdependencies and layers of
kludges? Youre not far away from the truth. But wait, theres more eventually LISP will be
integrated with VXLAN for a seamless globe-spanning overlay network. It just might be easier to fix
the applications, dont you think so?

Copyright ipSpace.net 2014

Page 5-7

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In autumn of 2011 (when the following blog post was written) it looked like VXLAN might be the
answer to all questions. I tried to dispel that myth and describe how VXLAN, OTV and LISP might fit
together.
The notes in the blog post describe the additional implementation options that became available
between the time the blog post was written and summer of 2014.

VXLAN, OTV AND LISP


Immediately after VXLAN was announced @ VMworld, the twittersphere erupted in speculations and
questions, many of them focusing on how VXLAN relates to OTV and LISP, and why we might need a
new encapsulation method.
VXLAN, OTV and LISP are point solutions targeting different markets. VXLAN is an IaaS
infrastructure solution, OTV is an enterprise L2 DCI solution and LISP is ... whatever you want it to
be.
VXLAN tries to solve a very specific IaaS infrastructure problem: replace VLANs with something that
might scale better. In a massive multi-tenant data center having thousands of customers, each one
asking for multiple isolated IP subnets, you quickly run out of VLANs. VMware tried to solve the
problem with MAC-in-MAC encapsulation (vCDNI), and you could potentially do the same with the
right combination of EVB (802.1Qbg) and PBB (802.1ah), very clever tricks a-la Network Janitor, or
even with MPLS.

Copyright ipSpace.net 2014

Page 5-8

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Compared to all these, VXLAN has a very powerful advantage: it runs over IP. You dont have to
touch your existing well-designed L3 data center network to start offering IaaS services. The need
for multipath bridging voodoo magic that a decent-sized vCDNI deployment would require is gone.
VXLAN gives Cisco and VMware the ability to start offering reasonably-well-scaling IaaS cloud
infrastructure. It also gives them something to compete against Open vSwitch/Nicira combo.
Reading the VXLAN draft, you might notice that all the control-plane aspects are solved with
handwaving. Segment ID values just happen, IP multicast addresses are defined at the
management layer and the hypervisors hosting the same VXLAN segment dont even talk to each
other, but rely on layer-2 mechanisms (flooding and dynamic MAC address learning) to establish
inter-VM communication. VXLAN is obviously a QDS (Quick-and-Dirty-Solution) addressing a specific
need increasing the scalability of IaaS networking infrastructure.
In the meantime, Cisco and VMware shipped unicast VXLAN implementations Cisco on
Nexus 1000V, VMware with the VMware NSX.
VXLAN will indeed scale way better than VLAN-based solution, as it provides total separation
between the virtualized segments and the physical network (no need to provision VLANs on the
physical switches), it will scale somewhat better than MAC-in-MAC encapsulation because it relies on
L3 transport (and can thus work well in existing networks), but its still a very far cry from Amazon
EC2. People with extensive (bad) IP multicast experience are also questioning the wisdom of using
IP multicast instead of source-based unicast replication ... but if you want to remain control-plane
ignorant, you have to rely on third parties (read: IP multicast) to help you find your way around.
It seems there have already been claims that VXLAN solves inter-DC VM mobility (I sincerely hope
Ive got a wrong impression from Duncan Eppings summary of Steve Herrods general session @

Copyright ipSpace.net 2014

Page 5-9

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VMworld). If youve ever heard about traffic trombones, you should know better (but it does prove a
point @etherealmind made recently). Regardless of the wishful thinking and beliefs in flat earth,
holy grails and unicorn tears, a pure bridging solution (and VXLAN is no more than that) will never
work well over long distances.
Heres where OTV kicks in: if you do become tempted to implement long-distance bridging, OTV is
the least horrendous option (BGP MPLS-based MAC VPN will be even better, but it still seems to be
working primarily in PowerPoint). It replaces dynamic MAC address learning with deterministic
routing-like behavior, provides proxy ARP services, and stops unicast flooding. Until were willing to
change the fundamentals of transparent bridging, thats almost as good as it gets.
EVPN is the standardized BGP MPLS-based MAC VPN solution. Some hardware vendors
already have EVPN-capable product; its also used in several overlay virtual networking
solutions.
As you can see, it makes no sense to compare OTV and VXLAN; its like comparing a racing car to a
downhill mountain bike. Unfortunately, you cant combine them to get the best of both worlds; at
the moment, OTV and VXLAN live in two parallel universes. OTV provides long-distance bridging-like
behavior for individual VLANs, and VXLAN cannot even be transformed into a VLAN.
LISP is yet another story. It provides very rudimentary approximation to IP address mobility across
layer-3 subnets, and it might be able to do it better once everyone realizes hypervisor is the only
place to do it properly. However, its a layer-3 solution running on top of layer-2 subnets, which
means you might run LISP in combination with OTV (not sure it makes sense, but nonetheless) and
you could be able to run LISP in combination with VXLAN once you can terminate VXLAN on a LISPcapable L3 device.

Copyright ipSpace.net 2014

Page 5-10

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Three years later, theres still no integration between VXLAN, OTV and LISP.

So, with the introduction of VXLAN, the networking world hasnt changed a bit: the vendors are still
serving us all isolated incompatible technologies ... and all were asking for is tightly integrated and
well-architected designs.

Copyright ipSpace.net 2014

Page 5-11

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Once you launch a catchy idea, it gets a life of its own. Almost three years after I wrote the
following blog post networking engineers still ask me whether they could use VXLAN to implement a
data center interconnect.

VXLAN IS NOT A DATA CENTER INTERCONNECT


TECHNOLOGY
In a comment to the Firewalls in a Small Private Cloud blog post I wrote VXLAN is _NOT_ a viable
inter-DC solution and Jason wasnt exactly happy with my blanket response. I hope Jason got a
detailed answer in the VXLAN Technical Deep Dive webinar, heres a somewhat shorter explanation.
VXLAN is a layer-2 technology. If you plan to use VXLAN to implement a data center interconnect,
youll be stretching a single L2 segment across two data centers.
You probably know my opinion about the usability of L2 DCI, but even ignoring the obvious
problems, current VXLAN implementations dont have the features one would want to see in a L2
DCI solution.

Copyright ipSpace.net 2014

Page 5-12

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

WHAT SHOULD A L2 DCI SOLUTION HAVE?


Assuming someone forced you to implement a L2 DCI, the technology you plan to use SHOULD have
these features:

Per-VLAN flooding control at data center edge. Broadcasts/multicasts are usually not ratelimited within the data center, but should be tightly controlled at the data center edge
(bandwidth between data centers is usually orders of magnitude lower than bandwidth within a
data center). Ideally, youd be able to control them per VLAN to reduce the noisy neighbor
problems.

Broadcast reduction at data center edge. Devices linking DC fabric to WAN core should
implement features like ARP proxy.

Controlled unicast flooding. It should be possible to disable flooding of unknown unicasts at


DC-WAN boundary.

Its also nice to have the following features to reduce the traffic trombones going across the DCI
link:

First hop router localization. Inter-subnet traffic should not traverse the DCI link to reach the
first-hop router.

Ingress traffic optimization. Traffic sent to a server in one data center should not arrive to
the other data center first.

OTV in combination with FHRP localization and LISP (or load balancers with Route Health Injection)
gives you a solution that meets these criteria.. VXLAN with hypervisor VTEPs has none of the abovementioned features.

Copyright ipSpace.net 2014

Page 5-13

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

VXLAN gateway on Aristas 7150 is somewhat better, so you might be tempted to use it as solution
that would connect two VLANs across an IP network, but dont forget that they havent solved the
redundancy issues yet you can have a single switch acting as a VXLAN gateway for a particular
VLAN.
Conclusion: The current VXLAN implementations (as of November 2012) are a far cry from what I
would like to see if being forced to implement a L2 DCI solution. Stick with OTV (its now available
on ASR 1K).

Copyright ipSpace.net 2014

Page 5-14

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

If you have to offer your customers a layer-2 inteconnect into a public cloud, and dont want to risk
the stability of the underlying network infrastructure by extending layer-2 segments into customer
sites, overlay networking might be a viable alternative.

EXTENDING LAYER-2 CONNECTION INTO A CLOUD


Carlos Asensio was facing an interesting challenge: someone has sold a layer-2 extension into
their public cloud to one of the customers. Being a good engineer, he wanted to limit the damage
the customer could do to the cloud infrastructure and thus immediately rejected the idea to connect
the customer straight into the layer-2 network core ... but what could he do?
Overlay virtual networks just might be a solution if you have to solve a similar problem:

Build the cloud portion of the customers layer-2 network with an overlay virtual networking
technology;

Install an extra NIC in one (or more) physical host and run a VXLAN-to-VLAN gateway in a VM on
that host the customers VLAN is thus completely isolated from the data center network core;

Connect the extra NIC to WAN edge router or switch on which the customers link is terminated.
Whatever stupidity the customer does in its part of the stretched layer-2 network wont spill
further than the gateway VM and the overlay network (and you could easily limit the damage by
reducing the CPU cycles available to the gateway VM).

Copyright ipSpace.net 2014

Page 5-15

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The diversity of overlay virtual networking solutions available today gives you plenty of choices:

You could use Nexus 1000V with VXLAN or OVS/GRE/OpenStack combo at no additional cost
(combining VLANs with GRE-encapsulated subnets might be an interesting challenge in current
OpenStack Quantum release);

VMwares version of VXLAN comes with vCNS (a product formerly known as vShield), so youll
need a vCNS license;

You could also use Nicira NVP (part of VMware NSX) with a layer-2 gateway (included in NVP
platform).
Hyper-V Network Virtualization might have a problem dealing with dynamic MAC addresses
coming from the customers data center this is one of the rare use cases where dynamic
MAC learning works better than a proper control plane.

VXLAN-to-VLAN gateway linking the cloud portion of the customers network with the customers
VLAN could be implemented with Ciscos VXLAN gateway or a simple Linux or Windows VM on which
you bridge the overlay and VLAN interfaces (yet again, one of those rare cases where VM-based
bridging makes sense). Aristas 7150 or F5 BIG-IP is probably an overkill.
And now for a bit of totally unrelated trivia: once we solved the interesting part of the problem, I
asked about the details of the customer interconnect link they planned to have a single 100 Mbps
link and thus a single path of failure. I can only wish them luck and hope theyll try to run stretched
clusters over that link.

Copyright ipSpace.net 2014

Page 5-16

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In 2014 I somewhat revised my perspective on VXLAN as the layer-2 DCI technology. Its still not
the right tool for the job, but it might be better than some alternatives.

REVISITED: LAYER-2 DCI OVER VXLAN


Im still getting questions about layer-2 data center interconnect; it seems this particular bad idea
isnt going away any time soon. In the face of that sad reality, lets revisit what I wrote about layer2 DCI over VXLAN.
VXLAN hasnt changed much since the time I explained why its not the right technology for longdistance VLANs:

I havent seen integration with OTV or LISP that was promised years ago (or maybe I missed
something please write a comment);

VXLAN-to-VLAN gateways are still limited to single gateway (or MLAG cluster) per VXLAN
segment, generating traffic trombones with long-distance VLANs;

Traffic trombones generated by stateful appliances (inter-subnet firewalls or load balancers) are
obviously impossible to solve.

Then theres the obvious problem of data having gravity (or applications being used to being close to
data) if you move a VM away from the data, the performance quickly drops way below acceptable
levels.
However, if youre forced to implement a stretched VLAN (because the application team cannot
possibly deploy their latest gizmo without it, or because the server team claims they need it for

Copyright ipSpace.net 2014

Page 5-17

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

disaster recovery that has no chance of working) that nobody will ever use, VXLAN is the least
horrible technology. After all, youve totally decoupled the physical infrastructure from the follies of
virtual networking, and even if someone manages to generate a forwarding loop between two
VXLAN segments, the network infrastructure wont be affected assuming you implemented some
basic traffic policing rules.

Copyright ipSpace.net 2014

Page 5-18

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The final blog post in this chapter describes my grudge with misleading marketing claims. VXLAN
uses a packet format thats very similar to whats described in OTV IETF draft but thats a far cry
from the packet format that OTV is using.

VXLAN AND OTV: IVE BEEN SUCKERED


When VXLAN came out a year ago, a lot of us looked at the packet format and wondered why Cisco
and VMware decided to use UDP instead of more commonly used GRE. One explanation was evident:
UDP port numbers give you more entropy that you can use in 5-tuple-based load balancing. The
other explanation looked even more promising: VXLAN and OTV use very similar packet format, so
the hardware already doing OTV encapsulation (Nexus 7000) could be used to do VXLAN
termination. Boy have we been suckered.
It turns out nobody took the time to analyze an OTV packet trace with the Wireshark; everyone
believed whatever IETF drafts were telling us. Heres the packet format from draft-hasmit-otv-03:

Copyright ipSpace.net 2014

Page 5-19

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service|
Total Length
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Identification
|Flags|
Fragment Offset
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live | Protocol = 17 |
Header Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Source-site OTV Edge Device IP Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Destination-site OTV Edge Device (or multicast) Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Source Port = xxxx
|
Dest Port = 8472
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
UDP length
|
UDP Checksum = 0
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|R|R|R|I|R|R|R|
Overlay ID
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Instance ID
| Reserved
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
Frame in Ethernet or 802.1Q Format
|
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

And heres the packet format from draft-mahalingam-dutt-dcops-vxlan. Apart from a different UDP
port number, the two match perfectly.

Copyright ipSpace.net 2014

Page 5-20

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Outer Ethernet Header:


+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Outer Destination MAC Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Outer Destination MAC Address | Outer Source MAC Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Outer Source MAC Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|OptnlEthtype = C-Tag 802.1Q
| Outer.VLAN Tag Information
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Ethertype = 0x0800
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Outer IPv4 Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service|
Total Length
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Identification
|Flags|
Fragment Offset
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live |Protocl=17(UDP)|
Header Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Outer Source IPv4 Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Outer Destination IPv4 Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Outer UDP Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Source Port = xxxx
|
Dest Port = VXLAN Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
UDP Length
|
UDP Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
VXLAN Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|R|R|R|I|R|R|R|
Reserved
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
VXLAN Network Identifier (VNI) |
Reserved
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Copyright ipSpace.net 2014

Page 5-21

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Inner Ethernet Header:


+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Inner Destination MAC Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Inner Destination MAC Address | Inner Source MAC Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Inner Source MAC Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|OptnlEthtype = C-Tag 802.1Q
| Inner.VLAN Tag Information
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Payload:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Ethertype of Original Payload |
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
Original Ethernet Payload
|
|
|
|(Note that the original Ethernet Frame's FCS is not included) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

However, it turns out the OTV draft four Ciscos engineers published in 2011 has nothing to do with
the actual implementation and encapsulation format used by Nexus 7000. It seems Brian McGahan
was the first one to actually do the OTV packet capture and analysis and published his findings. He
discovered that OTV is nothing else than the very familiar EoMPLSoGREoIP. No wonder the first
VXLAN gateway device Cisco announced at Cisco Live is not the Nexus 7000 but a Nexus 1000Vbased solution (at least thats the way I understood this whitepaper).

Copyright ipSpace.net 2014

Page 5-22

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

ALTERNATE APPROACHES TO NETWORK


VIRTUALIZATION

IN THIS CHAPTER:
NETWORK VIRTUALIZATION AND SPAGHETTI WALL
SMART FABRICS VERSUS OVERLAY VIRTUAL NETWORKS
NETWORK VIRTUALIZATION AT TOR SWITCHES? MAKES AS MUCH SENSE AS IP-OVERAPPN

Copyright ipSpace.net 2014

Page 6-1

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

When we move the network virtualization functionality into the virtual switches residing in
hypervisor hosts or bare-metal servers, the physical network becomes exceedingly simple: all we
need is end-to-end IP connectivity, potentially with equidistant endpoints, which are easy to get in a
leaf-and-spine (aka Clos) fabric.
Hardware networking vendors are trying to stem the shift by offering new platforms and
architectures that keep the traditional VLAN-based virtual switches and implement the network
virtualization functionality in the (hardware) network edge. Needless to say, these approaches
usually make as much sense as trying to keep X.25 a viable alternative to TCP/IP.
This chapter contains a few rants I wrote in the last three years. You wont find many technical
arguments in this chapter all the high-level arguments are listed in the Overlay Virtual Networking
101 chapter, and Ive seen little value in repeating them in every blog post.

Copyright ipSpace.net 2014

Page 6-2

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

The title of this blog post was inspired by an article written by Randy Bush in which he compares IT
vendors to people throwing spaghetti at a wall to see what sticks. Hope youll enjoy a long list of
technologies that will never work.

NETWORK VIRTUALIZATION AND SPAGHETTI WALL


I was reading What Network Virtualization Isnt from Jon Onisick the other day and started
experiencing all sorts of unpleasant flashbacks caused by my overly long exposure to networking
industry missteps and dead ends touted as the best possible solutions or architectures in the days of
their glory:

X.25 gurus telling me how Telnet will never take off because TCP/IP header has much larger
overhead than X.29 PAD service. X.25 is dead (OK, maybe a zombie) and nobody complains
about TCP/IP header overhead anymore. BTW, we solved the header overhead problem decades
ago with TCP/IP header compression.

ATM gurus telling me how each application needs its own dedicated QoS settings and how the
only way to implement that is to run ATM to the desktop. ATM to the desktop never took off, and
global QoS remains a providers dream and vendors bonanza. In the meantime, weve watching
Netflix videos and talk over Skype with absolutely no QoS guarantees from our ISPs.

IBM sales engineers telling my customers how it would be totally irresponsible to transport bank
teller application data over a TCP/IP network (the data would still be in an SNA session, but
transported across unreliable routed network). SNA is another zombie, and everyone is using
TCP/IP protocol stack ... oh, and were running e-banking over the Internet.

Copyright ipSpace.net 2014

Page 6-3

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Alcatel guy telling us how its absolutely impossible to run VoIP because you can never get the
voice quality (and MOS) the customers are used to. Most of our customers run VoIP today
because its cheaper than ISDN, and Skype is more than good enough for most uses. Also, do I
have to mention how the voice zealots dropped their standards when faced with realities of
mobile calls? I usually get better voice quality with Skype than with my mobile phone.

An enterprise network designer who chose ATM LANE over Fast Ethernet (in the days when FE
was still bleeding edge technology) ... and failed miserably ... because he believed ATM
evangelists and wanted to transport data, voice and video over a common ATM campus
backbone. In the end, they did use ATM on long-distance WAN links (where it made perfect
sense) and Fast Ethernet in the campus.

All the above-mentioned technologies and architectures went extinct for a simple reason: whenever
theres a clash between competing solutions, the ones that move the complexity as far out to the
edge as possible usually win, and those that tried to keep the complexity and micro-state in the core
failed (X.25, ATM, traditional voice circuits) because keeping state is too expensive at scale. Draw
your own conclusions, and remember that in a server virtualization environment the edge is in the
hypervisor, not in the ToR switch.
Does that mean the overlay virtual networks are a perfect solution? Far from it were probably
where TCP/IP and Internet were in the early nineties, and virtualization vendors dont have a perfect
track record when it comes to network virtualization, but theyre catching up fast.

Copyright ipSpace.net 2014

Page 6-4

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In the meantime, Im positive that numerous network-focused startups and traditional vendors
trying to solve the network virtualization challenges in ToR switches or in the network core will
launch great products trying to capture the legacy enterprise market (because the cloud providers
already moved on). Some of those products will be well executed and highly profitable, but in the
end all of them will go down the same path as X.25, ATM and LANE went a decade ago, because
theyre architecturally suboptimal.
Finally, if youre wondering why I mentioned the spaghetti wall in the title thats how I feel when
being faced with a barrage of competing (and incompatible) ToR-based network virtualization
solutions.

Copyright ipSpace.net 2014

Page 6-5

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

In this blog post (written in July 2013) I expanded the overlay virtual networking is like Skype
analogy with a few more technical details.

SMART FABRICS VERSUS OVERLAY VIRTUAL NETWORKS


With the recent plethora of overlay networking startups and Cisco Live Dynamic Fabric Architecture
announcements its time to revisit a blog post I wrote a bit more than a year ago, comparing virtual
networks and voice technologies.
They say a picture is worth a thousand words here are a few slides from my Interop 2013 Overlay
Virtual Networking Explained presentation.
This is how most enterprise data centers provision virtual networks these days (if youre working for
a cloud provider and still doing something similar, run away as fast as you can).

Copyright ipSpace.net 2014

Page 6-6

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Good morning! To which VLAN would you like to connect today?

The networking industry would love to keep the complexity (and related margins) in the network,
keeping the edge (hypervisors) approximately as smart as the following device:

Copyright ipSpace.net 2014

Page 6-7

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Which VLAN would you like to dial today?

With the edge being mostly stupid (and 802.1Qbg playing the role of rotary dialing), you need loads
of technologies in the network to compensate for the edge stupidity, just like the voice exchanges
needed more and more complex technologies and protocols to establish voice circuits.

Copyright ipSpace.net 2014

Page 6-8

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 6-1: Networking industrys answers to the challenges of network virtualization

The details have changed a bit (Cisco seems to be embracing L3 forwarding at the ToR switches),
but the architectural options havent you have to have the complex stuff somewhere and it will be
either in the end systems (hypervisors) or in the network.

Copyright ipSpace.net 2014

Page 6-9

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 6-2: JBOT Just a Bunch of Technologies

We all know how the voice saga ended you cant sell a mobile phone if it doesnt support Skype,
and while there are still plenty of loose ends when you have to connect the old and the new worlds,
more or less everyone essentially gave up and started using VoIP for new deployments. Yes, it took
us more than a decade to get there, and the road was bumpy, but I dont think you could persuade
anyone to invest money in a PBX-with-SS7 startup these days.

Copyright ipSpace.net 2014

Page 6-10

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Figure 6-3: Lets move from a PBX-like technology toward Skype

Well probably see the same game played out twenty years later in the virtual networking space
(one can only hope the remains of the past wont hinder us as long as they are in the VoIP world)
the established networking vendors selling us smarter and smarter exchanges (switches) and the
virtualization vendors and startups selling us end-system solutions running on top of IP. Its easy to
predict the final outcome; its just the question of how long it will take to get there (and dont forget
that Alcatel, Lucent and Nortel made plenty of money selling PBXes to legacy enterprises while Cisco
and others tried to boost low VoIP adoption).

Copyright ipSpace.net 2014

Page 6-11

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Finally, heres my take on vendors that try to implement network virtualization at the ToR switches.
They might have excellent well-thought-out solutions, but theyre still trying to swim against the
tide.

NETWORK VIRTUALIZATION AT TOR SWITCHES? MAKES


AS MUCH SENSE AS IP-OVER-APPN
One of my blogger friends sent me an interesting observation:
After talking to networking vendors I'm inclined to think they are going to focus on a
mesh of overlays from the ToR, with possible use of overlays between vswitch and ToR
too if desired - drawing analogies to MPLS with ToR a PE and vSwitch a CE. Aside from
selling more hardware for this, I'm not drawn towards a solution like this bc it doesn't
help with full network virtualization and a network abstraction for VMs.
The whole situation reminds me of the good old SNA and APPN days with networking vendors
playing the IBM part of the comedy.
I apologize to the younglings in the audience the rest of the blog post will sound like total
gibberish to you but I do hope the grumpy old timers will get a laugh or two out of it.
Once upon a time, there were mainframes (and nobody called them clouds), and all you could do
was to connect your lowly terminal (80 x 24 fluorescent green characters) to a mainframe. Not
surprisingly, the networking engineers were building hub-and-spoke networks with the mainframes

Copyright ipSpace.net 2014

Page 6-12

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

(actually their sidekicks called Front End Processors) tightly controlling all the traffic. The whole
thing was called Systems Network Architecture (SNA) and life was good (albeit a bit slow).
Years later, seeds of evil started appearing in the hub-and-spoke wonderland. There were rumors of
coax cables being drilled and vampire taps being installed onto said cables. Workstations were able
to communicate without the involvement of the central controller ... and there was a new protocol
called Internet Protocol that powered all these evil ideas.

Figure 6-4: Vampire biting marks on an original yellow thick coax

Copyright ipSpace.net 2014

Page 6-13

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Not surprisingly, IBM (the creator of SNA) tried a tweak-embrace-and-extend strategy. First they
introduced independent logical units (clients and servers in IP terminology), later on they launched
what seemed like a Crazy Ivan (not related to my opinions) to the orthodox hub-and-spoke
believers: Advanced Peer-to-Peer Networking (APPN), still using the time-tested (and unbelievably
slow) SNA protocols.

Figure 6-5: What is APPN (Source: Cisco)

At the same time IBM tried to persuade us 4Mbps Token Ring works faster than 10Mbps
switched Ethernet. Brocade recently tried a similar stunt, trying to tell us how Gen5 Fiber
Channel (also known as 16GB FC) is better than anything else (including 40GE FCoE)
another proof the marketers never learn from past blunders.

Copyright ipSpace.net 2014

Page 6-14

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Faced with dismal adoption of APPN (I havent seen a live network running APPN, although I was
told some people were using it for AS/400 networking), and inevitable rise of IP, IBM tried yet
another approach: lets transport IP over APPN. Crazy as it sounds, I remember someone proposing
to run datagram service (IP) on top of layer-7 (LU6.2) transport ... and there are people today
running IP over SSH, proving yet again that every bad idea resurfaces after a while.
Franois Roy provided the necessary IP-over-APPN detail in his comment to my blog post:
IBM implemented it in 2217 Nways Multiprotocol Concentrator. Straight from the
documentation: "TCP/IP data is routed over SNA using IBM's multiprotocol transport
networking (MPTN) formats."
Regardless of IBMs huge marketing budget, the real world took a different turn. First we started
transporting SNA over IP (remember DLSw?), then deployed Telnet 3270 (TN3270) gateways to
give PCs TCP/IP-based access to mainframe applications. Oh, and IBM seems to have APPN over IP.
A few years later, IBM was happily selling Fast Ethernet mainframe attachments and running TCP/IP
stack with TN3270 on the mainframes (you see, they never really cared about networking their
core businesses are services, software and mainframes) ... and one of the first overlay virtual
network implementations was VXLAN in Nexus 1000V.
And so I finally managed to mention overlay virtual networking ... but dont rush to conclusions;
before drawing analogies keep in mind that most organizations couldnt get rid of the mainframes:
there were millions of lines of COBOL code written for an environment that could not be easily
replicated anywhere else. Migrating those applications to any other platform was mission impossible.

Copyright ipSpace.net 2014

Page 6-15

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

On the other hand, all it takes in the server virtualization world is an upgrade to vSphere 5.1 (or
Hyper-V 3.0) and a hardware refresh cycle (to flush the physical appliances out of the data center),
and the networking vendors will be left wondering where all the VMs and VLANs disappeared. And
you did notice that HP finally delivered TRILL and EVB on their ToR switches, didnt you?

Copyright ipSpace.net 2014

Page 6-16

This material is copyrighted and licensed for the sole use by Nebojsa Marjanac (Nebojsa.Marjanac@mtel.ba [81.93.84.66]). More information at http://www.ipSpace.net/Webinars

Das könnte Ihnen auch gefallen