Sie sind auf Seite 1von 170

Preface 1

. . . . . Introduction
.................................................................... 2

. . . . . Authors
.................................................................... 4

. . . . . Acknowledgements
.................................................................... 5

. . . . . Organization
. . . . . . . . . . . . .of
. . this
. . . . book
................................................. 6

. . . . . Intended
. . . . . . . . . Audience
........................................................... 9

. . . . . Book
. . . . . Writing
. . . . . . . .Methodology
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Why a New Approach 11

. . . . . Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

. . . . . Why
. . . . . VXLAN
. . . . . . .Overlay
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

. . . . . Why
. . . . . a. .Control
. . . . . . . Plane
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

. . . . . Looking
. . . . . . . . Ahead
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Fundamental Concepts 20

. . . . . Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

. . . . . What
. . . . . .is. .VXLAN?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

. . . . . How
. . . . . Does
. . . . . VXLAN
. . . . . . . Work?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

. . . . . Networking
. . . . . . . . . . . .in. .a. VXLAN
. . . . . . . Fabric
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Software Overlays 33

. . . . . Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

. . . . . Host-Based
. . . . . . . . . . . .Overlay
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Single-POD VXLAN Design 42

. . . . . Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

. . . . . Underlay
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

. . . . . Overlay
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

. . . . . Host
. . . . . Connectivity
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

External Connectivity for VXLAN Fabric 65

. . . . . Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

. . . . . Layer
. . . . . .3. .Connectivity
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

. . . . . Layer
. . . . . .2. .Connectivity
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

. . . . . Integration
. . . . . . . . . . . and
. . . . Migration
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Layer4-Layer7 Services 91

. . . . . Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

. . . . . Use
. . . . Cases
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Multi-POD & Multi-Site Designs 114

. . . . . Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

. . . . . Fundamentals
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

. . . . . Multi-POD
. . . . . . . . . . .Design
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122

. . . . . Design
. . . . . . . Options
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135

. . . . . Building
. . . . . . . . the
. . . .Multi-Site
. . . . . . . . . .Inter-Connectivity
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .138

Operations & Management 145

. . . . . Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .146

. . . . . Management
. . . . . . . . . . . . .tasks
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .148

. . . . . Available
. . . . . . . . .Tools
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .153

Acronyms 162

. . . . . Acronyms
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .163
Preface
Preface 2

Introduction

VXLAN EVPN
For many years now, VLANs have been the de-facto method for pro​ vid​
ing net​
work seg​
-
men​ ta​
tion in data cen​ter net​
works. Stan​dard​ized as IEEE 802.1Q, VLANs lever​
age tra​
-
di​
tional loop pre​ven​
tion tech​niques such as Span​ning Tree Pro​
to​
col which not only im​
-
poses re​stric​
tions on net​
work de​sign and re​siliency, but it also re​
sults in an in​
ef​
fi​
cient
use of avail​
able net​work links due to the block​ ing of re​
dun​dant paths, re​ quired to en​-
sure a loop free net​work topology.

VLANs also use a 12-bit VLAN iden​ ti​


fier to ad​dress Layer 2 seg​
ments, al​
low​ing for the
ad​
dress​
ing of up to a prac​ti​
cal limit of ~4,000 VLANs. In mod​ ern data cen​ter net​
work
de​
ploy​
ments, VLANs have be​ come a lim​ it​
ing fac​
tor to IT de​part​ments and cloud
providers as they build in​
creas​
ingly large and com​ plex, multi-ten​
ant data centers.

Mod​ ern data cen​ters re​quire an evo​ lu​tion from the re​ straints of tra​di​
tional Layer 2 net​ -
works. Cisco, in part​ner​ ship with other lead​ ing ven​dors, pro​posed the Vir​ tual Ex​ ten​
si​
-
ble LAN (VXLAN) stan​ dard to the IETF as a so​ lu​
tion to the data cen​ ter net​work chal​ -
lenges posed by tra​ di​
tional VLAN tech​ nol​
ogy and the Span​ ning Tree Pro​ to​col. At its
core, VXLAN pro​ vides ben​ e​fits of elas​tic work​ load place​ment, higher scal​ a​
bil​
ity of Layer
2 seg​men​ ta​
tion, and con​ nec​ tiv​
ity ex​ ten​
sion across the Layer 3 net​ work bound​ ary.
How​ ever, with​out an in​tel​
li​
gent con​ trol plane, VXLAN has its lim​ its due to its flood and
learn behavior.
Preface 3

Multi-Pro​to​col Bor​der Gate​ way Pro​ to​


col (MP-BGP) in​ tro​duced new Net​ work Layer
Reach​a​
bil​
ity In​
for​
ma​ tion (NLRI) to carry both Layer 2 MAC and Layer 3 IP in​ for​
ma​tion
at the same time. By hav​ ing the com​bined set of MAC and IP in​ for​
ma​ tion avail​
able for
for​
ward​ing de​ci​
sions, op​ti​
mized rout​ing and switch​
ing within a net​ work be​ comes fea​ si​
-
ble and the need for flood and learn be​ hav​
ior which lim​
its its abil​
ity to scale. The ex​
ten​ -
sion that al​lows BGP to trans​ port Layer 2 MAC and Layer 3 IP in​ for​
ma​tion is called
EVPN – Eth​ er​net Vir​
tual Pri​vate Network.

In sum​
mary, the ad​
van​
tages pro​
vided by a VXLAN EVPN so​
lu​
tion are as follows:

• Stan​
dards-based Over​
lay (VXLAN) with stan​
dards-based con​
trol plane (BGP)
• Layer 2 MAC and Layer 3 IP in​
for​
ma​
tion dis​
tri​
b​
u​
tion by con​
trol plane (BGP)
• For​
ward​
ing de​
ci​
sion based on scal​
able con​
trol plane (min​
i​
mizes flooding)
• In​
te​
grated Rout​
ing/Bridg​
ing (IRB) for Op​
ti​
mized For​
ward​
ing in the Overlay
• Lever​
ages Layer 3 ECMP – all links for​
ward​
ing – in the underlay
• Sig​
nif​
i​
cantly larger name​
space in the over​
lay (16M segments)
• In​
te​
gra​
tion of phys​
i​
cal and vir​
tual net​
works with hy​
brid overlays
• Fa​
cil​
i​
ta​
tion of Soft​
ware-De​
fined-Net​
work​
ing (SDN)

This book ex​ plores VXLAN EVPN, be​ gin​


ning with the in​ tro​duc​tory stages, gain​ing an
un​
der​stand​ ing of terms and con​ cepts and evolv​ ing through de​ ploy​ments within a sin​gle
data cen​ter to mul​ ti​
ple data cen​ters. The book also ad​ dresses de​ sign and in​
te​gra​
tion of
L4-L7 net​ work ser​ vices, co-ex​is​
tence with brown​ field en​ vi​
ron​ ments, and the tools
needed to build, op​ er​
ate, and main​tain a VXLAN EVPN Fab​ ric. At the con​
clu​sion of this
book, read​ers will have a solid foun​ da​tion of VXLAN EVPN and a com​ pre​
hen​sion of real-
world use cases that can be im​ me​ di​
ately uti​
lized to as​sist in de​vel​op​
ment of a plan to
suc​
cess​fully tran​
si​tion to a next gen​er​a​
tion data cen​ter Fabric.

VXLAN BGP EVPN fea​ tures and func​tion​


al​
ity dis​
cussed within are avail​
able on the fol​
-
low​
ing Cisco Nexus Se​
ries Switches:

• Cisco Nexus 9000 Se​


ries start​
ing NX-OS 7.0
• Cisco Nexus 7000 Se​
ries and Nexus 5600 start​
ing NX-OS 7.3
Preface 4

Authors

This book rep​


re​sents a col​
lab​
o​
ra​
tive ef​
fort be​
tween Tech​ ni​
cal Mar​
ket​ing and Sales En​
-
gi​
neers dur​
ing a week-long in​ten​
sive ses​sion at Cisco Head​quar​
ters in San Jose, CA.

• Bren​
den Bu​
resh - Sys​
tems Engineering
• Dan Eline - Sys​
tems Engineering
• David Jansen - Sys​
tems Engineering
• Jason Gmit​
ter - Sys​
tems Engineering
• Jeff Os​
ter​
miller - Sys​
tems Engineering
• Jose Moreno - Sys​
tems Engineering
• Kenny Lei - Tech​
ni​
cal Marketing
• Lil​
ian Quan - Tech​
ni​
cal Marketing
• Lukas Krat​
tiger - Tech​
ni​
cal Marketing
• Max Ardica - Tech​
ni​
cal Marketing
• Rahul Para​
meswaran - Tech​
ni​
cal Marketing
• Rob Tap​
pen​
den - Sys​
tems Engineering
• Satish Kon​
dalam - Tech​
ni​
cal Marketing
Preface 5

Acknowledgements

A spe​ cial thanks to Cisco’s In​sieme and EISG BU Ex​ ec​


u​
tives, Tech​
ni​
cal Mar​
ket​
ing and
En​gi​
neer​ ing teams, who sup​ ported the re​
al​
iza​
tion of this book. Thanks to Carl Sol​
der,
James Christo​ pher, Joe On​ isick, Matt Smorto, Vic​tor Moreno and Yousuf Khan for sup​ -
port​ing this ef​fort. Thanks to Cisco Sales Lead​er​ship for sup​
port​
ing the group of in​
di​
-
vid​
ual con​
trib​
u​
tors who have ded​
i​
cated their time in au​
thor​
ing this book.

We would also like to thank Cyn​thia Brod​er​


ick for her ex​
cep​
tional re​
source or​ga​
ni​
za​
-
tion and sup​
port through​ out our jour​ney, and Shilpa Grandhi for mak​ ing sure that
every​
thing worked smoothly and that all thir​
teen writ​
ers made it to the end.

We are also gen​


uinely ap​
pre​
cia​
tive to our Book Sprint (www.​
booksprints.​
net) team:

• Adam Hyde (Founder)


• Hen​
rik van Leeuwen (Illustrator)
• Juan Car​
los Gutiérrez Bar​
quero (Tech​
ni​
cal Support)
• Julien Taquet (Book Producer)
• Laia Ros (Facilitator)
• Raewyn Whyte (Proof Reader)

Laia and the team cre​


ated an en​ abling en​
vi​
ron​
ment that al​
lowed us to ex​er​
cise our col​
-
lab​
o​
ra​
tive and tech​
ni​
cal skills to pro​
duce this tech​
ni​
cal pub​
li​
ca​
tion to meet a grow​ ing
demand.
Preface 6

Organization of this book

Read​ers can read this book se​


quen​ tially, or go di​
rectly to in​
di​vid​ual chap​ters. Gen​er​
ally,
chap​ters should be self-con​tained, but in order not to du​ pli​
cate in​for​
ma​ tion, point​
ers
to other parts of the book may exist. Where ap​ plic​
a​
ble, hy​per​ links will take the reader
to In​
ter​
net pages that pro​
vide ad​di​
tional lev​els of details.

Why a New Approach


The mo​ ti​
va​
tion for the de​
vel​
op​ment of this new tech​ nol​
ogy is ex​
plored, as well as the
ben​
e​fits or​ga​
ni​
za​
tions can ex​tract from it. A brief overview ex​plains how the data and
con​
trol planes of VXLAN EVPN con​ tribute to solv​
ing busi​
ness chal​lenges that many or​-
ga​
ni​
za​tions are con​ fronted with.

Fundamental Concepts
This chap​ ter shifts the focus from the "Why" to the "What". Es​
sen​tial con​ cepts for un​
-
der​stand​
ing the tech​ nol​
ogy are laid out, to set the nec​
es​
sary foun​da​ tion for un​der​
-
stand​ing the rest of the book. The ba​ sics of VXLAN tech​
nol​
ogy are ar​tic​u​
lated, as well
as the fun​da​
men​ tals of net​
work​ing in a VXLAN Fabric.

Software Overlays
The in​ter​
sec​
tion of vir​
tual and phys​ i​
cal net​
work​ing is dis​
cussed in order to help the
reader gain the re​quired per​spec​tive to de​cide how to best im​ple​
ment VXLAN tech​ nol​
-
ogy to sup​port these vir​
tu​
al​
ized environments.

Single-POD VXLAN Design


Preface 7

Single-POD VXLAN Design


A deep dive into the inner work​ ings of the VXLAN pro​ to​
col, in​
clud​
ing best prac​ tices,
de​sign rec​om​men​ da​
tions and lessons learned about both the un​ der​
lay and over​ lay el​
e​
-
ments of a VXLAN Fab​ ric with MP-BGP EVPN. Even though this pub​ li​
ca​
tion should not
be con​ sid​
ered as a con​ fig​
u​
ra​
tion guide, com​ mand ex​am​ples are in​
cluded so the reader
in​
ter​ested in the ac​
tual de​ploy​ment can un​ der​
stand the re​
quired con​ fig​
u​
ra​
tion for each
of the in​di​
vid​
ual com​po​ nents of the technology.

External Connectivity for VXLAN Fabrics


After a VXLAN Fab​ ric is up and run​
ning, the next step is con​nect​ing it to the rest of the
world. This chap​ ter dis​cusses the de​
tails of con​
nect​ing the VXLAN Fab​ ric over Layer 2
and Layer 3 with non-VXLAN net​ works. These tech​niques are used in the last sec​ tion of
the chap​ter to demon​ strate a pro​
ce​
dure for mi​grat​
ing a brown​field legacy net​ work into
a VXLAN Fab​ ric.

Layer4-Layer7 Services
Eth​er​
net routers and switches are not the only el​ e​ments pro​ vid​ing net​ work ser​
vices in
a data cen​ter. Layer 4-Layer 7 de​ vices like fire​
walls or ap​pli​
ca​
tion de​ liv​
ery con​
trollers
are often in​dis​pens​able for se​
cure and ef​fi​
cient ap​
pli​
ca​tion de​liv​
ery. This chap​ter ad​-
dresses how to con​ nect these net​
work ap​ pli​
ances to the VXLAN Fab​ ric so the data cen​-
ter net​
work of​ fers the best per​
for​
mance and avail​ abil​
ity end-to-end.

Multi-POD and Multi-Site Designs


Most or​ ga​ni​
za​
tions today have busi​ ness con​ ti​
nu​ity re​
quire​ments that de​ ter​
mine
how net​ work in​ fra​
struc​
ture is de​ ployed in mul​ ti​
ple ge​
o​
graph​ i​
cal lo​
ca​
tions. Whether
net​
works are de​ ployed across two rooms in the same build​ ing or across sites thou​
sands
of miles apart, this chap​ter il​
lus​
trates how to re​solve the dis​
trib​uted net​work prob​
lem,
achiev​ing at the same time work​ load mo​bil​
ity and fault containment.

Operations and Management


Preface 8

Operations and Management


Ef​
fi​
cient man​
age​
ment prac​
tices can op​
ti​
mize the way com​
puter net​
works are mon​
i​
-
tored and de​ployed, and VXLAN Fab​ rics are no ex​
cep​
tion to this. This chap​
ter de​
scribes
how to use both tra​di​
tional and mod​ ern net​work tech​
niques to man​ age a VXLAN Fab​ -
ric. Off-the-shelf net​
work man​ age​ment soft​ ware will be dis​cussed as well as open
source ap​proaches or De​vOps-in​spired tools such as Pup​
pet, Chef and Ansible.
Preface 9

Intended Audience

The in​tended au​ di​


ence for this book is net​work pro​fes​
sion​als with a gen​ eral need to
un​
der​stand how to de​ ploy VXLAN net​ works in their or​
ga​ni​
za​tions to un​leash the full
po​
ten​
tial of mod​ ern net​work​ing. While in​
ter​
ested net​
work ad​ min​ is​
tra​
tors will reap the
most ben​ e​
fits from this con​tent, the in​
for​
ma​tion in​
cluded within this book may be of
use to every IT pro​ fes​
sional in​
ter​
ested in net​
work​ing tech​nolo​
gies. El​e​
ments in this
book ex​ plore how VXLAN and EVPN solve net​ work chal​lenges that have daunted the in​ -
dus​try for years, as well as how to de​ ploy con​
structs that are typ​i​
cally seen in tra​
di​
-
tional net​works with this new technology.
Preface 10

Book Writing Methodology

How many en​ gi​


neers do you need to write a book? Thir​ teen! Thir​teen highly-skilled
pro​
fes​
sion​
als got to​
gether in Build​
ing 31 in Cisco head​quar​ters in San Jose, Cal​
i​
for​nia.
Thir​
teen In, One Out: Thir​teen in​di​
vid​
u​
ally-se​
lected highly-skilled pro​fes​
sion​
als from
di​
verse back​grounds ac​cepted the chal​lenge to duel thoughts over the course of five
days. Fig​ ur​
ing out how to har​ ness the brain power and col​ lab​o​
rate ef​ fec​tively at first
seemed to be nearly im​ pos​
si​
ble, how​ ever, op​ po​sites at​
tracted and the team per​ sisted
through the hur​ dles. The Book Sprints (www. booksprints.net) method​ ol​
ogy cap​ tured
each of our strengths, fos​ tered a team-ori​ ented en​ vi​
ron​ment, and ac​ cel​er​
ated the
over​all time to com​ ple​tion. The as​ sem​ bled group lever​ aged their near two hun​ dred
years of ex​ pe​
ri​
ence and a thou​ sand hours of dili​ gent au​ thor​ship which re​ sulted in this
pub​li​
ca​tion. Rep​re​
sent​ing four con​ ti​
nents and seven na​ tion​al​
i​
ties, after five long days,
one book was pro​ duced. Fu​ eled by Chi​ nese, In​ dian, Japan​ ese, Mex​ i​
can and Ital​ ian food,
but first and fore​ most by Amer​ i​
can cof​ fee, to​gether with their fa​ cil​
i​
ta​
tor from Book
Sprints, the writ​ers poured their ex​ pe​ri​
ence and knowl​ edge into this publication.
Why a New Approach
Why a New Approach 12

Introduction

IT is evolv​
ing to​
ward a cloud con​ sump​tion model. This tran​ si​
tion af​
fects the way ap​pli​
-
ca​tions are being ar​chi​tected and im​
ple​
mented, dri​ving an evo​ lu​
tion in data cen​ter in​
-
fra​struc​
ture de​sign to meet these chang​ ing re​
quire​ments. As the foun​ da​
tion of the
mod​ ern data cen​ter, the net​
work must also take part in this evo​lu​
tion while also meet​ -
ing the in​
creas​
ing de​
mands of server virtualization and new mi​ croser​vices-based ar​
chi​
-
tec​
tures. This de​mands a new par​
a​
digm that must de​ liver on the fol​
low​ing areas:

• Flex​
i​
bil​
ity to allow work​
load mo​
bil​
ity across any floor tile in any site
• Re​
siliency to main​
tain ser​
vice lev​
els even in fail​
ure con​
di​
tions (bet​
ter fault
isolation)
• Multi-ten​
ancy ca​
pa​
bil​
i​
ties and bet​
ter work​
load segmentation
• Per​
for​
mance to pro​
vide for ad​
e​
quate band​
width and pre​
dictable la​
tency, in​
de​
pen​
-
dent of scale for de​
mand​
ing workloads
• Scal​
a​
bil​
ity from small en​
vi​
ron​
ments to cloud scale while main​
tain​
ing the above
characteristics

As a re​
sult, mod​ ern data cen​ter net​works are evolv​ ing from tra​di​
tional hi​
er​
ar​chi​
cal de​
-
signs to hor​i​
zon​tally-ori​
ented spine-leaf ar​chi​tec​
tures with hosts and ser​ vices dis​
trib​
-
uted through​ out the net​work. These net​ works are ca​ pa​ble of sup​port​ing the in​
creas​-
ingly com​mon east-west traf​ fic flows ex​
pe​ri​
enced in mod​ ern ap​pli​
ca​
tions. In ad​
di​
tion,
there are clus​ ter​
ing tech​nolo​ gies and vir​
tu​al​
iza​
tion tech​niques that re​ quire Layer 2
adjacency.

Evolv​ ing user de​mands and ap​ pli​


ca​
tion re​quire​ ments sug​ gest a dif​
fer​
ent ap​proach that
is sim​ ple, and more agile. Ease of pro​vi​
sion​ing and speed are now crit​ i​
cal per​for​
mance
met​ rics for data cen​ ter net​work in​ fra​
struc​ture that sup​ ports phys​ i​
cal, vir​tual, and
cloud en​ vi​
ron​
ments - with​out com​ pro​mis​ing scal​a​
bil​
ity or se​
cu​
rity. These are the main
dri​vers for the in​
dus​
try to look at Soft​ware De​ fined Net​ work (SDN) solutions.
Why a New Approach 13

Cisco Ap​
pli​
ca​tion Cen​
tric In​
fra​
struc​
ture (ACI) is an in​
no​v​
a​
tive data cen​ter ar​
chi​
tec​
ture
that sim​
pli​
fies, op​
ti​
mizes and ac​ cel​
er​
ates the en​ tire ap​pli​
ca​
tion life​
cy​
cle through a
com​mon pol​ icy man​age​
ment frame​ work. ACI pro​ vides a turnkey so​ lu​
tion to build and
op​er​
ate an au​to​
mated cloud in​ fra​
struc​ture. An al​ter​na​tive op​tion is a VXLAN Fab​ ric
with BGP EVPN con​ trol plane that pro​vides a scal​
able, flex​i​
ble and man​ age​
able so​
lu​
tion
to sup​port grow​ing de​
mands of cloud environments.

This chap​ter in​


tro​
duces the con​
cepts of VXLAN EVPN and the prob​
lem it has been de​
-
signed to solve.
Why a New Approach 14

Why VXLAN Overlay

Net​work over​ lays are a tech​nique used in state-of-the-art data cen​ters to cre​
ate a flex​
-
i​
ble in​
fra​
struc​ture over an in​ her​
ently sta​
tic net​
work by vir​
tu​
al​
iz​
ing the net​work. Be​ -
fore going into the de​ tails of how over​
lays work, the chal​
lenges they face, and the so​ lu​
-
tions to over​lay prob​lems, it's worth spend​ing some time to un​der​stand why tra​di​tional
net​
works are so static.

When net​works were first de​ vel​


oped, there was no such thing as an ap​ pli​
ca​
tion mov​ ing
from one place to an​ other while it was in use. As a re​ sult, the orig​i​
nal ar​
chi​tects of
TCP/IP used the IP ad​ dress as both the iden​tity of a de​
vice and its lo​
ca​tion on the net​-
work. This was a per​ fectly rea​
son​able thing to do as com​ put​ers and their ap​ pli​
ca​
tions
did not move, or at least they did not move very fast or very often.

Today in the mod​ern data cen​ ter, ap​


pli​
ca​tions are often de​ployed on vir​
tual ma​
chines
(VMs) or con​
tain​
ers. The vir​
tu​
al​
ized ap​pli​
ca​tion work​load can be stretched across mul​
-
ti​
ple lo​
ca​
tions. The ap​ pli​
ca​
tion end​ points (VMs, con​ tain​ers) can also be mo​ bile among
dif​
fer​
ent hosts. Their iden​ ti​
ties (IP ad​dresses) no longer in​ di​
cate their lo​ ca​
tion. Due to
the tight cou​pling of an end​ point's lo​
ca​tion with its iden​ tity in the tra​ di​
tional net​work
model, the end​ point may need to change its IP ad​ dress to in​ di​
cate the new lo​ ca​
tion
when it moves. This breaks the seam​ less mo​ bil​
ity model re​ quired by the vir​ tu​al​
ized ap​-
pli​
ca​
tions. There​fore, the net​ work needs to evolve from the sta​ tic model to a flex​ i​
ble
one in order to con​ tin​u​
ously sup​ port com​ mu​ ni​
ca​tions among ap​ pli​
ca​tion end​ points re​-
gard​less of where they are. One ap​ proach is to sep​ a​
rate the iden​ tity of an end​ point
from its phys​i​
cal lo​
ca​tion on the net​ work so the lo​ ca​tions can be changed at will with​ -
out break​ing the com​ mu​ni​
ca​tions to the end​ point. This is where over​ lays come into the
picture.

An over​lay takes the orig​ i​


nal mes​ sage sent by an ap​ pli​
ca​tion and en​cap​su​lates it with
the lo​
ca​
tion it needs to be de​ liv​
ered to be​ fore send​ ing it through the net​ work. Once
the mes​sage ar​ rives at its final des​ ti​
na​tion, it is de​
cap​su​lated and de​liv​
ered as de​ sired.
The iden​ti​
ties of the de​ vices (ap​ pli​
ca​tions) com​ mu​
ni​
cat​ing are in the orig​i​
nal mes​ sage,
and the lo​ca​tions are in the en​ cap​su​la​tion, thus sep​a​rat​
ing the lo​
ca​tion from the iden​ -
Why a New Approach 15

tity. This en​


cap​
su​
la​
tion and de​
cap​
su​
la​
tion is done on a per-packet basis and there​
fore
must be done very quickly and efficiently.

Today, ac​ cord​ ing to mar​ ket re​


search, ap​prox​i​
mately 60-70% of all ap​ pli​
ca​
tion work​
-
loads are vir​
tu​ al​
ized, how​ever, more than 80% of the servers in use today are not run​ -
ning a hy​per​vi​
sor. Of course, every data cen​ ter is unique and the mix of servers run​
ning
vir​
tu​
al​
ized work​ loads vs. non-vir​tu​
al​
ized work​ loads cov​ers the en​
tire spec​trum. Any
net​work so​lu​tion for the data cen​ter must ad​dress this mix.

Cisco, in part​ner​
ship with other lead​ ing ven​dors, pro​
posed the Vir​ tual Ex​
ten​ si​
ble LAN
(VXLAN) stan​ dard to the IETF as a so​lu​
tion to the data cen​ter net​work chal​
lenges posed
by tra​di​
tional VLAN tech​nol​
ogy. The VXLAN stan​ dard pro​vides for the elas​tic work​ load
place​ment and higher scal​a​
bil​
ity of Layer 2 seg​men​ta​
tion that is re​
quired by today’s ap​ -
pli​
ca​
tion demands.

VXLAN is de​signed to pro​vide the same Eth​ er​net Layer 2 net​work ser​vices as VLANs do
today, but with greater ex​ten​
si​
bil​
ity and flex​
i​
bil​
ity. Im​
ple​
ment​ing VXLAN tech​ nolo​
gies
in the net​
work will pro​
vide the fol​
low​ing ben​e​
fits to every work​
load in the data center:

• Flex​
i​
ble place​
ment of any work​
load in any rack through​
out and be​
tween data
centers
• De​
cou​
pling be​
tween phys​
i​
cal and vir​
tual networks
• Large Layer 2 net​
work to pro​
vide work​
load mobility
• Cen​
tral​
ized Man​
age​
ment, pro​
vi​
sion​
ing, and au​
toma​
tion, from a controller
• Scale, per​
for​
mance, agility and stream​
lined operations
• Bet​
ter uti​
liza​
tion of avail​
able net​
work paths in the un​
der​
ly​
ing infrastructure
Why a New Approach 16

Why a Control Plane

When im​ ple​ment​ing an over​


lay, there are three major tasks that have to be ac​
com​-
plished. Firstly, there must be a mech​ a​
nism to for​ward pack​
ets through the net​
work.
Tra​di​
tional net​
work​ing mech​a​
nisms are ef​
fec​
tive for this.

Sec​
ondly, there must be a con​
trol plane where the lo​
ca​
tion of a de​
vice or ap​
pli​
ca​
tion
can be looked up and the re​
sult used to en​
cap​
su​
late the packet so that it may be for​
-
warded to its destination.

Thirdly, there must be a way to up​ date the con​trol plane such that it is al​
ways ac​cu​
-
rate. Hav​ing the wrong in​for​
ma​tion in the con​
trol plane could re​
sult in pack​ets being
sent to the wrong lo​
ca​
tion and likely dropped.

The first task, for​


ward​ing the packet, is some​ thing that net​
work​ ing equip​ment has al​
-
ways de​liv​
ered. Per​for​
mance, cost, re​li​
a​
bil​
ity, and sup​
port​a​
bil​
ity are fun​
da​
men​tal con​
-
sid​
er​
a​
tions for the net​
work which must equally apply to both the phys​
i​
cal and over​
lay
net​
works respectively.

The sec​
ond task, con​ trol plane lookup and en​cap​su​
la​
tion, is re​
ally an issue of per​
for​
-
mance and ca​pac​ity. If these func​tions were per​
formed in soft​ ware, they would con​ -
sume valu​
able CPU re​ sources and add la​tency when com​ pared to hard​ ware solutions.

The third com​ po​


nent of an over​ lay is the means by which mod​ i​
fi​
ca​
tions to the con​trol
plane are up​dated across all net​work el​ e​
ments. This up​dat​
ing is a real chal​lenge and a
con​cern for any data cen​ter ad​min​is​tra​
tor due to the po​
ten​tial for ap​
pli​
ca​
tion im​pact
from packet loss if the con​
trol plane malfunctions.

VXLAN Control Plane


Why a New Approach 17

VXLAN Control Plane


VXLAN as an over​ lay tech​
nol​ogy does not pro​ vide many of the mech​ a​
nisms for scale
and fault tol​
er​ance that other net​ work​ ing tech​
nolo​gies have de​ vel​
oped and are now
tak​
ing for granted. In a VXLAN net​ work, each switch builds a data​ base with the lo​
cally
con​nected hosts. A mech​ a​
nism is re​quired so that other switches learn about those
hosts. In a tra​
di​
tional net​work, there is no mech​ a​
nism to dis​ trib​
ute this in​
for​
ma​
tion.
The only con​ trol plane pre​vi​
ously avail​able was a data plane-dri​ ven model called flood
and learn. For a host to be reach​ able, its in​
for​
ma​
tion has to be flooded across the net​ -
work. Eth​er​
net net​ works have op​er​ated with this de​fi​
ciency for decades.

While the de​ mand for scal​ able net​


works in​creases, the ef​fects of flood and learn need
to be mit​i​
gated. For a VXLAN over​ lay, a con​
trol plane is re​
quired that is ca​
pa​ble of dis​-
trib​
ut​
ing the Layer 2 and Layer 3 host reach​ a​bil​
ity in​
for​
ma​tion across the net​ work.
Early im​
ple​men​ta​tions of VXLAN lacked the abil​ity to carry Layer 2 net​work reach​a​bil​
ity
in​
for​
ma​tion, there​fore, Eth​
er​
net VPN (EVPN) ex​ ten​sions were added to Multi-Pro​ to​col
BGP (MP-BGP) to carry this information.

MP-BGP EVPN
MP-BGP EVPN for VXLAN pro​ vides a dis​trib​
uted con​trol plane so​lu​
tion that sig​
nif​
i​
-
cantly im​
proves the abil​ity to build and in​ ter​
con​
nect SDN over​ lay net​
works. MP-BGP
EVPN con​trol plane for VXLAN of​ fers the fol​
low​
ing key benefits:
• Con​
trol plane learn​
ing for end host Layer 2 and Layer 3 reach​
a​
bil​
ity information.
• Abil​
ity to build a more ro​
bust and scal​
able VXLAN over​
lay network
• Sup​
ports multi-tenancy
• Pro​
vides in​
te​
grated rout​
ing and bridging
• Min​
i​
mizes net​
work flood​
ing through pro​
to​
col-dri​
ven host MAC/IP route
distribution
• ARP sup​
pres​
sion to min​
i​
mize un​
nec​
es​
sary flooding
• Peer dis​
cov​
ery and au​
then​
ti​
ca​
tion to im​
prove security
• Op​
ti​
mal east-west and north-south traf​
fic forwarding
Why a New Approach 18

Looking Ahead

Even though VXLAN tech​ nol​ogy has at​tained a con​sid​


er​
able de​gree of ma​ tu​
rity in a
very short time, the in​
dus​
try is al​
ready de​
sign​
ing the next evo​
lu​
tion of this technology.

Generic Protocol Encapsulation (VXLAN-GPE)


VXLAN is one of many data plane en​ cap​su​
la​
tions avail​able. Ex​am​ples of other UDP-
based en​ cap​
su​la​
tions are LISP (Lo​ ca​
tor/ID Sep​ a​
ra​
tion Pro​ to​
col) and OTV (Over​ lay
Trans​ port Vir​
tu​al​
iza​
tion). These three en​ cap​su​la​
tions are very sim​ i​
lar, the dif​
fer​
ences
lying in the over​lay shim header. While all three use the same size header, the field al​ lo​-
ca​tion and the nam​ ing are slightly dif​
fer​
ent. Within the en​ cap​su​
la​
tion, there are also
vari​a​
tions. While VXLAN main​ tains an in​
ner-MAC header, LISP only car​ ries an in​
ner-IP
header. It be​comes ev​ i​
dent that an ap​ proach for header ex​ ten​sions is needed to avoid
adding yet an​ other UDP-based encapsulation.

VXLAN-GPE was in​ vented to bring some con​ sol​


i​
da​
tion in the UDP-based en​ cap​su​la​
tion
fam​ily. A major part of VXLAN-GPE is the in​ clu​
sion of a pro​to​
col-type field to de​ fine
what is being en​cap​su​
lated and set the mean​ing for the var​
i​
ous flags and op​tions in the
over​lay shim header. This pro​ to​
col type de​scribes the packet pay​ load; cur​rently de​-
fined types in​
clude IPv4, IPv6, Eth​
er​
net, and Net​work Ser​vice Header (NSH).

A promi​nent ex​
am​ple for the need of this flex​
i​
ble pro​
to​
col ex​
ten​
sion is Ser​
vice Chain​
ing
and the re​
lated NSH approach.

NSH en​ ables the pos​si​


bil​
ity of dy​nam​ i​
cally spec​i​
fy​
ing that cer​tain net​
work traf​fic is sent
through a chain of one or more net​ work ser​ vices. The goal of NSH is to cre​ ate a topol​-
ogy-in​ de​pen​dent way of spec​ i​
fy​
ing a ser​ vice path. NSH also in​ cludes a num​ ber of
manda​ tory, fixed-size con​ text head​ ers de​ signed to cap​ ture net​ work plat​
form in​ for​ma​-
tion. NSH even con​ tains an op​ tional vari​able length meta​ data field for ad​
di​
tional ex​ten​-
si​
bil​
ity and is de​
signed to in​ clude all re​quired in​for​ma​tion in​
side fixed-size fields.
Why a New Approach 19

Cre​
ation of yet an​
other en​
cap​
su​
la​
tion pro​
to​
col stands to add more con​
fu​
sion to the al​
-
ready crowded en​ cap​su​
la​
tion pro​ to​
col space. The ex​ ten​
si​
bil​
ity of VXLAN-GPE and
NSH promises to both re​ duce the amount of en​ cap​su​
la​
tion in the in​
dus​try and ac​com​ -
mo​date fu​ture net​work en​ cap​su​
la​
tion re​quire​ments. Gen​ eve, VXLAN-GPE, and NSH
are all re​
cent pro​
to​
col drafts pro​posed to the IETF. The three pro​ to​
cols pro​
vide sim​i​
lar
ap​
proaches to achieve flex​ i​
ble pro​
to​col map​ pings. While Gen​ eve uses vari​able length
op​
tions, VXLAN-GPE and NSH use fixed size op​ tions. Cisco sup​ ports open stan​ dards
and will con​tin​
u​
ously reeval​u​
ate sup​port for fu​ture encapsulations.

Evolution of the EVPN Control Plane


The cur​rent im​ple​men​ ta​
tion of the EVPN con​ trol plane is fo​
cused on de​ liv​
er​ing scal​
able
data cen​ter Fab​rics with mo​ bil​
ity and seg​
men​ ta​
tion. As EVPN con​ trol plane im​ ple​
men​ -
ta​
tions be​come more com​ plete, the EVPN con​ trol plane may ad​ dress ad​ di​tional use-
cases such as DCI. The com​ plete the​o​
ret​
i​
cal de​f​
i​
n​
i​
tion of the EVPN con​ trol plane is
cap​tured in a se​
ries of In​ter​
net drafts being worked on at the IETF. The gen​ eral spec​i​
fi​
-
ca​
tion of EVPN ac​ com​ mo​ dates use cases be​ yond the Data Cen​ ter Fab​ ric, in​clud​
ing
Layer 2 Data Cen​ ter In​
ter​con​
nect.

In order to prop​ erly ad​


dress the DCI re​ quire​
ments, the EVPN con​ trol plane im​ple​
men​-
ta​
tion must be ex​ panded to in​ clude the multi-hom​
ing func​tion​al​
ity de​
fined in the EVPN
spec​i​
fi​
ca​
tion to de​liver fail​
ure con​
tain​
ment, loop pro​tec​
tion, site-aware​ ness, and op​
ti​
-
mized mul​ ti​
cast repli​
ca​
tion.
Fundamental Concepts
Fundamental Concepts 21

Introduction

In the net​ work​


ing world, an over​ lay net​work is a vir​
tual net​
work run​ ning on top of a
phys​ i​
cal net​
work in​fra​struc​
ture. The phys​ i​
cal net​
work pro​ vides an un​ der​
lay func​tion,
of​
fer​ing the con​
nec​tiv​ity and ser​vices re​
quired to sup​port the vir​
tual net​
work in​stances
de​liv​
ered in the over​lay. The vir​tual net​work al​
lows for an in​de​pen​dent set of net​work
ser​
vices to be of​
fered re​
gard​
less of the un​ der​lay in​
fra​
struc​ ture, even though those ser​ -
vices may be the same. As an ex​am​ ple, it is pos​si​
ble to de​liver Layer 2 con​
nec​tiv​
ity ser​
-
vices on top of a Layer 3 net​
work in​fra​struc​ture via an over​ lay net​
work. A com​ mon ex​ -
am​ple of this would be VPLS ser​
vice of​fered over a car​ rier's MPLS infrastructure.

An over​lay net​work typ​


i​
cally pro​vides trans​port of net​work traf​ fic be​
tween tun​ nel end​
-
points on top of the un​der​lay by en​ cap​su​
lat​
ing and de​ cap​su​
lat​
ing traf​fic be​
tween tun​-
nel end​points. The tun​nel end​ point may be de​ liv​ered through a phys​ i​
cal net​
work de​-
vice, and per​form tun​nel en​cap​su​la​
tion/de​cap​su​la​
tion in hard​ware. It also may be vir​-
tual, with the tun​
nel end​point process run​ ning in a hy​per​vi​
sor. A hard​ware tun​ nel end​
-
point pro​ vides greater per​ for​
mance lever​ ag​
ing hard​ware-based for​ ward​
ing, but has
less flex​
i​
bil​
ity im​ple​ment​ing new ca​ pa​bil​
i​
ties. In con​
trast, a soft​
ware end​
point pro​
vides
in​
creased flex​ i​
bil​
ity but at the cost of lim​ited performance.

This chap​ter pro​vides an overview of the con​


cepts re​
quired to have a basic un​
der​
-
stand​
ing of the tech​nol​
ogy and how it works.
Fundamental Concepts 22

What is VXLAN?

Vir​
tual Ex​ten​
si​
ble LAN (VXLAN) as de​ fined in RFC 7348 is an over​ lay tech​nol​
ogy de​ -
signed to pro​vide Layer 2 and Layer 3 con​ nec​tiv​
ity ser​
vices over a generic IP net​ work.
IP net​
works pro​ vide in​
creased scal​
a​
bil​
ity, bal​
anced per​ for​
mance and pre​ dictable fail​
ure
re​
cov​ery. VXLAN achieves this by tun​ nel​ing Layer 2 frames in​ side of IP pack​ets. VXLAN
re​
quires only IP reach​
a​
bil​
ity be​
tween the VXLAN edge de​
vices, pro​
vided by an IP rout​
-
ing protocol.

There are pros and cons to con​sider when se​ lect​


ing the un​
der​
lay rout​
ing pro​
to​
col and
these are dis​
cussed in more de​
tail in the Sin​
gle-POD VXLAN De​ sign Chapter.

The VXLAN stan​


dard de​
fines the packet for​
mat il​
lus​
trated by the fol​
low​
ing diagram:

Figure: VXLAN Packet Format


Fundamental Concepts 23

VXLAN uses an 8-byte header that con​sists of a 24-bit iden​ti​


fier (VNID) and mul​ ti​
ple re​
-
served bits. The VXLAN header, along with the orig​ i​
nal Eth​er​net frame, is placed in the
UDP pay​ load. The 24-bit VNID is used to iden​ tify Layer 2 seg​ ments and to main​ tain
Layer 2 iso​
la​
tion be​
tween the seg​
ments. With 24 bits al​lo​
cated for the VNID, VXLAN can
sup​
port up to 16 mil​
lion log​
i​
cal segments.

The ter​
mi​
nol​
ogy used when de​
scrib​
ing the key com​
po​
nents of a VXLAN Fab​
ric include:

• VTEP – Vir​
tual Tun​
nel End​
point: The hard​
ware or soft​
ware el​
e​
ment at the edge of
the net​
work re​spon​si​
ble for in​
stan​
ti​
at​
ing the VXLAN tun​
nel and per​
form​
ing VXLAN
en​
cap​su​
la​
tion and decapsulation
• VNI – Vir​
tual Net​
work In​
stance: a log​
i​
cal net​
work in​
stance pro​
vid​
ing Layer 2 or
Layer 3 ser​
vices and defin​
ing a Layer 2 broad​
cast domain
• VNID – Vir​
tual Net​
work Iden​
ti​
fier: a 24-bit seg​
ment ID that al​
lows the ad​
dress​
ing of
up to 16 mil​
lion log​
i​
cal net​
works to be pre​
sent in the same ad​
min​
is​
tra​
tive domain
• Bridge Do​
main: A set of log​
i​
cal or phys​
i​
cal ports that share the same flood​
ing or
broad​
cast characteristics

The VXLAN tun​ nel end​point func​


tion can be per​
formed by a hard​ ware de​vice or by a
soft​
ware en​tity such as a hy​per​
vi​
sor. The main ad​van​
tage of using a hard​
ware-based
tun​nel end​
point is the en​hanced per​ for​
mance of​
fered through the ca​pa​
bil​
i​
ties of the
switch ASICs.

Al​
ter​
na​
tively, a soft​ware-based VTEP re​ moves the de​ pen​dency from the hard​ ware
switches, al​
beit at the ex​
pense of per​
for​
mance. Ad​di​
tion​
ally, VXLAN de​ ploy​
ments could
adopt hy​brid ap​proaches, where the VXLAN tun​ nels are es​
tab​lished be​
tween hard​
ware
and soft​ware VTEPs. More in​ for​
ma​tion on this can be found in the Soft​ ware Over​lays
chapter.
Fundamental Concepts 24

As dis​
cussed in the in​
tro​duc​
tion, the use of VXLAN tech​
nol​
ogy brings sev​
eral ben​
e​
fits
to Data Cen​
ter net​
work​ ing which include:

• Multi-ten​
ancy: VXLAN Fab​
rics in​
her​
ently sup​
port multi-ten​
ancy both at Layer 2
(sep​
a​rate Layer 2 VNIs rep​ re​sent log​
i​
cally iso​
lated bridg​
ing do​
mains) and Layer
3 (by defin​
ing dif​
fer​
ent VRFs for each sup​ported tenant)
• Mo​
bil​
ity: The over​
lay ca​
pa​
bil​
ity of​
fered by VXLAN pro​
vides Layer 2 ex​
ten​
sion ser​
-
vice across the data cen​
ter to pro​
vide flex​
i​
ble de​
ploy​
ment and mo​
bil​
ity of phys​
i​
cal
and vir​
tual endpoints
• In​
creased Layer 2 seg​
ment scale: VLAN-based de​
signs are lim​
ited to a max​
i​
mum of
4,096 Layer 2 seg​
ments due to the use of a 12 bit VLAN ID. VXLAN in​ tro​
duces a 24-
bit VNID that the​
o​
ret​
i​
cally sup​
ports up to 16 mil​
lion dis​
tinct segments
• Multi-path Layer 2 sup​
port: Tra​
di​
tional Layer 2 net​
works sup​
port one ac​
tive path
be​cause Span​ning Tree (STP) ex​
pects and en​ forces a loop-free topol​
ogy by block​ing
re​
dun​dant paths. A VXLAN Fab​ ric lever​
ages a Layer 3 un​der​
lay net​
work for the use
of mul​
ti​
ple ac​
tive paths
Fundamental Concepts 25

How Does VXLAN Work?

Data Plane
VXLAN re​ quires an un​der​
ly​
ing trans​port net​
work that per​forms data plane for​ward​ing.
This data plane for​
ward​ing is re​
quired to pro​
vide uni​
cast com​ mu​ni​
ca​
tion be​
tween end​ -
points con​nected to the Fab​ric. The fol​
low​
ing di​
a​
gram il​
lus​
trates data plane for​
ward​ing
in a VXLAN network.

Figure: VXLAN Overlay Network


Fundamental Concepts 26

At the same time, the un​


der​
lay net​
work can be used to de​
liver multi-des​
ti​
na​
tion traf​
fic
to end​
points con​ nected to a com​ mon Layer 2 broad​ cast do​
main in the over​
lay net​
work.
Often this traf​
fic is re​
ferred to as BUM, since it in​
cludes Broad​
cast, Un​known Uni​ cast
and Mul​ti​
cast traffic.

Two dif​
fer​
ent ap​
proaches can be taken to allow trans​
mis​
sion of BUM traf​
fic across the
VXLAN Fabric:

1 Lever​
age mul​
ti​
cast tech​
nol​
ogy in the un​
der​
lay net​
work (Pro​
to​
col In​
de​
pen​
dent Mul​
-
ti​
cast or PIM), to make use of the na​
tive repli​
ca​
tion ca​
pa​
bil​
i​
ties of the Fab​
ric spines
to de​
liver traf​
fic to all the edge VTEP devices.

2 In sce​
nar​
ios where mul​
ti​
cast can​
not be de​
ployed, it is pos​
si​
ble to make use of the
source-repli​
ca​
tion ca​
pa​
bil​
i​
ties of the VTEP nodes that cre​
ate mul​
ti​
ple uni​
cast
copies of the BUM frames to be sent to each re​
mote VTEP de​
vice. This ap​
proach is
not as ef​
fi​
cient as using mul​
ti​
cast for BUM traf​
fic replication.

VXLAN doesn't change the se​ man​ tics of Layer 2 or Layer 3 for​ ward​ing and al​
lows the
VTEP to per​ form bridg​ ing and rout​ing func​tions while lever​ ag​
ing the VXLAN tun​ nel for
data plane for​ward​ ing. As such, the VTEP of​ fers a set of dif​
fer​
ent gate​way func​tions as
out​
lined in the fol​
low​ing diagram.

• Layer 2 Gate​
way: VXLAN to VLAN bridg​
ing maps a VNI seg​
ment to a VLAN to cre​
ate
a com​
mon bridge domain
• Layer 3 Gate​
way (VXLAN Router): VXLAN to VXLAN rout​
ing pro​
vides Layer 3 con​
-
nec​
tiv​
ity be​
tween two VNIs na​
tively so no de​
cap​
su​
la​
tion func​
tion is required
• Layer 3 Gate​
way (VXLAN Router): VXLAN to VLAN rout​
ing pro​
vides Layer 3 con​
nec​
-
tiv​
ity be​
tween a VNI and a VLAN
Fundamental Concepts 27

Figure: VXLAN Gateway Functions

Control Plane
Fundamental Concepts 28

Control Plane
The VXLAN RFC has to date only con​ cerned it​self with the trans​ port (data plane) of
traf​fic, en​
sur​
ing con​ nec​
tiv​ity to all hosts in a VXLAN do​ main. The con​ trol plane, or
method by which VXLAN reach​ a​
bil​
ity and learn​ing oc​ curs, was achieved through what
is known as flood and learn be​ hav​ior. Sim​ply speak​ ing, flood and learn is a data-dri​ ven
method​ ol​
ogy wherein a VTEP that doesn’t know the lo​ ca​tion of a given des​ ti​
na​tion
MAC floods the frame onto the VXLAN’s as​ so​
ci​
ated mul​ ti​
cast group. Mul​ ti​
cast is typ​ i​
-
cally used in order to pro​ vide a more man​ age​able ap​proach to multi-des​ ti​
na​tion traf​fic.
In​stead of learn​ing the source in​ ter​face as​so​
ci​
ated with a frame’s source MAC ad​ dress,
the host learns the en​ cap​su​lat​ing source IP ad​ dress of the re​ mote VTEP. Flood and
learn method​ ol​
ogy is con​cerned with both the dis​ cov​ery (be​ tween peers) of VTEPs as
well as re​ mote end​point lo​ca​tion learning.

While flood and learn method​ ol​


ogy pre​sents a rea​
son​ably low bar​ rier to entry for net​
-
work ven​ dors to im​ ple​
ment a VXLAN stack, the draw​ back to flood and learn is first and
fore​most scal​
a​
bil​
ity. The amount of ad​ di​tional mul​
ti​
cast traf​
fic in​
tro​duced into an en​vi​
-
ron​ment can be dif​ fi​
cult to pre​
dict and as such has been a bar​ rier to adop​tion for some
en​ter​
prise customers.

In order to ad​ dress the con​ cerns of scal​a​


bil​
ity, the con​ cept of a con​ trol plane to man​ -
age MAC learn​ ing and VTEP peer dis​ cov​ery is de​ sir​able, and prefer​ ably one that could
be based on ex​ ist​
ing pro​ to​cols that are gen​ er​
ally well un​ der​ stood. Multi-Pro​ to​col Bor​ -
der Gate​ way Pro​ to​ col (MP-BGP) with Eth​ er​
net Vir​ tual Pri​ vate Net​ work (EVPN) ex​ ten​ -
sions has been pro​ posed as the IETF stan​ dard con​ trol plane for VXLAN. Based on the
ex​ist​
ing MP-BGP stan​ dard, the MP-BGP EVPN con​ trol plane pro​ vides pro​ to​col-based
VTEP peer dis​ cov​ ery and end​ point reach​ a​bil​
ity in​ for​
ma​ tion dis​ tri​
b​
u​
tion that al​ lows
more scal​ able VXLAN over​ lay net​work de​ signs. The MP-BGP EVPN con​ trol plane in​ tro​-
duces a set of fea​ tures that re​ duces the amount of traf​ fic flood​ ing in the over​ lay net​ -
work and en​ ables op​ ti​
mal for​ward​ ing for both east-west and north-south traf​ fic. Rel​ e​-
vant to the data cen​ ter use case, EVPN pro​ vides reach​ a​bil​
ity in​for​
ma​tion for both L2
and L3 end​ points. Ex​ tend​ing this level of reach​ a​bil​
ity, and adding the ca​ pa​
bil​
ity for ARP
sup​ pres​sion, re​ duces the re​ quired amount of flood​ ing in the net​ work. One ad​ di​
tional
ben​ e​
fit of the EVPN con​ trol plane is that it pro​ vides VTEP peer dis​ cov​ery and au​ then​ ti​
-
ca​tion, mit​ i​
gat​
ing the risk of rogue VTEPs in the VXLAN over​ lay network.
Fundamental Concepts 29

In order to un​ der​stand MP-BGP EVPN func​ tion​al​


ity, it is help​
ful to have a back​ ground
un​der​
stand​ing of MP-BGP as it is com​ monly used in MPLS net​ works. A tra​ di​
tional
MPLS net​ work has a full mesh of BGP routers or route re​ flec​
tors for scal​ing that ex​-
change reach​ a​
bil​
ity and pro​
file in​for​
ma​tion for L3VPNs (or L2VPNs in the case of VPLS
for ex​
am​ple). The com​ bi​
na​
tion of route dis​ tin​
guish​ers (RD) and VPNv4 ad​ dresses en​ -
sure the abil​
ity to uniquely iden​ tify a tar​
get, and routes can be se​ lec​
tively learned using
route tar​
get (RT) filtering.

In the EVPN con​ trol plane, there are tech​


ni​
cally three data plane op​tions: Multi-Pro​
to​
-
col Label Switch​ ing (MPLS, draft-ietf-l2vpn-evpn), Provider Back​ bone Bridg​ing (PBB,
draft-ietf-l2vpn-pbb-evpn), and Net​ work Vir​
tu​al​
iza​
tion Over​lay (NVO, draft-ietf-bess-
evpn-overlay).

Figure: EVPN IETF Draft

For the pur​


poses of this book, NVO will be as​
sumed when dis​
cussing EVPN.
Fundamental Concepts 30

Networking in a VXLAN Fabric

In tra​
di​
tional Layer 2 ac​cess net​works, the Layer 3 de​ fault gate​way is most com​ monly
placed at the ag​gre​
ga​tion layer. Gen​ er​
ally, the pair of ag​gre​ga​tion switches lever​
age a
first-hop re​
dun​dancy pro​ to​
col such as HSRP, VRRP or GLBP to pro​ vide a re​
dun​
dant de​
-
fault gate​
way IP ad​dress. De​pend​ ing on con​ fig​
u​
ra​
tion and pro​ to​col, these may be con​
-
fig​
ured for ac​
tive/standby or ac​
tive/ac​
tive redundancy.

With the rise of vir​ tu​


al​
iza​
tion in the data cen​ ter, the phys​
i​
cal de​sign of the net​ work and
its log​i​
cal rep​re​
sen​
ta​
tion are in​ creas​ingly dif​
fer​
ent. Vir​
tu​
al​
iza​
tion en​ cour​ ages work​load
mo​ bil​
ity and this in​tro​duces in​ ef​
fi​
cien​
cies, given that de​ fault gate​ way place​ ment was
pred​ i​
cated upon the phys​ i​
cal lo​
ca​tion of net​work re​sources. Traf​ fic for​ward​ing con​tin​
-
ues to func​ tion, how​ever, the in​ her​ent in​ef​
fi​
ciency cre​ated by traf​ fic hair-pin​
ning is
suboptimal.

Distributed Anycast Gateway


The use of the MP-BGP EVPN con​ trol plane in​tro​
duces Dis​ trib​
uted Any​ cast Gate​
way
func​tion​
al​
ity. In this model, the de​ fault gate​
way func​ tion is fully dis​ trib​
uted across all
leaf nodes within the VXLAN Fab​ ric. Lever​ag​
ing the Dis​ trib​
uted Any​ cast Gate​way func​-
tion pro​vides im​proved ef​fi​
ciency and higher cross-sec​ tional band​ width while elim​
i​
nat​
-
ing the need to run a First Hop Re​ dun​ dancy Pro​to​col (FHRP). Fur​ ther​
more, routed traf​-
fic be​tween work​ loads con​nected to the same leaf is lo​ cally for​warded with​ out hav​
ing
to be sent to the spine layer. By de​ creas​ing hop count, the Dis​ trib​uted Any​cast Gate​
way
greatly re​duces net​ work latency.

Integrated Routing and Bridging


EVPN VXLAN Fab​ rics in​tro​
duce In​ te​
grated Rout​ ing and Bridg​ing (IRB) func​tion​al​
ity,
which of​fers the ca​pa​
bil​
ity of both Layer 2 and Layer 3 for​ ward​ing di​
rectly at the leaf
switch. This is fun​da​
men​ tal to the Dis​trib​uted Any​ cast Gate​way func​ tion which pro​ -
vides a dis​
trib​
uted de​
fault gate​ way ca​pa​bil​
ity clos​
est to the endpoints.
Fundamental Concepts 31

Asymmetric vs Symmetric Forwarding


The EVPN draft de​ fines two dif​
fer​
ent meth​
ods for rout​ing traf​
fic be​
tween VXLAN over​
-
lays. The first method is re​ferred to as asym​
met​ ric IRB and the sec​ ond is known as
sym​met​ric IRB.

With asym​ met​ric IRB, the ingress VTEP is per​ form​ ing both rout​ ing and bridg​ ing,
whereas the egress VTEP is only per​ form​ ing bridg​ing. As a re​sult, the re​
turn traf​
fic will
take a dif​fer​ent VNI than the source traf​ fic. This ne​ces​si​
tates that the source and des​ ti​
-
na​
tion VNIs re​ side on both the ingress and egress VTEPs. This leads to a more com​ plex
con​fig​
u​ra​tion as all switches need to be con​ fig​
ured for all pos​ si​
ble VNIs. Per​haps a
more press​ ing con​ sid​
er​
a​
tion is the scal​
ing im​ pli​
ca​tions of all de​
vices po​ten​
tially need​ -
ing to learn a con​ sid​er​
ably larger num​ber of endpoints.

Figure: Asymmetric IRB


Fundamental Concepts 32

In sym​met​ric IRB, both the ingress and egress VTEP pro​ vide both L2 and L3 for​ward​ing.
This re​
sults in pre​dictable for​
ward​ing be​
hav​ior. As a re​
sult, only the VNIs of lo​
cally-at​
-
tached end​ points need to be de​ fined in a VTEP (plus the tran​ sit L3 VNI), which in turn
sim​
pli​
fies con​
fig​
u​
ra​tion and re​
duces scale re​quire​ments through op​ ti​
mized use of ARP
and the MAC ad​ dress table. This re​
sults in bet​
ter scale in terms of the total num​ber of
VNIs a VXLAN Fab​ ric can support.

Figure: Symmetric IRB

It is im​por​tant to keep in mind that as both meth​ ods are de​fined in the stan​dard, con​ -
sid​er​a​
tion must be given to de​ vice se​
lec​
tion and the im​pli​
ca​tions for in​
ter​
op​
er​
abil​ity.
For ex​ am​ple, Cisco sup​ports only sym​met​ric IRB on the Nexus plat​ forms as it of​fers
bet​ter scalability.
Software Overlays
Software Overlays 34

Introduction

Server vir​
tu​al​
iza​tion has trans​formed the way in which data cen​ ters are op​ er​
ated, and
the vast ma​jor​ity of data cen​ters today im​ ple​
ment it to some de​ gree. How​ ever, mak​ing
the as​
sump​ tion that those data cen​ ters run ex​
clu​
sively vir​
tu​
al​
ized work​ loads would be
a mis​
take. Many or​ ga​
ni​
za​
tions still make use of main​ frames, for ex​ am​ ple. More​over,
new ap​ pli​
ca​
tions that do not re​ quire server vir​
tu​al​
iza​
tion are com​ ing into the main
stage, such as cloud-based soft​ ware that makes use of Linux con​ tain​
ers, or mod​ern
scale-out ap​ pli​
ca​
tions such as Big Data, that de​ liver op​er​
a​
tional ben​e​
fits and scale
with​out the need of a hypervisor.

Al​
though VXLAN is a generic over​ lay con​ cept that is com​ monly de​ ployed in the net​ -
work, it is some​
times as​
so​ci​
ated with server vir​ tu​
al​
iza​tion and hy​per​vi​
sors. This chap​-
ter cov​ers the ad​
van​
tages and dis​ ad​
van​tages of im​ ple​ment​ing VXLAN on vir​ tu​al​
ized
hosts, and how to re​
al​
ize the most ben​ e​
fit out of this tech​nol​ogy, keep​ing in mind that
one of the main rea​
sons for in​ter​
est in VXLAN is its open​ ness, that avoids ven​dor lock-
in (ven​
dor or hypervisor).
Software Overlays 35

Host-Based Overlay

Server vir​tu​al​
iza​
tion of​
fers sig​
nif​
i​
cant ben​e​
fits in​
clud​ing flex​
i​
bil​
ity and agility in de​
liv​
er​
-
ing com​ pute ser​ vices in the data cen​ ter. Tra​di​
tion​
ally, net​work​ ing to the hy​per​vi​
sor is
pro​vided via VLAN trans​ port, and there is a new trend to adopt host-based VXLAN
over​lays to im​ prove agility and au​toma​tion of the net​work layer.

Figure: Host-Based Overlay

The host-based over​ lay typ​


i​
cally runs be​tween host VTEPs over an IP trans​port and of​
-
fers ease of de​
ploy​
ment and au​ toma​tion ca​
pa​
bil​
ity, em​
pow​er​
ing the server team to de​
-
liver vir​
tual net​
work​ing ser​vices with​out need​ing to in​
volve the net​work team. This
abil​
ity to au​
to​
mate net​work​ing di​
rectly through the Vir​
tual Ma​chine Man​ager (VMM) as
Software Overlays 36

a soft​
ware only over​
lay often re​
sults in a sub-op​ti​
mal net​work so​lu​
tion which does not
take into ac​
count the broader as​pects of op​ er​
a​
tions, in​
te​
gra​
tion, and per​
for​
mance for
the net​
work as a whole.

In ad​di​
tion to the CPU im​ pact in​tro​duced with host-based over​ lays, the net​ work team
has to pro​ vide extra ef​ forts in trou​ bleshoot​ing due to the lack of cor​ re​la​tion be​tween
the over​ lay and the un​ der​lay net​
works. In re​ gards to CPU im​ pact, the per​ for​mance of a
soft​ware VTEP is de​ pen​ dent on CPU and mem​ ory avail​
able on the hy​per​vi​
sor. Some im​ -
ple​men​ ta​
tions run the VTEP func​ tion in ker​nel space, oth​ers in user space. Both op​ -
tions must de​ liver the nec​ es​
sary packet pro​ cess​
ing re​quired for ef​
fi​
cient ap​ pli​
ca​
tion
de​liv​
ery. These so​ lu​
tions typ​ i​
cally strug​gle to de​liver line-rate through​ put even with
hard​ ware as​sis​
tance at the server NIC.

Ad​di​
tion​
ally, host-based over​lay net​
work so​lu​tions are pri​
mar​ily fo​
cused on net​ work​ing
for vir​
tual servers with​
out con​ sid​
er​
a​
tion for phys​ i​
cal work​loads or other ex​ ist​
ing ser​
-
vices in​
side or out​side the data cen​ter. Con​ nec​tiv​
ity to both phys​ i​
cal servers and re​-
sources be​ yond the vir​tual net​work typ​i​
cally re​quire gate​ways, ei​ther in soft​ware or
hard​ware which must be in​ te​
grated with the phys​ i​
cal network.

In sum​mary, when eval​


u​
at​
ing host-based over​ lay so​
lu​tions, it is crit​
i​
cal to con​
sider the
broader busi​ness and tech​ni​
cal im​
pli​
ca​
tions for the data cen​ ter in​
clud​ing li​
cens​
ing
cost, com​
pute over​head, per​for​
mance penalty, ad​ di​
tional gate​ way in​ fra​
struc​
ture re​
-
quire​
ments, and the im​
pact to net​work operations.

An Alternative to Host-Based Overlays


A net​work-based over​ lay de​ ploys net​ work switches as VXLAN tun​ nel end​points
(VTEPs). Com​ pared to host-based over​ lay so​
lu​tions, net​work over​ lays de​
liver hard​ -
ware-ac​ cel​
er​ated en​ cap​
su​la​
tion pro​vided by ASICs, which is the core of a net​ work
switch. By of​ fload​ing the VXLAN over​ lay func​tions to the net​ work switch, a sim​ ple
VLAN-based de​ ploy​ment can be used to con​ nect phys​ i​
cal or vir​
tual work​loads to the
VXLAN Fab​ ric. By re​ mov​ing the bur​ den of data traf​ fic for​ward​ing and en​ cap​su​
la​
tion
from the hy​ per​vi​
sor, CPU re​ sources are freed. In other words, the hy​ per​
vi​
sor can al​lo​
-
cate all avail​
able hard​ ware re​ sources to its key func​tion, serv​ing Vir​
tual Ma​chines (VMs)
and applications.
Software Overlays 37

With a VXLAN EVPN Fab​ ric and the as​ so​


ci​
ated op​
er​
a​tion and man​ age​ment tools, it is
pos​
si​
ble to de​
liver the flex​i​
bil​
ity of au​
to​
mated net​ work pro​vi​
sion​ing for vir​
tual ma​ -
chines while over​com​ing some of the pre​ vi​
ously dis​
cussed lim​i​
ta​tions af​
fect​
ing the
host-based over​ lay model. De​
ploy​
ing a net​work-based over​ lay does not have an im​pact
on the over​
all net​work per​
for​
mance, vis​
i​
bil​
ity, and trou​
bleshoot​ing of net​
work issues.

Figure: Network-Based Overlay

The VM Tracker (VMT) func​ tion on Nexus leaf switches pro​ vides vis​
i​
bil​
ity of the hy​per​
vi​
sor
hosts and the VMs con​ nected to the VXLAN Fab​ ric so the net​ work can take de​ ci​
sions upon
that in​for​
ma​tion. For ex​am​ple, the VM Tracker auto-con​ fig fea​
ture en​ ables au​ to​
mated
pro​vi​
sion​
ing of net​work re​
sources to sup​ port vir​
tual ma​chines in a VMware vSphere en​ vi​
-
ron​ment. VM Tracker com​ mu​ ni​
cates with VMware vCen​ ter Server to re​ trieve in​
for​
ma​tion
re​
lat​
ing to the vir​
tual net​
work con​ fig​
u​
ra​
tion. The in​for​ma​tion includes:

• vSphere ESX host to VM mappings


• VM state
Software Overlays 38

• Phys​
i​
cal port attachments
• VDS port groups as​
sign​
ment to VM

The net​
work con​ fig​
u​
ra​
tion that is dy​
nam​i​
cally pro​
vi​
sioned, de​
pend​
ing on the pre​
vi​
ous
in​
for​
ma​
tion, in​
cludes the fol​low​
ing attributes:

• VLAN provisioning
• VNI allocation
• L3 gateway
• VRF provisioning

Figure: vCenter Integration for Dynamic Network Configuration


Software Overlays 39

For ex​ am​ple, proper VLAN and VNI con​ fig​


u​ra​
tion is de​
ployed on a spe​ cific leaf when
the first VM as​ so​
ci​
ated to that VNI is lo​
cally con​nected and re​moved when the last VM
is mi​
grated to a dif​fer​
ent server or pow​ ered off. With such func​tion​
al​
ity, the server ad​
-
min​is​
tra​
tor can achieve the agility re​
quired to quickly de​ ploy VMs, and at the same time
pro​vi​
sion net​work re​sources in a dy​
namic manner.

This so​
lu​
tion pro​vides the per​for​
mance of hard​ ware-based en​ cap​su​
la​
tion with​out hav​
-
ing to up​
grade the host phys​ i​
cal NICs. This is es​pe​cially rel​
e​
vant in the case of vir​tual
net​work func​tions. For ex​am​ple, cer​tain host-based over​ lays use a vir​tual gate​
way to
in​
ter​
act with the rest of the world, as al​ready men​ tioned in a pre​ vi​
ous sec​tion. The per​
-
for​
mance dis​ cus​
sion is crit​
i​
cal in this case, be​cause that sin​ gle vir​
tual gate​way might
be​come a bot​tle​
neck for the whole vir​ tual environment.

In con​clu​sion, by lever​ag​ ing net​work-based over​ lays it is pos​


si​
ble to achieve line-rate
through​ put for both east-west and north-south traf​ fic flows while elim​ i​
nat​ing the need
for soft​
ware gate​ ways. At the same time, the net​ work ad​ min​is​
tra​tor gains vis​ i​
bil​
ity of
the vir​
tual in​fra​
struc​ture at​ tached to the Fab​ric, as​
sist​ing with the on​ go​ ing op​er​a​tions
and trou​bleshoot​ ing of the en​ vi​
ron​ment. Fi​
nally, VM Tracker func​ tion​al​
ity en​sures that
the vir​
tu​al​
iza​
tion ad​min​ is​tra​
tor gains net​
work agility on top of the ben​ e​
fits that server
vir​
tu​
al​
iza​
tion al​ready provides.

Hybrid Overlays with VXLAN EVPN


As pre​vi​
ously dis​cussed, pure host-based over​ lays bring lit​
tle value to data cen​
ters, but
there are sit​
u​
a​tions where a hy​brid ap​proach might solve some chal​ lenges or use cases.
Ser​
vice Providers have very spe​ cific re​
quire​
ments re​gard​ing net​ work man​ age​
ment and
op​
er​
a​tions including:
• Sup​
port for a mix of soft​
ware and hard​
ware VTEPs
• In​
te​
gra​
tion with the hy​
per​
vi​
sor layer
• Sup​
port multi-ven​
dor Fabrics
• Over​
lay and un​
der​
lay are op​
er​
ated by dif​
fer​
ent teams
Software Overlays 40

Hy​brid VXLAN over​lays con​sist of both host-based soft​


ware VTEPs and switch-based
hard​ware VTEPs. A co​
he​sive op​er​
a​
tional and man​age​
ment model is needed to in​
te​
grate
the two types of VTEP to​gether. Cisco Vir​tual Topol​
ogy Sys​
tem (VTS) is an ex​
am​ple of
such a solution.

Fur​
ther de​
tails about the Cisco Vir​ tual Topol​ogy Sys​
tem are pro​vided in the Man​age​-
ment And Op​ er​a​
tions chap​
ter, but below is a brief sum​
mary of Cisco VTS architecture.

Figure: Hybrid Overlays

Cisco Vir​ tual Topol​ogy Sys​


tems (VTS) pro​
vi​
sions hard​
ware and soft​
ware VTEPs. The
abil​
ity to in​
te​
grate a VXLAN soft​
ware-based VTEP al​lows the de​
ploy​
ment of the VXLAN
tech​nol​ogy on top of legacy net​ work hard​ware or to com​ ple​
ment hard​
ware-based
VTEP deployments.
Software Overlays 41

The Vir​tual Topol​ ogy Con​troller (VTC) is the sin​gle point of man​ age​ment for hy​ brid
over​lays to con​
fig​ure, man​age and op​ er​
ate a VXLAN Fab​ ric with MP-BGP EVPN con​ trol
plane. The man​ age​ ment layer sup​ ports in​
te​
gra​
tion with hy​ per​
vi​
sors such as VMware
vSphere or Open​ stack/KVM so that net​ work con​ structs can be di​rectly pro​vi​
sioned
from the hy​ per​
vi​sor User In​
ter​face. The north​bound REST APIs en​ able in​
te​
gra​
tion with
third party tools.

The con​
trol plane is rep​
re​
sented by a vir​ tu​al​
ized IOS-XR router to pro​vide in​
te​
gra​
tion
with MP-BGP EVPN and ad​ ver​tise reach​a​bil​
ity in​
for​
ma​
tion to the soft​
ware VTEP it​self
over an API. The soft​ ware VTEP named Vir​ tual Topol​
ogy For​ warder (VTF) pro​vides
VXLAN en​cap​su​
la​
tion ca​
pa​
bil​
ity in the hypervisor.

More de​
tails on Cisco Vir​
tual Topol​
ogy Sys​
tem ar​
chi​
tec​
ture are avail​
able at https://​
www.​
cisco.​
com/​ go/​vts
Single-POD VXLAN
Design
Single-POD VXLAN Design 43

Introduction

In clas​
sic hi​
er​
ar​
chi​
cal net​work de​signs, the ac​cess and ag​ gre​
ga​
tion lay​
ers to​
gether pro​-
vide Layer 2 and Layer 3 func​ tion​al​
ity as a build​ing block for data cen​ ter con​
nec​tiv​
ity.
In smaller data cen​ter en​vi​
ron​
ments, this sin​ gle build​ ing block would pro​vide suf​
fi​
cient
scale to meet the en​ tire de​
mands for con​ nec​tiv​ity and per​for​
mance. As the en​ vi​
ron​-
ment scales to meet the in​ creased de​mands of the larger data cen​ ter, this build​ing
block is typ​
i​
cally repli​
cated with an ad​di​
tional core layer in​
tro​
duced to con​ nect these
to​
gether. These build​ ing blocks are com​ monly re​ ferred to as a Point of De​ liv​
ery, or
POD, and allow for con​ sis​
tent, mod​
u​
lar scale as the en​
vi​
ron​
ment grows.

Figure: Hierarchical Network Design

When de​ sign​


ing a VXLAN Fab​ ric, a sin​
gle-POD also de​
fines a sin​
gle VXLAN Fab​
ric based
on a scal​
able spine-leaf ar​
chi​
tec​ture as shown in the di​
a​
gram below.
Single-POD VXLAN Design 44

Figure: VXLAN Fabric

A sin​gle VXLAN POD can scale to hun​ dreds of switches and thou​ sands of ports which
will meet the de​ mands of many en​ ter​
prise data cen​ ter en​
vi​
ron​
ments; how​ ever, to meet
more com​ plex or larger scale re​ quire​ments, the VXLAN POD may be repli​ cated in the
form of a multi-POD de​ sign. In a typ​ i​
cal de​ploy​ment with mul​ ti​
ple data cen​ter lo​
ca​
-
tions, these VXLAN Fab​ rics, whether sin​ gle or multi-POD- based, will be de​ ployed to​-
gether as a multi-site VXLAN de​ sign. Both the multi-POD and multi-site de​ ploy​
ment
types are de​ scribed fur​ ther in the multi-POD and multi-site De​ signs chap​ter. Ad​
di​
tion​
-
ally, the con​nec​tiv​
ity of Layer 2 and Layer 3 to the ex​ ter​
nal net​
work do​ main is cov​ered
in the Ex​ter​
nal Con​ nec​tiv​ity for the VXLAN Fab​ ric chapter.

This chap​ter ex​


plores the de​sign con​
sid​
er​
a​
tions for build​ing a sin​
gle VXLAN POD com​ -
pris​
ing the un​der​
lay net​
work foun​ da​
tion and the over​ lay net​
work to​ gether with their
as​
so​ci​
ated data and con​trol planes, as well as guide​lines for end​point con​nec​tiv​
ity to
the Fabric.
Single-POD VXLAN Design 45

Underlay

In build​
ing a VXLAN EVPN Fab​ ric, it is es​
sen​tial to con​struct an ap​pro​
pri​
ate un​der​
lay
net​
work as this will pro​ vide a scal​
able, avail​able and func​ tional foun​
da​
tion to sup​port
the over​lay. This sec​
tion in​
cludes im​por​ tant con​sid​
er​a​
tions for the un​
der​
lay design.

Routed Interface Considerations

MTU

In order to im​prove the through​ put and net​ work per​for​


mance, it is rec​om​mended to
avoid frag​
men​ta​
tion and re​ assem​bly on net​work de​vices per​
form​ ing VXLAN en​cap​su​la​
-
tion and de​
cap​su​
la​
tion. It is there​
fore re​
quired to in​
crease the max​ i​
mum trans​mis​sion
unit (MTU) in the trans​port net​ work by at least 50 bytes (54 if an 802.1Q header is pre​-
sent in the en​cap​
su​lated frame). If the over​ lay uses a 1500-byte MTU, the trans​ port
net​
work needs to be con​fig​
ured to ac​
com​mo​ date 1550 byte (1554 bytes if in​
clud​
ing the
802.1Q header) frames as a min​
i​
mum. Jumbo frame sup​ port in the trans​
port net​work is
strongly rec​
om​mended if the over​ lay ap​
pli​
ca​
tions use frame sizes larger than 1500
bytes.

In order to en​sure that VXLAN en​


cap​
su​lated pack​
ets can be suc​cess​
fully car​ried across
the Fab​ric, the in​
crease of MTU must be con​fig​
ured on all the Layer 3 in​ter​
faces con​-
nect​ing the Fab​ric nodes.

Routed Interface Addressing


The con​ nec​
tiv​
ity be​tween net​work de​
vices in a VXLAN Fab​ric typ​
i​
cally lever​
age routed
point-to-point in​ ter​
faces which can be sim​
ply ad​dressed with a /30 or even a /31 sub​-
net mask. In a large data cen​ ter Layer 3 un​der​
lay net​
work, there will be many routed
links, lead​
ing to high IP ad​dress consumption.

Fol​
low​
ing are the IP ad​
dress re​
quire​
ment for a cou​
ple dif​
fer​
ent scenarios.
Single-POD VXLAN Design 46

Figure: IP Address Requirement

In the ex​
am​ple above, in a small net​
work with 4 spine and 6 leaf switches, there will be
a min​i​
mum of 24 point-to-point links, re​ quir​
ing a total of 68 ad​
dresses for the Fab​ric
un​der​
lay. This num​ber will ex​
po​nen​
tially in​
crease to 408 in a larger scale sce​
nario of 4
spine and 40 leaf switches.

A rec​om​ mended ap​proach is to use IP un​num​bered for the in​ter​


face IP ad​dress con​ fig​
-
u​
ra​tion re​
quir​
ing only a sin​
gle IP ad​
dress per de​
vice, re​gard​
less of the num​ ber of links
de​ployed. As shown below, in the smaller scale ex​ am​ ple the total num​ ber of IP ad​ -
dresses con​ sumed would be re​ duced to 16 for the en​ tire Fab​
ric un​der​
lay. When ex​ -
pand​ ing the net​
work, the IP ad​
dress re​
quire​
ment will in​crease lin​
early with the num​ ber
of devices.

Loopback Interface Addressing


As high​ lighted in the ex​
am​ples below, each leaf switch with a VTEP should have a min​ i​
-
mum of two loop​ back in​
ter​faces. The first loop​
back is used as Router-ID (RID) and for
as​
sign​ing an IP ad​ dress to the un​ num​bered Layer 3 links. The sec​ond loop​
back rep​re​-
sents the VTEP IP ad​ dress used as source and des​ ti​
na​
tion for VXLAN en​cap​su​
lated
traffic.
Single-POD VXLAN Design 47

Figure: IP Address Requirement with IP Unnumbered

Routing Protocol Considerations


The choice of rout​ ing pro​ to​
col for the un​ der​lay net​ work has nu​ mer​ ous op​ tions, how​ -
ever, it is typ​
i​
cally de​ ter​mined by what pro​ to​
cols are al​ ready in use and are fa​ mil​iar to
the net​ work ad​min​ is​tra​tor. In mak​ ing this de​ ci​
sion, it is im​por​
tant to con​ sider the pro​ -
to​
col con​ ver​
gence char​ ac​ter​
is​
tics as this will de​ ter​mine the over​ all speed of con​ ver​
-
gence of the over​ lay net​ work. Specif​ i​
cally, Open Short​ est Path First (OSPF) and In​ ter​
-
me​di​ate Sys​tem - In​ ter​me​ di​
ate Sys​tem (IS-IS) are two types of In​ te​
rior Gate​ way Pro​ to​
-
col (IGP) that are par​ tic​u​
larly suit​able for multi-stage spine-leaf Fab​ rics. As the spine-
leaf de​sign in​
her​ently pro​ vides mul​ ti​
ple paths be​ tween leaf switches via the spine, SPF-
based pro​ to​
cols will com​ pute a topol​ ogy con​ sist​
ing of mul​ ti​
ple equal cost paths
through the net​ work and pro​ vide rapid con​ ver​
gence around failures.

While BGP also has merit as an un​ der​


lay rout​ ing pro​to​
col, it is a Path Vec​ tor Pro​to​col
and pri​ mar​ily con​ sid​ers Au​ tonomous Sys​ tems (AS) to cal​ cu​ late paths. De​ spite this, an
ex​pe​ri​
enced net​ work en​ gi​
neer can ma​ nip​
u​late BGP to achieve com​ pa​
ra​
ble con​ ver​-
gence out​ comes to SPF-based rout​ ing pro​to​
cols by lever​ag​ ing the many at​ trib​
utes and
op​tions avail​able. The main per​ ceived ad​ van​tage of using BGP in the un​ der​
lay is hav​ing
only one rout​ ing pro​ to​
col in use across the en​ tire net​
work (un​ der​lay + over​lay). While
this of​fers sim​pli​
fi​
ca​tion, po​ ten​tial dis​
ad​van​
tages exist due to the ad​ di​
tional con​fig​
u​ra​
-
tion re​quired. Since the over​ lay pre​dom​ i​
nantly uses a sin​gle path from VTEP to VTEP, it
is as​
sumed that the un​ der​lay pro​ vides multi-path for​ ward​ ing. This is not the de​ fault
Single-POD VXLAN Design 48

for​
ward​ing be​
hav​
ior of BGP, there​
fore, spe​
cific at​
ten​
tion is needed to achieve equiv​
a​
-
lent multi-pathing as would be achieved when using SPF-based IGPs in the underlay.

When se​ lect​ing rout​ ing pro​ to​ cols for use in the un​ der​lay, it is im​per​a​tive to con​sider
how the over​ lay con​trol plane pro​ to​
col func​ tions and should be con​ fig​ured. By using
the same pro​ to​col for the un​ der​ lay and over​ lay, a clear sep​ a​
ra​tion of these two do​ mains
can be​come blurred. There​ fore, when de​ sign​ing an over​ lay net​work, it is a good prac​ -
tice to in​de​pen​ dently build a trans​ port net​work as has been done in MPLS. The de​ ploy​-
ment of an IGP in the un​ der​ lay of​fers this sep​ a​
ra​
tion of un​ der​ lay and over​ lay con​trol
pro​to​
col. This pro​ vides a very lean rout​ ing do​ main for the trans​ port net​ work that con​ -
sists of only loop​ back and point-to-point in​ ter​
faces. At the same time, MAC and IP
reach​a​bil​
ity for the over​ lay ex​ ists in a dif​
fer​
ent pro​ to​col, namely MP-BGP EVPN.

OSPF Deployment Recommendation


OSPF is a link-state rout​ing pro​
to​
col com​ monly used in en​ ter​
prise en​
vi​
ron​ments. The
OSPF de​ fault in​
ter​
face type used for Eth​ er​
net in​
ter​
faces is “Broad​cast,” which in​
her​
-
ently re​
sults in a Des​
ig​
nated Router (DR) and/or Backup Des​ ig​
nated Router (BDR) elec​
-
tion thus re​duc​ing rout​
ing up​
date traf​fic. While this is fine in a Multi-Ac​cess net​
work
(such as a shared Eth​er​
net seg​
ment), it is un​nec​
es​sary in a point-to-point network.

In a point-to-point net​ work, the “Broad​ cast” in​


ter​
face type of OSPF adds a DR/BDR
elec​tion process and an ad​ di​tional Type 2 Link State Ad​ ver​
tise​
ment (LSA). This re​
sults
in un​ nec​
es​
sary ad​di​
tional over​ head, which can be avoided by chang​ ing the in​
ter​face
type to “point-to-point”. In this way, the DR/BDR elec​ tion process can be avoided, re​-
duc​ing the amount of time to bring up the OSPF ad​ ja​
cency be​tween the leaf and spine
switches. In ad​di​
tion, with the point-to-point in​ ter​
face mode, the need for Type-2 LSAs
is re​
moved with only Type-1 LSA needed since there is no Multi-Ac​ cess (or Broad​cast)
seg​ment pre​sent. As a re​sult, the OSPF LSA data​ base re​mains lean.

IS-IS Deployment Recommendation


Single-POD VXLAN Design 49

IS-IS Deployment Recommendation


An​
other stan​dard based IGP rout​ ing pro​
to​col is In​
ter​
me​di​ate Sys​tem – In​ ter​
me​
di​
ate
Sys​
tem (IS-IS). This link state rout​
ing pro​
to​col is gain​
ing pop​ u​
lar​
ity with fast con​
ver​
-
gence in a large-scale en​ vi​
ron​ment al​
though has pri​ mar​
ily been de​ ployed in ser​
vice
provider en​
vi​
ron​ments. IS-IS uses Con​ nec​tion​
less Net​ work Pro​ to​
col (CLNP) for com​ -
mu​ ni​
ca​
tion be​
tween peers and doesn’t de​ pend on IP. There is no SPF cal​ cu​
la​
tion on
link change and SPF cal​ cu​la​
tion only hap​ pens when there is a topol​ ogy change which
helps with faster con​ver​
gence and sta​ bil​ity in the un​der​
lay. No sig​
nif​
i​
cant tun​
ing is re​
-
quired for IS-IS to achieve an ef​fi​
cient, fast con​ verg​
ing un​der​
lay network.

IP Multicast Recommendation
IP mul​ ti​
cast pro​
vides an ef​
fi​
cient mech​
a​
nism for the dis​
tri​
b​
u​
tion of multi-des​
ti​
na​
tion
traf​
fic in the Fab​
ric underlay.

To de​ploy IP mul​ti​
cast in the un​der​lay, a Pro​
to​
col In​de​
pen​dent Mul​ ti​
cast (PIM) rout​
ing
pro​
to​col needs to be en​ abled and must be con​ sis​
tent across all the de​vices in the un​
-
der​
lay net​work. The two com​ mon PIM pro​ to​
cols are Sparse-Mode (PIM-ASM) and Bidi​ -
rec​
tional (PIM-Bidir). This im​plies the re​quire​ment to de​ploy ren​dezvous Points (RPs).

Multicast Rendezvous Point (RP) Consideration


Sev​eral meth​ ods are avail​able to achieve a highly avail​
able RP de​ ploy​ment, in​clud​ing for
ex​am​ple the use of pro​ to​cols such as auto-RP and Boot​ strap. How​ ever, to im​
prove the
con​ ver​
gence ex​ pe​ri​
ence in the RP fail​ ure sce​
nario, the rec​ om​men​ da​
tion is to de​ploy
Any​ cast RP, which con​ sists of using a com​mon IP ad​ dress on dif​ fer​
ent de​vices to iden​-
tify the RP. Sim​ ple sta​
tic RP map​ ping con​
fig​
u​
ra​
tion is then ap​ plied to each node in the
Fab​ric to as​
so​ci​
ate mul​ti​
cast groups to the RP, so that each source or re​ ceiver can then
uti​
lize the local RP that is the clos​est from a topo​log​
i​
cal point of view.

It is im​
por​
tant to re​
mem​ ber that the VTEP nodes rep​ re​
sent the sources and des​
ti​
na​
-
tions of the mul​
ti​
cast traf​
fic used to carry BUM traf​
fic be​
tween end​points con​
nected to
those devices.
Single-POD VXLAN Design 50

Nor​
mally, the RPs would be de​ ployed on the spine nodes, given the cen​
tral po​
si​
tion
those de​
vices play in the Fabric.

Figure: Multicast RP Placement

When de​ ploy​


ing Any​cast RP, it is crit​
i​
cal to syn​
chro​
nize in​for​
ma​tion be​tween the dif​-
fer​
ent RPs de​ployed in the net​work, as it may hap​ pen that sources and re​ ceivers join
dif​
fer​
ent RPs, de​
pend​ing where they are con​ nected in the net​work. Two mech​ a​nisms
are sup​ported on Cisco Nexus plat​ forms to syn​ chro​
nize state in​
for​
ma​tion be​
tween RPs:

• Mul​
ti​
cast Source Dis​
cov​
ery Pro​
to​
col (MSDP): this op​
tion has been around for a long
time and it is widely avail​
able across dif​fer​
ent switches and routers. MSDP ses​sions
are es​
tab​lished be​
tween RP de​ vices to ex​change in​
for​
ma​tion about source and re​
-
ceivers for each given mul​ ti​
cast group
• PIM with Any​
cast RP: this op​
tion is cur​
rently sup​
ported only on Cisco Nexus plat​
-
forms and lever​
ages PIM as con​
trol plane to syn​
chro​
nize state be​
tween RPs

Ingress Replication
Single-POD VXLAN Design 51

Ingress Replication
Ingress repli​ ca​
tion, also known as Head-End repli​ ca​
tion, may be used as an al​ ter​
na​tive
to IP mul​ ti​
cast to carry the BUM traf​ fic in​
side the Fab​ric. One rea​son for using this al​
-
ter​
nate method is that IP mul​ ti​
cast is not al​ways an avail​ able op​
tion due to hard​ware
and soft​ ware con​ straints. IP mul​ti​
cast may also not be pre​ ferred due to per​ceived com​ -
plex​ity by the net​work op​ er​a​
tions team.

When de​ ploy​


ing ingress repli​ca​
tion it is im​ por​
tant to con​sider the over​all scale of the
Fab​ric and the amount of multi-des​ ti​
na​
tion traf​fic ex​
pected in the en​
vi​
ron​ment. This is
be​
cause for VXLAN EVPN ingress repli​ ca​tion, the VXLAN VTEP uses a list of IP ad​ -
dresses of other VTEPs in the net​ work to send BUM traf​ fic as uni​
cast traf​fic, cre​
at​
ing
mul​ti​
ple copies of the same traf​fic type. It is worth notic​ing that the de​ploy​ment of the
MP-BGP con​ trol plane en​ables the list of VTEPs con​ nected to the same VXLAN Fab​ ric
to be dy​nam​i​
cally built. These IP ad​dresses are ex​ changed be​ tween VTEPs through the
BGP EVPN con​ trol plane.

interface nve1
no shutdown
source-interface loopback0
host-reachability protocol bgp
member vni 30000
ingress-replication protocol bgp
member vni 30001
ingress-replication protocol bgp

As shown in the con​fig​u​


ra​tion sam​ple above, the ingress repli​ca​
tion mode is con​ fig​
-
urable on a per-L2VNI. It is not pos​
si​
ble to mix mul​
ti​
cast and ingress repli​
ca​
tion for the
same L2VNI in the same VXLAN Fabric.
Single-POD VXLAN Design 52

Overlay

After build​
ing a solid foun​
da​tion for the VXLAN net​ work with the un​ der​
lay, the over​lay
con​cepts are equally im​por​
tant to pro​vide the re​
quired func​
tion​
al​
ity and flexibility.

VXLAN EVPN Control Plane


As an in​
dus​try stan​dard over​lay tech​nol​
ogy, VXLAN has seen in​ creas​ing adop​tion in the
data cen​ ter space. EVPN is the con​ trol plane for VXLAN and pro​ vides an ef​fi​
cient
method for route learn​ ing and dis​tri​
b​
u​
tion in the VXLAN over​ lay net​work. The rout​ ing
in​
for​
ma​tion in​
cludes Layer 2 MAC routes, Layer 3 Host IP routes, and Layer 3 sub​ net IP
routes. EVPN con​ trol plane also in​
tro​duces multi-ten​ ancy sup​ port to the VXLAN over​ -
lay net​
work, as well as a VTEP peer dis​ cov​ery, se​
cu​rity, and au​then​ti​
ca​tion mech​a​nism.
This sec​
tion is in​
tended to pro​ vide a deeper un​ der​
stand​ ing of the VXLAN EVPN con​ trol
plane.

MP-BGP EVPN
EVPN uses MP-BGP as the rout​ ing pro​
to​
col to dis​
trib​
ute reach​a​
bil​
ity in​
for​
ma​
tion for
the VXLAN over​lay net​work, in​
clud​
ing end​
point MAC ad​ dresses, end​point IP ad​
dresses,
and sub​
net reach​a​
bil​
ity information.

EVPN is an​ other MP-BGP ad​ dress fam​


ily lever​
ag​
ing sim​i​
lar con​structs as the VPNv4 ad​ -
dress fam​ ily tra​
di​
tion​
ally de​
ployed in MPLS VPN ar​ chi​tec​
tures. Those con​ structs in​
-
clude VRFs, Route Dis​ tin​
guish​
ers (RD) and Route Tar​ gets (RT). The pe​ cu​liar​
ity of the
EVPN con​ trol plane when com​ pared to VPNv4 is the ca​ pa​bil​
ity of ex​
chang​ ing not only
IP but also MAC ad​ dress information.

Virtual Routing and Forwarding (VRF)


Single-POD VXLAN Design 53

Virtual Routing and Forwarding (VRF)


Vir​
tual Rout​
ing and For​ward​ ing (VRF) de​ fines the Layer 3 rout​ing do​
main for each ten​
-
ant sup​ported in the VXLAN Fab​ ric. In VXLAN EVPN net​ works, each ten​
ant VRF has a
Layer 3 VNI used as a vir​
tual back​bone for rout​ ing within the VRF.

Route Distinguisher (RD)


Route Dis​tin​
guisher (RD) is the iden​ti​
fier of a VRF since each VRF has its own unique RD
in the net​work. When an EVPN ad​ ver​tise​
ment is sent out to the peers, the RD of the
VRF to which this route be​ longs is prepended to the orig​ i​
nal route it​
self to ren​
der it
unique within the net​ work. This al​
lows dif​ fer​
ent VRFs to use over​lap​
ping IP ad​dresses
so that dif​
fer​
ent ten​ants can have true au​ ton​omy for IP ad​dress man​age​ment. The RD
can be au​to​
mat​i​
cally de​
fined to sim​plify configuration.

Route Target (RT)


Route Tar​ get (RT) is an ex​ tended at​ tribute in EVPN route up​ dates used to con​ trol route
dis​tri​
b​
u​
tion in a multi-ten​ ant net​work. EVPN VTEPs have an im​ port RT set​ting and an
ex​port RT set​ ting for each VRF and each L2VNI. When a VTEP ad​ ver​
tises EVPN routes,
it af​
fixes its ex​
port RT in the route up​ date. The routes will be re​
ceived by other VTEPs
in the net​ work. These de​ vices will com​ pare the RT value car​ried with the route against
their own local im​ port RT set​ ting. If the two val​ues match, the route will be ac​ cepted
and pro​ grammed in the rout​ ing table. Oth​ er​
wise, the route will not be im​ported. The
RT can be au​ to​mat​i​
cally de​fined to sim​ plify configuration.

EVPN Route Types


The EVPN con​
trol plane ad​
ver​
tises dif​
fer​
ent types of rout​
ing information:
• Type-2 - End​
point reach​
a​
bil​
ity in​
for​
ma​
tion, in​
clud​
ing MAC and IP ad​
dresses of the
endpoints
• Type-3 - Mul​
ti​
cast route ad​
ver​
tise​
ment-an​
nounc​
ing ca​
pa​
bil​
ity and in​
ten​
tion to use
Ingress Repli​
ca​
tion for spe​
cific VNIs
• Type-5 - IP pre​
fix route used to ad​
ver​
tise in​
ter​
nal IP sub​
net and ex​
ter​
nally learned
routes onto the VXLAN Fabric
Single-POD VXLAN Design 54

The EVPN route up​


date also in​
cludes the fol​
low​
ing information:

• VNID for the L2VNI and VNID for the L3VNI for the ten​
ant VRF
• BGP next-hop IP ad​
dress iden​
ti​
fy​
ing the orig​
i​
nat​
ing VTEP device
• Router MAC ad​
dress of the orig​
i​
nat​
ing VTEP device

Route Reflector Placement


As dis​
cussed in the pre​ vi​
ous chap​ ter, iBGP is the most com​ mon rout​ ing pro​to​
col de​-
ployed for the EVPN con​ trol plane in VXLAN Fab​ rics. With iBGP, there is a re​
quire​ment
to have a full mesh be​ tween all of the iBGP speak​ ers. To help scale and sim​ plify the
iBGP con​fig​
u​
ra​
tion, it is rec​om​ mended to im​ ple​ment iBGP Route Re​ flec​
tors (RR). The
place​
ment of the iBGP route re​ flec​
tors is rec​om​ mended to be im​ ple​mented on the
spines as they are cen​ tral to all of the leaf switches. In this case, two of the spines will
have BGP route re​ flec​tor con​ fig​
ured and all of the leaf switches will be con​ fig​
ured as
the BGP route re​ flec​
tor clients. The route re​ flec​
tor will re​
flect EVPN routes for the
VTEP leaf switches.

Figure: iBGP Route Reflector Placement


Single-POD VXLAN Design 55

Endpoint Detection and Tracking


A VTEP in MP-BGP EVPN de​ tects at​
tached end​points via local learn​ing. MAC ad​ dresses
are learned in the data plane from the in​ com​ing Eth​er​
net frames whereas the IP ad​ -
dress is learned via ARP or Gra​ tu​
itous ARP (GARP) con​ trol plane pack​ ets sent by the
end​point. Al​
ter​
nately, the learn​ing can be achieved by using a con​ trol plane or through
man​ age​
ment plane in​ te​
gra​tion be​
tween the VTEP and the local hosts.

Once a VTEP de​ tects its local end​points, it will in​


stall a Host Mo​
bil​
ity Man​
ager (HMM)
route to track it. The VTEP will also con​ struct an EVPN Type-2 route to ad​ ver​
tise the
learned MAC and IP ad​ dress of the end​point to the rest of the VTEPs in the same Fabric.

The EVPN Type-2 route has an em​ bed​ded se​ quence num​ ber used for end​ point move​-
ment track​ ing. When an end​ point moves from one VTEP to an​ other VTEP, the new VTEP
will de​
tect it as a newly at​tached local host. It will send a new EVPN Type-2 rout​ ing up​
-
date with the reach​ a​
bil​
ity in​
for​
ma​tion for this end​
point. When doing so, it will in​
cre​
ment
the se​quence num​ ber by one. When the rest of VTEPs re​ ceive the new route with the
higher se​quence num​ ber they will up​ date their rout​ing in​for​
ma​tion for the end​point
using the new VTEP as the next hop.

Layer 2 Logical Isolation (Layer 2 VNIs)


The cre​ation of VXLAN over​ lay net​
works pro​ vides the log​i​
cal ab​
strac​
tion al​low​ing end​
-
points con​nected to dif​fer​
ent leaf nodes sep​ a​
rated by mul​ ti​
ple Layer 3 Fab​ ric nodes to
func​
tion as they were con​ nected to the same Layer 2 seg​ ment. This log​ i​
cal Layer 2 seg​
-
ment is usu​ally re​
ferred to as Layer 2 Vir​
tual Net​work In​stance (L2VNI).

The VXLAN seg​ ments are in​de​pen​


dent of the un​der​ly​
ing net​work topol​ogy; like​
wise,
the un​der​ly​
ing IP net​
work be​ tween VTEPs is in​de​
pen​ dent of the VXLAN over​ lay. The
com​bi​
na​tion of lo​
cally de​
fined VLANs and their map​ ping to as​so​
ci​
ated L2V​NIs al​lows
the cre​
ation of Layer 2 log​
i​
cal seg​
ments that can be ex​tended across the Fabric.

As with tra​
di​
tional VLAN de​ ploy​
ments, com​mu​ni​
ca​tion be​
tween end​ points be​
long​
ing to
sep​
a​
rate L2V​NIs is pos​
si​
ble only through a Layer 3 rout​
ing function.
Single-POD VXLAN Design 56

The sam​ ple below shows the cre​


ation of VLAN-to-VNI map​
pings on a VTEP de​
vice,
which is usu​
ally a leaf node.

vlan 100
vn-segment 30000
vlan 101
vn-segment 30001

Once the VLAN-to-VNI map​ pings have been de​ fined, it is then re​
quired to as​so​
ci​
ate
those cre​
ated L2V​
NIs to an NVE log​
i​
cal in​
ter​
face, as shown in the con​fig​
u​
ra​
tion sam​ple
below.

interface nve1
no shutdown
source-interface loopback0
host-reachability protocol bgp
member vni 30000
suppress-arp
mcast-group 239.239.239.100
member vni 30001
suppress-arp
mcast-group 239.239.239.101

In the de​
f​
i​
n​i​
tion of the NVE log​i​
cal in​
ter​face, the loop​
back in​
ter​
face cre​
ated as part of
the un​der​lay con​ fig​
u​
ra​
tion is spec​ i​
fied to be used for VXLAN en​ cap​su​
la​
tion and
decapsulation.

It is also re​
quired to as​ so​ci​
ate the EVPN con​ trol plane to the VXLAN de​ ploy​
ment, in​-
stead of the orig​i​
nal flood and learn model. At the time of writ​ ing, this con​
fig​
u​
ra​
tion has
a global scope for a given VXLAN de​ ploy​ment, hence, it is not pos​si​
ble to mix the two
modes of op​ er​
a​
tion (con​ trol plane or flood and learn based) in the same Fabric.

When mul​ ti​


cast is the de​
ploy​
ment choice for han​dling the repli​
ca​
tion of BUM traf​
fic, a
spe​
cific mul​
ti​
cast group is as​
so​
ci​
ated to each de​
fined L2VNI. The as​ sign​
ment of mul​ti​
-
Single-POD VXLAN Design 57

cast groups to the L2V​


NIs is quite flex​
i​
ble and the cho​
sen con​
fig​
u​
ra​
tion de​
pends on the
fol​
low​
ing considerations:

• Using a unique mul​


ti​
cast group for each de​
fined VNI would allow the most gran​
u​
lar
dis​
tri​
b​u​
tion of BUM traf​fic, which will be only flooded to the leaf nodes where that
spe​cific L2VNI is de​
fined. On the other side, this de​ sign choice would dras​ti​
cally in​
-
crease the amount of mul​ ti​
cast state in the Fab​
ric leaf and spine devices.
• Using a com​
mon mul​
ti​
cast group for all the de​
fined L2V​
NIs would re​
duce at a min​
i​
-
mum the amount of mul​ ti​
cast state in the core of the net​work, but would cause
BUM traf​ fic for a given L2VNI to be flooded to all the leaf nodes even where that
spe​
cific VNI is not pre​sent (the traf​
fic would then be dis​
carded by the leaf).

The gen​ er​ally rec​


om​mended ap​ proach is a bal​
ance of the two op​tions above and sug​-
gests to as​sign a com​ mon mul​ti​
cast group to all the L2V​
NIs de​
fined for a given ten​
ant
(VRF); dif​
fer​ent mul​
ti​
cast groups can in​
stead be used across tenants.

Fi​
nally, as part of the L2VNI con​ fig​
u​ra​tion, it is pos​
si​
ble to en​ able ARP sup​ pres​sion. This
re​moves the need to flood ARP re​ quests across the Fab​ ric, which usu​ ally rep​re​
sents the
large ma​ jor​
ity of L2 broad​ cast traf​fic. ARP sup​ pres​
sion can be en​ abled since each leaf
node learns about all the end​ points con​ nected to the Fab​ ric via the EVPN con​ trol plane.
When re​ ceiv​ing an ARP re​ quest orig​ i​
nated by a lo​ cally con​ nected end​ point try​ing to
iden​tify the MAC of the re​ motely con​ nected end​ point, the leaf can then per​ form a
lookup in a local cache pop​ u​
lated upon re​ cep​ tion of EVPN up​ dates. If the MAC/IP in​ -
for​
ma​ tion for the re​mote end​ point is avail​ able, the leaf can then reply to the local end​ -
point with the ARP map​ ping in​ for​ma​ tion on be​ half of the re​ mote end​ point. If the
MAC/IP in​ for​ma​tion for the re​ mote end​ point is not avail​ able, the ARP re​ quest is
flooded across the Fab​ ric by en​cap​ su​lat​ing the packet in a VXLAN frame des​ tined to the
mul​ti​
cast group as​ so​ci​
ated to the L2VNI of the local end​ point. ARP sup​ pres​
sion can also
be en​abled or dis​abled on a per L2VNI basis.

Be​cause most end​ points send ARP re​ quests to an​nounce them​ selves to the net​ work
right after they come on​line, the local VTEP will im​ me​di​
ately have the op​ por​tu​nity to
learn their MAC and IP ad​ dresses and dis​ trib​
ute this in​
for​ma​ tion to other VTEPs
through the MP-BGP EVPN con​ trol plane. There​fore, most ac​ tive IP hosts in VXLAN
EVPN should be learned by the VTEPs ei​ ther through local learn​ ing or con​trol plane-
Single-POD VXLAN Design 58

based re​
mote learn​
ing. As a re​
sult, ARP sup​
pres​
sion re​
duces the net​
work flood​
ing
caused by host ARP learn​
ing behavior.

Layer 3 Multi-Tenancy (VRFs and Layer 3 VNIs)

The log​ i​
cal Layer 2 seg​ment cre​ ated by map​ ping a lo​ cally sig​
nif​
i​
cant VLAN with a glob​-
ally sig​
nif​
i​
cant L2VNI is nor​ mally as​so​ci​
ated with an IP sub​ net. When end​ points con​-
nected to the L2VNI need to com​ mu​ ni​cate with end​ points be​ long​
ing to dif​
fer​
ent IP
sub​nets, they send the traf​ fic to their de​ fault gate​way. De​ ploy​
ing VXLAN EVPN al​ lows
sup​port for a dis​
trib​uted de​fault gate​way func​ tion​al​
ity on each leaf node, a de​ploy​ment
model com​ monly re​ ferred to as Dis​trib​
uted Any​ cast Gate​ way. In a VXLAN de​ ploy​
ment,
the var​i​
ous Layer 2 seg​ ments de​ fined by com​ bin​ing local VLANs and global VNIs can be
as​so​
ci​
ated to a VRF if they need to communicate.

Com​mu​ni​ ca​
tion be​
tween local end​
points con​
nected to dif​
fer​
ent L2V​
NIs can occur via
nor​
mal Layer 3 rout​ ing in the con​
text of the VRF (i.e. no VXLAN en​ cap​
su​la​
tion is
required).

The de​ploy​
ment of Sym​ met​ric In​
te​
grated Rout​ ing and Bridg​ ing (IRB), al​
ready in​tro​
-
duced in the Fun​ da​men​ tal Con​cepts chap​ ter, re​
quires the in​ tro​
duc​tion of a tran​ sit
Layer 3 VNI (L3VNI) of​ fer​
ing L3 seg​men​ ta​
tion ser​vices per ten​ ant VRF. Each VRF in​ -
stance is mapped to a unique L3VNI in the net​ work. Dif​fer​
ent L2V​NIs for the same ten​ -
ant are usu​
ally as​
so​ci​
ated to the same VRF. As a re​ sult, the in​
ter-VXLAN rout​ ing is per​-
formed through​ out the L3VNI within a par​ tic​
u​
lar VRF instance.

The Sym​ met​ric IRB model as​ sumes that the de​
fault gate​
way for all the L2V​NIs is fully
dis​
trib​
uted to all the leaf nodes. At the time of this writ​
ing, the dis​
trib​
uted gate​way
model is the only one sup​ ported with VXLAN EVPN and can be en​ abled by ap​
ply​ing the
con​fig​
u​
ra​
tion below on all the leaf nodes:
Single-POD VXLAN Design 59

fabric forwarding anycast-gateway-mac 2020.2020.2020

vlan 100
vn-segment 30000

interface Vlan100
no shutdown
vrf member Tenant-1
ip address 192.168.100.1/24 tag 21921
fabric forwarding mode anycast-gateway

The first com​ mand de​ fines a com​ mon vir​


tual MAC ad​ dress to be used for the de​ fault
gate​way. The same value is used for all the IP sub​nets as​so​ci​
ated with the L2VNI seg​ -
ments, in​ de​
pen​dently from the VRF they be​ long to (fab​ric-wide con​fig​
u​
ra​
tion). The
“fab​ric for​
ward​
ing mode any​ cast-gate​
way” com​mand is used to en​ able the dis​trib​
uted
de​fault gate​
way func​tion​al​
ity on the VTEP nodes. This com​ mand must be ap​ plied to all
the SVIs for the VLANs that are mapped to L2VNIs.

The con​ fig​


u​ra​
tion shown above must be re​ peated for all the local VLANs mapped to
L2V​NIs if there is a re​ quire​ment to route traf​
fic to sep​
a​rate IP sub​
nets. The SVI is as​
so​
-
ci​
ated to a spe​ cific VRF in​ stance (“vrf mem​ ber” com​ mand). The use of VRFs pro​ vides
Layer 3 log​i​
cal iso​la​
tion, a con​cept often re​
ferred to as “multi-tenancy”.

The ad​
di​
tional re​
quired con​
fig​
u​
ra​
tion for each de​
fined VRF is shown below.
Single-POD VXLAN Design 60

vlan 2500
name L3_Tenant1
vn-segment 50000

vrf context Tenant-1


vni 50000
rd auto
address-family ipv4 unicast
route-target import auto
route-target import auto evpn
route-target export auto
route-target export auto evpn

interface 2500
description L3_Tenant1
no shutdown
mtu 9216
vrf member Tenant-1
ip forward

interface nve1
member vni 50000 associate-vrf

In MP-BGP EVPN, any VTEP in a VNI can be the Dis​ trib​


uted Any​ cast Gate​way for end
hosts in an IP sub​ net by sup​ port​ing the same vir​ tual gate​way IP ad​ dress and vir​
tual
gate​way MAC ad​ dress. When using Dis​ trib​
uted Any​ cast Gate​ way with EVPN, routed
traf​
fic from an end​ point is al​ways processed by the clos​ est leaf node. This ca​pa​
bil​ity
en​ables op​ ti​
mal for​ward​ing for north​ bound traf​ fic from end​ points in the VXLAN over​ lay
net​work. East-West traf​ fic be​tween end​ points con​ nected to the same leaf is lo​ cally
routed by that leaf. This is es​ pe​cially im​ por​
tant for ap​ pli​
ca​tions with rack aware​ ness
like some Hadoop dis​ tri​
b​u​
tions. A Dis​ trib​
uted Any​ cast Gate​ way also of​fers seam​less
host mo​ bil​
ity in the VXLAN over​ lay net​ work. The gate​ way IP and vir​ tual MAC ad​dress
are iden​ ti​
cally pro​vi​
sioned on all VTEPs within a VNI, there​ fore, when an end host
Single-POD VXLAN Design 61

moves from one VTEP to an​other VTEP, it doesn’t need to send an​
other ARP re​
quest to
re-learn the gate​
way MAC address.

Figure: Distributed Anycast Gateway

Multicast in the Overlay


At the time of writ​ing of this pub​li​
ca​tion, there is no stan​ dard cov​ er​
ing the im​ ple​
men​-
ta​
tion of IP mul​ti​
cast in the VXLAN over​ lay. This im​plies that if Layer 3 mul​ ti​
cast ser​
-
vices are re​quired in the VXLAN over​ lay net​ works, the only pos​si​
bil​
ity is con​nect​ing ex​
-
ter​
nal mul​ti​
cast routers to the Fab​ ric. Con​ fig​
u​
ra​
tion of the ex​
ter​nal mul​ ti​
cast routers is
out​side the scope of this publication.
Single-POD VXLAN Design 62

Host Connectivity

When con​ nect​ ing end​


points (bare-metal servers, hy​per​
vi​
sor hosts, net​work ser​
vice
nodes, etc.) to the net​
work in a re​
dun​
dant fash​
ion, two op​
tions are available:

• Using an Ac​
tive/Standby at​
tach​
ment mode, where the end​
point lever​
ages one or
more ac​
tive links to one leaf switch and one or more standby links to a sec​
ond leaf
switch. This en​
sures the end​ point can sur​ vive the fail​
ure of a sin​gle leaf switch and
re​gain net​
work con​nec​
tiv​
ity sim​ ply by ac​
ti​
vat​ing the standby links. This con​ fig​
u​
ra​
-
tion does not re​
quire any spe​ cific func​
tion​al​
ity to be sup​
ported on the leaf, as nor​ -
mal Layer 2 learn​
ing and for​ ward​ ing can be per​ formed to de​ liver traf​
fic to the lo​
-
cally con​
nected endpoints.
• Using an Ac​
tive/Ac​
tive at​
tach​
ment mode, sta​
tic or dy​
namic bundling of phys​
i​
cal
in​
ter​
faces using Link Ag​ gre​
ga​
tion Con​trol Pro​
to​
col (LACP). This en​ sures that all
avail​
able links are al​
ways ac​
tive and used to send and re​
ceive traf​
fic. This model re​-
quires that the leaf switches sup​ port a Multi-Chas​sis Link Ag​gre​ga​tion (MC-LAG)
func​
tion​
al​
ity to ap​
pear as a sin​gle log​
i​
cal en​
tity to the lo​
cally con​
nected end​
points.
Cisco Nexus switches offer Vir​ tual Port-Chan​ nel (vPC) to achieve this.

Figure: VTEPs and Server Attachment Models


Single-POD VXLAN Design 63

In a VXLAN Fab​
ric, there are some ad​
di​
tional as​
pects to consider.

• When the end​


point con​
nects in Ac​
tive/Standby mode, each leaf switch is con​
fig​
-
ured with an in​de​pen​dent VTEP IP ad​
dress. Traf​
fic orig​
i​
nated by the server will al​-
ways be VXLAN en​ cap​
su​
lated and de​
cap​
su​lated by the leaf switch con​ nected to the
ac​
tive port. Re​
mote VTEPs will al​ ways point to VTEP 1 or VTEP 2 (de​ pend​ ing on
which link is ac​
tive) when re​mote end​points need to send traf​ fic to de​
vices lo​
cally
con​nected to those leaf switches.
• When the end​
point con​
nects in vPC mode (Ac​
tive/Ac​
tive), the two VXLAN leaf
switches are de​ ployed as part of the same vPC do​ main and a com​ mon Any​ cast
VTEP is de​ fined. As a con​se​
quence, no mat​ ter which phys​i​
cal up​link is used by the
local end​point to send the traf​fic into the net​work, re​
mote VTEPs al​ ways as​ so​
ci​ate
the end​point in​for​ma​tion to the source Any​ cast VTEP ad​ dress. This is crit​i​
cal for
the con​sis​
tent Layer 2 MAC learn​ ing of the end​point within the Fab​ ric, in order to
avoid con​tin​u​
ous flap​
ping of in​
for​ma​tion in the MAC ta​bles of the re​
mote VTEPs.

In the Cisco NX-OS im​ ple​


men​ta​
tion, the Any​
cast VTEP ad​
dress is de​
fined as a com​
mon
sec​ondary IP ad​dress as​
so​
ci​
ated to the VTEP loop​ back in​
ter​
face of both VXLAN leaf
switches part of the same vPC domain.

interface loopback0
description VTEP
ip address 10.254.254.102/32
ip address 10.254.254.1/32 secondary

It is worth not​ing that once a pair of VXLAN switches is con​ fig​


ured as part of a vPC do​-
main, the Any​ cast VTEP is al​ ways used as next-hop for all the EVPN ad​ ver​
tise​
ments rel​-
a​
tive to di​
rectly con​ nected end​ points. This is valid also for local end​points con​
nected in
Ac​ tive/Standby fash​ ion. The con​ se​quence is that roughly half of the flows des​ tined to
those de​ vices may be de​ liv​
ered from the spines to the VTEP de​ vice con​nected to the
standby end​ points (the spines have two equal cost paths to reach the Any​ cast VTEP IP
ad​dress); the traf​fic would hence have to take an extra hop across the peer-link in
order to be de​ liv​
ered to the ac​ tive in​
ter​
face of the endpoint.
Single-POD VXLAN Design 64

This sub​op​ti​
mal be​hav​
ior can be avoided by group​
ing end​
points based on the types of
con​
nec​tiv​
ity (Ac​tive/Standby vs LACP) and con​ nect​
ing them to sep​a​
rate sets of leaf
switches.
External Connectivity
for VXLAN Fabric
External Connectivity for VXLAN Fabric 66

Introduction

In real world data cen​ ter de​


ploy​ments, the Fab​ric is never an iso​
lated en​vi​
ron​ment and
con​nec​tiv​
ity with ex​
ter​
nal net​
works is al​
ways re​quired. The re​ quired con​ nec​
tiv​
ity typ​
i​
-
cally de​pends on which type of ex​ ter​
nal net​
works the VXLAN Fab​ ric is con​
nected to.
For ex​am​ ple, when con​nect​ing to the cam​pus, WAN, or the In​ ter​
net, Layer 3 rout​ing is
nor​
mally used. When ex​ tend​ing Layer 2 out​
side of the VXLAN Fab​
ric, ad​
di​
tional con​
-
nec​
tiv​
ity con​
sid​
er​
a​
tions are required.

In ad​di​
tion to ex​
ter​
nal con​ nec​
tiv​
ity, the VXLAN Fab​ ric will typ​
i​
cally be de​
ployed into an
ex​
ist​ing data cen​ter en​vi​
ron​
ment, so in​ ter​
op​er​
abil​ity with the ex​ ist​
ing net​
work and the
abil​
ity to mi​
grate work​ loads to the new Fab​ ric will be very relevant.

This chap​
ter pro​vides de​
tail on both Layer 2 and Layer 3 ex​ter​
nal con​nec​tiv​
ity to the
VXLAN Fab​ ric and how to use those con​cepts to de​
ploy a VXLAN Fab​ric into an ex​ ist​
ing
data center.
External Connectivity for VXLAN Fabric 67

Layer 3 Connectivity

Multi-ten​ancy is one of the pri​mary use cases for de​ploy​ment of a VXLAN BGP EVPN
Fab​
ric. Dif​
fer​
ent VRFs could be de​ fined and seg​
mented as dif​fer​
ent or​ga​
ni​
za​tions, busi​
-
ness units, merg​ers and ac​qui​
si​
tions, user-groups, ap​
pli​
ca​
tions, or sim​ply se​
cu​rity seg​
-
men​ta​
tion and pol​icy enforcement.

In the con​ text of VXLAN BGP EVPN, each in​ stance (i.e. VRF/VLAN) is log​i​
cally iso​lated,
but phys​ i​
cally in​
te​
grated into the over​all Fab​ric as a shared in​fra​
struc​
ture. When ex​ -
tend​ing Layer 3 con​ nec​tiv​
ity out​
side the VXLAN Fab​ ric, two dif​
fer​
ent sce​nar​ios are
usu​ally considered:

1 Ex​
tend the log​
i​
cal iso​
la​
tion be​
tween VRFs into the ex​
ter​
nally routed do​
main. This
sce​
nario is typ​
i​
cally de​
ployed when con​
nect​
ing the VXLAN Fab​
ric to the cam​
pus
net​
work or to the WAN, as shown in the fig​
ure below.

Figure: Extending Layer 3 Multi-Tenancy to the External Layer 3 Domain


External Connectivity for VXLAN Fabric 68

The bor​
der node rep​
re​
sents the edge of the VXLAN Fab​
ric and nor​
mally ter​
mi​
nates
the VXLAN data plane en​ cap​
su​la​
tion to pro​ vide Layer 3 hand-off func​ tion​
al​
ity to​-
ward the edge router. The bor​ der node role could be im​ ple​
mented on a leaf or
spine switch. The edge router takes care of ex​ tend​ing multi-ten​
ancy con​ nec​tiv​
ity
across the ex​
ter​
nal net​work, lever​ag​ing one of the de​ ploy​
ment op​tions dis​cussed in
the sec​
tions below. It is worth not​ ing this model al​lows full sup​
port for over​lap​ping
IP ad​
dress space across dif​fer​
ent ten​ ants, pro​vid​
ing end-to-end log​i​
cal isolation.

2 Pro​
vide shared ac​
cess to a com​
mon ex​
ter​
nal ser​
vice. This sce​
nario al​
lows dif​
fer​
ent
ten​
ants to have com​
mon ac​
cess to shared re​
sources such as the Internet.

Figure: Accessing External Layer 3 Shared Resources

The sim​ple use case shown above does not allow over​ lap​ping IP ad​dress space across
dif​
fer​
ent ten​
ants, as this merges all the rout​ing in​
for​ma​tion into the “De​fault VRF” rout​
-
ing table. As an ex​ten​sion to the pre​vi​
ous ex​
am​ ple, ac​
cess to shared re​ sources may be
pro​vided by front-end​ ing each ten​
ant with a se​cu​rity de​vice. This pro​
vides an en​
force​-
ment point for se​ cu​rity pol​
icy when a ten​ ant needs to ac​ cess ex​
ter​
nal re​
sources or to
com​ mu​ni​
cate with other ten​ ants as shown in the fig​ ure below.
External Connectivity for VXLAN Fabric 69

In this case, it is com​mon to lever​age NAT (Net​work Ad​dress Trans​


la​
tion) func​
tion​
al​
ity
of​
fered by a fire​wall for ten​
ants that must sup​
port over​
lap​
ping ad​
dress space.

Figure: Secure Access to External Layer 3 Shared Resources

The above di​


a​grams il​
lus​
trate how Layer 3 con​ nec​
tiv​
ity ex​
ter​nal to the VXLAN Fab​
ric is
pro​
vided by a bor​
der node de​ vice. The first de​sign de​ci​
sion to make is the place​
ment of
the bor​
der node. Two main op​ tions are usu​ally considered:

1 Bor​
der node on a leaf de​
vice is termed bor​
der leaf. This is a nat​
ural choice as the
leaf nodes are de​
ployed as VTEP de​
vices ca​
pa​
ble of sup​
port​
ing the re​
quired con​
trol
plane and data plane func​
tion​
al​
i​
ties. De​
ploy​
ing the VTEP ca​
pa​
bil​
i​
ties only on the
leaf nodes keeps the con​
fig​
u​
ra​
tion on the spine switches much sim​
pler. The spine
pro​
vides the Fab​
ric back​
plane func​
tion​
al​
ity, rout​
ing VXLAN en​
cap​
su​
lated traf​
fic be​
-
tween the leaf nodes. The bor​
der leaf only ser​
vices north-south communication.

2 Bor​
der node on a spine de​
vice is termed bor​
der spine. This de​
ploy​
ment op​
tion pro​
-
vides the ad​
van​
tage of op​
ti​
miz​
ing the north-south com​
mu​
ni​
ca​
tion with ex​
ter​
nal
re​
sources. At the same time, it in​
tro​
duces the re​
quire​
ment to de​
ploy a spine de​
vice
that is ca​
pa​
ble of sup​
port​
ing VXLAN con​
trol and data plane func​
tion​
al​
ity (VTEP).
The bor​
der spine will most likely also serve as BGP Route Re​
flec​
tor (RR) and Mul​
ti​
-
External Connectivity for VXLAN Fabric 70

cast Ren​
dezvous Point (RP). The bor​
der spine ser​
vices north-south as well as east-
west communication.

A good net​
work de​ sign al​
ways pro​ vides re​siliency and re​dun​ dancy for key net​work el​e​
-
ments. The bor​ der node per​ forms a key func​ tion, in​
ter​
con​nect​ing the VXLAN Fab​ ric to
the ex​
ter​
nal net​work do​ main, so it is crit​
i​
cal to en​sure re​
siliency. It is rec​
om​
mended to
de​
sign the Fab​ ric with re​
dun​dant bor​ der nodes and edge routers, each lever​ ag​
ing re​-
dun​dant phys​i​
cal con​nec​
tions, as shown below.

Figure: Redundant Border Nodes and Connections to the Edge Routers

Re​gard​
ing Layer 3 hand-off func​tion​
al​
ity, it is a fair as​sump​ tion that the links be​
tween
the bor​der nodes and the edge routers are routed in​ ter​faces. De​pend​ing on how Layer
3 com​mu​ ni​
ca​
tion is ex​
tended out​
side the VXLAN Fab​ ric, those Layer 3 in​
ter​faces could
External Connectivity for VXLAN Fabric 71

be ded​
i​
cated for each ten​
ant or shared across mul​
ti​
ple ten​
ants. The fol​
low​
ing sec​
tions
pro​
vide an overview of the dif​
fer​
ent de​
ploy​
ment op​tions. All the sce​
nar​ios de​
pict a bor​
-
der leaf de​
ploy​
ment, but the same con​ sid​
er​
a​
tions can be ap​ plied in the bor​der spine
case.

VRF-Lite Hand-Off
The use of VRF en​ables the abil​
ity to have mul​ ti​
ple rout​
ing ta​
bles that are com​
pletely
in​
de​
pen​dent and iso​
lated. VRF-Lite rep​re​sents a com​ mon and well-known mech​ a​
nism
to ex​
tend the ten​
ant Layer 3 VRF in​for​
ma​tion be​yond the VXLAN Fabric.

The VRF-Lite ap​ proach dic​tates using a two-box so​ lu​


tion where the bor​ der node and
the edge router are phys​ i​
cally in​
de​
pen​dent de​vices. With VRF-Lite, con​ nec​tiv​
ity for dif​
-
fer​
ent ten​ants from the VXLAN Fab​ ric is ex​
tended ex​ ter​
nally on a hop-by-hop basis.
The bor​der leaf par​
tic​
i​
pates in the VXLAN Fab​ ric and has the full VTEP con​ fig​
u​ra​tion to
per​
form the VXLAN en​ cap​su​la​
tion and de​ cap​
su​la​
tion along with rout​ ing to​ward the
edge rout​ing device.

For this to hap​


pen, the fol​
low​
ing two re​
quire​
ments must be met:

• At the con​
trol plane level, the bor​
der node is re​
spon​
si​
ble for ex​
chang​
ing per-ten​
ant
rout​
ing in​
for​
ma​ tion be​tween the VXLAN Fab​ ric and the ex​
ter​
nal net​work. The bor​-
der node runs IPv4 or IPv6 uni​ cast rout​
ing for each of the ten​
ant VRFs with the ex​ -
ter​
nal edge rout​ ing de​
vice to learn the ex​ter​nal routes and to ad​ver​
tise the Fab​
ric
sub​
net/host routes to the ex​ ter​
nal net​
work. The bor​ der node also re​ dis​
trib​
utes
and ad​ver​
tises the ex​ter​
nal routes through MP-BGP EVPN to the in​ ter​
nal nodes on
the Fabric.
• The rout​
ing pro​
to​
col used to com​
mu​
ni​
cate with the edge router can be BGP or an
IGP rout​ing pro​
to​col of your choice. When using BGP to peer with ex​ ter​nal routers,
MP-BGP EVPN au​ to​
mat​ i​
cally im​ ports the BGP routes learned from the VRF-lite
IPv4 or IPv6 uni​cast ad​dress fam​ ily into the L2VPN EVPN ad​ dress fam​ily. This rep​
-
re​
sents a com​ mon op​ tion adopted in many real world de​ ploy​
ments. With other
rout​
ing pro​ to​
cols, re​
dis​tri​
b​
u​tion of routes is re​ quired to en​ sure routes are ex​ -
changed be​ tween the VXLAN Fab​ ric and the ex​ter​
nal router.
External Connectivity for VXLAN Fabric 72

When the bor​ der node learns the ex​ter​


nal routes from the edge router, it ad​ver​tises the
pre​fixes in​
side the VXLAN Fab​ ric do​
main as EVPN Type-5 routes. This in​ for​
ma​ tion is
dis​
trib​uted to the other VTEP nodes. At the same time, the bor​ der node is con​fig​
ured to
send EVPN routes learned from the L2VPN EVPN ad​ dress fam​
ily to the IPv4 or IPv6 uni​-
cast ad​dress fam​ily and ad​
ver​
tise them to the ex​
ter​nal edge router.

The sam​ ple con​fig​


u​
ra​
tion below shows the ex​
am​ple where eBGP with a sub-in​
ter​
face is
used as rout​ing pro​to​
col be​
tween the bor​
der node and the edge router.

vrf context Tenant-1


vni 50000
rd auto
address-family ipv4 unicast
route-target import auto evpn
route-target export auto evpn
route-target import auto
route-target export auto

interface Ethernet1/10.100
encapsulation dot1q 100
vrf member Tenant-1
ip address 192.168.5.254/30

router bgp 65500


router-id 10.254.254.200
neighbor 10.254.254.3
remote-as 65500
update-source loopback1
address-family l2vpn evpn
send-community both
neighbor 192.168.1.1
remote-as 65535
address-family ipv4 unicast
prefix-list filter-host-routes out
External Connectivity for VXLAN Fabric 73

vrf Tenant-1
address-family ipv4 unicast
advertise l2vpn evpn

ip prefix-list filter-host-routes seq 10 deny 0.0.0.0/0 eq 32


ip prefix-list filter-host-routes seq 20 permit 0.0.0.0/0 le 32

In this ex​
am​ple, the “ad​
ver​
tise l2vpn evpn” com​
mand under the VRF IPv4 ad​
dress fam​
ily
en​sures that:

• All the EVPN Fab​


ric in​
ter​
nal IP pre​
fixes are ad​
ver​
tised from EVPN into the VRF
• All the ex​
ter​
nal IP pre​
fixes learned from the edge router are ad​
ver​
tised from the
VRF into EVPN
• By de​
fault, all the Fab​
ric in​
ter​
nal pre​
fixes, in​
clud​
ing host routes for the con​
nected
end​ points, are ad​ ver​
tised to​
ward the edge router. If this is not the de​
sir​
able be​
hav​
-
ior, it is pos​
si​
ble to apply route pol​
icy to elim​
i​
nate host routes.

Sim​
i​
lar con​
fig​
u​
ra​tion with the ex​
cep​
tion of the EVPN ad​ dress fam​ily spe​
cific com​mands
must then be ap​ plied on the edge router to en​ sure the BGP ses​sion can be es​ tab​
lished
with the bor​der node.

At the data plane level, traf​


fic for dif​
fer​
ent ten​
ants must be car​
ried be​
tween the bor​ der
node and the edge router. This can be achieved by ded​ i​
cat​
ing an in​
ter​
face (log​
i​
cal or
phys​i​
cal) to each VRF. The avail​able op​tions are:

• Phys​
i​
cal Routed Ports: this im​
plies using a ded​
i​
cated phys​
i​
cal in​
ter​
face for each
tenant
• Sub-In​
ter​
faces: one log​
i​
cal sub-in​
ter​
face can be carved for each ten​
ant to carry
traf​
fic on the same phys​
i​
cal connection.

As shown above, it is im​ por​tant to note, that for each VRF, man​
ual con​ fig​
u​
ra​
tion is re​
-
quired along the en​ tire Layer 3 path. Since VRF-lite needs to be con​ fig​ured on a hop-
by-hop basis, scal​a​
bil​
ity be​
comes a con​ cern for large num​bers of ten​
ants/VRFs; this is
the ad​
van​tage of an MPLS hand-off.
External Connectivity for VXLAN Fabric 74

MPLS Hand-Off
In many two-de​ vice de​ ploy​ments, the edge router can act as an MPLS-provider edge
node. Al​ter​
na​
tively, a sin​
gle de​vice so​lu​tion can be used to ter​
mi​nate MPLS and VXLAN
rout​ing on the same de​ vice. This so​ lu​
tion merges the bor​ der node and the MPLS
Provider Edge (PE) router func​ tion​
al​
i​
ties into a sin​
gle phys​
i​
cal de​
vice, usu​
ally re​
ferred
to as the Bor​
der PE node. This sce​ nario is de​picted below.

Figure: Single Device Solution with Border PE Nodes

This sec​
tion sum​
ma​ rizes the steps for con​
fig​
ur​
ing the Bor​der PE de​vice de​
ployed on a
Cisco NX-OS based plat​ form using man​ ual con​
fig​
u​
ra​
tion, with ref​
er​
ence to the sim​
ple
net​
work topol​ogy shown below.
External Connectivity for VXLAN Fabric 75

Figure: Single Device Configuration Example

The sam​
ple con​
fig​
u​
ra​
tion below shows a Bor​
der PE ex​
am​
ple configuration.

vrf context Tenant-1


vni 50000
rd auto
address-family ipv4 unicast
route-target import auto evpn
route-target export auto evpn
route-target import auto
route-target export auto
route-target import 65535:1
route-target export 65535:1

Note: The additional route-target have to match the one used in MPLS L3VPN
for each VRF.

interface Ethernet1/10
ip address 192.168.5.254/30
ip router ospf MPLS-CORE
mpls ip
External Connectivity for VXLAN Fabric 76

router bgp 65500


router-id 10.254.254.200
neighbor 10.254.254.3
remote-as 65500
update-source loopback1
address-family l2vpn evpn
import vpn unicast reoriginate
send-community both
neighbor 192.168.1.1
remote-as 65535
address-family vpnv4 unicast
import l2vpn evpn reoriginate
vrf Tenant-1
address-family ipv4 unicast
advertise l2vpn evpn

The Bor​ der PE re-orig​


i​
nates IP pre​ fixes from the VXLAN Fab​ ric EVPN ad​ dress fam​ ily to
the MPLS VPNv4 ad​ dress fam​ ily and vice versa. The re​
quired com​ mands to achieve this
are “im​port vpn uni​cast re​o​
rig​i​
nate” or “im​port l2vpn evpn re​o​
rig​
i​
nate” re​
spec​tively in
the op​po​site ad​
dress-fam​ily. It is re​
quired to use an eBGP peer​ ing be​tween the Bor​ der
PE and the MPLS PEs. For the im​ port and ex​port to MPLS L3VPN, the ap​ pro​ pri​ate
route-tar​gets have to be cho​ sen for each VRF.

LISP Hand-Off
In Ac​tive/Ac​ tive data cen​ter de​ ploy​
ments, work​ load mo​ bil​
ity al​
lows ap​ pli​
ca​
tions to
move be​ tween ge​ o​graph​
i​
cally dis​
persed lo​
ca​
tions. This brings the chal​ lenge of ingress
route op​ti​
miza​ tion when the work​ loads change lo​ca​
tion. Lo​
ca​tor/Iden​ ti​
fier Sep​a​ra​
tion
Pro​to​
col (LISP) solves this chal​lenge by rout​
ing the client traf​
fic to the cor​ rect lo​
ca​tion
where the re​ sources are lo​
cated. The rout​ing in​
for​
ma​tion for LISP does not add any ad​ -
di​
tional pre​fixes to the un​
der​lay rout​
ing domain.
External Connectivity for VXLAN Fabric 77

LISP, as de​ fined in RFC 6830, is a rout​ ing ar​chi​


tec​ture that en​ ables a new par​ a​
digm for
IP ad​
dress​ ing. IP ad​dresses are scoped in two dis​ tinct name​ spaces: End​ point Iden​ ti​
fiers
(EIDs), which are as​ signed to end-hosts, and Rout​ ing Lo​ca​tors (RLOCs), which are as​ -
signed to net​ work​ ing de​vices. The LISP pro​ to​
col pro​ vides all the mes​ sag​
ing nec​ es​sary
to main​tain and ac​ cess a map​ ping data​ base in which EIDs are cor​ re​
lated to RLOCs. LISP
uses a map-and-en​ cap​su​late for​ward​ ing model in which traf​ fic des​tined for an EID is
en​cap​
su​lated and sent to the RLOC of the de​ vice through which it is con​ nected, based
on the re​ sults of a lookup in a map​ ping data​ base. Traf​fic is sent to a de​ vice's RLOC
rather than di​ rectly to the des​ ti​
na​
tion EID. This ap​ proach re​ lieves the core net​ work of
the re​
spon​ si​
bil​
ity of han​dling EID in​ for​ ma​
tion. Using this ap​ proach, the LISP ar​ chi​tec​-
ture aug​ ments the cur​ rent routed in​ fra​
struc​ture to fa​ cil​
i​
tate new func​ tion​al​
ity with
min​i​
mal dis​ rup​tion to the ex​ist​
ing net​work infrastructure.

LISP is a di​
rec​tory of ad​
dresses and their lo​
ca​
tions, not a tra​
di​
tional rout​
ing pro​ to​
col.
LISP uses a de​ mand-based model where edge-de​ vices re​
quest lo​
ca​
tion in​
for​ma​tion as
re​
quired. This de​mand model is in con​trast with the push model used by rout​ ing pro​to​
-
cols and re​sults in a re​
duced load on the de​vice's hard​ware ta​
bles. LISP has other ad​ -
van​tages noted below:

• Mo​
bil​
ity: EID portability
• Scal​
a​
bil​
ity: On-de​
mand routing
• Se​
cu​
rity: Ten​
ant ID-based segmentation
• DCI: Ingress route optimization

LISP map​ pings can be clas​ si​


fied to give VPN and ten​ ant se​ man​ tics to each pre​fix han​-
dled by LISP. This clas​ si​
fi​
ca​
tion is en​coded in the LISP con​ trol plane as stip​
u​
lated in the
stan​ dard de​f​
i​
n​
i​
tion of the pro​ to​
col. The LISP data plane also sup​ ports seg​men​ta​tion of
traf​fic into mul​ti​
ple VPNs. LISP binds VRFs to in​ stance IDs, and then these IDs are in​ -
cluded in the LISP header to pro​ vide data plane (traf​ fic flow) sep​ a​
ra​
tion for sin​gle or
multi-hop for​ ward​ ing. The LISP multi-ten​ ancy so​ lu​
tion promises to ex​ ceed the scal​ a​
-
bil​
ity of cur​rent seg​ men​ ta​
tion so​lu​
tions sig​
nif​
i​
cantly be​ cause it uses on-de​ mand rout​ -
ing and does not re​ quire main​ te​
nance of tra​
di​
tional rout​ ing adjacencies.
External Connectivity for VXLAN Fabric 78

Figure: Border Spine and LISP Hand-Off

Border Spine and LISP Hand-Off


External Connectivity for VXLAN Fabric 79

Border Spine and LISP Hand-Off


This sec​
tion fo​
cuses on a de​ sign where the spine de​ vice acts as a bor​der node and sup​ -
ports LISP hand​off. Since all of the spines are con​nected to the WAN edge routers, this
al​
lows us to have ECMP from the spine de​ vices to given ex​ter​
nal sites. This al​
lows hosts
con​nected to the VXLAN Fab​ ric to com​mu​ni​
cate to the ex​ ter​
nal sites. Al​though we
focus on the sce​
nario where the spines act as bor​ der nodes, a sim​ i​
lar de​
sign can be im​-
ple​mented on bor​der leaf nodes.

In this sce​
nario, the spine de​vice acts as a LISP xTR. A LISP xTR refers to a de​vice that
can act as both a LISP Ingress Tun​ nel Router (ITR) and a LISP Egress Tun​ nel Router
(ETR). With LISP, reg​u​
lar IPv4/IPv6 host routes orig​i​
nat​
ing from the data cen​ter are not
ad​ver​
tised which helps op​ ti​
mize the rout​
ing table.

LISP has a map​ ping data​


base sys​ tem that keeps track of routes learned from all spine
de​vices in the EVPN Fab​ ric. LISP also tracks ad​dresses on re​mote sites and adds them
to the map​ ping data​base. Routes learned from leaf VTEPs are added into the Rout​ ing
In​
for​ma​tion Base (RIB) at the xTR. LISP se​lects these routes from the RIB and adds them
dy​nam​i​
cally to the map​ping data​ base as Lo​ca​
tor-Iden​tity mappings.

The spine ad​


ver​tises a de​
fault route to at​
tract north​
bound traf​
fic from the leaf VTEPs.
When a leaf VTEP re​ ceives a packet and it does not have a spe​ cific route, it sends the
packet to the spine using the over​ lay. The spine de​
cap​
su​
lates the pack​ ets, per​forms a
lookup in the LISP map​ ping data​ base, does a LISP en​cap​su​
la​
tion and for​ wards the
packet north​
bound across the WAN.

The spine de​ vices con​tinue to have L3V​ NIs con​fig​


ured, as they act as VTEPs for north​ -
bound traf​fic com​ ing from at​ tached leaf de​ vices. They would also act as tun​ nel end​
-
points for south​ bound traf​fic com​ ing from the re​ mote LISP xTR des​ tined to the hosts
con​
nected to the leaf. The spine de​ vices would not need to be con​ fig​ured with L2V​NIs
and has the ad​ van​
tage of al​
low​ ing Layer 3 multi-pathing across the VXLAN Fabric.
External Connectivity for VXLAN Fabric 80

North-South Traf​
fic with VXLAN Host in the POD

In this sce​
nario, packet for​
ward​
ing in​
volves two encapsulations:

1 LISP en​
cap​
su​
la​
tion be​
tween the ex​
ter​
nal sites and the bor​
der spine

2 VXLAN en​
cap​
su​
la​
tion be​
tween the bor​
der spine and the leaf

The fol​
low​
ing sce​
nario dis​
cusses host de​
tec​
tion and packet forwarding:

1 The VXLAN Fab​


ric can be con​
nected across an IP cloud to con​
nect to ex​
ter​
nal sites
for north-south traf​
fic using LISP. Mak​
ing use of the Bor​
der PE provider edge so​
lu​
-
tion to con​
nect the data cen​
ters and ex​
ter​
nal sites using LISP.

2 In the VXLAN Fab​


ric, the host routes and MAC ad​
dress in​
for​
ma​
tion are dis​
trib​
uted
in the MP-BGP EVPN con​
trol plane from the leaf nodes, which means that the Fab​
-
ric it​
self per​
forms the host de​
tec​
tion. The LISP site gate​
ways use these host routes
for trig​
ger​
ing the LISP en​
cap​
su​
la​
tion and de-encapsulation.

3 When the LISP site gate​


way (Bor​
der PE, also run​
ning MP-BGP EVPN in the Fab​
ric)
de​
tects this host based on the route re​
ceived in BGP, it sends a map-reg​
is​
ter mes​
-
sage to the map sys​
tem data​
base to reg​
is​
ter the new IP ad​
dress in its own data
center

4 When re​
mote sites want to talk to the data cen​
ter hosts, they send an in​
quiry to the
map​
ping sys​
tem re​
quest​
ing the lo​
ca​
tion of the host. The map​
ping sys​
tem replies
with the lo​
ca​
tion of the LISP site gate​
way where the des​
ti​
na​
tion EID is located.

5 Com​
mu​
ni​
ca​
tion is then es​
tab​
lished be​
tween the re​
mote client and the data cen​
ter
host lever​
ag​
ing the LISP and VXLAN tech​
nolo​
gies as de​
scribed earlier

Layer 3 Connectivity Summary


External Connectivity for VXLAN Fabric 81

Layer 3 Connectivity Summary


Ex​
ter​nal Layer 3 con​
nec​
tiv​
ity from a VXLAN Fab​
ric can be achieved using three dif​
fer​
-
ent technologies.
• VRF-lite pro​
vides an IP hand-off using sub-in​
ter​
faces with IEEE 802.1Q tags to sep​
-
a​
rate the VRFs
• MPLS uses VPN la​
bels to sep​
a​
rate traf​
fic on a per-VRF basis
• LISP uses an IP-in-IP en​
cap​
su​
la​
tion and in​
stance-ID to seg​
re​
gate the VRFs
External Connectivity for VXLAN Fabric 82

Layer 2 Connectivity

There are two major use-cases for Layer 2 hand-off and con​ nec​tiv​
ity. The first is for
mi​
gra​
tion sce​
nar​
ios, where the VXLAN Fab​ ric needs to be con​ nected to an ex​ ist​
ing
non-VXLAN net​ work in​fra​
struc​
ture. The sec​ ond is the ex​ ten​
sion of Layer 2 broad​ cast
do​
mains be​tween sep​a​
rate VXLAN Fab​ rics, re​ferred to as multi-site.

Layer 2 vPC Hand-Off


In this sce​
nario, the VXLAN Fab​ric is con​nected to an ex​ter​
nal Layer 2 net​work via Eth​-
er​
net 802.1Q VLAN trunks. This ex​ ter​nal Layer 2 net​
work will be re​ferred to as a Clas​
si​
-
cal Eth​er​
net (CE) POD. Layer 2 con​nec​tiv​
ity can be ex​
tended be​ tween the two en​ vi​
ron​-
ments, tak​ing the form of an L2VNI on the VXLAN Fab​ ric and a tra​
di​
tional VLAN on the
CE POD.

A vPC bor​der node pair on the VXLAN Fab​ ric can be used as re​
dun​
dant Layer 2 gate​
way
for the hand-off. In this case, the two en​
vi​
ron​ments can be con​nected via a vPC with​
out
in​
tro​
duc​ing loops to the ex​tended Layer 2 networks.

Figure: Traffic Flow Between a VXLAN Fabric and a CE POD


External Connectivity for VXLAN Fabric 83

In the il​
lus​
tra​
tion above:

• BL1 and BL2 pro​


vide Layer 2 bor​
der leaf func​
tion​
al​
ity for the VXLAN Fabric
• The bor​
der leafs form a back-back vPC to re​
dun​
dantly con​
nect with the CE pod ag​
-
gre​
ga​
tion switches
• The as​
sump​
tion is that the ag​
gre​
ga​
tion layer switches sup​
port vPC

vPC Unicast Communication


With vPC, the pair of bor​ der leaf switches shares a sin​gle vir​
tual VTEP IP ad​dress and
MAC ad​ dress. This al​
lows both de​vices to han​
dle the for​
ward​ ing and re​
ceipt of uni​
cast
traffic.

When the VTEPs learn MAC reach​ a​


bil​
ity in​for​ma​ tion for de​vices in the CE POD, they
in​
ject this in​for​
ma​tion into the EVPN Fab​ ric con​ trol plane. They as​ so​
ci​
ate the vir​
tual
VTEP IP ad​ dress to the end​ point MAC ad​ dresses con​ nected to the CE pod. This en​ sures
that the other VTEPs within the VXLAN Fab​ ric re​ceive this in​for​ma​tion and pro​ gram it
in their Layer 2 for​ ward​ing ta​
bles. Any leaf in the VXLAN Fab​ ric can reach re​ sources
con​nected to the CE pod by en​ cap​
su​lat​
ing traf​fic in VXLAN pack​ ets des​tined to the sin​
-
gle vir​
tual VTEP next-hop ad​ dress. This im​ plies traf​ fic in the un​der​lay Fab​ric net​
work
can be load-bal​ anced across Equal Cost Mul​ ti​
path (ECMP) paths. In the event of a fail​ -
ure of a bor​ der leaf node, min​i​
mal im​ pact is ob​ served given the re​ dun​dancy built into
the Fab​ric. MAC ad​ dresses learned from the CE POD in the data plane are synched
across the vPC peer link so both bor​ der leaf switches are ca​ pa​ble of for​
ward​ ing uni​
cast
Layer 2 traf​fic di​
rected to​
wards them.

The bor​der leaf switches are aware of the IP and MAC ad​ dresses of all the end​points
con​nected to the VTEPs in the VXLAN Fab​ ric, so traf​
fic re​
ceived from the CE POD can
be VXLAN en​ cap​su​
lated and for​
warded in​
side the Fab​ ric to​
wards the des​ ti​
na​
tion VTEP.

vPC Multicast Communication


External Connectivity for VXLAN Fabric 84

vPC Multicast Communication


For BUM traf​
fic han​
dling, the bor​
der leaf sim​
ply floods the packet to the vPC which
sup​
ports an Ac​ tive-Standby model for multi-des​ ti​
na​
tion packet for​
ward​ing. Only one of
the vPC peers is se​ lected as the des​
ig​
nated for​warder on a per group basis and is re​ -
spon​si​
ble for for​
ward​ing the BUM traf​fic to avoid cre​
at​
ing mul​
ti​
ple copies of the same
packets.

When a de​
vice in the VXLAN Fab​
ric sends a mul​
ti​
cast packet:

• Both vPC bor​


der nodes re​
ceive the mul​
ti​
cast traf​
fic en​
cap​
su​
lated in VXLAN
• The des​
ig​
nated for​
warder will de​
cap​
su​
late the packet and for​
ward it to the CE POD
• The non-des​
ig​
nated for​
warder switch drops the ingress packet

When a de​
vice in the CE POD sends a mul​
ti​
cast packet:

• When the packet reaches the bor​


der nodes, one of the vPC peers is des​
ig​
nated as
the for​
warder for that VLAN. That vPC peer will take re​
spon​
si​
bil​
ity for en​
cap​
su​
lat​
-
ing the mul​
ti​
cast packet into the VXLAN EVPN Fabric.

Loop Prevention
As Layer 2 is ex​tended out​ side the VXLAN Fab​ ric, it is im​por​tant to re​
mem​ ber that the
bor​
der node par​ tic​
i​
pates in the VXLAN Fab​ ric, both from a con​ trol and data plane per​ -
spec​tive. VXLAN does not cur​ rently pro​vide any in​ te​gra​tion with Span​ ning Tree (STP),
mean​ ing VXLAN does not for​ ward BPDUs across the Fab​ ric. There​fore, es​
tab​
lish​
ing re​
-
dun​dant Layer 2 con​ nec​
tions be​tween the VXLAN Fab​ ric and the ex​ter​
nal net​work may
re​
sult in the cre​
ation of a loop as high​lighted in the fig​ ure below.
External Connectivity for VXLAN Fabric 85

Figure: Creation of a Layer 2 Loop

In order to have a multi-homed loop-free topol​ ogy, Cisco rec​


om​
mends using vPC for
the south​
bound con​nec​tiv​
ity of Edge De​
vices as shown below.
External Connectivity for VXLAN Fabric 86

Figure: Layer 2 Loop-Free Topology with vPC

Since the bor​ der nodes are al​ready par​tic​


i​
pat​
ing in the Layer 3 hand-off, it is a nat​ ural
choice to lever​ age them to ex​tend Layer 2 con​ nec​tiv​
ity out​
side the Fab​ ric. The rec​om​ -
men​ da​tion is to en​able vPC on the bor​ der node and pro​ vi​
sion par​ al​
lel in​
ter​
faces be​ -
tween the bor​ der node for the Layer 2 VLANs that need to be ex​ tended. It is im​ por​
tant
to note, that the bor​ der node does not have to be used for Layer 2 ex​ ten​sion, and it is
pos​si​
ble to lever​age an​other pair of leaf switches for this func​ tion. It is not rec​ om​ -
mended to lever​ age the bor​der spine for the Layer 2 con​ nec​tiv​
ity be​
tween lo​ ca​tions as
the spine de​ vices should be in​
de​
pen​dent nodes.
External Connectivity for VXLAN Fabric 87

Integration and Migration

Green​ field sce​nar​


ios do not re​quire much focus on in​ te​
gra​
tion with legacy tech​nolo​
-
gies. This sec​ tion fo​
cuses on brown​ field sce​
nar​
ios, using the in​
for​
ma​tion pro​
vided in
pre​
vi​ous sec​tions of this chapter.

Brown​field data cen​ters typ​


i​
cally in​
te​
grate new net​
work tech​
nolo​
gies, VXLAN Fab​
rics
are no ex​
cep​tions, using one of the fol​
low​
ing methodologies:

• Layer 2 POD expansion: the new VXLAN Fab​


ric will be con​
nected with the ex​
ist​
ing
net​
work using Layer 2.
• Layer 3 POD addition: the new VXLAN Fab​
ric will be con​
nected with the ex​
ist​
ing
net​
work using Layer 3

Since the im​


ple​
men​ta​
tion of the new VXLAN Fab​ ric as an ad​
di​
tional Layer 3, POD in the
DC does not typ​i​
cally in​
volve mi​grat​
ing work​
loads from the ex​ ist​
ing net​
work to the
new Fab​ric. The next sec​tion will focus on the first use case, de​
ploy​
ing a VXLAN Fab​
ric
as a Layer 2 ex​
ten​
sion of an ex​
ist​
ing network.

Expansion of an Existing POD with a VXLAN Fabric


This sec​
tion pro​vides an overview of mi​ grat​ing to a VXLAN Fab​ ric from an ex​ist​
ing net​
-
work that could have been built lever​ ag​
ing vPC, Fab​ ric​
path or tra​
di​tional STP tech​nolo​
-
gies. The VXLAN Fab​ ric is built as a new POD de​ ploy​ment and the scope does not cover
con​ver​
sion of the ex​ ist​
ing net​work de​ vices into VXLAN nodes. The goal is to pro​ vide
net​
work in​te​gra​
tion and a path for the mi​ gra​tion of end​points and ser​ vices to the new
VXLAN Fab​ ric with min​ i​
mal ser​vice dis​rup​ tion. Once the in​te​gra​tion is com​plete, the
Layer 3 and L4-L7 ser​ vices could op​ tion​ally be mi​grated to the VXLAN Fabric.
External Connectivity for VXLAN Fabric 88

Figure: Layer 2 and Layer 3 Interconnect to Assist Integration and Migration

Layer 2 Interconnect
Using the tech​
niques de​
scribed ear​
lier in this chap​
ter, the new VXLAN Fab​ ric can be in​
-
ter​
con​
nected with the ex​ ist​
ing net​
work, lever​ ag​
ing vPC and loop pre​ ven​tion tech​-
niques such as BPDU Guard, Root Guard and storm con​ trol to de​
liver a re​dun​
dant
Layer 2 path be​
tween the two environments.

At the VXLAN Fab​ ric bor​


der nodes, the same VLAN IDs need to be used in order to map
Layer 2 seg​ ments from the ex​ist​
ing net​work to L2V​NIs to es​
tab​
lish Layer 2 con​nec​tiv​
ity.
Vir​
tual ma​ chines and end​points can now be seam​ lessly moved from the ex​ ist​
ing net​ -
work to the new VXLAN Fab​ ric with min​ i​
mal im​
pact. The end​points in the VXLAN Fab​ -
ric will still have Layer 2 con​ nec​tiv​
ity with the end​points that have not yet been mi​ -
External Connectivity for VXLAN Fabric 89

grated. The de​fault gate​


way for all Layer 2 seg​
ments still re​
sides in the orig​
i​
nal net​
work
at this point.

Moving Endpoints to the VXLAN Fabric


Once Layer 2 con​ nec​tiv​
ity be​tween the legacy net​ work and the new VXLAN Fab​ ric is op​-
er​
a​
tional, work​
loads can be mi​ grated. Mi​grat​
ing phys​
i​
cal servers will typ​
i​
cally re​
quire re-
ca​
bling and a ser​
vice dis​rup​tion for the server being mi​grated. On the other hand, vir​ tual
ma​chines can be mi​grated over live mi​ gra​tion with​
out any no​tice​
able net​work impact.

Moving L4-L7 Network Services to the VXLAN Fabric


L4-L7 ser​vices ap​pli​
ances are es​ sen​tially end​
points, so they may be mi​ grated in the
same way as the server work​ loads. It may be pos​ si​
ble to mi​grate vir​
tual L4-L7 ap​ pli​
-
ances with​out dis​
rup​ tion de​
pend​ ing on their ca​pa​
bil​
i​
ties and con​fig​
u​
ra​
tion. In the case
of phys​
i​
cal ap​
pli​
ances, high avail​abil​
ity fea​
tures such as clus​ter​
ing can help to min​ i​
mize
dis​
rup​
tion and allow for mi​ gra​
tion to the new Fabric.

You can find ad​di​


tional de​
tails about how L4-L7 ser​
vices can be con​
nected to a VXLAN
Fab​
ric in the chap​
ter Layer 4-Layer 7 Services.

Moving the Default Gateway


Once all end​ points have been mi​ grated to the VXLAN Fab​ ric, it would be sub​ op​ti​
mal to
still have the de​ fault gate​
way in the ex​
ist​
ing net​work. The next step in the mi​ gra​tion is
to move the de​ fault gate​way to the new Fab​ ric. The new Fab​ ric must have an ex​ ist​
ing
Layer 3 up​ link into the net​work core to pre​ serve con​ nec​
tiv​ity to the ex​
ist​
ing routed
network.

This mi​
gra​
tion is a four-step process:

1 Dis​
able the de​
fault gate​
way in the ex​
ist​
ing network

2 Con​
fig​
ure the gate​
way IP ad​
dress as a Dis​
trib​
uted Any​
cast Gate​
way in the new
VXLAN Fab​
ric. By using the MAC ad​
dress of the orig​
i​
nal de​
fault gate​
way, the end​
-
points do not need to re-ARP for the new de​
fault gateway
External Connectivity for VXLAN Fabric 90

3 En​
sure that the sub​
net is ad​
ver​
tised up​
stream to the Layer 3 net​
work core

4 Re​
move the de​
fault gate​
way and rout​
ing con​
fig​
u​
ra​
tion from the ex​
ist​
ing network

Note that al​ though step 3 can be con​ fig​


ured in ad​ vance, step 2 needs to be done se​-
quen​tially after step 1. There will be a short out​
age from step 1 until the any​cast gate​
-
way is fully func​tional in the VXLAN Fab​ ric and all end​
points have learned the MAC ad​ -
dress for the new gateway.

Re​
learn​ing the gate​ way's MAC ad​ dress is not re​ quired if the any​cast gate​ way in the
VXLAN Fab​ ric can over​ take the same MAC ad​ dress that the old de​fault gate​way had.
One re​stric​
tion is that, as it has been de​ scribed in pre​vi​
ous chap​ters, there is a sin​gle
MAC ad​ dress for the whole Fab​ ric for all Dis​trib​
uted Any​cast Gate​way, so if the de​fault
gate​
ways in the legacy net​ work had mul​ ti​
ple MAC ad​ dresses (for ex​am​ple if mul​ti​
ple
HSRP or VRRP groups were used), a mi​ gra​ tion where the MAC ad​ dress of the de​ fault
gate​
way stays the same will not be possible.

In case there are still end​ points in the orig​


i​
nal net​
work, they will have to use the any​ -
cast gate​ way in the VXLAN Fab​ ric to com​mu​ni​cate with other net​ work seg​ments. The
ARP sup​ pres​sion mech​ a​
nism may cause traf​ fic black​
hol​ing so it should not be en​ abled
until all end​points have been mi​ grated. That is the rea​son why, in order to re​duce the
com​ plex​ity of the mi​ gra​
tion, the rec​om​men​da​tion is to com​ pletely mi​grate all end​-
points from the legacy net​ work to the new VXLAN Fab​ ric. Once that is done, the whole
VXLAN con​ fig​
u​ra​
tion can be re​
moved from the legacy network.

At this point, com​ mu​ni​


ca​tion be​
tween sub​ nets where the de​ fault gate​
way is still in the
legacy net​work and sub​ nets whose de​ fault gate​ way has al​ready been mi​ grated to the
VXLAN Fab​ ric is sub​
op​ti​
mal and can po​ ten​ tially fol​
low asym​met​ ric paths. There​ fore, it
is rec​
om​mended to com​ plete the mi​gra​tion of all end​ points and seg​ ments from the old
net​work to the new VXLAN Fab​ ric in a pe​riod of time as short as possible.
Layer4-Layer7
Services
Layer4-Layer7 Services 92

Introduction

This chap​ter pro​vides an overview of Lay​


er4-Lay​
er7 ser​
vices, de​
ploy​
ment mod​
els, a
focus on de​sign and on de​
ploy​
ment use-cases.

A VXLAN Fab​
ric pro​
vides Layer 2 and Layer 3 con​
nec​
tiv​
ity; how​
ever, ad​
di​
tional ser​
vices
are re​quired in the data cen​ ter. These ser​vices are pro​vided by ded​i​
cated ap​pli​
ances
(phys​i​
cal or vir​
tual), and re​
quire con​
nec​tiv​
ity to the fab​
ric. These ded​i​
cated func​tions
are re​ferred to as Lay​er4-Lay​
er7 services.

Tra​di​
tional hi​
er​
ar​
chi​
cal net​
work de​signs con​nect Lay​er4-Lay​er7 ser​vices at the ag​
gre​
-
ga​
tion layer. Within a VXLAN Fab​ ric, Lay​
er4-Lay​er7 ap​pli​
ances can be con​ nected to any
leaf switch or con​nected to a ded​
i​
cated leaf pair re​
ferred to as a “ser​
vice leaf".

There are dif​


fer​
ent con​
nec​
tiv​
ity op​
tions for the phys​
i​
cal and vir​tual ap​
pli​
ances. The fol​
-
low​
ing sec​
tion dis​
cusses dif​
fer​
ent op​tions for con​
nec​tiv​ity for the Lay​
er4-Lay​ er7 ser​
-
vices devices.

Layer4-Layer7 Device Types


De​pend​ing on the re​
quire​
ments, mul​
ti​
ple Lay​
er4-Lay​er7 ser​
vices may be im​ple​
mented
to pro​
vide a com​plete net​
work and ser​
vice func​
tion stack. These func​
tions in​
clude the
following:
• State​
ful Layer 4 firewalling: Many or​
ga​
ni​
za​
tions im​
ple​
ment net​
work se​
cu​
rity on
ded​i​
cated fire​walls where com​plex fire​wall poli​
cies are en​
forced. The fire​
wall poli​
-
cies per​
mit or deny com​ mu​
ni​
ca​
tion be​tween dif​fer​
ent or​ga​
ni​
za​
tional or ap​
pli​
ca​
tion
tiers. There are many other func​ tions fire​
walls can per​ form such as Net​ work Ad​ -
dress Trans​ la​
tion (NAT).
• Ap​
pli​
ca​
tion Firewalls: Most at​
tack vec​
tors today focus on the ap​
pli​
ca​
tion. The at​
-
tacks lever​
age stan​
dard TCP ports to ex​ ploit ap​pli​
ca​
tion vul​ner​
a​
bil​
i​
ties. Ex​
am​ ples
in​
clude SQL Code In​ jec​
tion or Cross-Site Script​ing. Ap​pli​
ca​
tion-level fire​walls can
help pre​
vent these types of mod​ ern day attacks.
Layer4-Layer7 Services 93

• In​
tru​
sion De​
tec​
tion (IDS) / In​
tru​
sion Pre​
ven​
tion (IPS): The so​
lu​
tion de​
tects at​
-
tacks and pre​vents sys​
tems from being com​ pro​
mised. It also pre​
vents a com​pro​
-
mised sys​tem from orig​i​
nat​
ing sus​
pi​
cious net​
work ac​
tiv​
ity. Ex​
am​ples are net​
work
re​
con​nais​
sance with ping sweeps and port scans.
• WAN Optimization: The goal of this ser​
vice is to im​
prove the user ex​
pe​
ri​
ence
through tech​ niques such as op​
ti​
miza​
tion of the TCP stack, com​
pres​
sion, and con​
-
tent caching.
• Ap​
pli​
ca​
tion De​
liv​
ery Con​
trollers (ADC): The ADC in​
cludes server load bal​
anc​
ing,
SSL of​fload and other ap​pli​
ca​
tion func​tion​
al​
ity. ADCs can be de​
ployed by them​
-
selves or in tan​
dem with other ser​vice nodes.

Some Lay​ er4-Lay​er7 ap​


pli​
ance ven​ dors might in​
te​
grate sev​ eral of these above cat​
e​
-
gories in a sin​
gle prod​
uct such as FW and IPS. In ad​ di​
tion, an​other com​monly used
term is a ser​
vice-chain, when mul​ ti​
ple Lay​
er4-Lay​er7 de​vices are im​ple​
mented in se​
-
quence, such as WAN op​ ti​
miza​
tion, FW and ADC.

Deployment Models
In ad​di​
tion to the func​
tion​al​
ity of the Lay​er4-Lay​er7 ser​
vices, an im​por​
tant fac​
tor to
con​sider is how to de​ploy the ser​ vice ap​
pli​
ances. The fol​
low​ing sec​
tion de​
scribes dif​
-
fer​
ent de​ploy​
ment mod​ els for Lay​er4-Lay​er7 services.

Virtual vs Physical
Lay​er4-Lay​
er7 ser​
vices come in dif​
fer​
ent form fac​tors in​
clud​
ing phys​i​
cal and vir​
tual ap​
-
pli​
ances. There are cer​
tain con​
sid​
er​a​
tions re​
quired for vir​
tual ap​
pli​
ances, in​
clud​ing the
following:
• With vir​
tual ap​
pli​
ances, there is typ​
i​
cally a vir​
tual switch be​
tween the phys​
i​
cal leaf
and the VM host​
ing the ser​
vices appliance
• Vir​
tual ser​
vices have dif​
fer​
ent NIC re​
dun​
dancy mod​
els; these func​
tions are pro​
-
vided by the hypervisor
Layer4-Layer7 Services 94

The de​ci​
sion whether to use vir​tual or phys​
i​
cal ap​
pli​
ances re​
quires ad​ di​
tional con​
sid​
er​
-
a​
tions in​
clud​ing that phys​
i​
cal ap​
pli​
ances are gen​ er​
ally spe​
cial​
ized hard​ ware which of​-
fers bet​ter per​for​
mance than generic x86 plat​ forms, par​tic​u​
larly with en​ cryp​
tion
services.

Transparent vs Routed
There are two de​ ploy​
ment mod​els with ser​ vice ap​pli​
ances, trans​
par​ent mode and
routed mode. In trans​
par​
ent mode, the ser​
vice ap​
pli​
ance is de​
ployed as a bump-in-the-
wire and does not change any MAC in​ for​
ma​tion. With trans​par​
ent mode, a fail safe
mech​a​
nism needs to be im​
ple​
mented to pre​vent Layer 2 data plane loops.

Figure: Layer4-Layer7 Service in Transparent Mode

On the other hand, routed de​ ploy​ments are not prone to Layer 2 loops be​ cause they
fol​
low IP rout​ing se​
man​ tics. Lay​
er4-Lay​er7 ap​
pli​
ances in​serted in routed mode can par​ -
tic​
i​
pate with dy​ namic rout​ ing pro​to​
cols. The ben​
e​
fit of im​ple​
ment​ ing a dy​
namic rout​-
ing pro​to​
col is that it al​
lows for Route Health In​jec​
tion (RHI) that in​flu​
ences the ingress
rout​ing path to the ser​ vices appliance.

Figure: Layer4-Layer7 Service in Routed Mode

One-arm vs Two-arm designs


Layer4-Layer7 Services 95

One-arm vs Two-arm designs


Fire​
walls have two or more in​ ter​
faces, an in​
ter​
nal in​
ter​
face, and an ex​
ter​
nal in​ter​
face.
ADC can be con​ nected in a two-arm or one-arm mode; one-arm mode im​ ple​ments a
sin​
gle log​i​
cal or phys​
i​
cal in​
ter​
face. The ADC typ​ i​
cally im​
ple​ments Net​work Ad​ dress
Trans​la​
tion (NAT) to en​ sure that the re​turn traf​
fic is sent back to the orig​ i​
nal ADC
appliance.

The fol​
low​
ing fig​
ure il​
lus​
trates the one-arm de​
sign option.

Figure: Layer4-Layer7 service in one-arm mode

Physical Connectivity
Lay​er4-7 ser​
vices have dif​
fer​
ent con​
nec​
tiv​
ity and re​
dun​
dancy de​
ploy​
ment mod​
els, as
dis​
cussed below.
• No redundancy: one log​
i​
cal in​
ter​
face maps to one phys​
i​
cal in​
ter​
face, re​
sult​
ing in a
sin​
gle net​
work connection
• Re​
dun​
dancy at the NIC level (port-channel): one log​
i​
cal in​
ter​
face maps to mul​
ti​
ple
phys​
i​
cal in​
ter​
faces. These two in​ter​
faces are con​
fig​
ured as a sin​
gle port-chan​
nel
con​
nected to a sin​
gle leaf switch
• Re​
dun​
dancy at the NIC and switch level (vPC): one log​
i​
cal in​
ter​
face maps to mul​
ti​
-
ple phys​
i​
cal in​
ter​
faces. These two in​ ter​
faces are con​
fig​
ured as a sin​
gle port-chan​-
nel con​
nected to two dif​ fer​
ent leaf switches. The two dif​
fer​
ent switches are im​ple​
-
mented as a vPC pair.
Layer4-Layer7 Services 96

Figure: Physical Connectivity Options

Redundancy Model
Dif​
fer​
ent re​
dun​dancy mod​els will have an im​
pact on how the net​
work will be​
have in
case of an Lay​
er4-Lay​
er7 ap​
pli​
ance outage:
• No redundancy: This mode is some​
times used for non-crit​
i​
cal en​
vi​
ron​
ments, and is
typ​
i​
cally de​
ployed in con​ junc​
tion with vir​
tual Lay​
er4-Lay​
er7 ap​
pli​
ances that lever​
-
age High Avail​abil​
ity fea​
tures of the hypervisor.
• Ac​
tive/Standby: Two Lay​
er4-Lay​
er7 ap​
pli​
ances are de​
ployed, and one of them
han​dles all traf​
fic. When the ac​ tive de​vice fails, the standby de​vice will be​come ac​-
tive. The net​ work con​ verges away from the failed ap​ pli​
ance while the pre​ vi​
ous
standby node be​ comes ac​ tive. With the ac​ tive / standby model, traf​ fic flows are
de​ter​
min​is​
tic and this sim​pli​
fies the for​
ward​ ing path through the network.
• Clus​
ter​
ing (Active/Active): There are two dif​
fer​
ent mod​
els of clus​
ter​
ing, where all
ser​
vices ap​pli​
ances are serv​ing the work​load. While one model uses the ap​
proach
of a local port-chan​ nel per ser​vices ap​
pli​
ance, the sec​
ond model rep​re​
sents the
ser​
vices clus​
ter as a sin​
gle port-channel.

Integration into the VXLAN Fabric


Layer4-Layer7 Services 97

Integration into the VXLAN Fabric


In most cases the Lay​ er4-Lay​ er7 ap​pli​
ance is seen by the fab​ ric as an end​point. This
does not re​quire any ad​ di​
tional con​ trol plane in​ter​ac​tion with the fab​
ric. How​ever, in
some cases, the Lay​ er4-Lay​er7 ven​dor has im​ ple​
mented VXLAN en​ cap​
su​
la​
tion sup​port
to the ser​
vice ap​pli​
ance. This pro​ vides the flex​i​
bil​
ity to lever​
age VXLAN for data plane
in​
te​
gra​
tion. In this sce​
nario, the ser​vice ap​pli​
ance would act as a VTEP.

When con​ sid​


er​ing in​
te​
grat​ing Lay​
er4-Lay​ er7 ser​
vice ap​pli​
ances into a VXLAN Fab​ ric,
the im​ple​men​ ta​tion de​
tail needs to align be​tween the two. For ex​ am​ ple, if the VXLAN
Fab​ric is run​ning with a BGP EVPN con​ trol plane, the ser​
vice ap​pli​
ance needs to sup​ -
port this de​ploy​ ment model also. Within this book, use of a ser​ vice ap​pli​
ance as a VTEP
is not considered.
Layer4-Layer7 Services 98

Use Cases

There are mul​ ti​


ple op​tions that are pos​ si​
ble to de​
ploy Layer 4-Layer 7 ser​ vices in a
VXLAN Fab​ ric: phys​ i​
cal sin​
gle-arm, vPC-based, ADC de​ ploy​
ment in ac​ tive/standby
mode, vir​
tual ac​ tive/ac​tive fire​
walls in routed mode, trans​par​ent vir​
tual in​
tru​sion pre​
-
ven​
tion sys​
tems, etc. The fol​ low​
ing sec​tions focus on the most fre​
quent use cases.

Firewall as Default Gateway


Using the fire​
wall as the de​
fault gate​
way is one of the sim​
plest use cases.

In this de​
sign, the VXLAN Fab​ric pro​
vides a Layer 2-only ser​
vice. All com​
mu​
ni​
ca​tion
that re​quires cross​
ing the Layer 2 de​mar​
ca​tion must be sent to the fire​
wall to be
routed.

For example:

vlan 1100
name WEB
vn-segment 30100
vlan 1101
name APPLICATION
vn-segment 30101
vlan 1102
name DATABASE
vn-segment 30102

The fire​wall will have a log​


i​
cal Layer 3 in​ ter​
face in each VNI that will serve as the de​
-
fault gate​
way for all end​points. Rout​ ing be​tween IP sub​nets, rep​
re​
sented by a VNI, has
to flow through the fire​ wall. The fire​wall be​comes the Layer 3 gate​way for all VNIs for
the VXLAN Fabric.
Layer4-Layer7 Services 99

Figure: Firewall as a Default Gateway with a Layer 2 VXLAN Fabric

For ex​
am​ ple, an ASA fire​
wall with four phys​
i​
cal ports grouped in two log​
i​
cal port-
channels:

int po10.1100
vlan 1100
nameif WEB
security-level 100
ip address 192.168.110.1 255.255.255.0

int po10.1101
vlan 1101
nameif APPLICATION
security-level 100
ip address 198.168.111.1 255.255.255.0
Layer4-Layer7 Services 100

int po10.1102
vlan 1102
nameif DATABASE
security-level 100
ip address 198.168.112.1 255.255.255.0

int po20
nameif OUTSIDE
security-level 50
ip address 192.168.100.255 255.255.255.0

The fire​wall be​comes the sin​ gle point for in​ ter-sub​net com​ mu​ ni​ca​tion in the fab​
ric,
con​se​
quently, it is im​por​tant to prop​ erly size the ap​pli​
ance for re​silient, per​
for​
mance,
and scale rea​sons. When a fail​ ure oc​curs in an ac​tive/standby de​ ploy​ ment, the newly-
ac​
tive fire​
wall will no​
tify the net​work of the change, nor​ mally send​ ing GARP (gra​ tu​
itous
ARP) or RARP (re​ verse ARP) pack​ ets. These will trig​ger the re-learn​ ing of the MAC ad​ -
dresses on the ports con​ nected to the standby firewall.

Transparent Firewall Insertion


An​other pop​u​lar op​
tion for de​ploy​ing fire​walls is to trans​
par​
ently in​
sert the fire​
wall
into the net​
work, be​tween the server's de​ fault gate​way and the server it​
self. Some rea​
-
sons to use trans​par​
ent fire​
wall in​
ser​tion include:
• Abil​
ity to add fire​
wall ser​
vices with​
out chang​
ing ex​
ist​
ing IP ad​
dress​
ing of the
servers
• Mul​
ti​
cast streams can eas​
ily tra​
verse the firewall
• Non-IP traf​
fic can be for​
warded via the firewall
• Pro​
to​
cols such as HSRP and VRPP can pass through the firewall
• Rout​
ing Pro​
to​
cols can es​
tab​
lish ad​
ja​
cen​
cies through the firewall

From a log​
i​
cal stand​point, the fab​
ric is the de​fault gate​
way for the servers. For ex​am​
ple,
the servers are de​ployed in the 192.​168.​100.​
0/​ 24 sub​net and the VXLAN Fab​ ric any​
cast
gate​
way is con​fig​
ured as the server's de​ fault gate​
way of 192.168.100.1.
Layer4-Layer7 Services 101

The fire​wall needs to be in​serted trans​ par​


ently into the data​path. In​
stead of the servers
being de​ ployed in the same VLAN/VNI as the de​ fault gate​way, the servers will be con​-
fig​
ured in a dif​fer​
ent VLAN/VNI. For ex​ am​ple, the de​fault gate​
way re​sides in VLAN 100
(un​pro​tected), while the servers are being placed in VLAN 1100 (pro​ tected). The fire​
wall
in trans​par​ent mode is stitch​ ing both VLAN/VNI to​ gether, mean​ ing the fire​
wall is in
the data​ path be​tween VLAN 100 and VLAN 1100. When​ ever a server re​ quires reach​ing
the de​fault gate​way, the traf​
fic has to pass the firewall.

Figure: Transparent Firewall

The fire​wall en​forces the se​


cu​
rity poli​
cies ap​plied for data pass​ing be​tween the pro​ -
tected and un​ pro​tected VLANs and main​ tains the ap​ pro​pri​
ate for​ward​ing be​tween
them. This de​ sign can be used to de​ploy a mi​cro-seg​men​ta​
tion ser​vice in​
side of a sub​
-
net for servers that might have been com​ pro​mised. As an ex​ am​ple, in​
stead of read​-
dress​ing the servers you can dy​ nam​i​
cally move them be​ hind a fire​
wall and iso​late the
in​
fected hosts from the rest of the fabric.

Ex​
am​
ple:

vlan 100
name UnProtected-SVI
vn-segment 30000

vlan 1100
name Protected-VLAN
Layer4-Layer7 Services 102

vn-segment 31000

interface Vlan100
no shutdown
vrf member Tenant-1
no ip redirects
ip address 192.168.100.1/24 tag 21921
fabric forwarding mode anycast-gateway

In this con​
fig​u​
ra​
tion, VLAN 100 (un​pro​
tected) is the out​
side in​
ter​
face and VLAN 1100
(pro​
tected) is the in​
side interface.

The fire​
wall con​
fig​
u​
ra​
tion to stitch VLAN 100 to VLAN 1100 would be as follows:

firewall transparent

int po10.100
vlan 100
nameif sviVLAN
bridge-group 1
security-level 0

int po10.1100
vlan 1100
nameif serverVLAN
bridge-group 1
security-level 100

Integrating Layer 3 Firewall - Multi-Tenancy


Layer4-Layer7 Services 103

Integrating Layer 3 Firewall - Multi-Tenancy


A com​ mon re​quire​
ment is to pro​vide se​
cu​rity pol​
icy for in​
ter-ten​ ant traf​
fic and for ac​
-
cess​ing shared-ser​vices in a ded​
i​
cated VRF. As we have seen, the VXLAN Fab​ ric pro​
-
vides multi-ten​ancy through MP-BGP and VRF tech​ nolo​gies. Multi-ten​ ant com​ mu​ ni​
ca​
-
tion is routed through​out the VXLAN Fab​ ric and ten​ant iso​la​
tion is maintained.

A Layer 3 fire​wall in​


volves sep​ a​
rat​
ing dif​fer​
ent se​
cu​
rity zones using dif​fer​ent sub​
nets.
The fire​wall routes traf​fic be​
tween sub​ nets and ap​plies the fire​
wall rules.​When in​te​
-
grat​ing Layer 3 fire​wall in a VXLAN EVPN Fab​ ric using Dis​trib​
uted Any​ cast Gate​
way,
each of these zones must cor​ re​
spond to a VRF on the fab​ ric. The traf​
fic within a VRF
will be routed by the fab​ric and traf​fic be​
tween the VRFs will be routed by the firewall.

Figure: Layer 3 Firewall Traffic Flow

The ex​am​ple below shows a con​


fig​
u​
ra​
tion snip​
pet from VTEP A run​
ning OSPF with the
firewall.
Layer4-Layer7 Services 104

SVIs are de​fined on VTEP for both IN​ SIDE-VRF and OUT​ SIDE-VRF and the VTEP will
peer with a fire​
wall on each of these VRF to dy​
nam​
i​
cally learn rout​
ing in​
for​
ma​
tion to go
from one VRF to the other.

FIREWALL Configuration:
int po10.3001
vlan 3001
nameif OUTSIDE
security-level 50
ip address 10.30.1.2 255.255.255.252

int po10.3002&#; vlan 3002&#; nameif INSIDE&#; security-level


100&#; ip address 10.30.2.2 255.255.255.252&#; router ospf 1
network 10.30.1.0 255.255.255.0 area 0
network 10.30.2.0 255.255.255.0 area 0

VTEP A Configuration

interface VLAN 3001


description outside_vlan
vrf member OUTSIDE-VRF
ip address 10.30.1.1/30
ip router ospf 1 area 0

interface VLAN 3002


description inside_vlan
VRF member Tenant-1
ip address 10.30.2.1/30
ip router ospf 1 area 0

router bgp 65500


vrf OUTSIDE-VRF
address-family ipv4 unicast
advertise l2vpn evpn
Layer4-Layer7 Services 105

redistribute ospf 1 route-map OSPF_OUT


vrf Tenant-1
address-family ipv4 unciast
advertise l2vpn evpn
redistribute ospf 1 route-map OSPF_TENANT1

Inspecting these routes on VTEP 1

show ip route ospf-1 vrf OUTSIDE-VRF


IP Route Table for VRF "OUTSIDE-VRF"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.100.0/24, ubest/mbest: 1/0


*via 10.30.1.2 Vlan3001, [110/41], 1w5d, ospf-1, intra

The OSPF routes are ad​ ver​


tised by the VTEP into the VXLAN Fab​ ric. All other VTEPs
will im​port these routes in each VRF, point​ing to VTEP A as the next hop. The ex​ am​
ple
below shows the rout​ ing table on VTEP 1. VTEP A's IP ad​
dress 10.​
30.​
1.​
2 (OUT​ SIDE-VRF)
is the next hop.

VTEP1# show ip route 192.168.100.0/24 vrf OUTSIDE-VRF


IP Route Table for VRF "OUTSIDE-VRF"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.100.0/24 ubest/mbest: 1/0


*via 10.30.1.2%default, [200/41], 1w1d, bgp-65500, internal, tag 65500
(evpn) segid: 55555 tunnelid: 0xa010112 encap: VXLAN
Layer4-Layer7 Services 106

Traf​
fic from VTEP 1 will be en​ cap​su​
lated to​
wards VTEP A, de​ cap​
su​lated and sent to the
fire​
wall. The fire​wall en​forces the pol​icy and sends the traf​
fic back to VTEP A on the IN​-
SIDE-VRF. VTEP A will en​ cap​su​
late the traf​
fic and send it to the des​ ti​
na​
tion VTEP 2
where traf​fic is de​
cap​ su​lated and sent to the endpoint.

Firewall Failover
When the ac​ tive fire​
wall fails and the standby fire​
wall takes over, routes are with​
drawn
from ser​ vices VTEP A. As the pre​ vi​
ous standby be​
comes ac​ tive, routes are now ad​ver​
-
tised to the fab​ric through ser​ vices VTEP B.

If it is not de​
sir​
able to run a dy​
namic rout​
ing pro​
to​
col on the fire​
wall, there is a need
for sta​
tic routes point​ ing to the fire​
wall as next hop. It is crit​
i​
cal to en​
sure that only the
VTEP serv​ ing the ac​
tive fire​
wall is ad​
ver​tis​
ing the sta​
tic route.

The first way to ac​ com​ plish this task is to track ac​
tive fire​
wall reach​a​
bil​
ity by val​
i​
dat​
ing
it is lo​
cally learned via HMM (Host Mo​ bil​
ity Man​ager). The sec​ ond ap​proach is to con​ -
fig​ure the sta​ tic route at all the com​ pute VTEPs in​ stead of the ser​vices VTEPs. Both ap​ -
proaches are in​ tro​duced to en​ sure that only the route to​ wards the ser​ vice VTEP with
the ac​ tive fire​wall is used.

The ap​ proach using HMM track​ ing en​sures that if the ac​ tive fire​
wall is con​
nected to
VTEP A, only VTEP A will have and ad​ ver​
tise the sta​tic route. VTEP A will track how the
sta​
tic route's next hop (fire​wall IP) is learned. Only if the next hop is learned as an HMM
route (di​rectly con​ nected), VTEP A will ad​ ver​tise the sta​tic route through re​ dis​
tri​
b​
u​
-
tion. If the ac​tive fire​
wall fails and the standby takes over, VTEP A starts to learn the
next hop IP through BGP and VTEP B starts to know the fire​ wall’s IP ad​dress as next
hop through HMM. VTEP A will then with​ draw the tracked routes and VTEP B starts ad​ -
ver​
tis​
ing its routes into the fabric.

For example:

VRF context Tenant-1


ip route 0.0.0.0/0 10.30.2.2 track 10
track 10 ip route 0.0.0.0/0 reachability hmm
Layer4-Layer7 Services 107

vrf member Tenant-1

VTEPA# show track 10


Track 10
IP Route 0.0.0.0/0 Reachability
Reachability is UP

VTEPA# show ip route 0.0.0.0 vrf Tenant-1


IP Route Table for VRF "Tenant-1"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

0.0.0.0/0, ubest/mbest: 1/0


*via 10.30.2.2 [1/0], 00:00:08, static

Firewall Failure on VTEP A caused the track to go down causing VTEP A to


withdraw the static route

VTEPA# show track 10


Track 10
IP Route 0.0.0.0/0 Reachability
Reachability is DOWN

VTEPA# show ip route 0.0.0.0 vrf Tenant-1


IP Route Table for VRF "Tenant-1"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

Route not found


Layer4-Layer7 Services 108

In this case, where the sta​tic route is con​ fig​


ured in the com​ pute at​
tached VTEPs (VTEP
1 and VTEP 2), no ad​ di​
tional con​fig​
u​
ra​tion is nec​es​
sary as re​cur​
sive route lookup will
en​sure that the sta​tic route is only ac​tive if the next hop is reach​ able. Only the VTEP
with the ac​tive fire​
wall will ad​
ver​
tise the fire​wall IP. This ap​
proach en​ sures that traf​
fic
will only be routed to​wards the VTEP with the ac​ tive firewall.

VRF context Tenant-1


ip route 0.0.0.0/0 10.30.2.2

VTEP1# show ip route 0.0.0.0 vrf Tenant-1

IP Route Table for VRF "Tenant-1"


'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

0.0.0.0/0, ubest/mbest: 1/0


*via 10.30.2.2 [1/0], 00:00:08, static

VTEP1# show ip route 10.30.2.2/32 vrf Tenant-1

IP Route Table for VRF "Tenant-1"


'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

10.30.2.2/32 ubest/mbest: 1/0


*via 10.254.254.111%default, [200/41], 1w1d, bgp-65500, internal, tag
65500 (evpn) segid: 50000 tunnelid: 0xa010112 encap: VXLAN

Firewall Failure on VTEP A (10.254.254.111) caused the recursive lookup to


change toward VTEP B (10.254.254.112)
Layer4-Layer7 Services 109

VTEP1# show ip route 10.30.2.2/32 vrf Tenant-1

IP Route Table for VRF "Tenant-1"


'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

10.30.2.2/32 ubest/mbest: 1/0


*via 10.254.254.112%default, [200/41], 00:00:01, bgp-65500, internal, tag
65500 (evpn) segid: 50000 tunnelid: 0xa010112 encap: VXLAN

Integrating Application Delivery Controllers


ADC is an​other cat​
e​
gory of net​
work ser​vice that many ap​
pli​
ca​
tions re​
quire. The in​
tro​
-
duc​
tion de​scribed the dif​
fer​
ent de​
ploy​
ment modes to bring an ADC into a net​ work.
This sec​
tion fo​
cuses on the one-arm de​sign with source NAT.

In the one-arm de​


ploy​
ment model, the traf​
fic flow is as follows:

• Client traf​
fic en​
ters the in​
ter​
face pre​
sent​
ing the vir​
tual IP ad​
dress (VIP)
• ADC de​
cides which real server to send the re​
quest to
• ADC then trans​
lates the des​
ti​
na​
tion ad​
dress, which was pre​
vi​
ously the VIP, with the
IP ad​
dress of the real server.
• The re​
quest to​
wards the real server is ex​
it​
ing the same in​
ter​
face as the client re​
-
quest came from
• The source IP ad​
dress is trans​
lated via source NAT
• The real server will see the ADC IP ad​
dress as the source IP

The ADC is con​ nected to a ser​


vice VTEP or a pair of ser​ vice VTEPs with vPC. ADCs are
com​ monly de​ployed as a High-Avail​abil​
ity pair. Within the HA pair, the ac​
tive ADC ad​
-
ver​
tises the VIP to the ser​
vice VTEP. This can be achieved by sim​ ple MAC/IP learn​ ing
Layer4-Layer7 Services 110

ad​
ver​
tised in the VXLAN Fab​
ric as an EVPN Type-2 route. Al​
ter​
na​
tively, the ADC can be
im​
ple​
mented with a dy​
namic rout​
ing pro​
to​
col and ad​
ver​
tise the VIP as an EVPN Type-5
route.

Figure: ADC Traffic Flow

Traf​
fic flow is as follows:

• Client traf​
fic will be en​
cap​
su​
lated by VTEP 1 to​
wards ser​
vices VTEP A
• VTEP A de​
cap​
su​
lates and sends the traf​
fic to the ac​
tive ADC
• The ADC sends the traf​
fic des​
tined to the real server back to ser​
vices VTEP A
• VTEP A en​
cap​
su​
lates and sends the traf​
fic to the des​
ti​
na​
tion VTEP 2
• Traf​
fic gets de​
cap​
su​
lated at VTEP 2 and sent to the real server
• The re​
sponse back from the real server is sent back to the ADC, since the ADC per​
-
formed source NAT
Layer4-Layer7 Services 111

Application Delivery Controller Failover


When the ac​ tive ADC fails and standby one takes over, routes are with​drawn from the
ser​
vices VTEP A. The newly ac​ tive ADC ad​
ver​tises the VIP on ser​
vice VTEP B, the same
way as the pre​vi​
ous ac​
tive ADC has done it on VTEP A. Ser​vice VTEP B is now re​spon​
si​
-
ble for re​
quests to​
wards the ADC VIP.

Integrating Service Chaining


Typ​i​
cally, or​
ga​
ni​
za​
tions find that Layer 4-Layer 7 ser​ vices are not in​serted in​di​
vid​u​
ally,
but as a chain. For ex​am​ple, a cer​
tain ap​pli​
ca​
tion might re​ quire se​
cu​
rity ser​vices from a
fire​
wall, and load bal​ anc​ing ser​vices from an Ap​ pli​
ca​tion De​liv​
ery Con​ troller. Tying
these net​ work ser​
vices to​gether is com​ monly re​ferred to as a ser​vice chain.

Con​ sid​
er​
a​
tion needs to be given to place​ ment of the de​
vices so traf​fic does not take ex​ -
ces​sive hops across the fab​ ric when going be​tween the fire​
wall and the load bal​ ancer. It
is com​ mon to have mul​ ti​
ple fire​
walls and ADCs con​ nected to a ded​ i​
cated pair of
switches as ser​vice nodes. The place​ ment of the ap​pli​
ances in the VXLAN Fab​ ric is con​
-
sol​
i​
dated to a pair of ser​
vices under a ser​vice node pair. Traf​
fic flow is as follows:

• Traf​
fic will be VXLAN-en​
cap​
su​
lated from the client VTEP 1 to​
wards the ser​
vices
VTEP A.
• The ser​
vice VTEP re​
spon​
si​
ble for the ac​
tive fire​
wall de​
cap​
su​
lates and sends the
traf​
fic to the ac​
tive firewall.
• The fire​
wall then sends the traf​
fic to​
wards the ADC's VIP ad​
dress. This is done with
the as​sump​tion that the fire​
wall and the ADC are con​ nected to the same ser​
vice
VTEP. If fire​
wall and ADC are on dif​ fer​
ent VTEPs, traf​
fic will be VXLAN-en​
cap​
su​
-
lated to​
wards the ser​vice VTEP host​ing the ADC.
• ADC then sends the traf​
fic des​
tined to the real server back to the ser​
vices VTEP,
which en​
cap​
su​
lates and sends it to the des​
ti​
na​
tion VTEP 2.
• Traf​
fic gets de​
cap​
su​
lated at VTEP 2 and sent to the real server.
• The re​
sponse back from the real server is sent back to the ADC as the ADC is using
source NAT. With the usage of source NAT, the X-For​ warded-For HTTP header
field is going to be in​ serted to pre​ serve client IP ad​
dress vis​
i​
bil​
ity. Sub​
se​
quently,
the traf​fic will be in​
spected by the fire​
wall on its way back to the client.
Layer4-Layer7 Services 112

The di​
a​
gram below shows a log​
i​
cal rep​
re​
sen​
ta​
tion of a ser​
vice chain.

Figure: Layer 4-Layer 7 Service Chain

The di​
a​gram below shows a phys​ i​
cal rep​
re​
sen​
ta​
tion of VXLAN Fab​ric with a ded​i​
cated
ser​
vice VTEP pair. Fire​
walls and ADCs are com​ monly con​ nected to the ser​
vices VTEPs.
This can be achieved with or with​
out vPC (vPC shown in diagram).

Figure: VXLAN Fabric with Service Leaf


Layer4-Layer7 Services 113

To avoid ad​di​
tional en​
cap​
su​
la​
tions and de​
cap​su​
la​
tions, affin​
ity can be cre​
ated be​
tween
the ac​
tive fire​
wall and the ac​tive ADC, and they can be placed on the same ser​ vices
VTEPs.
Multi-POD &
Multi-Site
Designs
Multi-POD & Multi-Site Designs 115

Introduction

In an in​creas​
ingly com​pet​ i​
tive, glob​ally con​ nected busi​ness en​
vi​ron​ment, or​ ga​ni​
za​
tions
are faced with enor​ mous pres​ sures to en​ sure con​ tin​
u​
ous avail​
abil​ity of crit​
i​
cal busi​ness
ap​pli​
ca​
tions. With dig​ i​
tal strate​ gies dri​ ving in​no​v​
a​
tive new busi​ ness op​ por​ tu​
ni​
ties,
these or​ ga​
ni​
za​
tions are look​ ing for IT in​fra​struc​
tures that offer the agility, per​ for​mance
and avail​
abil​
ity re​
quired to sup​
port these new ap​
pli​
ca​
tion infrastructures.

When build​ ing the IT in​fra​


struc​ture to sup​
port these busi​ness crit​
i​
cal en​vi​
ron​
ments,
today's data cen​ ter de​ploy​
ments re​ quire ge​
o​
graph​
i​
cal di​
ver​
sity and scale, en​sur​
ing the
abil​
ity to de​
liver rapid scale, high per​
for​
mance and "al​
ways on" availability.

As a con​se​
quence, data cen​ter net​works are build​ing built as scal​
able, highly avail​
able
net​
work fab​rics which are dis​
trib​
uted across mul​ ti​
ple data cen​ters, whether sep​a​
rated
within or across a metro area, or across the globe.

This chap​ter pre​sents dif​


fer​ent de​
ploy​
ment op​ tions for the in​
ter​
con​
nec​tion of VXLAN
Fab​
rics, dis​
tin​
guish​
ing be​tween the multi-POD and multi-site ap​ proaches based on the
spe​
cific needs for scal​a​
bil​
ity and avail​
abil​
ity and the ex​ist​
ing phys​
i​
cal and op​er​
a​
tional
constraints.
Multi-POD & Multi-Site Designs 116

Fundamentals

A Point of De​liv​
ery (POD) is a net​work build​
ing block which can eas​ ily be repli​
cated
within a data cen​ter. The pre​
dictable and ho​
mo​ge​neous char​ ac​
ter​
is​
tics of a POD pro​ -
vide self-con​
tain​ment and a pre-as​ signed scale and per​ for​
mance re​ quire​ ment (POD
plan​
ning). The ar​chi​
tec​
ture of a POD should be mod​ u​
lar to allow for it to be repli​cated
and in​
ter​
con​
nected, keep​
ing a ho​
mo​
ge​
neous design.

Figure: 3 Tier Architecture

In clas​sic hi​
er​
ar​
chi​
cal net​
work de​ sign, the POD is formed by the Ac​ cess and Ag​gre​ga​
-
tion Layer, where the Ag​ gre​
ga​tion Layer pro​ vided the Layer 2 de​ mar​
ca​
tion. Layer 2
traf​
fic is ter​
mi​
nated and routed across the Core to reach other PODs or ex​ ter​
nal net​
-
works. With the de​ mar​ca​
tion at the Ag​gre​
ga​tion Layer, a Layer 2 VLAN or an IP Sub​net
Multi-POD & Multi-Site Designs 117

is lo​
cal​ized within a sin​
gle-POD, there​
fore Layer 2 com​mu​ ni​ca​
tion be​tween PODs is not
pos​ si​
ble. As a con​
se​
quence, host mo​bil​
ity across PODs is dif​
fi​
cult to implement.

Figure: 2-Stage Clos Architecture

With the evo​ lu​


tion of the hi​er​ar​
chi​cal Core/Ag​ gre​
ga​tion/Ac​ cess de​ sign into spine-leaf
topolo​gies, the lo​ca​
tion of func​ tions and in​ ter​con​nec​tiv​
ity within a POD shifts. Sim​ ply
mi​grat​
ing the topol​ ogy from a hi​ er​
ar​
chi​cal de​sign to spine-leaf does not bring about a
change in func​ tions, how​ ever, the ad​ di​
tion of over​ lays in​tro​duces greater ver​ sa​til​
ity.
With In​te​
grated Route and Bridg​ ing (IRB) and VXLAN, the leaf not only pro​ vides the de​ -
fault gate​
way but also a Layer 2 bridg​ ing ser​vice to other leaf switches. With this ap​ -
proach, it is pos​si​
ble to ex​
tend Layer 2 ser​ vices be​yond a sin​ gle-POD by using an over​ -
lay with end-to-end en​ cap​
su​la​
tion. When struc​ tur​
ing mul​ ti​
ple PODs and en​ abling ex​ -
tended Layer 2 and Layer 3 ser​ vices, use cases such as host mo​ bil​
ity are now eas​ ier to
implement.
Multi-POD & Multi-Site Designs 118

Figure: Interconnecting Two Clos Networks

The in​
ter​
con​nec​
tion within a multi-POD site can be achieved in var​ i​
ous ways. Spines
can be in​
ter​
con​
nected back to back, an ad​
di​
tional su​
per-spine layer can be in​
tro​
duced,
or PODs can be in​ter​
con​
nected at des​
ig​
nated leaf switches.

When mul​ ti​


ple phys​i​
cal lo​
ca​
tions are pre​sent, multi-site de​
signs come into con​ sid​
er​
a​
-
tion. A site de​fines a set of PODs (multi-POD) which share the same do​ main con​ structs,
pro​vid​
ing the same set of Layer 2 and Layer 3 seg​ ments at a given phys​ i​
cal lo​
ca​
tion. As
a re​
sult, within a given site the end-to-end over​ lay en​
cap​su​
la​
tion starts at a leaf in one
POD and can ex​ tend to a leaf in an​
other POD.

Multi-POD de​ signs can be stretched across phys​ i​


cal lo​
ca​
tions, how​ ever, this is not the
rec​
om​men​da​tion given the high avail​
abil​
ity re​
quire​
ments for ge​ o​graph​i​
cally dis​
persed
data cen​
ters. Fur​
ther de​
sign as​
pects are cov​ered through​ out this chapter.

In a multi-site de​ sign, the most sig​ nif​


i​
cant as​pects to con​sider are how to con​ nect the
sites with each other at the con​ trol and data plane lev​ els. When con​ sid​
er​
ing north-
south con​ nec​tiv​
ity, the first op​
tion in a multi-POD de​ sign is to con​
sol​
i​
date all ex​
ter​
nal
con​nec​
tiv​
ity into a sin​ gle point of ac​ cess. Al​
ter​
nately, sin​
gle points of ac​cess for each
in​
di​
vid​
ual POD for dis​ trib​
uted ingress and egress for​ ward​ing can be defined.
Multi-POD & Multi-Site Designs 119

The east-west com​ mu​ni​


ca​
tion has to solve the chal​ lenge of con​ nect​ing sites to each
other al​
low​ing work​load mo​ bil​
ity, but at the same time iso​lat​
ing the sites so that they
are in​
de​
pen​ dent from each other from a busi​ ness con​
ti​
nu​
ity per​spec​
tive. A fault in one
site should not prop​a​
gate to the other.

Why Deploy Multiple PODs?


The op​ ti​
mal way to ef​ fi​
ciently scale a sys​
tem is through mod​ u​
lar​
ity. Any mono​ lithic ar​
-
chi​
tec​ture will only grow to a cer​ tain point, after which in​ ef​
fi​
cien​cies will ap​pear. A
data cen​ ter is an ex​am​ ple of a sys​
tem that re​quires a flex​
i​
ble way to scale the net​ work
in​
fra​
struc​ture. Fre​ quently a data cen​ ter build-out starts in a sin​gle room and later ex​ -
pands across mul​ ti​
ple rooms.

Be​side scale, phys​ i​


cal fa​
cil​
ity and in​ fra​struc​ture lay​
outs can be an​ other mo​ ti​
va​tion for
multi-POD de​ signs. Multi-POD de​ signs fit very well in sit​u​
a​tions where a phys​ i​
cal lo​
ca​-
tion is par​ti​
tioned across mul​ ti​
ple rooms with lim​ ited ca​bling, but main​ tain​ing end-to-
end Layer 2 and Layer 3 con​ nec​tiv​
ity is still re​
quired. Any ser​ vice within one POD can
be made avail​ able to any other POD within this multi-POD topol​ ogy. As an ex​ am​ ple,
con​ sider a high avail​ abil​
ity (HA) clus​ ter being de​ ployed at a sin​ gle phys​ i​
cal lo​ca​tion but
spread across dif​ fer​
ent rooms due to the site's local HA ca​ pa​
bil​
i​
ties (dif​fer​ent Power
Dis​tri​
b​
u​
tion Unit - PDU, Un​ in​
ter​
rupt​ ible Power Sup​ ply - UPS etc.).

Why Deploy Multiple Sites?


Mod​ ern data cen​ ter en​
vi​
ron​ments must meet the needs for high avail​ abil​
ity within the
data cen​ ter and across ge​o​
graph​ i​
cally-dis​trib​uted data cen​ ter in​fra​
struc​ture. This type
of dis​
trib​uted ar​
chi​
tec​
ture of​fers mul​ti​
ple ben​ e​
fits for highly avail​able ap​
pli​
ca​tion de​
liv​
-
ery. Ap​pli​ca​
tions can be de​liv​ered in an ac​ tive/ac​ tive or ac​tive/standby de​ ploy​
ment
model and form the foun​ da​
tion for an ef​fec​tive busi​ ness con​ti​
nu​ity or dis​
as​ter re​
cov​
ery
strategy.

There are many fac​ tors which de​ter​


mine the ap​ plic​
a​
bil​
ity and de​sign of the multi-site
data cen​ter en​vi​
ron​ ment in​clud​
ing phys​i​
cal con​
straints such as site lo​ ca​
tion and re​-
quire​
ments for ge​ o​
graph​i​
cal di​
ver​
sity. Other con​ sid​
er​a​
tions in​
clude band​ width and
ser​
vice avail​
abil​
ity for in​
fra​
struc​
ture such as dark fiber or wave​ length ser​vice, and la​
-
Multi-POD & Multi-Site Designs 120

tency which may im​


pact ap​
pli​
ca​
tion per​
for​
mance. These fac​
tors de​
ter​
mine the Re​
cov​
-
ery Point Ob​ jec​
tive (RPO) and Re​
cov​
ery Time Ob​
jec​
tive (RTO) for ap​
pli​
ca​
tion
availability.

In con​trast to a sin​
gle site de​
ploy​ment a net​ work​ing so​lu​
tion for mul​
ti​
ple sites must
also ad​dress the need to main​ tain a level of sep​
a​
ra​
tion. Any event whether planned or
un​
planned im​ pact​
ing one site should not spread to any other site as it would im​
pact
over​
all ap​
pli​
ca​
tion availability.

When de​ ploy​ing a net​work in​fra​


struc​ture based on VXLAN EVPN, the con​ sis​
tent de​liv​
-
ery of Layer 2, Layer 3 and IP mul​ ti​
cast ser​
vices must be main​ tained. To​gether, these
allow for the de​ liv​
ery of dis​
trib​uted ap​pli​
ca​
tion ar​
chi​tec​
tures and ge​o​graph​ i​
cally-dis​
-
persed clus​ tered in​fra​
struc​
ture to sup​ port highly avail​
able stor​
age ac​
cess and com​ pute
virtualization.

De​
sign cri​
te​
ria to be con​
sid​
ered for such de​
ploy​
ments include:

• Phys​
i​
cal Connectivity: In many cases, given the con​
straints out​
lined above, the
avail​
abil​ity of con​nec​tiv​
ity ser​vices may be lim​ ited. As an ex​am​ ple, dark fiber or
wave​ length ser​ vices avail​abil​
ity may be lim​ited or cost-pro​hib​i​
tive over large dis​-
tances, whereas a routed Layer 3 or MPLS ser​ vice may be read​ ily avail​
able at an
achiev​ able price point. The de​ sign must take into con​ sid​
er​
a​
tion the need to allow
for mul​ ti​
ple con​nec​tion types rang​ ing from high band​ width dark fiber through to
band​width-con​ strained ser​vice provider-de​ liv​
ered Layer 3 services.
• Fault Isolation: When con​
nect​
ing mul​
ti​
ple dis​
crete net​
work en​
vi​
ron​
ments to​
-
gether, the risk of a fail​ure event prop​ a​
gat​
ing be​
tween sites in​creases sig​nif​
i​
cantly
un​less con​trols are ap​plied to re​strict the con​trol plane and data plane ac​tiv​
ity. Ex​
-
am​ ples in​
clude se​lec​
tion and con​ fig​
u​
ra​tion of con​trol plane pro​
to​
cols such as BGP,
and the con​ trol or re​ stric​
tion of data plane ac​ tiv​
ity such as ARP sup​ pres​
-
sion/spoof​ ing and storm control.
Multi-POD & Multi-Site Designs 121

Based on these cri​


te​
ria, the multi-site so​
lu​
tion must de​
liver the ap​
pro​
pri​
ate set of fea​-
tures and func​ tion​
al​
ity re​quired to meet the spe​ cific de​mands of a par​ tic​
u​lar
deployment.

In sub​
se​
quent chap​ ters the op​
tions for multi-POD and multi-site de​ ploy​
ment are ex​ -
plored fur​
ther, in​
clud​ing back-to-back vPC, OTV, and PBB-EVPN for a com​ pre​hen​sive
DCI so​
lu​
tion in order to main​tain con​
trol plane and data plane iso​
la​
tion and at the same
time pro​
vide work​ load mobility.
Multi-POD & Multi-Site Designs 122

Multi-POD Design

With a multi-POD de​ sign, mul​


ti​
ple data cen​
ter PODs are in​ ter​
con​nected using re​dun​-
dant Layer 3 paths and run the same VXLAN EVPN con​ trol plane. Each POD can have its
own Clos Fab​ric ar​
chi​
tec​
ture with in​
de​
pen​dent spine and leaf lay​ers. Phys​
i​
cal con​
nec​-
tiv​
ity be​
tween PODs can be es​ tab​
lished by in​
ter​
con​
nect​ing them on ei​ ther the leaf or
spine layer. This ses​
sion dis​
cusses dif​
fer​
ent op​
tions in the multi-POD fab​
ric design.

Placement of Inter-POD Connecting Points


The phys​
i​
cal con​nec​
tions be​
tween PODs pro​ vide the Layer 3 path for the EVPN con​ trol
and data planes, as a part of the un​ der​
lay IP trans​port net​work. There​ fore, the in​
ter-
POD links do not bear any spe​ cial func​
tional re​quire​ments for VXLAN EVPN. The min​ i​
-
mum re​quire​ments for in​
ter-POD links are Layer 3 IP uni​ cast and mul​ ti​
cast rout​
ing in
the un​
der​
lay. Place​
ment of the in​ ter-POD con​ nect​ing points is flex​
i​
ble as it can be ei​-
ther on a leaf node or a spine node. Con​ sid​er​
a​
tion around the link speeds, the optic
types, and/or the ca​
bling plan could be dri​
ving fac​tors for the de​
ci​
sion to in​
ter​
con​
nect
PODs on leaf or spine nodes. Fig​
ures below il​
lus​
trate the topolo​gies options.

Figure: Multi-POD Topology with Inter-Connections on Leaf Nodes or Spine Nodes


Multi-POD & Multi-Site Designs 123

To achieve high avail​


abil​
ity for in​
ter-POD con​
nec​
tiv​
ity, the rec​
om​
men​
da​
tion is to
lever​
age re​
dun​dant in​
ter-POD paths using two or more de​ vices in each POD. Since the
in​
ter-POD con​nec​tions only need to pro​vide Layer 3 con​
nec​tiv​
ity, the re​
dun​
dant in​
ter​
-
con​nect​
ing de​
vices do not need to be in vPC pair.

Scale Multi-POD with Multi-Stage Clos Architecture


When the num​ ber of PODs in​ creases, sim​
ple yet scal​
able multi-POD de​ sign be​
comes an
im​
por​ tant de​ci​
sion point. Fol​
low​ing the n-stage data cen​ter fab​
ric de​
sign prin​
ci​
ple, one
de​
sign op​ tion to con​nect mul​ ti​
ple data cen​
ter PODs is to in​
tro​duce a super spine layer
that in​ter​
con​nects the spine layer of each POD.

This es​sen​ tially builds a multi-stage hi​ er​


ar​
chi​cal fab​
ric topol​ogy. MP-BGP EVPN is run​ -
ning be​ tween the Fab​ ric nodes to dis​ trib​
ute the VXLAN EVPN routes. This multi-stage
fab​ric de​sign with a su​ per-spine layer sim​ pli​
fies the in​ter​con​nec​ tion topol​ogy among
PODs, mak​ ing it eas​
ier to scale the num​ ber of PODs. It is the most ef​ fi​
cient way of pro​-
vid​ing con​ sis​tent for​ward​ ing hop counts for in​ ter-POD traf​ fic. If un​ der​
lay mul​ti​
cast
repli​ca​
tion is used to trans​ port the VXLAN Fab​ ric BUM traf​ fic, a multi-stage Clos Fab​ ric
de​sign also helps re​ duce the num​ ber of Mul​ ti​
cast Out​ put In​ ter​faces (OIF) re​quired.
Since most switch plat​ forms sup​ port a lim​ited num​ ber of mul​ ti​
cast OIFs, the su​ per-
spine will allow the VXLAN Fab​ ric to scale with​out ex​ ceed​ing the max​ i​
mum num​ ber of
OIFs sup​ ported on a sin​ gle spine device.

Figure: Multi-Stage Fabric


Multi-POD & Multi-Site Designs 124

Scalability
When de​
sign​
ing for con​
trol plane scale for an in​
ter-POD Fab​
ric, plat​
form OIF, mul​
ti​
cast
groups, and VTEPs need to be con​ sid​
ered in ad​di​tion to host MAC and MAC/IP. It is
im​
por​
tant to look at the hard​
ware ver​
i​
fied scal​
a​
bil​
ity guidelines.

For ex​ am​ple, in a sim​ple multi-POD sce​ nario, if the spine sup​ports 256 OIF, then sub​ -
tract 2 OIF for the up​ link to​
wards the L3 core, leav​ ing 254 OIFs for south​bound con​ nec​
-
tiv​
ity to the leafs in the vPC do​mains. This would give 254 leaves, or 127 vPC do​ mains, to
con​ nect south​ bound if each leaf in the vPC do​main has a sin​gle link to each spine in the
POD.

Look​ing closer at the above ex​ am​ple, both vPC VTEP switches in​ de​pen​dently send the
IP PIM reg​is​
ter to the Ren​dezvous Point for the mul​ ti​
cast group of the VXLAN VNI. Both
source the reg​ is​
ter pack​ets from the any​ cast VTEP ad​ dress and each in​stalls the cor​
re​
-
spond​ing (*, G) entry in their mul​ ti​
cast rout​
ing ta​
bles with the VTEP in​ter​
face (NVE1) in
the out​
put in​ter​face (OIF) list.

In ad​di​
tion, con​sid​
er​
a​
tion needs to be given to host MAC and IP scale per leaf. A leaf
will learn all BGP routes across the multi-POD en​ vi​
ron​ment but will not pro​gram the
hard​ ware ta​bles For​ward​ing In​for​
ma​tion Base / Rout​ing In​for​
ma​
tion Base (FIB/RIB)
un​less the leaf needs to know about them. If the leaf knows about the VRF and is im​ -
port​ing the route-tar​ gets it will pro​
gram the RIB for the MAC/IP routes. In ad​ di​
tion,
the leaf only pro​grams the FIB with the MAC ad​ dress of the VNIs of the VRFs it has lo​
-
cally defined.

Building the Overlay


Multi-POD & Multi-Site Designs 125

Building the Overlay

Data Plane Operation

Like a sin​gle-POD, the tun​ nels run be​


tween VTEP de​ vices in a multi-POD fab​ric. In a
multi-POD fab​ ric the tun​ nel head​end VTEPs can re​ side in dif​
fer​
ent PODs if the traf​fic
tra​
verses the POD bound​ ary, which re​
sults in an in​
ter-POD tun​ nel. VXLAN en​cap​su​la​
-
tion and de​ cap​
su​la​
tion only take places on the ingress and egress VTEPs. The other de​ -
vices along the for​ ward​ ing path only need to route the en​ cap​
su​lated VXLAN pack​ ets.
This pro​vides very ef​fi​
cient end-to-end, sin​gle-tun​nel over​
lay data plane processing.

Figure: Data Plane Operation

IP Gateway Localization
In net​
works with​ out Dis​trib​
uted Any​ cast Gate​ way the de​ fault gate​ way is made re​ dun​ -
dant through the use of a First Hop Re​ dun​
dancy Pro​ to​
col (FHRP). When a net​ work seg​ -
ment spans across mul​ ti​ple phys​i​
cal lo​
ca​
tions, the same con​ cept can force all traf​ fic
through sin​ gle VTEP. Al​ ter​na​
tively you can pro​ vide lo​cal​
iza​tion by hav​ ing an ac​tive in​-
stance of the de​ fault gate​way in each lo​ ca​
tion. Using lo​ cal​iza​tion pro​vides an more op​ -
ti​
mal for​
ward​ ing path be​ tween sub​ nets within the same lo​ ca​tion. If ap​
pli​
ca​
tion work​ -
load mo​bil​
ity is re​
quired be​ tween lo​ca​tions, it is im​por​tant to main​ tain the same de​ fault
gate​
way IP and MAC ad​ dress. With gate​ way lo​ cal​iza​
tion, end​ points do not need to re-
learn these in​for​ma​tion at the new location.
Multi-POD & Multi-Site Designs 126

Multi-POD VXLAN Fab​ ric with Dis​


trib​
uted Any​
cast Gate​
way makes it easy to meet this
re​
quire​
ment. Sim​i​
lar to sin​
gle-POD, all the VTEPs serv​
ing the same IP sub​
net can use
the Dis​
trib​
uted Any​cast Gateway.

Figure: Gateway Localization with Distributed Anycast Gateway

In the il​
lus​
tra​
tion above, the VTEP leafs in blue are Dis​
trib​
uted Any​
cast Gate​
way for
Layer 2 VNI "Blue" while the VTEP leafs in green are Dis​trib​
uted Any​
cast Gate​
way for
Layer 2 VNI "Green".

Control Plane Operation


EVPN MP-BGP con​ trol pro​
to​
col runs through​ out multi-PODs the same way as it does
within a sin​gle-POD. The fab​ ric nodes run​ ning EVPN ex​ change MP-BGP EVPN routes
with one an​ other. Each VTEP de​ vice de​tects its local end​points and in​ stalls HMM
routes for end​ point track​ing. The HMM routes are au​ to​
mat​i​
cally in​
jected into MP-BGP
EVPN ad​ dress-fam​ ily and dis​ trib​
uted to other EVPN nodes as EVPN type-2 routes.
Upon re​ceiv​ ing the EVPN routes, the rest of the VTEP de​ vices will in​
stall the end​point
reach​a​
bil​
ity in​for​
ma​tion into their L2 RIB and L3 RIB ta​ bles. Fur​ ther pro​gram​ ming of
the hard​ware for​ ward​ing ta​
bles, in​clud​
ing the MAC-ad​ dress table and host/LPM for​ -
ward​ing ta​bles based on the RIB in​ for​
ma​tion will hap​pen if they pos​ sess the cor​ re​
-
spond​ing L2VNI and L3VNI information.
Multi-POD & Multi-Site Designs 127

Placement of MP-BGP EVPN Peering


In a VXLAN Fab​ ric the same EVPN con​ trol plane runs through​ out the en​tire en​vi​
ron​-
ment. Within a POD, EVPN ses​ sions are formed be​ tween leaf and spine nodes. Be​ tween
PODs, EVPN peer​ ing does not nec​ es​
sar​
ily need to co​ in​
cide with the phys​i​
cal con​nec​-
tion topol​
ogy. The fol​
low​ing draw​ing de​picts the avail​
able de​
signs in which EVPN MP-
BGP peer​ing oc​curs be​
tween the con​ nected leafs or be​ tween the spine nodes of dif​ fer​
-
ent PODs.

Often, switch hard​ ware plat​forms with more con​ trol plane ca​pac​ity and higher band​ -
width are cho​ sen for the spine layer. Also, due to their cen​ tral​
ized lo​ca​
tion in a POD,
the spine nodes are often cho​ sen as the con​trol point for MP-BGP EVPN route dis​ tri​
b​
u​
-
tion. For ex​
am​ple, in a MP-iBGP Fab​ ric, the spine nodes are often cho​ sen to be the iBGP
route re​flec​
tors. In this case, peer​
ing on the spine nodes be​ tween PODs can take ad​ -
van​tage of the more scal​ able con​
trol plane and the com​ plete set of EVPN rout​ ing in​
for​
-
ma​tion on the spine nodes.

MP-iBGP vs MP-eBGP
MP-BGP EVPN dis​ trib​
utes the Layer 2 and Layer 3 reach​ a​
bil​
ity in​
for​
ma​tion for the
VXLAN over​ lay net​work. It sup​
ports both iBGP and eBGP topol​ ogy, which pro​ vides the
de​sign flex​
i​
bil​
ity to run MP-BGP in a multi-POD en​ vi​
ron​ment. It is not within the scope
of this book to doc​ u​
ment all the pos​
si​
ble com​bi​
na​tions of iBGP and/or eBGP de​ signs in
a multi-POD Fab​ ric. The com​ mon prac​tice de​signs will be dis​cussed to il​lus​
trate the
de​sign principles.

The Fig​ure below de​ scribes a com​ mon multi-POD de​ sign in which each POD runs MP-
iBGP EVPN be​ tween leafs and spines whereas MP-eBGP EVPN is used to in​ ter​
con​
nect
the PODs. The draw​ ing does not in​ di​
cate any phys​i​
cal topol​
ogy for con​nect​
ing mul​
ti​
ple
PODs to​ gether rather, it de​ picts the peer​ing topol​
ogy. Con​ cep​
tu​
ally, the Route Re​
flec​-
tors (RR) of dif​
fer​
ent PODs are ex​ chang​ing EVPN routes via MP-eBGP so that reach​ a​bil​
-
ity in​
for​
ma​tion can be ex​ tented from one POD to another.
Multi-POD & Multi-Site Designs 128

Figure: eBGP Peering Among PODs

With this de​sign, BGP peer​ ing among mul​ ti​


ple PODs is sim​ple. EVPN routes can be dis​ -
trib​
uted among PODs through MP-eBGP peer​ ing with​out the need for ad​di​
tional con​-
fig​
u​
ra​
tion. Ad​di​
tional con​
sid​er​
a​
tions need to be given to how to pre​serve the at​
trib​utes
in an EVPN route when it is dis​ trib​
uted within the Fab​
ric as eBGP de​fault be​
hav​ior may
cause some of the at​ trib​
utes to be overwritten:

• By de​
fault, a router over​
writes the next-hop in the route to it​
self when send​
ing a
route to its eBGP peers.
• If each AS gen​
er​
ates EVPN route-tar​
gets (RT) au​
to​
mat​
i​
cally, they may end up hav​
-
ing dif​
fer​
ent RTs for the same L2VNI or L3VNI as often the auto-RT func​ tion uses
the BGP AS num​ ber as one of the el​
e​
ments to de​rive EVPN RTs. So ad​ di​
tional cau​-
tion needs to be ap​plied when con​fig​
ur​
ing the EVPN RT im​ port and ex​
port poli​cies
to en​sure the routes within the same VNI shall have the same im​ port/ex​ port RTs
on VTEPs with dif​fer​
ent PODs so that the route dis​ tri​
b​
u​
tion can be com​plete end-
to-end.
Multi-POD & Multi-Site Designs 129

An​
other de​
sign is to use a sin​
gle BGP AS across all PODs so that the multi-POD Fab​
ric
runs EVPN MP-iBGP.

Figure: iBGP Peering Among PODs

With this de​sign, ad​di​


tional iBGP de​sign prin​
ci​
ples need to be ap​plied to en​sure EVPN
routes are dis​trib​uted end-to-end through the BGP AS. As a loop pre​ ven​tion mech​a​-
nism, iBGP has the rule that routes learned from one iBGP peer will not be ad​ ver​
tised to
the other iBGP peer. That is the rea​ son why iBGP route re​ flec​
tors (RR) are needed to
re​
flect routes be​ tween the peers. In this multi-POD de​ sign, RRs from dif​ fer​
ent PODs
are fur​
ther in​
ter​con​nected within the same BGP AS, there is a need for an​ other layer of
RRs to pass MP-iBGP EVPN routes among the POD RRs. How​ ever, by de​
sign MP-iBGP
pre​serves EVPN route at​ trib​
utes bet​
ter than MP-eBGP.

• iBGP by de​
sign pre​
serves BGP next-hop. There​
fore, when an EVPN route is dis​
trib​
-
uted within an iBGP topol​
ogy, the orig​
i​
nat​
ing VTEP ad​
dress will be pre​
served in the
BGP next-hop.
• iBGP does not change the EVPN Route-Tar​
get (RT) value while dis​
trib​
ut​
ing the routes.
Multi-POD & Multi-Site Designs 130

• Auto-RT func​
tion will gen​
er​
ate the same EVPN RT for the same VNI across dif​
fer​
ent
PODs. This en​sures that VTEPs in dif​
fer​
ent PODs will have con​
sis​
tent im​
port and
ex​
port RT value for the same VNI.

When com​ par​


ing the two com​ mon EVPN MP-BGP de​ signs, each of them of​
fers sim​
plic​
-
ity in one as​
pect while in​
tro​
duc​
ing com​
plex​
ity in an​
other. The fol​low​
ing table sum​
ma​ -
rizes the comparison.

MP-eBGP MP-iBGP

BGP Peering Simple Complex

EVPN Route Distribute Complex Simple

If a multi-POD Fab​ ric is de​


ployed in a net​ work that al​ready has a BGP de​ ploy​ment, the
de​ci​
sion on whether to use MP-iBGP or MP-eBGP peer​ ing will de​
pend on the ex​ ist​
ing
BGP de​ ploy​ment. It is worth not​ing that in the con​text of multi-POD de​ sign, the ad​van​
-
tage of using dif​
fer​
ent BGP AS's for bet​ ter con​trol plane seg​men​ ta​
tion is not sig​
nif​
i​
cant
as the en​ tire multi-POD fab​ ric is under the same ad​ min​
is​
tra​
tive scope and MP-BGP
EVPN domain.

Building the Underlay

Cabling

Most of the ex​ ist​


ing ca​
bling in​
fra​
struc​tures were de​ signed to han​ dle 3-tier phys​ i​
cal
cable lay​outs which may lend it​self to a multi-Pod topol​ ogy due to lim​ited ca​bling ca​-
pac​
ity. In N+1 POD en​ vi​
ron​ments, a ca​bling in​fra​
struc​
ture pro​
vid​
ing cen​ tral core con​ -
nec​
tiv​
ity will ad​
dress scale out and fa​ cil​
i​
tates adding more ca​ pac​ity. An​other ca​bling
op​
tion is multi-POD using dark fiber, MAN or DWDM.
Multi-POD & Multi-Site Designs 131

Figure: Cabling Infrastructure

IP Multicast Replication vs Ingress Replication


The MP-BGP EVPN con​ trol plane is used to dis​ cover end​
points and ex​ change host in​ -
for​ma​tion be​
tween leaf nodes, IP Mul​ ti​
cast or Ingress Repli​
ca​
tion is used for BUM traf​ -
fic. There are two ways to repli​ cate BUM traf​ fic through​
out the fab​ric, ei​
ther one to
many, or one to one many times. ARP re​ quests are one ex​ am​ ple of BUM traf​ fic that
needs to be repli​
cated using a mul​ ti​
cast group or ingress replication.

Ingress repli​ca​
tion can have scale is​ sues as the switch needs to repli​ cate BUM pack​ ets
as many times as there are VTEPs that own the VNI need​ ing to see that traf​fic. As an
ex​
am​ple, with 50 VTEPs that own the same VNI that re​ quire BUM traf​ fic, repli​
ca​
tion
needs to be per​ formed 50 times. Repli​ cated BUM trans​ mis​sions con​
sume a lot of band​ -
width in the net​ work. In con​trast, IP mul​ ti​
cast across a multi-POD en​ vi​ron​ment is a
much more scal​ able so​
lu​
tion to han​ dle BUM traf​ fic as the fab​
ric na​
tively pro​ vides the
ca​
pa​
bil​
i​
ties for the re​quired repli​
ca​ tion. IP mul​ti​
cast re​duces net​work load, im​ proves
per​
for​
mance, and in​ creases scal​
a​
bil​ity across multi-POD environments.

When an any​ cast RP is con​


fig​
ured, the re​
stric​
tion of hav​
ing one ac​
tive RP per mul​ti​
cast
group in​stead de​ ploy re​
dun​
dant RPs for the same group range. The RP routers share a
sin​
gle uni​
cast IP ad​dress be​
tween PODs. This method pro​ vides RP re​
dun​dancy and load
shar​
ing within the do​ main. Sources from one RP are known to other RPs in other PODs
using the Mul​ ti​
cast Source Dis​cov​
ery Pro​to​col (MSDP). Sources and re​ ceivers use the
clos​
est RP, as de​ter​
mined by the IGP. Dur​ing an RP fail​
ure, sources and re​
ceivers seam​ -
Multi-POD & Multi-Site Designs 132

lessly failover to a new RP based on the un​


der​
lay rout​
ing do​
main. In multi-POD en​
vi​
-
ron​ments, PIM-SM RP and RP re​ dun​dancy should be po​ si​
tioned lo​
cally in​
side of each
re​
spec​tive POD. Any​ cast RP clus​
ter​
ing can be used for RP re​dun​
dancy across PODs but
for bet​
ter con​trol of the mul​
ti​
cast en​
vi​
ron​
ment MSDP is the rec​ om​ mended so​ lu​
tion to
con​nect mul​ti​
ple PIM-SM domains.

Figure: MSDP for Inter Site Multicast


Multi-POD & Multi-Site Designs 133

More de​
tails of con​
fig​
u​ra​
tion ex​
am​
ples can be found at http://​
www.​
cisco.​
com/​
c/​
en/​
us/​
support/​ docs/​ip/​ip-multicast/​
115011-anycast-pim.​html

Multi-POD Routing Design


Multi-POD IP Rout​ing de​
signs need to take into con​ sid​
er​
a​
tion the re​
quire​ments for
both un​
der​
lay and over​
lay. The un​der​lay will es​
tab​
lish Layer 3 con​nec​tiv​
ity be​
tween
VTEPs de​
ployed across mul​ti​
ple PODs.

The un​der​lay should only con​


sist of VTEP reach​
a​
bil​
ity in​
for​
ma​
tion for all net​
work​
ing
de​
vices in the fabric.

When in​ ter​


con​ nect​
ing an EVPN multi-POD en​ vi​
ron​ment, it is im​
por​tant to main​ tain as
much as POD in​ de​pen​
dence as pos​ si​
ble. In very large multi-POD en​ vi​
ron​
ments it may
be ben​e​
fi​
cial to have mul​ti​
ple IGP areas to im​ prove fault tol​er​
ance across the PODs. As
an ex​am​ple, each POD can be op​ tion​ally a Stub Area. As a Stub Area, each DC area
knows its own topol​ ogy and has a de​fault route to​wards the bor​ der leaf; while the back​-
bone area has a view of the full multi-POD Fab​ ric. How​ever, for most de​ signs a sin​gle
area across mul​ ti​
ple PODs will suf​
fice as sim​plic​
ity out​
weighs complexity.

The un​ der​lay can be built with any rout​ ing pro​ to​
col. BGP may not be the best choice as
an un​ der​
lay pro​to​
col as it is a dis​
tance vec​ tor rout​
ing pro​to​col and it does not take into
ac​count link speed or path cost, and in a multi-POD en​ vi​
ron​ment mul​ ti​
ple paths with
dif​
fer​ent link speeds might be used to in​ ter​
con​ nect the PODs. Dri​ ving sim​ plic​
ity in the
rout​ ing de​sign in the un​der​ lay will help to im​ prove over​all con​ver​
gence in the over​ lay.
Tun​ ing IGP timers may help im​ prove con​ ver​
gence time, how​ ever, there is no generic
rec​om​ men​ da​
tion, and this must be qual​ i​
fied and val​i​
dated for each deployment.

Other IGPs such as OSPF, ISIS would be a bet​ter op​


tion for un​
der​lay rout​
ing. Please
refer to the Sin​
gle-POD de​sign chap​
ter for a more de​ tailed dis​
cus​sion on rout​ing
protocols.
Multi-POD & Multi-Site Designs 134

Service Integration
In a multi-POD de​sign, it is a rec​
om​
mended prac​ tice to have all the ser​
vices in​
fra​
struc​
-
ture such as fire​
walls or load bal​ ancers con​ nected to a sep​ a​
rate ser​vices node POD.
This helps with scal​
a​
bil​
ity and high avail​
abil​
ity for ser​vices across a multi-POD design.
Multi-POD & Multi-Site Designs 135

Design Options

When con​ nect​


ing data cen​
ter sites based on VXLAN Fab​ rics, there are a num​ ber of de​-
sign con​sid​
er​
a​
tions which will de​
ter​
mine the over​
all per​
for​mance, avail​abil​
ity, and scale
of the environment.

These de​
sign con​
sid​
er​
a​
tions in​
clude the fol​
low​
ing aspects.

• De​
ter​
min​
ing the In​
ter-Site Bor​
der Con​
nec​
tion Points: The Fab​
ric bor​
der pro​
vides
an edge func​ tion to allow for ex​ter​ nal con​
nec​tiv​
ity in and out of the Fab​ric and also
pro​vides an at​tach​ment point for the DCI ser​ vices which de​ liver the re​
quired in​ter-
site con​nec​
tiv​ity. Al​
though the Fab​ ric bor​
der for Layer 2 Layer 3 Ex​ ter​
nal Con​nec​-
tiv​
ity and DCI ser​ vices have sim​ i​
lar char​
ac​
ter​is​
tics, they may or may not be com​ -
bined de​ pend​ing on fac​ tors de​
tailed in the Ex​
ter​nal Con​ nec​
tiv​
ity chapter.
• DCI Ser​
vice De​
liv​
ery: An ap​
pro​
pri​
ate se​
lec​
tion of DCI ser​
vice will be a pri​
mary fac​
-
tor in the multi-site de​
sign as each will have dif​
fer​
ent prop​
er​
ties as ex​
plained fur​
-
ther in the Ex​
ter​
nal Con​
nec​
tiv​
ity chapter.

• L3 ser​
vices in​
clud​
ing L3VPN, VRF Lite, LISP or VXLAN
• L2 ser​
vices in​
clud​
ing Eth​
er​
net over Dark Fibre/DWDM, OTV, PBB-EVPN, MPLS
EVPN, VPLS or VXLAN
Multi-POD & Multi-Site Designs 136

Figure:Multi-Site DCI over L3 Service

Figure: Multi-Site DCI over L2 Service

Border Leaf Scalability


Multi-POD & Multi-Site Designs 137

Border Leaf Scalability


The bor​
der leaf pro​
vides the at​
tach​
ment point be​
tween mul​
ti​
ple Fab​
rics, de​
liv​
er​
ing
Layer 3 rout​ ing and Layer 2 ex​ ten​sion be​
tween sites. In order to per​ form host rout​ing
for the traf​
fic tra​
vers​
ing the DCI trans​ port, it must also main​tain a host rout​
ing table in
hard​ware for con​ nec​tiv​
ity within and across mul​ ti​
ple sites.

The key fac​


tors which typ​
i​
cally de​
ter​
mine multi-site scale at Layer 2 and Layer 3
include:

• Vir​
tual Net​
work Iden​
ti​
fiers (VNI) - Layer 2 and Layer 3
• MAC Addresses
• IP Host Routes (IPv4/IPv6)
Multi-POD & Multi-Site Designs 138

Building the Multi-Site Inter-Connectivity

The need for cre​at​


ing mul​
ti​
ple sites is to en​
sure that any im​ pact in one avail​
abil​
ity zone
will have min​
i​
mal to zero im​pact on the other avail​ abil​
ity zones. An in​de​
pen​dent fab​ ric
is one that has its own con​trol plane and data plane. Multi-site pro​ vides the abil​ity to
in​
ter​con​
nect the in​de​
pen​
dent fab​ rics using a DCI so​ lu​
tion such as back-to-back vPC,
OTV, EVPN, PBB-EVPN and Layer 3 con​
nec​
tiv​
ity with VRF-lite, MPLS or LISP.

A con​tin​
u​
ously avail​
able, ac​
tive/ac​
tive, flex​
i​
ble en​
vi​
ron​
ment pro​
vides sev​
eral ben​
e​
fits
to the business:

• In​
creased uptime
• Dis​
as​
ter avoidance
• Eas​
ier maintenance
• Flex​
i​
ble work​
load placement
• Ex​
tremely low RTO

It is im​por​tant to re​
mem​ ber that host reach​ a​
bil​
ity in​
for​
ma​tion is con​
tained within a
sin​gle site and ex​tended using a DCI tech​nol​ogy. The Layer 3 di​ a​
grams below demon​ -
strate in​de​pen​dent con​
trol planes in each site and will high​
light how to ex​tend Layer 2
connectivity.
Multi-POD & Multi-Site Designs 139

Figure: Layer 3 DCI for VXLAN Interconnect

To pro​vide true ac​


tive/ac​
tive ar​chi​
tec​
ture, it is also re​
quired to in​te​
grate Lay​er4-Lay​
-
er7 ser​
vices such as fire​
walls and ADCs. Cisco Adap​ tive Se​cu​
rity Ap​pli​
ance (ASA) pro​
-
vides sup​port for multi-site ac​
tive/ac​ tive fire​
wall clus​ter​
ing with sites lo​cated hun​-
dreds of kilo​
me​ters/miles apart.

Layer 2 Reachability Across Sites


Due to re​
quire​ments for dis​
as​
ter avoid​
ance and work​load mo​ bil​
ity, there is a re​
quire​
-
ment to ex​
tend VLANs across dif​fer​
ent VXLAN Fab​rics. The dis​
cus​ sion below will cover
three DCI op​
tions: vPC-based, OTV, and VXLAN.

The DCI so​lu​


tion should pro​ vide Layer 2 and Layer 3 ex​ ten​
sion, and en​ sure that a fail​
-
ure in one data cen​ter will not be prop​ a​
gated to the other data cen​ ter. To pre​vent this
from hap​pen​ing, the key tech​ ni​
cal re​quire​
ment is the ca​ pa​
bil​
ity to con​trol the broad​ -
cast, un​
known uni​ cast and mul​ ti​
cast flood at the data plane level while en​sur​ing con​trol
plane independence.
Multi-POD & Multi-Site Designs 140

Layer 2 ex​ten​
sion must be dual homed for re​ dun​dancy while pro​ hibit​
ing end-to-end
Layer 2 loops that would lead to traf​fic storms caus​ing link over​flows, sat​u​
rate switch
CPUs and vir​tual ma​chine CPUs. This is why in Data Cen​ ter In​
ter​
con​ nect de​ploy​
ments,
one key com​ ple​
men​tary fea​
ture to Layer 2 ex​
ten​
sion is storm control.

Figure: Storm Control

VPC as a DCI Transport


Multi-POD & Multi-Site Designs 141

VPC as a DCI Transport


Two VXLAN fab​ rics can be di​rectly con​ nected using back-to-back vPC. On each side,
one pair of bor​der nodes are lever​ag​ing a back-to-back vPC con​ nec​tion to ex​
tend Layer
2 con​
nec​tiv​
ity across sites. This dual link vPC could use dark fiber or DWDM. vPC is ex​ -
tremely sim​ple to con​fig​
ure how​ ever there are cer​tain lim​
i​
ta​
tions to using vPC as a DCI
such as:
• Can​
not in​
ter​
con​
nect more than two sites
• Lack of fail​
ure boundaries
• Site in​
de​
pen​
dence is not preserved

OTV as DCI Transport


OTV pro​ vides a proven and ex​tremely sim​ ple way to in​ ter​
con​
nect mul​ ti​
ple data cen​
-
ters. OTV has been de​signed for the data cen​ ter in​
ter​con​
nect space, and is con​sid​
ered
the most ma​ ture and func​
tion​
ally rich so​
lu​
tion to ex​tend multi-point Layer 2 con​ nec​
-
tiv​
ity over a generic IP net​work. In ad​ di​
tion, it of​fers na​
tive func​tions that allow
strength​en​
ing the DCI con​
nec​tion and in​
creas​ ing the in​
de​
pen​dence of the fabrics.
• Span​
ning Tree (STP) isolation
• Un​
known Uni​
cast traf​
fic suppression
• ARP optimization
• Layer 2 broad​
cast pol​
icy control
• Fault isolation
• Site independence
• Fail​
ure bound​
ary preservation
Multi-POD & Multi-Site Designs 142

Figure: OTV as a DCI

VXLAN as DCI Transport


VXLAN al​lows Layer 2 ex​ ten​
sion, not only in​side a data cen​ ter but also the abil​
ity to
allow Layer 2 in​
ter​con​
nec​ tions be​
tween mul​ ti​
ple sites. Today VXLAN EVPN as a DCI of​ -
fers some ben​ e​
fits but is not yet ma​ture in terms of na​ tive func​tions like pro​
vided by
OTV.

New func​tion​
al​
i​
ties are being added to the VXLAN con​ trol plane which would make it a
very vi​
able DCI so​ lu​
tion in the fu​
ture. This is fur​
ther dis​cussed in the in​
tro​
duc​
tion
chapter.

As de​
picted in the di​a​
gram below, log​i​
cal back-to-back vPC con​nec​
tions are used be​-
tween the VXLAN bor​ der leaf nodes and the local pair of VXLAN DCI de​
vices to in​
ter​
-
con​
nect mul​ti​
ple sites.
Multi-POD & Multi-Site Designs 143

Figure: VXLAN as a DCI

Layer 3 Reachability Across Sites


In ad​
di​
tion to the Layer 2 solutions, the VXLAN fab​
ric can also be con​
nected across
Layer 3 bound​aries using MPLS L3VPN, VRF-Lite and LISP.

VRF-lite Based Approach


VRF-lite uses a two-de​ vice ap​ proach to pro​vid​
ing Layer 3 con​ nec​tiv​
ity be​
tween multi-
site fab​ rics. In this ap​proach, each ten​ ant VRF in the fab​ ric is ex​
tended using a subin​ -
ter​face per ten​ ant with In​ te​rior Gate​
way Pro​ to​
col (IGP) or Ex​ ter​
nal Bor​der Gate​ way
Pro​ to​
col (eBGP) peer​ ing be​tween the bor​ der node and the edge router at each site. The
Ex​ter​nal Con​ nec​tiv​
ity chap​ ter dis​
cusses in de​tail the pro​
ce​dure to con​ nect a Layer 3
hand​ off to the VXLAN fab​ ric. The same prin​ ci​
ples can be used to pro​ vide multi-site in​
-
ter​con​ nect using VRF-lite.

MPLS-Based Approach
Multi-POD & Multi-Site Designs 144

MPLS-Based Approach
This ap​ proach uses a sin​ gle de​ vice to in​ ter​
con​nect multi-site VXLAN fab​ rics and
achieve seg​men​ ta​
tion using MPLS L3VPNs. The sin​ gle de​vice called Bor​ der PE can be
used to ter​mi​
nate MPLS and VXLAN rout​ ing on the same de​ vice. The Ex​ ter​
nal Con​ nec​
-
tiv​
ity chap​
ter pro​
vides ad​di​
tional de​tails about using a MPLS hand​ off to the VXLAN fab​ -
ric. The same prin​ci​
ples can be used to pro​ vide multi-site in​
ter​
con​nec​tiv​
ity as well.

LISP-Based Approach
The third ap​ proach for in​ter​con​nect​
ing multi-site VXLAN fab​ rics is LISP. It of​
fers the
same seg​ men​ ta​
tion ben​e​
fits as MPLS and can be used as an al​ ter​
na​tive so​
lu​
tion. The
Ex​ter​nal Con​ nec​
tiv​
ity chap​ter pro​
vides ad​
di​
tional de​
tails about using a LISP hand​ off to
the VXLAN fab​ ric. The same prin​ ci​
ples can be used to pro​ vide multi-site in​ter​
con​ nec​-
tiv​
ity as well.
Operations &
Management
Operations & Management 146

Introduction

This final chap​


ter is fo​
cused on pro​
vid​
ing guid​ance for op​er​
a​
tional as​
pects of build​
ing,
op​
er​at​
ing and main​ tain​
ing a VXLAN Fab​ric. The lat​
ter half of the chap​ter cov​
ers APIs
and off-the-shelf and open source tools for au​tomat​ing the man​ age​
ment of the fabric.

For the last 20 years, net​works have been man​ aged as in​
de​
pen​ dent el​e​
ments lever​ ag​ing
pur​
pose-built pro​ to​
cols and in​ter​
faces, such as Sim​ple Net​work Man​ age​ment Pro​ to​col
(SNMP), Com​ mand-Line In​ ter​face (CLI), and Net​ Conf, to name a few. These pro​ to​
cols
have served net​ work ad​ min​is​
tra​tors well, and have mostly ful​ filled their ob​
jec​
tives for
Fault, Con​fig​
u​
ra​tion, Ac​count​ing, Per​for​
mance and Se​ cu​
rity man​ age​
ment tasks (also
known as the FCAPS frame​ work). How​ ever, to meet the new scale re​ quire​ments, the
net​
work has to be viewed and man​ aged as a sys​tem to en​able faster and more con​ sis​
-
tent de​
liv​
ery of services.

Sev​
eral years ago, the server in​
dus​
try, dri​
ven by scale re​
quire​
ments, went through the
same tran​
si​
tion. Server teams were faced with the need to man​ age large pools of re​
-
sources that drove the need for more au​ to​
mated con​ fig​u​
ra​
tion man​ age​
ment tools.
Today, server man​ age​
ment teams lever​ age pop​u​lar con​ fig​u​
ra​
tion man​ age​
ment tools
such as Pup​pet, Chef or An​si​
ble. These tools are chang​ ing or​ga​
ni​
za​tional processes
which sup​
port agile de​
vel​
op​
ment and De​ vOps initiatives.

It is ques​tion​able when this dis​ rup​tion will im​


pact the net​ work in​ dus​try; how​ever, the
con​ fig​
u​
ra​
tion man​ age​ment tools listed above are now able to pro​ vide com​ pa​
ra​
ble ways
to man​ age net​work el​e​
ments. Over time, IT or​ ga​
ni​za​tion will evolve to this new way of
man​ ag​
ing IT in​fra​
struc​ture, how​ ever, it is im​
por​tant that or​ ga​ni​
za​
tions have time to
com​ plete this tran​
si​
tion. IT sys​
tems re​ quire the abil​ity to sup​port the ex​ist​
ing man​age​-
ment par​ a​digms as well as these new mod​ els dur​ing a tran​si​tion phase to​ wards more
ef​
fi​
cient processes.
Operations & Management 147

VXLAN tech​ nol​


ogy will ben​ e​
fit from so​ lu​
tions that can con​sis​
tently de​
ploy con​ fig​
u​ra​
-
tions across mul​ti​
ple switches when cre​ at​
ing new ten​ants or new net​ works. The man​ -
age​ment and op​ er​a​
tions of VXLAN will de​ pend on tools that can pro​ vide vis​
i​
bil​
ity and
di​
ag​
nos​tic analy​
sis of the un​
der​ ly​
ing infrastructure.
Operations & Management 148

Management tasks

VXLAN Fab​ rics are no dif​


fer​
ent from other tech​nolo​gies, given that they re​quire foun​
-
da​
tional in​
fra​
struc​ture that needs to be op​
er​
a​
tionally man​ aged in a sim​
plifed manner.

Mul​
ti​
ple tra​
di​
tional frame​
works exist to de​
fine what the op​
er​
a​
tions of IT in​
fra​
struc​
ture
en​tail, such as IT In​ fra​
struc​
ture Li​brary (ITIL) or FCAPS. Some or​ ga​
ni​za​
tions have
started to in​ cor​
po​rate IT op​er​
a​
tional prac​
tices from other areas of the in​dus​try such as
ap​pli​
ca​tion de​vel​
op​
ment taken from De​ vOps (De​ vel​
op​
ment + Op​er​
a​tions), as is cov​
ered
in the next section.

Agility is one of the main ob​ jec​


tives that most data cen​ter lead​ers covet, and is part of
the over​ all op​er​
a​
tions process. Au​ tomat​
ing the data cen​ter en​ ables pro​
vi​sion​
ing net​-
work re​ sources in a re​li​
able man​ ner while main​
tain​
ing con​fig​u​
ra​tion con​
sis​
tency to re​-
duce down​ time. In place of the stan​ dard com​ mand line in​ ter​
face (CLI), net​work au​ -
toma​ tion can be used to sim​ plify these com​mands, for con​ sis​
tency, stan​dard​iza​
tion,
and re​
duc​tion of human error. Net​
work pro​vi​
sion​
ing nor​mally starts with script-level
au​
toma​tion and can progress to more ad​
vanced mod​ els of deployment.

What​ever tools or process are used, whether au​


toma​
tion or man​
ual in​
ter​
ven​
tion, there
are some basic tasks that need to be per​
formed. The net​
work man​age​ ment life​
cy​
cle is
di​
vided into three dif​
fer​
ent phases:

• Day 0: install
• Day 1: configure/optimize
• Day N: up​
grade and monitoring

Day 0
Operations & Management 149

Day 0
Tra​di​
tion​
ally, Day 0 ac​tiv​
i​
ties have in​cluded in​ stalling the de​ vice into a rack, pow​er​ing it
up, some basic boot​ strap con​ fig​
u​ra​
tion, and op​ tion​ally, up​dat​ing the firmware. This is
how many or​ ga​ni​
za​
tions have dealt with Day 0 tasks until now. As a con​ se​
quence, it is
not un​ com​mon to see a net​ work with mul​ ti​
ple ver​sions of soft​ ware de​ployed and dif​ -
fer​
ing stan​dards of con​ fig​
u​ra​
tion. In order to re​ duce the in​ con​ sis​
ten​
cies in the net​work
when the equip​ ment is de​ ployed, au​ toma​ tion of the ini​ tial de​ploy​ment is a cru​cial first
step. This pro​ vides a solid foun​da​tion for suc​ cess​ful net​work operation.

Day 1
After the base con​ fig​
u​
ra​
tion and com​ mon soft​ware re​
leases have been de​ ployed across
the Fab​ric, the next step is to pro​vi​
sion the over​
lay and de​
vice-spe​cific con​fig​
u​
ra​
tions.
These con​ fig​
u​ra​
tion steps in​
clude items such as MP-BGP, Mul​ ti​
cast, VNIs, VRFs, VLANs,
Any​cast Gate​ way and core capabilities.

The VXLAN Fab​ ric re​


quires more con​fig​
u​
ra​
tion than was pre​vi​ously needed in tra​
di​
-
tional de​
signs. The bur​den of the ad​
di​
tional VXLAN Fab​ric con​fig​
u​
ra​tion re​
quire​
ments
can be eased by automation.

For this phase a few op​ tions exist to help au​ to​
mate con​ fig​
u​
ra​tion de​ploy​ment. There
are tools such as Cisco Prime Data Cen​ ter Net​ work Man​ ager (DCNM), Cisco Nexus Fab​ -
ric Man​ager (NFM), Python scripts or script​ ing lan​guages that can con​ fig​
ure the de​ vices
di​
rectly or via API. An​other op​tion is con​ fig​u​
ra​
tion man​ age​ment tools (CMT) such as
Pup​pet, Chef, and An​ si​
ble that de​liver con​ fig​
u​
ra​
tion stan​dard​ iza​
tion. In​stead of just
push​ing con​fig​
u​
ra​
tion com​ mands to the switches, CMT checks the run​ ning con​ fig​u​
ra​
-
tion and up​dates changes to the con​ fig​
u​
ra​ tion. This al​
lows the cre​ ation of man​ i​
fests,
recipes, or play​books with the de​ sired end state of the spe​ cific el​
e​ments in the net​ -
work. For ex​ am​ple, the spine switches would have a very dif​ fer​ent con​fig​
u​ra​tion than
the leaf switches, but the leaf switch con​ fig​u​ra​
tions would likely be very sim​ i​
lar to one
an​other across the fabric.

As a re​sult of vir​
tu​
al​
iza​
tion and cloud pro​ vi​
sion​
ing, an​
other item to con​sider is VMM in​-
te​
gra​tion. Whether or not the con​ fig​
u​
ra​
tion of a switch should be dy​ nam​i​
cally mod​
i​
fied
based on a trig​ ger event is dis​
cussed at length in the Soft​ware Over​ lay chapter.
Operations & Management 150

Day N
Once the net​
work is con​
fig​
ured, run​
ning, and op​
ti​
mized, changes and soft​
ware up​
-
grades to the Fab​ ric will be needed. CMT so​ lu​
tions can au​
to​
mate soft​
ware up​
grades
and con​fig​
u​
ra​
tion changes to mul​ti​
ple devices.

An​other im​por​
tant Day-N task is con​
fig​
u​
ra​
tion back​ups, re​vi​
sion con​
trol, and the abil​
ity
to roll back to a pre​
vi​
ous snap​
shot. This tra​
di​
tional con​fig​
u​ra​
tion Man​age​ment can be
done with tools such as DCNM, NFM, the afore​ men​tioned CMT so​ lu​
tions, or with open
source tools such as RANCID.

Mon​ i​
tor​
ing the net​ work and re​ act​ing to events is a crit​i​
cal part of Day N op​ er​
a​
tions.
Tra​di​
tional net​work man​ age​ment tools used SNMP to mon​ i​
tor de​vice pa​ra​me​ters such
as in​ter​
face uti​liza​
tion or avail​ able mem​ ory. With NX-OS pro​ gram​ ma​bil​
ity func​tions,
using new tools, such as Car​ bon/Graphite, Zenoss, or Splunk en​ ables ac​cess to richer
in​
for​
ma​ tion. Linux-based mon​ i​
tor​
ing agents can be in​ stalled na​tively on the switch. Ex​ -
am​ples such as OpenTSDB (http://​ opentsdb.​net/​) pro​vide a col​ lec​
tor agent which
sends in​for​
ma​ tion to a cen​tral repos​ i​
tory for consolidation.

Vis​
i​
bil​
ity is an​
other im​por​
tant Day-N func​ tion. Tra​
di​
tional vis​
i​
bil​
ity tools are still avail​
-
able with a VXLAN-based so​ lu​
tion in​
clud​ing net​work TAPs, switch​ port an​
a​lyzer (SPAN),
Net​flow and/or sFlow, where ap​ plic​
a​
ble. Nexus Data Bro​ ker (NDB) clients can be lever​ -
aged to con​ sol​
i​
date SPAN from leaf switches into a com​ mon switch ag​ gre​ga​
tion point
to build scal​able net​
work TAPs and SPAN ag​ gre​
ga​
tion infrastructures.

VXLAN OAM (Operations, Administration and Management)


With the ad​ di​
tion of the over​ lay, a level of in​
di​
rec​tion has been in​ tro​
duced, re​sult​
ing in
a cer​
tain de​gree of ab​strac​tion from the un​ der​
ly​
ing net​work. This ab​ strac​
tion be​
came a
sim​ple and ef​
fi​
cient way of pro​ vid​
ing ser​vices with​ out con​sid​er​ing the in​
ter​
me​di​
ate de​-
vices. When ser​ vice degra​da​tion oc​ curs, the ef​
fec​tive phys​i​
cal com​ po​
nent or path used
can be​come hid​ den. To iden​ tify the path the ap​ pli​
ca​
tion traf​fic takes from end​ point to
end​point re​quires tremen​ dous ef​ fort. The prob​ lem is ex​ ac​er​bated be​cause VXLAN
changes the UDP source-port to achieve en​ tropy. In ad​di​
tion with ECMP the num​ ber of
paths in​creases sig​nif​
i​
cantly, mag​ ni​fy​
ing the prob​lem scope.
Operations & Management 151

VXLAN OAM (Op​ er​


a​
tions, Ad​ min​is​tra​
tion, and Man​ age​ment) and CFM (Con​ nec​tiv​ity
Fault Man​ age​ment) pro​ vide a sim​ ple and com​ pre​
hen​sive so​ lu​
tion for prob​ lems de​ -
scribed ear​ lier. Rather than try​ ing to un​ der​stand load-bal​ anc​ing hash​ing al​
go​rithms to
fig​
ure out which path has been used, a sin​ gle probe could pro​ vide the re​
spec​tive feed​ -
back, and ac​ knowl​ edge reach​ a​
bil​
ity of the ex​ pected des​ti​
na​
tion. In the case where one
path is af​
fected by per​ for​
mance degra​ da​tion, the probe could de​ ter​
mine this by re​turn​ -
ing po​ten​tial packet loss sta​ tis​
tics. In the pres​ ence of an ap​pli​ca​
tion or pay​load pro​ file,
the probe will mimic ap​ pli​
ca​
tion be​ hav​ ior, and the re​sult will plot the ef​
fec​
tive phys​ i​
cal
path used by the work​ loads in question.

An ad​ di​
tional tool within VXLAN OAM is the “tissa”-based tra​ cepath, fol​ low​ing the
“draft-tissa-nvo3-oam-fm” IETF draft. This tool not only gets the exact path plot​ ting
from an un​ der​lay per​spec​
tive, but it also de​
rives the spe​ cific VTEP where the des​ ti​
na​
-
tion is ac​tu​
ally at​
tached. Fur​ther​
more, with ad​ di​
tional input pa​ ra​me​ters it is pos​si​ble to
iden​tify the egress VTEP, the un​ der​lay path from ingress to egress VTEP in​ clud​ing all
in​
ter​me​di​ate hops, as well as all in​ volved in​ter​faces. In ad​ di​
tion, the load and error
coun​ ters for those in​ter​
faces can be pro​ vided as well.

The sam​ ple out​ put below shows a "tissa"-based over​ lay path​
trace. The func​tion​
al​
ity ex​
-
poses the phys​ i​
cal path (un​
der​
lay) from leaf via spine to bor​der, while the re​quest was
ini​
ti​
ated in the VXLAN overlay.

Path trace Request to peer ip 10.254.254.200 source ip 10.254.254.102


Sender handle: 38

Hop Code ReplyIP IngressI/f EgressI/f State


====================================================
1 !Reply from 10.254.254.101, Eth2/1 Eth4/5 UP / UP
Input Stats:
discards:0
errors:0
unknown:0
bandwidth:42949672970000000
Output Stats:
discards:0
Operations & Management 152

errors:0
bandwidth:42949672970000000

2 !Reply from 10.254.254.200, Eth6/1 - UP / -


Input Stats:
discards:0
errors:0
unknown:0
bandwidth:42949672970000000

VXLAN OAM is im​ ple​


mented within NX-OS plat​ forms and can be ex​ e​
cuted using CLI or
with an API-dri​ ven ap​ proach using NX-API. It is pos​ si​
ble to ex​e​cute the var​ i​
ous probes
across a pro​ gram​ matic in​ ter​
face and also re​ trieve the sta​ tis​
ti​
cal in​
for​
ma​ tion from the
VTEP in the same way. As an ac​ knowl​ edg​ment of the probe ex​ e​cu​
tion, a sta​ tis​
tic iden​ti​
-
fier is sent. The sta​ tis​
tic iden​ti​
fier pro​vides cur​ rent and his​ toric sta​tis​tics from the
VTEP local OAM data​ base in a pro​ gram​ matic or CLI dri​ ven way. Fur​ ther en​ hance​ ments
in VXLAN OAM in​ clude the in​ tro​duc​tion of pe​ri​odic probes and re​ spec​tive no​ ti​
fi​
ca​
tion
to more proac​ tively man​ age the over​ lay net​
work with its phys​ i​
cal un​
der​ lay. In order to
make the col​ lected path in​ for​
ma​tion and sta​tis​tics mean​ ing​ful, VXLAN OAM is going to
in​
te​
grate with VXLAN re​ lated man​ age​ ment sys​tems like DCNM or VTS.
Operations & Management 153

Available Tools

There are mul​ti​


ple ap​
proaches to ad​dress the chal​lenges de​scribed in the pre​
vi​
ous sec​
-
tion. These ap​
proaches can be clas​si​
fied into the fol​
low​
ing categories:

• Tra​
di​
tional (CLI, scripting)
• Off-the-shelf tools
• De​
vOps (Pup​
pet, Chef, Ansible)

Traditional Tools

Command Line Interface

For VXLAN there are a num​


ber of new com​
mands to help with con​
fig​
ur​
ing, mon​
i​
tor​
ing
and trou​
bleshoot​ing the fab​
ric. How​ever, it needs to be noted that net​work op​
er​
a​
tors
who rely on CLI to man​age their fab​
ric will be con​
fronted with two issues:

1 VXLAN con​
fig​
u​
ra​
tion is com​
mand-in​
ten​
sive. The cre​
ation of new ten​
ants or seg​
-
ments re​
quires mul​
ti​
ple lines of con​
fig​
u​
ra​
tion, po​
ten​
tially across a large num​
ber of
devices.

2 VXLAN tech​
nol​
ogy de​
pends on the pres​
ence of a con​
sid​
er​
able num​
ber of un​
der​
ly​
-
ing pro​
to​
cols, mak​
ing it more bur​
den​
some to de​
ploy when com​
pared to other
tech​
nolo​
gies like Span​
ning Tree or FabricPath.

Python Scripting
Python script​
ing has been used by net​ work op​er​a​
tors for years; how​ever, with NX-OS
run​
ning on the switches, script​
ing can be taken to a whole new di​ men​sion. APIs and
Soft​
ware De​vel​
op​ment Kits (SDKs) are avail​
able for NX-OS. An ex​ am​ple of an SDK for
NX-OS is the nx​ toolkit which is freely avail​
able for down​ load: https://​github.​
com/​
datacenter/​nxtoolkit.
Operations & Management 154

An ex​
am​
ple Python script for VXLAN is lo​
cated at the fol​
low​
ing: https://​
github.​
com/​
erjosito/​evpn_​
shell. This script is es​
sen​ tially an ex​
ter​
nal CLI that can be used to cre​-
ate, delete, and view ten​ ants, VNIs, and rel​ e​
vant con​fig​
u​
ra​
tion el​
e​
ments across all
VTEPs in a VXLAN EVPN Fab​ ric. This script makes use of in​ fra​
struc​
ture vari​
ables such
as man​ age​
ment IP ad​ dresses, cre​den​tials etc. and with a sin​ gle com​mand de​ ploys all
the re​quired VXLAN EVPN con​ fig​
u​
ra​
tion to cre​ ate a ten​ant or a net​
work in​side of a
tenant.

For more script​


ing ex​
am​
ples, please check GitHub (https://​
github.​
com/​datacenter) or
the Cisco De​
vel​
oper Com​mu​nity for NXOS (https://​
opennxos.​cisco.​
com).

Scripting with Other Programming Languages


Mul​ti​
ple script​
ing lan​
guages can make use of NX-OS APIs using HTTP, as​ sum​
ing they
can parse JSON or XML strings, even if no SDK is avail​
able for that spe​
cific language.

There are two APIs avail​


able in NXOS:

• NX-API: HTTP-based API over which CLI com​


mands are sent to the de​
vice. The
out​
puts can be sent back in JSON or XML format.
• REST API: REST​
ful API lever​
ages an ob​
ject model. Being com​
pletely ob​
ject-based
this makes de​
vel​
op​
ment of SDKs pos​ si​
ble. Python SDKs are avail​
able in Github (see
https://​
github.​
com/​ datacenter). Com​mands and con​ fig​
u​
ra​
tion are sent using XML
or JSON for​
mat, and com​mand out​ puts are re​
turned similarly.

Lan​guages such as XML and JSON are used to struc​ ture com​ mands and out​ puts and
they elim​i​
nate the need to parse hu​ man-read​ able strings for​mat​ted in para​
graphs and
ta​
bles. String pars​ing is com​monly used in script​
ing but has ver​ sion de​
pen​den​
cies. That
puts a bur​ den on life​ cy​
cle man​
age​ment for these au​ toma​ tion scripts that have kept
many or​ ga​ni​
za​
tions from using them.​ The APIs avail​ able in NXOS are an im​ prove​ment
over tra​di​
tional script​
ing meth​
ods, and will im​prove the au​toma​ tion processes.

Off-the-Shelf Tools
Operations & Management 155

Off-the-Shelf Tools

Cisco Data Center Network Manager (DCNM)

Cisco DCNM is a gen​ eral pur​


pose Net​ work Man​age​
ment Soft​ ware (NMS) / Op​ er​
a​
tional
Sup​port Soft​
ware (OSS) prod​ uct tar​
geted at NX-OS net​work​ing equip​
ment. It sup​
ports
clas​
si​
cal Span​
ning Tree de​ploy​
ments with or with​out Vir​
tual Port Chan​
nels, Fab​
ric​
Path
and VXLAN.

In the con​
text of a VXLAN-based so​
lu​
tion, DCNM can be uti​
lized for the fol​
low​
ing
purposes:

1 Firstly, to pro​
vide for the Fab​
ric un​
der​
lay con​
fig​
u​
ra​
tion. DCNM has built-in Power
On Auto Pro​
vi​
sion​
ing (POAP) sup​
port to de​
liver zero-touch auto-pro​
vi​
sion​
ing of
the net​
work de​
vices that build the VXLAN Fabric.

2 Once the Fab​


ric is up and run​
ning, DCNM can also be uti​
lized for pro​
vi​
sion​
ing the
VXLAN over​
lay configuration.

3 DCNM sup​
ports mon​
i​
tor​
ing of the per​
for​
mance and uti​
liza​
tion of the net​
work
switches, as well as fault man​
age​
ment and sys​
log aggregation.

4 Man​
ag​
ing the soft​
ware run​
ning on the switches and per​
form​
ing soft​
ware up​
grades
and downgrades.

This pro​
vi​
sion​
ing can be per​
formed in a top-down (push) fash​ion, where DCNM tracks
de​
ploy​ment events and sim​ply pushes the re​
quired CLI con​
fig for the ac​
cess port onto
the switch.

Al​
ter​na​
tively, a more dy​ namic mech​ a​nism is pos​ si​
ble, where the leaf switches “pull” the
con​fig​
u​
ra​tion from the LDAP data​ base of DCNM based on a spe​ cific event, such as a
local at​
tach​ ment of an end​ point. A typ​i​
cal ex​ am​ ple of this more dy​ namic mech​ a​
nism is
the sup​port on the VXLAN leaf nodes of a func​ tion​
al​
ity called Vir​
tual Ma​chine Tracker
Auto-Con​ fig (VM Tracker), which au​ to​mat​ i​
cally pro​vi​
sions a spe​cific ten​
ant con​
fig​
u​
ra​
-
tion. The com​ mands re​ quired for pro​ vi​
sion​ ing the ten​ ant are stored in the form of a
con​fig​
u​
ra​tion pro​
file. A con​fig​
u​
ra​
tion pro​ file is a set of com​ mands that will be re​quired
Operations & Management 156

for pro​vi​
sion​
ing a par​
tic​
u​lar ten​
ant, ex​
cept the re​
quired pa​
ra​
me​
ters are writ​
ten as vari​
-
ables in​stead of ac​
tual val​
ues in a command.

Spe​
cific to VXLAN man​
age​
ment DCNM pro​
vides the fol​
low​
ing capabilities:

• DCNM pro​
vides in​
te​
grated Power-On Auto Pro​
vi​
sion​
ing (POAP) to boot new
switches for a green​
field Fab​
ric or add new switches to an ex​ist​
ing VXLAN Fab​ ric.
DCNM man​ ages this POAP work​ flow so that an admin sim​
ply as​signs a de​
vice to a
pre​
con​
fig​
ured template.
• In ad​
di​
tion, the POAP con​
fig​
u​
ra​
tion Diff/Sync fea​
ture lets the admin know if a de​
-
vice’s con​
fig​
u​
ra​tion does not match its POAP tem​
plate and then lets the user re​
-
solve these differences.
• DCNM also pre​
sents topol​
ogy views show​
ing phys​
i​
cal and over​
lay net​
works on the
same page, help​ing net​
work ad​
mins quickly iden​
tify the ex​
tent of vir​
tual over​
lay
net​
works on a Fabric.
• DCNM also pre​
sents smart topol​
ogy views show​
ing vir​
tual port chan​
nels (vPCs) and
vir​
tual de​
vice con​texts. In topol​ogy view, DCNM shows VXLAN Tun​ nel end​point
sta​
tus as well as VXLAN search. DCNM shows VXLAN net​ work iden​
ti​
fier (VNI) sta​
-
tus and other VXLAN in​ for​ma​tion on a per-switch basis.
• Built-in search al​
lows ad​
mins to search by VM Name, VM IP Address, VM MAC Ad​
-
dress, VNI, or Switch ID.

More in​
for​
ma​
tion on Cisco Data Cen​
ter Net​
work Man​
ager can be found at: http://​
www.​
cisco.​
com/​go/​dcnm.

Ignite
Day-0 tasks are ex​ tremely im​
por​tant in order to have a con​ sis​
tent Fab​ric. Ig​
nite is a
sim​ple hands-off ap​proach to boot​strap a de​vice with the ap​
pro​pri​
ate code level and
ini​
tial de​
vice setup. To achieve that, Ig​
nite lever​ages the POAP ca​ pa​bil​
i​
ties of Cisco
Nexus switches.

Ig​
nite is an open-source tool that can be down​
loaded at no cost from Github: https://​
github.​com/​ datacenter/​
ignite.
Operations & Management 157

In order to have a POAP en​ vi​


ron​ment that al​
lows for the au​
toma​tion of de​
ploy​
ment of
firmware and ini​tial con​
fig​
u​ra​
tion, there are some ex​ ter​
nal com​po​nents that Ig​
nite
requires:

• A DHCP server to boot​


strap the in​
ter​
face and DNS in​
for​
ma​
tion of switches that are
boot​
ing up.
• A TFTP server that con​
tains the con​
fig​
u​
ra​
tion script used to au​
to​
mate the soft​
ware
image in​
stal​
la​
tion and con​
fig​
u​
ra​
tion process.
• An Ubuntu server where Ig​
nite will be in​
stalled, that con​
tains the de​
sired soft​
ware
im​
ages and rules to dy​
nam​
i​
cally build con​
fig​
u​
ra​
tion files.

Cisco Nexus Fabric Manager


The Cisco Nexus Fab​ ric Man​ ager (NFM) is a man​ age​
ment sys​ tem de​ signed to highly
sim​
plify and op​
ti​
mize the full life​
cy​
cle man​age​ment of a switch fab​
ric built with NX-OS
based plat​forms (at the time of writ​ ing of this book, NFM sup​ port is lim​
ited to the
Nexus 9000 family).

Cisco NFM has a fab​ric-wide focus and al​lows for the auto-pro​vi​
sion​
ing and man​age​-
ment of the whole net​
work. NFM pro​ vides point-and-click meth​ ods for per​
form​
ing fab​
-
ric man​
age​ment tasks such as adding, re​
mov​ ing, and con​
fig​
ur​
ing net​
work com​ po​
nents
such as switch​
pools, switches, switch in​ter​
faces, VRFs, port chan​ nels and broad​cast
domains.

Cisco NFM builds a VXLAN EVPN Fab​ ric, but ab​


stracts the com​
plex​ity . It is still pos​
si​
ble
to log into the switches and view the con​ fig​
u​
ra​
tion that has been de​ ployed by Cisco
NFM, trou​ bleshoot with the CLI, or use any other stan​ dard mon​i​
tor​ing so​ lu​tion to ver​-
ify the state of the network.

Cisco NFM cov​


ers var​
i​
ous phases of the Fab​
ric man​
age​
ment lifecycle:

• Cre​
ation: NFM al​
lows for a zero-touch boot up of the Fab​
ric, per​
form​
ing some
Day-0 op​er​
a​tions like ca​
bling topol​
ogy ver​
i​
fi​
ca​
tion and au​
to​
matic VXLAN un​
der​
lay
provisioning
Operations & Management 158

• Con​
nec​
tion: NFM fully man​
ages the en​
tire VXLAN con​
fig​
u​
ra​
tion, re​
mov​
ing the op​
-
er​a​
tional as​
so​
ci​
ated hur​
dles. This es​
sen​
tially im​
plies that a user does not nec​
es​
sar​
-
ily need to know that VXLAN with MP-BGP EVPN is de​ ployed as the key func​
tion​
al​
-
ity to en​
able end​point communication
• Ex​
pan​
sion: there are more day-N type of op​
er​
a​
tions, such as zero-touch ad​
di​
tion
of switches to the Fab​
ric and auto-up​
grade of ex​
ist​
ing fab​
ric devices
• Fault Management: NFM of​
fers a built-in fault man​
age​
ment system
• Re​
port​
ing: Cisco NFM com​
mu​
ni​
cates to the switches de​
ployed in the fab​
ric by
lever​
ag​
ing soft​
ware agents em​
bed​
ded into the switches

More in​for​
ma​
tion re​
gard​
ing Cisco Nexus Fab​
ric Man​
ager is avail​
able at: http://​
www.​
cisco.​
com/​go/​nexusfabricmanager.

Cisco Virtual Topology System (VTS)


Ser​
vice providers have very spe​
cific re​
quire​
ments re​
gard​
ing data cen​
ter net​
work man​
-
age​
ment and operations:

1 Sup​
port for a mix of soft​
ware and hard​
ware VTEPs

2 In​
te​
gra​
tion with the hy​
per​
vi​
sor layer

3 Sup​
port of a mul​
ti​
ven​
dor Fabric

4 Over​
lay and un​
der​
lay op​
er​
ated by dif​
fer​
ent teams

VTS is an add-on to a VXLAN Fab​


ric con​
sist​
ing of the fol​
low​
ing elements:

• Vir​
tual Topol​
ogy Con​
troller: this is a man​
age​
ment plat​
form that of​
fers ways to de​
-
ploy ten​
ants and net​ works over a GUI or a north​
bound REST​ ful API. It in​
te​grates
with VMware vCen​ ter and with Open​stack/KVM, so cus​ tomers can man​ age the
over​
lay di​
rectly from the VMM. The Vir​
tual Topol​
ogy Con​
troller will roll out the re​
-
quired changes using south​bound APIs such as NX-API or NetConf/YANG.
Operations & Management 159

• IOS XRv: this is a vir​


tual router in​
stance that can take over re​
quired con​
trol plane
func​
tion​
al​
ity in case of a de​
ploy​
ment con​ sist​
ing ex​
clu​
sively of soft​
ware VTEPs. This
com​po​nent is re​spon​si​
ble for dis​
trib​
ut​
ing routes to hard​ ware VTEPs over EVPN
BGP, and to soft​ware VTEPs using the REST​ CONF API.
• Vir​
tual Topol​
ogy For​
warder (VTF): this is a soft​
ware VTEP that can be in​
stalled in a
VMware vSphere host or an Open​ stack com​pute node. It is con​
trolled by the Vir​
tual
Topol​ogy Con​troller, of​
fer​
ing L2 and L3 con​nec​
tiv​
ity be​
tween VMs run​ ning in the
local or re​
mote servers. VTF is a vir​tual ma​
chine run​ning in user space, so it does
not need any mod​ i​
fi​
ca​
tion to the vSphere code. VTF ex​ ploits per​for​
mance op​ ti​
-
miza​tion tech​
nolo​gies such as the open-source-li​censed DPDK (http://​ dpdk.​org/​)
and Cisco Vec​tor Packet Pro​ cess​
ing (VPP).

Cisco VTS sup​ ports flood and learn as well MP-BGP EVPN con​ trol planes. It in​cludes
func​
tion​al​
ity such as ARP sup​
pres​sion ca​
pa​
bil​
i​
ties, sym​
met​ric IRB, VTEP au​then​ ti​
ca​tion
and fast con​ ver​
gence upon net​work fail​
ures and end​ point mobility.

One im​ por​


tant con​
cept to un​ der​
stand is that Cisco Vir​
tual Topol​ogy Sys​tem does not
man​ age the un​
der​
lay. It is as​
sumed that the re​quired un​der​lay con​
fig​
u​
ra​
tion is al​
ready
in place.

More in​
for​
ma​
tion re​
gard​ing Cisco Vir​
tual Topol​
ogy Sys​
tem is avail​
able at: https://​
www.​
cisco.​
com/​go/​vts.

DevOps Tools
Con​ fig​
u​
ra​
tion Man​
age​ment Tools (CMT) are a new gen​ er​
a​
tion of in​
tent-based tools
that have gained great pop​ u​
lar​
ity, mainly in the Linux com​ mu​nity. They can be clas​
si​
-
fied into two cat​
e​
gories: Agent-based and agent​ less tools.
• In agent-based con​
fig​
u​
ra​
tion man​
age​
ment, changes are made cen​
trally on a mas​
ter
node, and are pulled down and ex​ e​
cuted by the agent. The de​vice agents pe​
ri​
od​
i​
-
cally con​
nect with the mas​ter for con​
fig​
u​
ra​
tion in​
for​
ma​
tion and the changes are
pulled down and ex​e​
cuted. Only the changes that are needed are pulled.
Operations & Management 160

• Agent​
less Con​
fig​
u​
ra​
tion Man​
age​
ment is push-based in​
stead of pull-based. Con​
fig​
u​
-
ra​
tion man​age​ment scripts are run on the mas​ ter and the mas​
ter con​
nects to the
man​ aged de​
vices and ex​
e​
cutes the task over an API.

Pup​pet and Chef are ex​ am​ ples of agent-based con​ fig​
u​
ra​tion man​ age​
ment tools. With
these agent-based sys​ tems, the user lever​ ages a cus​ tom de​ clar​
a​
tive lan​
guage to de​-
scribe the sys​tem con​ fig​
u​
ra​tion which needs to be con​ fig​
ured on the re​ mote sys​tems.
Both of these tools have sim​ i​
lar func​
tion​al​
ity which is con​ tin​
u​
ally evolv​
ing. Pup​
pet re​
-
cently re​leased mod​ ules to con​ fig​
ure, pro​ vi​
sion, and man​ age a Cisco VXLAN-based
Fab​rics plus sev​
eral stan​dard top-of-rack switch features.

Pup​pet uses mod​ ules that in​ clude de​scrip​tions about which fea​ tures are sup​ported,
and man​ i​
fests that are the ac​ tual de​
scrip​tions of how those de​ vices should be con​ fig​
-
ured. Man​ i​
fests can be sta​ tic, dy​nam​i​
cally in​
cor​po​
rate con​di​
tions or even use Ruby
logic. Some con​ di​tions will de​pend on which sys​ tem is being man​ aged, and a wealth of
that in​
for​
ma​ tion is gath​ered by Pup​ pet's com​ pan​
ion tool "fac​
ter". The Pup​pet agent will
pull the man​i​fest from the Pup​ pet server (Pup​pet Mas​ter) and im​ple​
ment it.

There are some ex​


am​
ples man​
i​
fests in Github under https://​
github.​
com/​
cisco/​
cisco-
network-puppet-module.

Chef ar​
chi​
tec​ ture is very sim​
i​
lar, but in​
stead of man​i​
fests the jar​
gon is "recipes", that is
where the ex​ pected state of the man​ aged de​vices is doc​ u​
mented. Recipes can be
grouped to​ gether in Cook​ books for eas​ ier man​age​ment. As al​ ready de​ scribed, Chef
runs in a client/server ar​ chi​
tec​ture, but it has an ad​di​
tional stand​alone mode called
"Chef solo".

As with Pup​
pet, some ex​
am​
ples of Chef recipes for Cisco NX-OS are avail​
able in Github
under https://​
github.​
com/​
cisco/​ cisco-network-chef-cookbook.

An​
si​
ble is an ex​am​ple of an agent​less based con​ fig​
u​
ra​
tion man​age​ ment sys​ tem that
man​ages nodes via SSH and has the abil​ ity to ex​
e​
cute the scripts lo​cally on the man​-
aged node or on the local server con​ nects via the Cisco NX-API. An​ si​
ble uses the con​
-
cept of Mod​ ules, Tasks, Plays, and Play​
books to man​ age the con​fig​
u​ra​
tion on the re​
-
mote devices.
Operations & Management 161

• Mod​
ules: units of work that An​
si​
ble ships out to re​
mote ma​
chines. Some mod​
ules
pre-in​
stalled, cus​
tom mod​
ules can be man​
u​
ally in​
stalled as well
• Tasks: com​
bi​
na​
tion of mod​
ules with ar​
gu​
ments and de​
scrip​
tion names
• Plays: map​
ping of hosts or groups to their tasks
• Play​
books: col​
lec​
tion of Plays by which An​
si​
ble or​
ches​
trates, con​
fig​
ures, ad​
min​
is​
-
ters, or de​
ploys sys​
tems. Play​
books are writ​
ten in YAML

Summary Table
The fol​
low​ing table il​
lus​
trates how the tools dis​
cussed above con​
tribute to the Day 0, 1
or N op​er​
a​
tions of net​work fabrics:

Day0 Day1 DayN

CLI X X

Python X X

Cisco Data Center X X X


Network Manager

Cisco Nexus Fabric X X X


Manager

Ignite X

Cisco Virtual Topology X X


System

Ansible X X

Puppet X X

Chef X X
Acronyms
Acronyms 163

Acronyms

ACI: Ap​
pli​
ca​
tion Cen​
tric Infrastructure

ADC: Ap​
pli​
ca​
tion De​
liv​
ery Controllers

API: Ap​
pli​
ca​
tion Pro​
gram Interface

ARP: Ad​
dress Res​
o​
lu​
tion Protocol

BD: Bridge Domain

BGP: Bor​
der Gate​
way Protocol

CLI: Com​
mand-Line Interface

DAG: Dis​
trib​
uted Any​
cast Gateway

DCNM: Data Cen​


ter Net​
work Manager

ECMP: Equal Cost Multi-Path

ETR: Egress Tun​


nel Router

EVPN: Eth​
er​
net Vir​
tual Pri​
vate Network

FCAPS: Fault, Con​


fig​
u​
ra​
tion, Ac​
count​
ing, Per​
for​
mance and Security

GEN​
EVE: Generic Net​
work Vir​
tu​
al​
iza​
tion Encapsulation

GPE: Generic Pro​


to​
col Encapsulation
Acronyms 164

IDS: In​
tru​
sion De​
tec​
tion System

IEEE: In​
sti​
tute of Elec​
tri​
cal and Elec​
tron​
ics Engineers

IGP: In​
te​
rior Gate​
way Protocol

IPS: In​
tru​
sion Pre​
ven​
tion System

IRB: In​
te​
grated Rout​
ing and Bridging

ITR: Ingress Tun​


nel Router

LISP: Lo​
ca​
tor/ID Sep​
a​
ra​
tion Protocol

LSA: Link State Advertisement

MP-BGP: Multi-Pro​
to​
col BGP

MPLS: Multi-Pro​
to​
col Label Switching

MSDP: Mul​
ti​
cast Source Dis​
cov​
ery Protocol

MTU: Max​
i​
mum Trans​
mis​
sion Unit

NAT: Net​
work Ad​
dress Translation

NDB: Nexus Data Broker

NFM: Nexus Fab​


ric Manager

NLRI: Net​
work Layer Reach​
a​
bil​
ity Information

NSH: Net​
work Ser​
vice Header

NVO: Net​
work Vir​
tu​
al​
iza​
tion Overlay
Acronyms 165

OAM: Op​
er​
a​
tions, Ad​
min​
is​
tra​
tion and Management

OTV: Over​
lay Trans​
port Virtualization

PBB: Provider Back​


bone Bridges

PIM: Pro​
to​
col-In​
de​
pen​
dent Multicast

POD: Point of Delivery

PVP: Path Vec​


tor Protocol

RD: Route Distinguisher

RP: Ren​
dezvous Point

RR: Route Reflector

RT: Route Target

SDK: Soft​
ware De​
vel​
op​
ment Kit

SDN: Soft​
ware De​
fined Networking

SNMP: Sim​
ple Net​
work Man​
age​
ment Protocol

VMM: Vir​
tual Ma​
chine Manager

VNI: Vir​
tual Net​
work Instance

VNID: VXLAN Net​


work Identifier

vPC: Vir​
tual Port-Channel

VRF: Vir​
tual Rout​
ing and Forwarding
Acronyms 166

VTC: Vir​
tual Topol​
ogy Controller

VTEP: Vir​
tual Tun​
nel Endpoint

VTF: Vir​
tual Topol​
ogy Forwarder

VTS: Vir​
tual Topol​
ogy System