Beruflich Dokumente
Kultur Dokumente
EXECUTIVE SUMMARY
AOS delivers powerful, intent-driven automation of network services in vendor-agnostic environments
by delivering it as an easily consumable service in response to consumer-specified intent. Network With AOS, configuration,
devices serve packets, AOS services serve application workloads. telemetry, and expectations are
derived from the single source
According to numerous studies, 70-80% of outages are due to configuration change applied to a
of truth - the intent.
living system, and not due to initial deployment. Initial, one-time deployment can be seen as sort of a
“hello world” application; the real complexities arise as the system evolves. With AOS, configuration,
telemetry, and expectations are derived from the single source of truth - the intent - in an idempotent
fashion, and as such there is no implementation difference between initial deployment or change
management. Attempting to solve this
problem is like embarking on
a journey to solve a puzzle.
INTRODUCTION
Along the way you realize that
The goal of this paper is to present the overall structure and operation of AOS in order to understand it is not linear, you continue
what it can do for you out of the box, and how you can extend it. Let’s start with some definitions. discovering additional pieces,
and never really know if you’ve
AOS is a distributed network operating system that delivers a set of system resources as a service, found all of the pieces. There
following a reference design and subject to constraints based on user-specified intent - collectively, may be encouraging signs early
the blueprint (Figure 1). AOS leverages a powerful distributed state management infrastructure to in the journey, even a neat
achieve these goals. “hello world” application, but
after spending years in this
The above statement is a minimal, yet complete specification of AOS architecture, and as such is
problem domain, we believe
relatively dry. Let’s add some examples in the context of the data center to help internalize the above
the correct the sentiment is:
statement.
“If you’re not scared (of the
journey), you don’t understand
INTENT (the complexity).” Our solution
Intent is declarative specification of desired outcome (service), conveying the need for cooperative is to have a minimal, no piece
behavior of the system infrastructure, without specifying imperative commands that prescribe how to can be removed without
achieve it (the desired outcome). Example of intent is: losing the complete story, yet
complete, there are no missing
Provide connectivity to 1000 servers, using L2 and/or L3 access at the edge, with
puzzle pieces, set of puzzle
oversubscription in the core of 1:1 (no oversubscription), with endpoints such as hosts,
components.
VMs or containers grouped into isolation domains (including both traffic and address
space isolation). Have some endpoints reachable via the rest of the world and some In this text, we will highlight
not, with policies associated with isolation domains governing both security and load only some of the puzzle pieces;
balancing, with connectivity to the rest of the world via at least n links to support the we also leave some of the
external traffic and protect from possible failures. pieces (from the complete set
of 17) out for compactness of
presentation.
It is worth noting that in AOS, a reference design is the key enabler for composing the features
into services. That composability is expressed during service rendering process described further
down. There is no need for higher level domain specific languages that are typically used to manage
composability of low-level features.
Once the reference design is chosen, AOS generates a specification of a reference system (which we
denote as topology) to support it. The reference system contains instantiations for all roles defined
in the reference design to support specific intent. For example, it may contain four devices playing
the spine role, 24 devices playing the leaf role, 1,000 devices playing the server role. In addition to
system roles, AOS also models relationships (links) between systems. These relationships may be
modelling physical connectivity (cable) or logical connectivity. In addition, these relationships are
assigned a role. For example, the reference system will contain a corresponding number of links
playing spine_leaf, leaf_server, and leaf_router roles. The AOS approach is to...
define a set of tests or
SERVICE expectations that must be met
The Service concept is central to AOS - it’s the thread that brings it all together. in order for AOS to declare
that the service has actually
One approach is to say: we will configure devices according to a reference design and as a result been delivered.
you will get the requested connectivity, reachability to the external world, etc. In AOS however, this
is just one service component, called “configuration deployment,” and it only guarantees that each
participating device has accepted the configuration.
In addition to configuration deployment, the AOS approach is to follow a standard, best software
development practice and define a set of tests or expectations that must be met in order for AOS
to declare that the service has actually been delivered. Similarly, to “test driven development” or
“behavior driven development” (TDD/BDD), we call this expectations driven service delivery - EDSD
(puzzle piece #2).
The Service object represents the availability of some functionality that can be consumed in a
predictable manner. Like a system, it can be composed of component services (as denoted by looped
arrow in the diagram above). Service and system objects are related via a relationship (service is
hosted on system, system hosts a service).
Reference design RD1 dictates that the overall service is composed of the following sub-services
(each of them scoped across the whole system):
1. Verify that cabling is as expected (these are the expectations associated with links in
the reference design)
2. Verify that interfaces are in correct operational state (for example, used interfaces
should be up, the rest should be down)
3. Verify that specific BGP sessions are established (these are the expectations associated
with BGP peers in the reference design)
4. Verify that configuration on the device corresponds to the expected one
5. Verify that routing table entries are as expected
6. Verify that endpoints of interest can ping each other
Figure 2.
To illustrate service composition, consider that, at the blueprint level, service is shown as a
composition of feature-specific service components (routing, cabling, etc.) which in turn aggregate
service components across the infrastructure (routing, cabling across different systems).
Figure 3.
Within AOS, a device telemetry agent collects statuses leveraging native protocols supported by
the device and compares them against expectations, generating an anomaly if there is a mismatch.
Another agent aggregates these statuses at the device level by reacting to changes generated
by device telemetry agent. Another agent aggregates statuses across the devices, by reacting to
changes at the device level statuses. Instantiations of this basic publish/subscribe pattern essentially
comprise the AOS application.
Figure 4.
With device agents running on the devices, collected data can be processed as close to the source as
possible and can decide to propagate only knowledge-enriched information across the network, thus
limiting the amount of telemetry traffic when that is of importance.
Note that the service expectations in the bulleted list above can, and typically do, cover both
operator related expectations (1-5) as well as consumer related expectations (6). Operator related
expectations help in troubleshooting problems. Consumer related expectations help identify
the impact of problems on consumers. Therefore, instead of being limited to consumer related
expectations (6) and leaving it to the operator to debug the system in case of problems, AOS allows
an expert to insert his own knowledge and define diagnostics (1-5) to immediately alert to possible Each test can be viewed as
problems. As the knowledge about system behavior improves, new expectations asserted from newly a pairing of a test result and
acquired best practices can be inserted. An on-call engineer using AOS is armed with continuously expectation, with alerts being
improving knowledge about system behavior. Note that AOS automates generation and execution generated when the result
of these tests in the context of the specific (instantiated) reference system (puzzle piece #9). In other doesn’t match the expectation
words, there is no need for scripting and defining rules that execute these tests, or keeping them (i.e. when an anomaly is
in sync with the current topology, which is known to be complex and fragile process. AOS achieves detected).
this as it models systems and features leveraged with a specific reference design. Instance of these
models are represented as state entities and are described in the “Distributed State Management
Infrastructure” section.
Each test can be viewed as pairing of test result and expectation, with alerts being generated when
the result doesn’t match the expectation (i.e. when an anomaly is detected). Anomalies contain
actionable data that helps the operator. For example, for test (1) above, System X expects to see
System Y as neighbor across the link connected to Port Z. If that is not the case, an alert will be raised
and show the actual (wrong) neighbor, or the absence of one. Note that this test relies on certain
features of the devices involved (“spines and leafs must support LLDP,” for example). Before a device
is approved to play a role in the specific reference design (i.e. being included in the “hardware
compatibility list”), its capabilities (in terms of supported features) are verified.
In contrast, in the absence of context provided by intent and reference design, extracting that
knowledge typically involves ((b) in figure below):
• “rich visualizations” that help the expert user extract knowledge by some visual
clues, typically indication that the data is very raw
• costly integrations that involve writing correlation rules, understanding object model
of data sources, mapping it to anecdotal knowledge of end-to-end intent
Essentially, this means that big data telemetry in the absence of actionable insights derived from
an intent is an invitation for an operational expenses explosion, as there will be a need to spend
significant resources to extract knowledge from the data.
There is a fundamental
difference between
configuration and the status
interface in terms of their
inherent complexity and
However, AOS actually shines in its capability to gather large amounts of information when that is
required and/or necessary (puzzle piece #13), helped by the optimized binary transport between the
device agents and AOS state repository and coupled with the ability to process and compress data as
close to the source as possible.
CONSTRAINTS
A constraint is a limitation on the possible changes that can occur on variables or parameters in a
system. Constraints allow a user to insert certain limitations that are applicable to his environment. For
example, he may want to only consider devices of certain type (Vendor A or B) or capacity (switches
with 6x40G ports and 48x10G ports), or restrictions around IP address pools or VLAN IDs. Constraints
essentially assist in fine-tuning the reference design to specific environment.
ABSTRACTIONS
With respect to the modeling of the system and services, the AOS approach has few guiding
principles:
1. Every model is a wrong one, it is just that some models are more useful than others in
specific situations (puzzle piece #1).
2. There is a fundamental difference between configuration and the status interface in
terms of their inherent complexity and performance requirements; as a result, they
should be separated and governed by different models (puzzle piece #6).
In AOS, for each reference design, a declarative specification of what the device should be doing
in the context of a blueprint to fulfill its role is specified. The generation of these declarative device
specifications is executed by the AOS service rendering process, which is also available for inspection
and extension by the system administrator. Service rendering generates:
1. Declarative configuration
2. Expectations and telemetry specification
3. Alert definitions
Regarding point 2, it is worth noting that the configuration is the interface to a very creative process
of system design and must be flexible and non-opinionated. And it is fundamentally different from
evaluating the health state of the system. As an analogy, while it is very complex to design robots
resembling humans or develop artificial intelligence, assessing the health of the human is much more
structured and defined process.
From the declarative device level specification, the composition of device features into configuration AOS generates these
in AOS is done in code (python) and is under full control (inspection/extension) of the system declarative specifications,
administrator, following the current DevOps best practices. However, AOS generates these relieving the designer or
declarative specifications, relieving the designer or system administrator from generating hundreds of system administrator from
files, and maintaining their coherence as the system goes through changes. This change management generating hundreds of
is supported by the reactive, publish-subscribe backend that updates the blueprint in response files, and maintaining their
to various state changes. Another aspect of service rendering is that it generates the definition of coherence as the system goes
tests and expectations to validate the service delivery. Here, the device receives a set of tests to through changes.
execute (from the ever-increasing catalogue of supported tests). Reactive logic detects anomalies
and generates alerts when a test result doesn’t match the expectation. This telemetry gathering is
supported by strongly typed interfaces as performance and predictability is of the essence here.
The communication fabric referenced above implies that there exists a logical communication channel
between agents in the system. Agents communicate via attribute-based interfaces (hence data centric
in the previous description) by publishing entities and subscribing to changes in entities. Data centric
also implies that data definition is part of the framework and is implemented by defining the entities,
as opposed to message based systems, for example. Note that data centric publish-subscribe system
does not suffer from the problems by which message based systems are plagued. Namely, sooner or
later within a message based system, the number of messages exceeds the capacity of the system to
store or consume them; dealing with this is hard as one has to replay the history of messages to get
to a consistent state. On the other hand, the data centric system is resilient to surges in state changes
as it is fundamentally dependent only on the last state (puzzle piece #4). This state captures the
http://www.apstra.com/docs/The_Distributed_Systems_Challenge_in_Data_Center_Automation.pdf
Hard problems (elasticity, fault tolerance) are solved once and on behalf of all agents. Typical
architecture then consists of a number of stateless agents that can be restarted in case of failure and
pick up from where they left off by simply re-reading the state they subscribe to from sysdb.
Figure 6.
Application Agents (AA) are responsible for performing application domain specific transformations,
by subscribing to input entities and producing output entities.
Device Agents (DA) reside on (or ar proxies for) a managed physical or virtual system such as a switch,
server, firewall, or load balancer and are used for writing configuration and gathering telemetry using
native (device specific) interfaces.
Both AOS built-in applications and custom defined applications follow the same pattern. Built-
in applications are for the users that want to leverage validated, best practice reference design
implementations and value predictability over extensibility. Custom applications, on the other hand,
are implemented by expert users who need to insert their expertise into AOS.
EXTENSIBILITY
AOS was built on the premise that users will want to extend it or even build applications from scratch.
Essentially, extensibility allows the user to control configuration generation, telemetry collection, and
alert generation. To achieve that goal, AOS exposes these knobs:
While the examples in this paper have focused on the data center use case, there is nothing data
center specific (or network specific for that matter) in the framework.
PROCESS OVERVIEW
While it is important to capture the state of managed infrastructure, it is equally important to capture
the state of the service lifecycle process.
1. Design phase - helps the user formulate a design template with a given set of con-
AOS was built on the premise
straints. In this context, AOS essentially provides a “computer aided design” (CAD)
that users will want to extend it
tool for designing network services.
or even build applications from
• AOS allows for a “sliding scale of abstraction,” where the user can start with a scratch.
very high level specification (“number of servers”) and have the design process
guide him and calculate possible options, subject to specified constraints,
resulting in a blueprint. Alternatively, at the other end of the spectrum, AOS
allows explicit specification of the whole blueprint, including, for example,
custom cabling and IP addressing allocations. Everything that is available in the
UI is available via APIs.
As discussed previously, the real system complexities causing 70-80% of outages are due to changing
configuration rather than initial deployment. Because configuration, telemetry, and expectations
within AOS are derived from a single source of truth - the user-specified intent - in an idempotent
fashion, there is no difference between initial deployment or change management. “Rolling back”
Configuration, telemetry, and
changes is equivalent to re-loading intent from some desired point in the past.
expectations within AOS are
derived from a single source of
AOS adopts DevOps practices by using agents to generate (in code and using templates) truth - the user-specified intent.
configurations for all the systems in the infrastructure. AOS then builds on top of this DevOps
practice, eliminating manageability headaches, by generating these configurations from the single
source of truth (intent) and based on the role they are playing and the reference design. With this
approach, AOS takes data center control to the next level by introducing Expectations Driven Service
Delivery, which drives context rich telemetry and provides simple and knowledge rich alerts that
clearly identify service and customer impacts. These benefits can be consumed out of the box as well
as extended to serve specific consumer needs. In the context of a reference design, AOS is the only
control knob the consumer needs.