Kubernetes Observability Ebook

E-BOOK
Kubernetes
Observability
Monitoring, troubleshooting and securing
Kubernetes with Sumo Logic
Table of contents
Introduction 2
Chapter 1 5
Understanding the Kubernetes monitoring landscape
Chapter 2 7
Challenges of monitoring and troubleshooting in Kubernetes environments
Chapter 3 10
Why traditional Kubernetes M&T solutions fail
Chapter 4 12
What to monitor
Chapter 5 15
Collecting Kubernetes data
Chapter 6 17
Sumo Logic: A unified DevSecOps platform for Kubernetes
Conclusion 21
Five reasons why you should choose Sumo Logic for Kubernetes monitoring
Appendix: A 22
Kubernetes metrics
Introduction
Why is monitoring Kubernetes hard? What is
observability in Kubernetes? How do I achieve it?
Legacy monitoring solutions weren’t designed to

monitor systems with thousands of components,
all of which are constantly moving, being replaced,
and scaling from moment to moment.
Most monitoring tools were designed to keep
track of primarily static environments where the
only changes were externally-driven and carefully
planned. Systems were centralized, monitoring was
centralized, and security was centralized.
But long gone are the days where you can print
out an architecture map of your infrastructure and
applications and post it on the wall.
Organizations are shifting towards containers,

serverless, and Kubernetes at an amazing rate.
The primary driver for this swift adoption is the
increased pressure on engineering teams to keep
pace with the more dynamic needs of the business.
Application developers need to accelerate the
development and deployment of business logic
while platform engineers need to be able to
accept changes from Developers at a faster rate.
Microservices are the key to staying competitive—
and Kubernetes is the key to handling
microservices. With near exponential growth,
there is no longer any debate that Kubernetes
is the dominant solution.
Q: What do you use to orchestrate Kubernetes’ market share dwarfs
your containers? other orchestrators
Fall 2018
Spring 2019
86% Industry watchers have loudly trumpted the rapid adoption
of Kubernetes - across various deployment modes,
57% including self-managed clusters; managed services
such as Amazon EKS, Azure AKS, and Google GKE; and
Kubernetes distributions such as Red Hat OpenShift and
43% Docker Enterprise Edition. Just six months ago, a little over
half (57%) were using Kubernetes in any of its forms.
In our survey today, 86% are using Kubernetes.
14% StackRox
Kubernetes non-Kubernetes
Kubernetes’ market share dwarfs other orchestrators

StackRox, The State of Container and Kubernetes Security, Spring 2019
With containers and Kubernetes now commanding

a greater footprint, companies need new tools to
manage and operate these dynamic workloads.
Site Reliability Engineers (SREs) and Platform
Engineers need to be able to successfully operate
the Kubernetes orchestration layer. Developers
must understand the impact of that layer on their
microservices, and their Security Analysts need
to have the visibility to secure the Kubernetes
environment from external hackers.
All parties need a way to ingest, observe,
alert on, and understand the data streams
exposed through Kubernetes.
Who is this book for?
This book focuses on the
observability of Kubernetes
environments. We will cover
best practices for data
collection, data filtering
and enrichment, and
monitoring. We will take
a look at the challenges
various stakeholders face
when monitoring Kubernetes
environments, and how
to enable observability of
Kubernetes with Sumo Logic.
E-BOOK | Kubernetes Observability
Chapter 1:
Understanding the Kubernetes
monitoring landscape
Goals of monitoring
First, what are we trying to accomplish by monitoring?
There are many answers to this question, but often, the primary
reason is to ensure reliability. Are things working as expected?
If not, what is broken and why?
The platform team is interested in understanding if their stack

— cloud infrastructure, CI/CD, Kubernetes, etc. — is operating
as expected. The development team is concerned with the
application running on top of the stack. Meanwhile, the security
team is focused on quickly identifying threats and ensuring that
no part of the system has been compromised. While each of these
roles might come at monitoring from a different angle, the end
goal is typically the same. In each case, reliability comes down to
observing that there is a problem, determining the source of that
problem, attempting a fix, and then validating that the fix worked.
Observability vs monitoring
Observability is not a new term. It has a long history stemming

from engineering and control theory. It is defined as the ability to
infer the state of a system based on its outputs. This is how we
will use it here, as an attribute of our system. Is our Kubernetes
environment observable or not? Monitoring is the action we take
to observe that system.
Note: Observability has more recently been defined in reference

to the three pillars of observability: logs, metrics, and tracing.
While we agree that this definition is a useful starting place, we
also recognize the lack of value in monitoring for monitoring sake.
If each pillar is implemented independently with a different tool,
there are diminishing returns when it comes to the overhead of
management and the contribution towards overall observability.
Caption goes here. Lorem ipsum dolor sit amet, consectetur adipiscing elit. 5
From Distributed Systems
Observability, Cindy Sridharan
argues that
“Logs, metrics, and traces are

useful tools that help with testing,
understanding, and debugging
systems. However, it’s important
to note that plainly having logs,
metrics, and traces does not result
in observable systems.”
In this text, we will discuss

observability comprehensively
as the ability to infer state.
at a cost. With each optimization, there are tradeoffs.
Chapter 2: With each layer of abstraction comes less visibility, resulting in

more complexity when something goes wrong. As organizations
race to adopt Kubernetes, unique challenges emerge that stretch
the limits of existing monitoring solutions.
Challenges of monitoring and

There are many more things to monitor
troubleshooting in Kubernetes
environments Instead of monitoring a static set of physical or virtual machines,
containers are orders of magnitude more numerous with much
shorter lifespans. Thousands of containers now live for mere
minutes while serving millions of users across hundreds of
Kubernetes is great but complex! services. In addition to the containers themselves, administrators
must also monitor the Kubernetes system and its many
Whether to enable hybrid and multi-cloud, promote deeper components, ensuring they are all operating as expected. When
specialization among development teams, enhance reliability, trying to display the sheer volume of information pouring out of a
or simply stay ahead of the curve, organizations are reaping the containerized environment, most tools come up short.
varied benefits of this technology investment— but it comes
Volume of containers for containers with lifespans under an hour
160,000
120,000
8,000
4,000
0 12.5 25 37.5 50
Container age (minutes)
The large volume of containers (generated by multiple customers) lasting less than 5 minutes indicates the potential for net new application
architectures using containers for periods of time far less than the amount of time typically needed to activate a virtual machine.
Caption goes here.-Lorem
New Relic
ipsumDocker Beta
dolor sit Program
amet, Development
consectetur Analysis
adipiscing elit. 7
Everything is ephemeral abstractions are how Kubernetes organizes itself. Kubernetes has
different hierarchies — services, namespace, deployment, or node
Everything in Kubernetes is, by design, ephemeral. Kubernetes centric views. Tools should have the flexibility to view Kubernetes
achieves its elastic ability to scale and contract by taking control through these various lenses.
over how pods—and the containers within those pods—are
deployed. A job needs to be done and Kubernetes schedules a
pod. When the job is complete, the pod is destroyed just as freely. Tools are distributed
But zoom out and we notice that Kubernetes has made the nodes
replaceable as well. A server dies and pods are rescheduled to Between logging tools, metrics tools, GitHub, and even SSH,
available nodes. Zoom out yet again to the clusters and these too engineers are constantly switching between a variety of tools to
are just as easily replaced. gain a complete picture of their system, i.e., observability. Walking
through a typical alert investigation, we can quickly get a sense
You have to zoom all the way out to the services to find a of this. An alert comes in and we immediately go check the logs
component with any staying power inside of Kubernetes. to find out more about the specific problem. Running through
Services and deployments represent the core application. a mental checklist of potential problems, we log into GitHub to
They still change but much less than their underlying components. see if any new code has been pushed. Did Kubernetes make any
Most tools weren’t designed to look at an environment from the scheduling decisions? What are the upstream and downstream
perspective of these logical abstractions. But these logical dependencies of the error I am seeing? And so on. Rarely are the
answers to the puzzle nicely connected and in one place.
But the more they are, the quicker we can resolve the issue.
Node view Namespace view Deployment view Service view
Cluster Cluster Cluster Cluster
Node Namespace Namespace Namespace
Pod Pod Deployment Service
Container Container Pod Pod
Container Container
Observe infrastructure Compare dev, lab, or Visibility into Monitor application

resources production environments deployment groupings services for user
experience visibility
Kubernetes has various hierarchies and Sumo Logic allows you to look at your data through these different lenses — depending on the situation.
8
Workflow and distributed toolset required for

troubleshooting today
NR Sumo Logic Github
Get an Check Check Kubernetes

application alert application logs configuration
Prometheus
Github Prometheus Kubectl
Check Github Did Kubernetes Check for events

to see if new code make any scheduling that happened in
was pushed decisions? Kubernetes
Mental Model Prometheus Cloud Provider
Think through the Check metrics

Check with the
application mental to compare if
cloud provider to
model for upstream the problem is in
see if limits are
and downstream production and dev
being hit
dependencies environments
SSH Isof Prometheus
Check pod Check metrics at

Check Kernel
and node the node/server/vm
metrics
networking level
9
Chapter 3: a service and deployment perspective is critical to understanding

the overall health of your application, and by extension, the
customer experience. Monitoring solutions should align with
the way Kubernetes is organized, as opposed to trying to fit
Why traditional Kubernetes M&T Kubernetes into our legacy modes.
solutions fail
Fragmented visibility
Kubernetes has several key differences that push the limits Most solutions only provide visibility into a piece of the
of traditional application monitoring. Due to the distributed Kubernetes environment. Admins are forced to navigate
ephemeral nature of Kubernetes, most existing solutions fail to between tools for logs, metrics, events, and security threats
give the visibility we might expect, resulting in longer resolution to build a real-time picture of application health.
times. Looking at these potential pitfalls can help guide us as we
take a fresh look at Kubernetes management and monitoring.
Infrastructure focused “I want a unified view... [of] control plane, pod

health, and node health.”
Traditional monitoring solutions look at applications from a
hardware or server-centric perspective. This makes sense for
legacy solutions where the underlying infrastructure would Lending Tree
often stay the same for months or even years, but this is no Jeremy Proffitt
longer the case. SRE
Pods, Nodes, even clusters can all be destroyed and rebuilt with
ease. Effectively monitoring what is running in Kubernetes means
to monitor at the application level, focusing on the Service and
Deployment abstractions. Understanding what is happening from
Infrastructure-centric visibility Service-centric visibility
Payment User Maps Payment User Maps

Service Service Service Service Service Service
Lack of correlation Security vulnerabilities
Furthermore, not only the tools but the data are also fragmented. Unfortunately, security visibility is often a low priority for teams
It is near impossible to connect the dots between metrics on a running Kubernetes, and existing toolsets rarely capture any sort
node to logs from a pod in that node. of security events for Kubernetes. Due to the lack of end-to-end
visibility into Kubernetes environments, the risk of undetected
This is because the metadata tagging of the data being collected security threats is a real issue. Kubernetes also makes it
is not consistent. A metric might be tagged with the pod and challenging to identify vulnerabilities in images at runtime,
cluster it was collected from, while a log might be labeled using a enforce security policies, and detect and remediate threats.
different naming convention. The metadata enrichment process
must be streamlined and centralized to gain consistent tagging, That said, end-users won’t care about the difficulties involved
and therefore, correlation. when their data is compromised. It is essential to take a more
DevSecOps-style approach in Kubernetes environments that
incorporates security considerations into the CI/CD lifecycle,
and elevates security visibility to the same importance as
operational visibility.
STORAGE &
COLLECTION & ENRICHMENT
DATA ACCESS
Node
Logs
</>
Fluentd Pod
Container Logging
Events
Backend
Node
</>
Prometheus
Metrics Pod
Metrics
Container
Backend
In traditional solutions, log and event collection and enrichment happens separately from metric collection enrichment, inhibiting the ability to correlate data
during troubleshooting.
11
Chapter 4: store logs, and create events which we can collect. We can break
down the components into four main parts.
1) The Control Plane - Master

What to monitor 2) Nodes
3) Pods
4) Containers
Building observability in Kubernetes The Control Plane - Master Node

Kubernetes is an orchestration platform, and the control plane
Building observability in Kubernetes begins with gathering data. facilitates that orchestration through multiple components:
So what data is available to collect? Further, what data should we •• The API Server — The API server runs on the master node and
collect? What data is going to be most useful to us? acts as the front end for Kubernetes. All communication happens
through the API server, as everything in Kubernetes is API based.
In Kubernetes, we can gather logs, metrics, and events throughout The API server has the best high-level view of the system and will
our cluster. be integral to understanding the state of our cluster.
•• Metrics serve as the heartbeat of the cluster, keeping track of the •• Etcd — etcd is the distributed key-value store and the heart of
global health and alerting if there are problems. the Kubernetes cluster. etcd serves as the Kubernetes backend
•• Logs help determine the source of those problems. and maintains the state of the system.
•• Events give us further insight into decisions made inside the •• Scheduler — The scheduler is responsible for finding places for
cluster such as resource state changes, error messages and pods to run. The scheduler scans the API server for unscheduled
notifications. pods and then determines if nodes have enough compute
resources to run them.
In the case of logs, metrics, and events each of these data points •• Controller manager — The controller manager is responsible for
is distributed throughout the Kubernetes environment so our first making sure the Kubernetes state matches our desired state.
job is going out and collecting them. It runs a series of reconciliation control loops, and if there is a
mismatch, it takes action to fix it.
Kubernetes architecture overview Nodes

Nodes make up the collective compute power of the Kubernetes
Kubernetes is made up of many components that talk to each cluster. This is where containers get deployed to run. Kubelet
other through the API server. These components expose metrics, is the main component to monitor on the nodes. Kubelet Runs
on every node in the Kubernetes cluster, including the master.
Kubelet is responsible for reporting/monitoring the health of the
Kubernetes Cluster node. Keeping a close eye on Kubelet ensures that the Control
Node
Plane can always communicate with the node that Kubelet is
Kubelet Kube Proxy running on.
Master
Pods
Pod Pod Pods are the lowest level resource in the Kubernetes cluster. A
API Server Container Container pod is made up of one or more containers. Containers in a given
pod will share the same namespace, and the same storage and
Scheduler resources.
Controller Node
Manager Containers
Kubelet Kube Proxy
Containers run inside pods. Containers run the application
workloads as well as some Kubernetes components.
Pod Pod
etcd
Container Container
12
Kubernetes metrics
Each of these components is doing their specific job while

simultaneously exposing metrics. Here is an example of what each
of these components exposes, and how we can use that data
to understand cluster health. For a detailed list of component
metrics, see Appendix A.
API server As most communication happens through the API server, monitoring API server request latency
can give you a quick insight into larger issues that might be impacting your cluster.
•• API server request latency
•• Requests per minute
•• Etcd requests
etcd etcd uses a raft protocol to elect a leader to manage coordination between the other members,
as etcd is a distributed key-value store. While leader changes are normal, too many could be a
sign of a problem.
•• Leader changes
•• Quorum
•• Disk space
Controller manager Monitor the requests it is making to your Cloud provider to ensure the controller manager can
successfully orchestrate. Currently, these metrics are available for AWS, GCE, and OpenStack.
•• Cloud provider latency
•• Scheduling latency
Scheduler Watching the request limits will ensure pods don’t fail to run due to lack of resources.
•• Request limits
•• Quota limits
•• Anti-affinity policy
Note: affinity allows you to control where pods run based on specific hardware requirements.
Kublet Keeping a close eye on Kubelet ensures that the Control Plane can always communicate with
the node that Kubelet is running on.
•• Containers currently running
•• Current runtime operations
•• Operation latency
Nodes Visibility into the standard host metrics of a node ensures you can monitor the health of each
node in your cluster, avoiding any downtime as a result of an issue with a particular node.
•• CPU
•• Memory consumption
•• System load
•• Filesystem activity
•• Network activity
(continued)
Containers At a minimum, you need access to the resource consumption of containers. Kubelet accesses
the container metrics from CAdvisor, a daemon that collects, aggregates, processes, and
exports information about running containers.
•• Resource consumption
•• CPU
•• Memory
•• File System
•• Network usage
Kube-State-Metrics Kube-State-Metrics is a Kubernetes add-on that provides insights into the state of Kubernetes.
It watches the Kubernetes API and generates various metrics, so you know what is currently
running. Metrics are generated for just about every Kubernetes resource including pods,
deployments, daemonsets, and nodes.
•• Pod status
•• Container resource limits and requests
•• Reason container is in waiting state
•• Node status
•• Deployment status
Kubernetes logs
Logs are how we can answer why something is happening.

They provide information regarding what the code is doing
and the actions it is taking. Kubernetes delivers a wealth of
logging for each of its components and the containerized
workloads running in Kubernetes. Access to these logs
ensures you have comprehensive visibility to monitor and
troubleshoot your applications.
The containers running in Kubernetes emit logs, which then get stored on that node.
Container workloads
Logs from these workloads provide information about the decisions the code is making
and the actions it is taking.
Kubernetes components Logs from these components give insights into the decisions made by Kubernetes.
Kubernetes events
Events in Kubernetes are a great resource to help understand

resource state changes, error messages and other notifications
that are relayed throughout the cluster. A non exhaustive listing
of event types can be found in the source code for Kubelet.
The event reason gives some minimal insight into the when and
where these occur.
1. They stay up to date. Each of these tools benefits from deep
Chapter 5: community support. As new versions of Kubernetes are released,

the extensive use of each of these tools ensures they are quickly
updated in turn.
2. They integrate with everything. Regardless of your unique
stack, it is likely that there is support for what you might want to
Collecting Kubernetes data export data from. The importance of these integrations cannot
be overstated, as they enable the flexibility needed to grow and
evolve a Kubernetes deployment over time.
Now that we understand what machine data is available to us,
how do we get to this data? The good news is that Kubernetes
makes most of this data readily available, you just need the Metrics collection
right tool to gather and view it. The solution we will discuss
here heavily utilizes open source tools for collection and data Prometheus is the de facto tool of choice for metrics monitoring
enrichment because of their deep integrations and overwhelming endorsed by the CNCF. It has a huge following and extensive
community support. support for anything you might want to collect metrics data from.
Prometheus works by pulling data from all of the components
and jobs running in Kubernetes. Every component of Kubernetes
Open-source data collection exposes its metrics in a Prometheus format. The running
processes behind those components serve up the metrics on
Perusing the CNCF website, we quickly discover a wealth of an HTTP URL. For example, the Kubernetes API Server serves
tools built around Kubernetes to enable not only monitoring but its metrics on https://$API_HOST:443/metrics.
networking, storage, and security. There is even a Kubernetes- Prometheus is particularly good at auto-discovering the jobs and
specific package manager, Helm, to make the deployment and services currently running in a Kubernetes cluster. As pods are
management of these resources easy and consistent. added, removed, or restarted, the Kubernetes Service construct
keeps track of what pods exist for a given service.
There are a couple of key benefits to taking advantage of This auto-discovery capability is one of the primary reasons
open-source collectors. for Prometheus’ popularity, ensuring that all new and existing
components are monitored.
Prometheus pulls data from all of the components and jobs running in Kubernetes.
15
Log collection Event collection
Kubernetes does not define a single standard approach to log Events provide insight into decisions being made by the cluster
collection, but the most common method is called cluster level and unexpected events that occur in Kubernetes. Events are
logging. Cluster level logging deploys a node level logging agent stored the API server on the master node, and collected using
to each node which then funnels data to a separate backend for the same method as log collection — via a node level logging
storage and analysis of logs. The primary benefit of this solution agent like Fluentd.
is that if a pod dies, the logs detailing what happened are retained.
Implementing node level logging, without funneling data to a
logging backend, will not retain log data if pods are evicted or die. Setup using Helm
Cluster level logging ensures that data is captured and retailed.
A common tool for implementing cluster level logging is Fluentd — Finally, collectors for logs, metrics, events, and security can be
or Fluentbit, a lightweight version of Fluentd — which acts as the easily deployed using Helm—an open source Kubernetes
node level logging agent funneling data to a logging backend, package manager. Helm can significantly simplify the setup
like Sumo Logic. process, reducing hundreds of lines of configuration to one.
These collection plugins can be used on any Kubernetes
cluster, whether one from a managed service like Amazon Elastic
Kubernetes Service (EKS) or a cluster you are running entirely
on your own.
Cluster level logging implementation with Fluentbit deployed to all nodes for node level logging and Fluentd acting as a centralized metadata enrichment pipeline —grabbing
enrichment data from the API server.
16
Chapter 6: Note
Labels — When you create objects in Kubernetes, you can

Sumo Logic: A unified DevSecOps assign custom key-value pairs to each of those objects,
called labels. These labels can help you organize and track
platform for Kubernetes additional information about each object. For example, you
might have a label that represents the application name,
With Sumo Logic, we can put all of these pieces together to build the environment the pod is running in or perhaps what team
end-to-end Observability in Kubernetes. owns this resource. These labels are entirely flexible and can
1. Setup and Collection - The entire collection process can be set be defined as you need. Our FluentD plugin ensures those
up with a single Helm chart. Fluentbit, Fluentd, Prometheus, and labels get captured along with the logs giving you continuity
Falco are deployed throughout the cluster in order to collect log, between the resources you have created and the log files
metric, event and security data. they are producing.
2. Enrichment - Once collected, the data flows into a centralized
Fluntd pipeline for metadata enrichment. Data is enriched—
tagged— with the details about where in the cluster it originated;
the service, deployment, namespace, node, pod, container, Metadata Enrichment
and their labels.
3. Sumo Logic - Finally, the data is sent to Sumo Logic via HTTP for Unified metadata enrichment is critical to building context about
storage, access, and most importantly analytics. the data in your cluster, and the hierarchy of the components
present. Standalone prometheus or fluentd deployments give
some context about the data — node, container, and pod level
information — but not valuable insight to the service, deployment
or namespace. Sumo Logic uses Fluentd as a centralized metadata
pipeline to ping the API server and gain rich context about the data
getting pass into Sumo Logic.
COLLECTION
Logs Fluentbit
ENRICHMENT
Service
Deployment
Events Fluentd
Namespace
Node
Metrics Prometheus
Pod
Container
Security Falco +
Sumo Logic collection and enrichment for logs, metrics, events, and security data.
17
By centralizing metadata enrichment, the Sumo Logic solution

reduces the load on the Kubernetes API server and ensures
consistent metadata tagging across logs, metrics and events “I really like the service terminology, and it was
without which it would be impossible to correlate data when something that was really hard to do in other
troubleshooting. You can use this metadata when searching platforms”
through your logs and your metrics and use them together
to have a unified experience when navigating your machine data.
(Figure 1). Informatica
Lior Mechlovich
Platform SRE
Ingestion into Sumo Logic
There is tremendous value in having this data come to a single

place. With metrics serving as the smoke detector, and logs
enabling us to drill down to the root cause, unifying these data Tying together DevOps and SecOps
sources around a common metadata language enables us to
easily correlate these signals. We can pivot from the metrics data We can take this further by providing data about security relevant
about a cluster to the events data about a cluster to the logs data events in the context of the Kubernetes mental model. Below we
about an application. can see top security rules triggered in the cluster overview. Zoom
Metadata enables us to build a hierarchical view of a cluster. in and we see this same data for the service or
By connecting pods to their services or group nodes by cluster, namespace and so on. Displaying security information within the
it becomes easier to explore the Kubernetes stack. By tapping into natural hierarchies of Kubernetes, we can enable a consistent
the Auto-discovery capabilities inherent in Prometheus, we can view across DevOps and SecOp to build closer and more efficient
ensure that the hierarchy visualized in Sumo Logic is accurate and DevSecOps cooperation. (Figure 3)
up to date. (Figure 2)
Figure 1. Namespace overview gives quick visibility into pods experiencing issues or in this case, in a CrashLoopBackOff state.
18
Figure 2. Rich metadata enables Sumo Logic to automatically build out the explorer hierarchy of the components present in your cluster,
and keep the explorer up to date as pods are added and removed.
Figure 3. Security visibility is available at the cluster level alongside log, metric, and event data.
19
Kubernetes security application security,

and network security
Zooming out, we can also take out Kubernetes security data
and insert it in our high-level security dashboards.
Combining infrastructure security, network security, full-stack
security, and Kubernetes security gives us comprehensive visibility
into the entire security story.
Full stack application

security monitoring
Cloud infrastructure
Security monitoring
Network & OS
Operating System, Firewall , Network devices
Application services
AWS Cloudfront, Akamai, Fastly
Application code
Java, Scala, .NET, Rails, Serverless/Lambda
Database and storage services

RDS, SQL, NOSQL, S3, Oracle
Infrastructure, container and orchestration

Docker, Kubernetes, AKS, EKS, GKE
Conclusion: In order to fully take advantage of the benefits of Kubernetes,

creating a comprehensive monitoring system is crucial.
Sumo Logic is uniquely positioned to provide end to end
visibility of lots, metrics, and security posture in Kubernetes
Six reasons why you should environments. By leveraging the Kubernetes ecosystem,
we can strike a balance between simplicity and flexibility.
choose Sumo Logic for
Kubernetes monitoring
Unified visibility Application-centric visibility CNCF standards-based
Sumo Logic combines metrics, logs, Sumo Logic allows admins to monitor and Sumo Logic’s solution leverages the
events, and security to create a real-time troubleshoot their environments using the de facto standards endorsed by the
view of the performance, uptime, and mental model of Kubernetes and that Cloud Native Computing Foundation
security of a Kubernetes platform. of their custom application, rather than (CNFC). Sumo Logic’s solution utilizes
being forced through the lens of a server- the extensive ecosystem of integrations
based approach. View a Kubenretes already created and maintained for
environment through its different monitoring Kubernetes.
hierarchies: node, deployment, service,
and namespace.
Centralized metadata Dynamic out-of-the-box End-to-end security visibility

enrichment dashboards
Sumo Logic provides out of the box
Sumo Logic centralizes metadata Sumo Logic auto-discovers the state of security visibility in the context of
enrichment enabling consistent tagging Kubernetes environments and provides the Kubernetes mental model.
across logs, metrics, events, and security admins with dynamic dashboards Because Sumo Logic is also a Security
data. Consistent tagging enables admins that update automatically based on platform, Kubernetes security data can
to correlate critical metrics data, to incoming data. As the Kubernetes be incorporated into comprehensive
Kubernetes event data, to log data about environment changes and new services, security dashboards for security visibility
their application. pods, nodes are added, Sumo Logic’s across network, application, and
dashboards will adjust in real-time Kubernetes cluster.
without additional configuration.
21
Appendix A: Kubernetes metrics
Common metrics
Kubernetes is written in GoLang and reveals some essential

metrics about the GoLang runtime. These metrics are necessary
to keep an eye on the state of what is happening in your GoLang
processes. There are also critical metrics related to etcd.
Multiple components interact with etcd and keeping an eye on
those interactions gives you insights into potential etcd issues.
Below are some of the top GoLang stats and common
etcd metrics to collect that are exposed by most
Kubernetes components.
Metric Components Description
go_gc_duration_seconds All A summary of the GC invocation durations.
go_threads All Number of OS threads created.
go_goroutines All Number of goroutines that currently exist.
etcd_helper_cache_hit_count API Server, Counter of etcd helper cache hits.

Controller Manager
etcd_helper_cache_miss_count API Server, Counter of etcd helper cache miss.

Controller Manager
etcd_request_cache_add_latencies_summary API Server, Latency in microseconds of adding an object

Controller Manager to etcd cache.
etcd_request_cache_get_latencies_summary API Server, Latency in microseconds of getting an object

Controller Manager from etcd cache.
etcd_request_latencies_summary API Server, Etcd request latency summary in microseconds

Controller Manager for each operation and object type.
22
Kubernetes control plane API server
The Kubernete Control Plane is the engine that powers The API Server provides the front-end for the Kubernetes cluster
Kubernetes. It consists of multiple parts working together to and is the central point that all components interact. The following
orchestrate your containerized applications. Each piece serves table presents the top metrics you need to have clear visibility into
a specific function and exposes its own set of metrics to monitor the state of the API Server.
the health of that component. To effectively monitor the Control
Plane, visibility into each components health and state is critical.
Metric Description
apiserver_request_count Count of apiserver requests broken out for each verb,

API resource, client, and HTTP response contentType and code.
apiserver_request_latencies Response latency distribution in microseconds for each verb,

resource and subresource.
23
Etcd
Etcd is the backend for Kubernetes. It is a consistent and highly-

available key-value store where all Kubernetes cluster data
resides. All the data representing the state of the Kubernetes
cluster resides in Etcd. The following are some of the top metrics
to watch in Etcd.
Metric Description
etcd_server_has_leader 1 if a leader exists, 0 if not.
etcd_server_leader_changes_seen_total Number of leader changes.
etcd_server_proposals_applied_total Number of proposals that have been applied.
etcd_server_proposals_committed_total Number of proposals that have been committed.
etcd_server_proposals_pending Number of proposals that are pending.
etcd_server_proposals_failed_total Number of proposals that have failed.
etcd_debugging_mvcc_db_total_size_in_bytes Actual size of database usage after a history compaction.
etcd_disk_backend_commit_duration_seconds Latency distributions of commit called by the backend.
etcd_disk_wal_fsync_duration_seconds Latency distributions of fsync calle by wal.
etcd_network_client_grpc_received_bytes_total Total number of bytes received by gRPC clients.
etcd_network_client_grpc_sent_bytes_total Total number of bytes sent by gRPC clients.
grpc_server_started_total Total number of gRPC’s started on the server.
grpc_server_handled_total Total number of gRPC’s handled on the server.
24
Scheduler
Scheduler watches the Kubernetes API for newly created pods

and determines which node should run those pods. It makes this
decision based on the data it has available including the collective
resource availability as well as the resource requirements of the
pod. Monitoring scheduling latency ensures you have visibility into
any delays the Scheduler is facing.
Metric Description
scheduler_e2e_scheduling_latency_microseconds The end-to-end scheduling latency, which is the sum of the

scheduling algorithm latency and the binding latency.
Controller manager
Controller manager is a daemon which embeds all the various

control loops that run to ensure the desired state of your cluster is
met. It watches the API server and takes action depending on the
current state versus the desired state. It’s important to keep an
eye on the requests it is making to your Cloud provider to ensure
the controller manager can successfully orchestrate. Currently,
these metrics are available for AWS, GCE, and OpenStack.
Metric Description
cloudprovider_*_api_request_duration_seconds The latency of the cloud provider API call.
cloudprovider_*_api_request_errors Cloud provider API request errors.
25
Kube-State-metrics
Kube-State-metrics is a Kubernetes add-on that provides insights

into the state of Kubernetes. It watches the Kubernetes API and
generates various metrics, so you know what is currently running.
Metrics are generated for just about every Kubernetes resource
including pods, deployments, daemonsets, and nodes. Numerous
metrics are available, capturing various information and below are
some of the key ones.
Metric Description
kube_pod_status_phase The current phase of the pod.
kube_pod_container_resource_limits_cpu_cores Limit on CPU cores that can be used by the container.
kube_pod_container_resource_limits_memory_bytes Limit on the amount of memory that can be used by the

container.
kube_pod_container_resource_requests_cpu_cores The number of requested cores by a container.
kube_pod_container_resource_requests_memory_bytes The number of requested memory bytes by a container.
kube_pod_container_status_ready Will be 1 if the container is ready, and 0 if it is in a not ready state.
kube_pod_container_status_restarts_total Total number of restarts of the container.
kube_pod_container_status_terminated_reason The reason that the container is in a terminated state.
kube_pod_container_status_waiting The reason that the container is in a waiting state.
kube_daemonset_status_desired_number_scheduled The number of nodes that should be running the pod.
kube_daemonset_status_number_unavailable The number of nodes that should be running the pod,

but are not able to.
kube_deployment_spec_replicas The number of desired pod replicas for the Deployment.
kube_deployment_status_replicas_unavailable The number of unavailable replicas per Deployment.
26
(continued)
Metric Description
kube_node_spec_unschedulable Whether a node can schedule new pods or not.
kube_node_status_capacity_cpu_cores The total CPU resources available on the node.
kube_node_status_capacity_memory_bytes The total memory resources available on the node
kube_node_status_capacity_pods The number of pods the node can schedule.
kube_node_status_condition The current status of the node.
Node components Kubelet
The Nodes of a Kubernetes cluster are made up of multiple Keeping a close eye on Kubelet ensures that the Control Plane can
parts, and as such you have numerous pieces to monitor. always communicate with the node that Kubelet is running on.
In addition to the common GoLang runtime metrics, Kubelet
exposes some internals about its actions that are good to track.
Metric Description
kubelet_running_container_count The number of containers that are currently running.
kubelet_runtime_operations The cumulative number of runtime operations available by the

different operation types.
kubelet_runtime_operations_latency_microseconds The latency of each operation by type in microseconds.
Node metrics Container metrics
Visibility into the standard host metrics of a node ensures you Monitoring of all of the Kubernetes metrics is just one piece of
can monitor the health of each node in your cluster, avoiding any the puzzle. It is imperative that you also have visibility into your
downtime as a result of an issue with a particular node. containerized applications that Kubernetes is orchestrating. At a
You need visibility into all aspects of the node, including CPU minimum, you need access to the resource consumption of those
and Memory consumption, System Load, filesystem activity containers. Kubelet access the container metrics from CAdvisor,
and network activity. a tool that can analyze resource usage of containers and makes
them available. These include the standard resource metrics like
CPU, Memory, File System and Network usage.
27
See business
differently
Toll-Free: 1.855.LOG.SUMO | Int’l: 1.650.810.8700

305 Main Street, Redwood City, CA 94603
www.sumologic.com
© Copyright 2019 Sumo Logic, Inc. All rights reserved. Sumo Logic, Elastic Log Processing, LogReduce, Push Analytics and Big Data
for Real-Time IT are trademarks of Sumo Logic, Inc. All other company and product names mentioned herein may be trademarks of
their respective owners. Updated 09/19

Kubernetes Observability Ebook

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Kubernetes Observability Ebook

Hochgeladen von

Copyright:

Verfügbare Formate

E-BOOK

Legacy monitoring solutions weren’t designed to

Organizations are shifting towards containers,

Kubernetes’ market share dwarfs other orchestrators

With containers and Kubernetes now commanding

The platform team is interested in understanding if their stack

Observability is not a new term. It has a long history stemming

Note: Observability has more recently been defined in reference

“Logs, metrics, and traces are

In this text, we will discuss

at a cost. With each optimization, there are tradeoffs.

Chapter 2: With each layer of abstraction comes less visibility, resulting in

Challenges of monitoring and

Volume of containers for containers with lifespans under an hour

Node view Namespace view Deployment view Service view

Cluster Cluster Cluster Cluster

Node Namespace Namespace Namespace

Pod Pod Deployment Service

Container Container Pod Pod

Observe infrastructure Compare dev, lab, or Visibility into Monitor application

Workflow and distributed toolset required for

NR Sumo Logic Github

Get an Check Check Kubernetes

Github Prometheus Kubectl

Check Github Did Kubernetes Check for events

Mental Model Prometheus Cloud Provider

Think through the Check metrics

SSH Isof Prometheus

Check pod Check metrics at

Chapter 3: a service and deployment perspective is critical to understanding

Infrastructure focused “I want a unified view... [of] control plane, pod

Infrastructure-centric visibility Service-centric visibility

Payment User Maps Payment User Maps

Lack of correlation Security vulnerabilities

1) The Control Plane - Master

Building observability in Kubernetes The Control Plane - Master Node

Kubernetes architecture overview Nodes

Each of these components is doing their specific job while

Logs are how we can answer why something is happening.

Events in Kubernetes are a great resource to help understand

1. They stay up to date. Each of these tools benefits from deep

Chapter 5: community support. As new versions of Kubernetes are released,

Log collection Event collection

Labels — When you create objects in Kubernetes, you can

By centralizing metadata enrichment, the Sumo Logic solution

There is tremendous value in having this data come to a single

Kubernetes security application security,

Full stack application

Database and storage services

Infrastructure, container and orchestration

Conclusion: In order to fully take advantage of the benefits of Kubernetes,

Unified visibility Application-centric visibility CNCF standards-based

Centralized metadata Dynamic out-of-the-box End-to-end security visibility

Appendix A: Kubernetes metrics

Kubernetes is written in GoLang and reveals some essential

Metric Components Description

go_gc_duration_seconds All A summary of the GC invocation durations.

go_threads All Number of OS threads created.

go_goroutines All Number of goroutines that currently exist.

etcd_helper_cache_hit_count API Server, Counter of etcd helper cache hits.

etcd_helper_cache_miss_count API Server, Counter of etcd helper cache miss.

etcd_request_cache_add_latencies_summary API Server, Latency in microseconds of adding an object

etcd_request_cache_get_latencies_summary API Server, Latency in microseconds of getting an object