Beruflich Dokumente
Kultur Dokumente
MongoDB
Simple! No SoA!
Lyft ~2 years ago
PHP / Apache
Internet AWS external
monolith AWS internal
ELB
(+haproxy/nsq) ELBs
MongoDB DynamoDB
Python services
Agenda
Can we do better?
Agenda
This sounds great! But it turns out its really, really hard.
What is Envoy
Out of process architecture: Lets do a lot of really hard stuff in one place and
allow application developers to focus on business logic.
Modern C++11 code base: Fast and productive.
L3/L4 filter architecture: A TCP proxy at its core. Can be used for things other
than HTTP (e.g., MongoDB, redis, stunnel replacement, TCP rate limiter, etc.).
HTTP L7 filter architecture: Make it easy to plug in different functionality.
HTTP/2 first! (Including gRPC and a nifty gRPC HTTP/1.1 bridge).
Service discovery and active health checking.
Advanced load balancing: Retry, timeouts, circuit breaking, rate limiting,
shadowing, etc.
Best in class observability: stats, logging, and tracing.
Edge proxy: routing and TLS.
Envoy service to service topology
Service Service
HTTP/2
Envoy Envoy
REST / GRPC
External Services
Discovery
Envoy edge proxy topology
Front Envoy
External Clients Internet Edge Proxy
Region #1
Front Envoy
Edge Proxy
Region #2
Private
Infra
Lyft today
Stats / tracing
Python services
(direct from
(+Envoy)
Envoy)
Discovery
Service mesh! Awesome!
Eventually consistent service discovery
Fully consistent service discovery systems are very popular (ZK, etcd, consul,
etc.).
In practice they are hard to run at scale.
Service discovery is actually an eventually consistent problem. Lets
recognize that and design for it.
Envoy is designed from the get go to treat service discovery as lossy.
Active health checking used in combination with service discovery to produce a
routable overlay.
Throughput is important ultimately for cost and operational scaling, but for many
companies developer time is worth more than infra costs.
BUT: Latency and predictability is what matters. And in particular tail latency
(P99+).
We already deal with incredibly confusing deployments (virtual IaaS, multiple
languages and runtimes, languages that use GC, etc.). All of these niceties
improve productivity and reduce upfront dev costs, but they make debugging
really difficult.
What is leading to sporadic error? The IaaS? The app? GC?
Ability to reason about overall performance and reliability is critical.
Performance matters for a service proxy
A service proxy provides invaluable benefits, but if the proxy itself has tail
latencies that are hard to reason about, most of the debugging benefits go out
the window and you are back to square one.
Anyone that is trying to sell you a service infra that does not consider the above
points is selling you a dream...
Agenda
Redis.
More outlier detection and ejection (SR and latency).
LB subset support.
More rate limiting options / open source rate limit service / IP tagging.
Configuration schema and better error output.
Work with Google/community to add k8s support.
Authentication and authorization.
Envoy ecosystem?
Agenda