Why we need the Service Mesh in Cloud native development

6 min readAug 11, 2021

First of all, what really is “cloud-native”? Well, it’s basically developing applications as microservices, packaging them into containers, and deploy to an elastic hyperscaler platform (popularly Kubernetes) through agile DevOps processes

Service Mesh — Microservice “Web” Communication & Monitoring Simplified

Preface

We’ve come across a lot of technical jargons when speaking of cloud native development. We started with mainframes & monoliths, moved up to microservices, then containerising them, and eventually running them in a Kubernetes cluster.

I’m now here to tell you that containers & Kubernetes will not solve all of your problems, and we also need a Service Mesh running in the cluster, especially for Enterprise applications. I’ve been working with many Enterprises during my tenure and have typically seen hundreds to thousands of microservices running in their portfolio

Enterprise Problems

Let’s take a look at some of the typical problems faced by an Enterprise -

Dynamic discovery of microservices/workloads/applications
Load balancing and health checks for each
Resiliency from failures
Traffic management (Canary deployments, A/B testing, service to service routing rules, etc)
Distributed tracking and monitoring
Enforce security policies (encrypt data in-transit, configure auth policies, etc)

Old problems. What worked then?

Some of the concerns are addressed by the Kubernetes platform, which is why the popularity, but there’s definitely more to it. I’ll take Telemetry & Observability as an example but the same concepts apply to other problems listed above. We’ll typically setup a Prometheus server to capture the metrics & telemetry data. On top, we could have a Grafana dashboard to visualize the microservices health & performance data

Let’s say our Enterprise has only 50 microservices and they must be monitored centrally. Now to have all the 50 application teams send metrics in a common format, we’ll generally need a service harness (a.k.a utility a.k.a library a.k.a template) to abstract such cross-cutting concerns & own this responsibility for all the applications.

Problems with the Old Approach

Rightly thought, the harness comes with it’s own set of problems, and which is where most of the Enterprises struggle -

Release Cycle — We have 51 different autonomous teams, each with their own release cadence. It gets hard to incorporate the latest version of harness to each application/microservice
Upstream Dependencies — For legacy or support based applications for example, the release cycle is not frequent and at best could happen once a month. The harness version would be really old that is no longer compatible with what Prometheus expects
Polyglot Environment — The applications would likely follow different tech stacks (Java, .Net, Python, Node) based on the team’s comfort, which in turn requires the harness support in each language. The difference in implementation might not remain consistent across all languages

So, WT “Fix” then?

Well, we do need the service harness to abstract cross-cutting concerns from the application, but we don’t need the problems that come with it. A “managed” service harness would be ideal, and that is exactly what a Service Mesh provides us through the “Envoy Proxy” (Among a load of other features)

Envoy Proxy

To understand the envoy proxy, we must first understand a simple forward proxy. The forward-proxy is a computer/server that acts as a link between a local network and the larger network (like the internet) mediating all of their inbound and outbound requests from the local network

The envoy proxy (a.k.a sidecar a.k.a sidecar proxy a.k.a envoy) is a similar concept, with the difference that it runs against each application, responsible for making all incoming & outgoing requests for the “attached” application/service (hence the name sidecar!)

It’s completely capable of acting as a service harness since all metrics are collected by the sidecar (any is highly configurable) while mediating requests for the attached service. All cross-cutting concerns such as security policies, routing rules, traffic management, auth, etc could be configured here

Service Mesh Architecture

Now that we know about the secret sauce which makes a Service Mesh tick, let’s back up & look at the overall architecture

Control Plane holds the management services for a Service Mesh that is usually part of the Kubernetes data plane in a separate namespace. For Eg, the most widely used (so far) Service Mesh — Istio runs it’s control plane services in the “istio-system” namespace. Tasks include -

Fetching the current running services in the Kubernetes cluster through the Apiserver (since the Service Mesh has to inject a sidecar proxy against each one of them)
A Configuration Manager to configure each sidecar proxy with the rules provided through Service Mesh manifest files (like which service can interact with which, who should see a client or server error since they’re testing resiliency, which version of the service should only get 15% traffic since it’s in Beta— Traffic Splitting)
The Certificate Manager, since security is considered essential and there is nothing more tried & tested like the TLS/SSL layer. Each Service is given an Identity by the Mesh and generates certificates for them to enable mTLS (mutual TLS) which is basically encrypting data in-transit for service to service communication (just like a client — server communication over HTTPS using certificates from CA)

Data Plane is like the Kubernetes data plane where your workloads run, except that it can’t be the reserved namespace running the control plane services

The best part of the Service Mesh is that a service or application running in the cluster doesn’t need to know about the Service Mesh being there. The sidecar can be auto-injected or configured so for specific namespaces and clusters, and they just take over application communications

Observability is another important aspect. To help with the monitoring, Service Mesh contain easy integrations with widely adopted monitoring tools, including pre-built dashboards displaying popular metrics. Like Istio provides pre-configured integrations for Kiali, Grafana, and Prometheus among others

Service Mesh Considerations

The Service Mesh is loaded with features but do come with it’s own negatives, like the additional containers per service consumes extra compute and memory. And then, if your application makes external vendor calls during startup, like fetching data from the database, then it might fail intermittently based on the availability of the sidecar proxy. That’s because if the service tries to access database before the sidecar proxy is up, the call will fail. But if you try it manually again while troubleshooting (yep, happened to me), it will work since the envoy proxy is up now. Such race conditions need to be handled and there are ways (such as Scuttle)

Whether or not to use the Service Mesh needs to be a conscious decision and in most of the cases (esp. with Enterprise applications) there are more pros to cons for the application. You might want to avoid it for short POCs or for clusters running just a couple of microservices

To help with the decision, consider the below image as a guiding but not an absolute rule since it does depend. Like today you have just a couple of microservices, but with the Organization’s native cloud adoption, the expectation could be to scale up to hundreds of microservices over the years. It’s better to adopt the Service Mesh sooner than later in such cases so the rich feature-set could be adopted early