“Alright, could you please tell us a few benefits and drawbacks of working with microservices?” – starts the usual tech interview question, and the traditional answer often follows: “They’re smaller; therefore, they’re easier to develop and to troubleshoot. There’s better scaling control on parts of the system. They can be written in different programming languages. They can be deployed individually.”
In this series of articles about microservices, we’ll go through the widespread notions of what it feels like to work with microservices. We’ll attempt to explain under what circumstances some – arguably a little blunt – misconceptions hold true. We’ll do this to speed up your journey and personal hype curve of microservices, so you can better understand what dynamics and challenges to prepare for.
Make sure to subscribe to our channels to be informed about subsequent articles in the series.
Our previous blog post highlighted three essential guidelines to ensure that teams are able to work relatively independently on separate microservices. In this blog post, we’re calling out a few important things to consider when it comes to troubleshooting and automated actions.
Belief: “Microservices are smaller; therefore, they’re easier to troubleshoot.”
Reality: “Not really. The style and methods of troubleshooting are quite different from those of a monolith.”
You typically run a microservice in a Linux container, most likely as a Docker container. A container internally runs a particular operating system, even if it uses the host’s kernel. Usually, a container orchestration platform makes managing resources (containers, volumes, virtual networks, environment settings, CPU/memory limits, etc.) easier, like Kubernetes. Such a platform is run on a set of machines; these have their own operating systems. These machines are either virtual or physical. Virtual machines have their virtualization infrastructure. Cloud providers have another layer of internal virtualization infrastructure under the resources they directly offer for use. Each level has its own networking infrastructure and the usual services, like DNS, DHCP, etc. There may be certificates presented or required by individual services.
Now, you might think, yeah, a bunch of things, but these need to be there for a monolith as well. Yes, but there’s a significant difference, and to see it, let’s imagine you make a call from one functional module to another, e.g. from the “billing” module, you want to retrieve user details, and it’s the “users” module that provides that.
These two modules would be two classes in a monolith, and the invocation would be a plain method call. If you encountered a problem, you’d see an exception stack trace, and you’d fairly easily localize the root cause of the problem. The only thing you need to rely on is the reliability of plain, in-process method invocation.
Using microservices, the two modules would be separate services, and the invocation would mean executing an HTTP request, for example. (Let’s stick with that for the sake of simplicity.) Now, let’s think about executing an HTTP request; what that entails is:
- resolving the IP address from the target service’s host name
- establishing a TCP connection to the machine with that IP address and the specified port
- if the service presents a certificate, verifying it
- if the service requests a certificate, providing it
- once the connection is established, transfering the request bytes
- optionally, authentication
- waiting until the response comes back
- closing the connection
We’re skipping serialization aspects and generally won’t talk about their overhead and the network round trip times, etc. Going with microservices naturally means compromising performance in favor of other benefits.
But we definitely want to talk about what can go wrong, so let’s see:
- resolving the IP address from the target service’s hostname
- the DNS infrastructure may become temporarily unavailable,
- the DNS records may become stale,
- the resolver configuration of the container can get corrupted,
- establishing a TCP connection to the machine with that IP address and the specified port
- the remote port may become unreachable (due to network changes or network saturation)
- if the service presents a certificate, verifying it
- the target service may present a certificate that expired,
- the target service may present a certificate whose intermediate or root CA certificate has expired, or the invoking client does not whitelist it,
- the target service may present a certificate that is invalid for the domain through which the service is accessed,
- TLS versions or cipher suites may become unsupported (e.g. due to endpoint security configuration updates),
- if the service requests a certificate, providing it
- the same expiration/validity issues as above,
- the target service may require different subject field values (common name, organization, etc. as a result of config change or certificate provisioning hiccup)
- once the connection is established, transfering the request bytes
- the network connection may become broken,
- the remote service may be killed, and the connection isn’t closed correctly
- optionally, authentication
- if this is done using a JWT token (which is the best practice in many situations), that may expire by the time the request gets to the target service
- waiting until the response comes back
- the same network issues as above,
- the response time is so high that the client times out,
- closing the connection
- not closing connections properly on the client side may exhaust the connection pools on either side.
While many of the above listed issues would only arise due to configuration changes, it’s important to notice their augmented effect and the difference in the order of how much remote module calls rely on the underlying infrastructure being stable compared to monoliths.
Now, these things can go wrong even if you just have a small number of microservices and a relatively low load. But imagine you have hundreds of services and thousands of users interacting with your applications within a second. That’s where it gets tricky.
To get back to the original answer, the style and methods of troubleshooting are different, especially in that, you have to start with basic infrastructure level checks to rule out their possibility efficiently. For example, if you see errors in the client service’s logs that it’s unable to invoke an endpoint of a remote service, it doesn’t necessarily mean that the remote service has issues; you first have to check basic things like name resolution, network connection, certificates, etc. Therefore, it’s essential to properly log the errors encountered and know what preliminary checks to perform for different errors.
If you want to prepare your organization to excel at troubleshooting in such an environment, then it’s essential to have a strong DevOps culture: where application developers and operations people don’t communicate only over tickets, but who can chat/jump on a call/etc. directly anytime.
The post-mortem about Slack’s outage in 2021 January explaining the causes and impact of network saturation is an excellent read which describes how things went wrong after the network got saturated at a transit gateway.
Let’s move on to another topic where misunderstandings are pretty common and where wrong configurations can weaken the stability of a system.
Belief: “Liveness and readiness probes will help to stabilize your services.”
Reality: “True, if you can define very well what readiness and liveness mean for each service.”
And that’s not always easy.
But first, a short definition of what these are. A service is “live” if it’s theoretically capable of serving requests, i.e. it’s not completely broken. A service is “ready” when it’s practically capable of serving requests, i.e. the incoming requests will be served as expected.
To give a concrete example, a Spring Boot application that uses Redis as a cache is in a “live” state if the connection pool to the cache is healthy. However, it isn’t “ready”, for example, if the number of jobs in an in-process queue exceeds a threshold, and therefore, the response time is known to be just too high.
Spring Boot simplifies managing and exposing these states to Kubernetes, here’s a nice how-to on that and of course the Kubernetes documentation is a must-read if you want to do this.
If liveness probing fails, the container is usually stopped and deleted. If the readiness probing fails, the container is usually taken out of service, but is not deleted. When it’s ready again, it’s put back to service. The number of retries, failure ratios, timeouts, etc. can usually be configured in the orchestration platform to tolerate hiccups. For each service, the desired number of replicas means the desired number of ready replicas, so another container is usually started up whenever a container is marked as not-ready. (In Kubernetes, we don’t control replica counts for containers directly, but pods, so a set of tightly-coupled containers, but for understanding what the probes mean, it’s a marginal difference.)
There’s a great article from Colin Breck about how to avoid making things even worse using liveness or readiness probes. But there’s another common mistake: setting a binary “application health” value as a control for liveness or readiness probes. Let’s suppose you add periodic health checks into your application, so you see very quickly if something’s not right. For example, you add code to your service that periodically checks the accessibility of its database and all remote services it uses, etc. If any of these health checks fail, you conclude that the application is not healthy and expose this state. This is very useful for monitoring purposes, but there’s a catch when you interpret service health as a binary value and start to use it in liveness or readiness probes: you’ll likely cause an outage in the service yourself because maybe some dependencies are failing but your service may still be able to serve certain types of requests. So a piece of good advice is to think in 3 states:
- green: the service is healthy,
- yellow: the service requires attention and/or it’s degrading, but is able to recover,
- red: the service is totally failing, there’s no chance of recovery,
… and classify the implications of failing health checks appropriately. Yellow state should trigger notifications or alerts, but shouldn’t make the orchestration change the ready/live states of the service.
So to sum it up:
- there has to be well-defined set of conditions which can cause a service to be marked as not live or not ready,
- for each probe you set up, it’s a good idea to think through if the same probing logic will be failing for newly started containers: if it would, then it probably shouldn’t control liveness/readiness.
To be continued… Follow us on Twitter or subscribe to our newsletter so you don’t miss further posts in this series and our other blog posts.