Microservices: A Perilous Journey

Aug 10

6 min read

A microservice architecture is well suited to handle the challenges of scale, whether stemming from high workload, a complex codebase, or a large organization. However, microservices are notoriously difficult to get right and many an organization ended up in a serious jam after introducing them. This article will explore some of the pitfalls that give microservices such a bad rap.

Network Uncertainty

Microservices communicate over a network rather than in process. A network partition can make a downstream microservice unavailable at any time and without notice. A physical failure is obvious, but also errors in the routing tables or a failure of the service discovery (e.g. DNS) or load-balancer can effectively partition the network.

The risk profile of the upstream microservice is not the same as that of the downstream microservice. They may be running on different machines, restarted at different times, fail because of different reasons. Just because the upstream is running doesn't mean that the downstream is running and able to process its requests. This stands in contrast to in-process calls inside a monolith where the caller and callee share the same existence.

To make things even more complicated, the upstream microservice is never certain of the health of the downstream microservice. To the upstream microservice, the downstream microservice is like Schrödinger's Cat, it may be alive or it may be dead. Health checks are a common approach, but health checks are prone to network errors themselves, may not provide instant indication, and are only eventually consistent. Retries are used to work around this uncertainty, but retries carry their own risks.

Network Latency

A call over the network will always take longer than an in-process call. Inside a cloud region, it is common to see a 0.5ms roundtrip. User requests that touch on many microservices in order to be satisfied may suffer from unacceptable latency, especially if the microservice chatter is done sequentially rather than concurrently. It is not uncommon to see seemingly simple requests take 1s or more.

Enabling Systems

Running microservices is not as simple as launching an executable. They require a jumble of systems to enable them. At the very least there's need of a discovery service and a load balancer. Often there is also Kubernetes. Sometimes there's also a service mesh or an eBPF router. All of these enabling systems require expertise and careful configuration which can be fragile. In addition, having a production environment that differs rather significantly from the local development environment makes debugging production issues more challenging.

Failure Points

Because of its complexity and large number of enabling systems, a microservice architecture can fail for any one of many reasons. When failures happen, they are difficult to pinpoint and often require deep knowledge of the subsystems and their configuration.

Many Moving Parts

Like a fine clock, a microservice architecture has many moving parts that need to work in unison for it to function smoothly. Retries might lead to retry storms without client backoff logic. Multiplexed connections necessitate client-side load balancing or eBPF routing. Service discovery requires health checks. Chaos testing is important for validating the resiliency of the system. Observability tools are critical for troubleshooting in production.

Most teams do not have the knowhow to implement a microservice architecture to the T. Even if they do posses the skills, they are unlikely to get buy in from leadership to spend their time on activity that shows no immediate value to customers. And so they compromise and take on an increased risk of system failure.

Hardware Footprint

Microservices run as separate executables, resulting in a larger overall memory footprint. This is particularly true for microservices that run on a heavy virtual machine or framework with many dependencies.

The large number of incoming and outgoing TCP connections also take their toll on memory consumption.

Microservices are often hosted inside individual Docker containers orchestrated by Kubernetes, which adds its own hardware requirements to the tally.

Cloud Costs

The larger hardware footprint and the requirement to ship and store centralized monitoring data balloon the cost of running a microservice architecture in the cloud. Some might (jokingly?) argue that microservices are being promoted in order to drive more revenue to the cloud providers.

Cyclic Dependencies

If two microservices each depend on each other to boot up, it becomes impossible to restart the system from nothing. Microservice X will not start unless Y has already started, and vice versa.

A similar cyclic dependency may exist also while the microservices are already running: X is calling endpoints on Y and Y is calling endpoints on X. Even if it does not cause an infinite loop, a cyclic dependency makes it difficult to reason about the system's properties.

Distributed Transactions

Often, an orchestrator microservice may need to update multiple objects in the database. If each object is owned by a different downstream microservice that's running in another process, it is not possible to encapsulate the entire operation in a single transaction.

If then one of the downstream microservices fail, the orchestrator microservice needs to orchestrate a rollback with no guarantee that it will not fail itself.

The sagas pattern can be used to address distributed transactions, but it is not always trivial to implement.

Even when all goes well, the different network latency to the downstream microservices means that the data objects do not get updated at the same time, which could result in inconsistencies.

Local Debugging

A distributed system runs on multiple processes, but during development typically only one is run in the IDE. In some cases, the other microservices run in a shared integration environment running in the cloud, making it practically impossible to debug across multiple microservices and significantly hampering developer productivity.

Troubleshooting

Identifying issues in production is just as challenging. An error logged in production will print the stack trace of only the process it is running in, without the stack trace of the upstream microservices that led to it. A distributed logging system can be used to coalesce individual logs but it typically involves shipping large amounts of data to a central storage at a rather high cost.

Unique IP:Port

Since a microservice is a web server, it must listen on a TCP port for incoming requests from other microservices. Different microservices or even different replicas of the same microservice need to run on different IPs or ports. Port mapping is often used to run multiple microservices on the single development machine, but port mapping comes with its own set of idiosyncrasies.

CI/CD Automation

Trying to manage microservices without a full set of automation tools is a huge productivity drain if only for the sheer volume of microservices. It may be possible to manually build, test and deploy a single monolith, but it is impossible to do the same for 100 microservices. Fully automating the CI/CD pipeline takes upfront investment and expertise.

Inter-Service Chaos

Microservices break an application into small components that don't provide much opportunity for chaotic code to form. Even if significant chaos does form inside a microservice, refactoring it is an easier task that refactoring a module that is tightly integrated in a monolith. If one is not careful however, the chaos ends up shifting up and forms at the interconnection of microservices.

Distributed Monolith

A distributed monolith is a distributed system where microservices are tightly coupled. It is so named because it resembles a monolith where each of the in-process calls was replaced with remote network calls. This term is often thrown around to demonstrate the futility of breaking a monolith into microservice.

Divergent Engineering Standards

Microservices are often built by different teams, maybe even in different time zones, each working on their own codebase using their own preferred coding standards, patterns and even tech stacks. This divergence of engineering standards introduces friction to the process of moving engineers between teams because of the steeper learning curve.

Abandoned Microservices

Once a microservice has been completed and stabilized, it may be left running for months or even years without change. Over time, the original team that built it will move on to work on other microservices. The original microservice might ultimately end up orphaned from any owners.

Dependency Management

Microservices are built and deployed separately. If there's a security fix or update to a dependency, all dependent microservices have to be rebuilt and redeployed. Such a large rollout always has the potential to yield catastrophic results and bring down the entire system.

Conclusion

Microservices are no picnic and the journey to the promised land of infinite scalability is dotted with pitfalls. If you're not careful, you'll end up in a real jam and wish you never left the comfort of the monolith.