Fault tolerance patterns for remote calls

Published in

Pragmatists

10 min readJul 22, 2021

We live in an imperfect world where failures are inevitable. The systems we depend on will fail sooner or later. We cannot do anything to prevent it, but we have the power to mitigate cascading failures. We just have to add some tools to our toolbox.

Back in the day, we used to write simple programs running on a single machine. These programs were using reliable interfaces to communicate with core devices. Ever since networks became more and more popular, people started to design systems with increased complexity, consisting of many programs running on independent machines and communicating with each other. Undeniably, there is no way back. People started to think about remote communication as an implementation detail and trying to hide networking intricacies behind yet another abstraction layer. That’s a very good design pattern, but very often programmers forget about the unreliability of such a form of communication. The fallacy that every call succeeds leads to an error-prone implementation — and we all know how it might end up — waking up on Saturday night because of pager duty notifications.

If you would like to sleep better, please read on. And remember: networks are unreliable!

Case studies

Patterns described in this article were applied in our system because of some incidents that had happened before. Fortunately, they solved these problems.

Case 1

We had a SQS messages consumer that was fetching batches of up to 20 messages from the queue. The next batch was fetched when all previous messages had been processed. From time to time we observed a growing number of messages in the queue. It turned out that the consumer stuck. The only way to recover was a manual intervention and restarting the service.

What was the reason? The messages processor was reading some data from the database and occasionally some of these calls were never ending. Why? We don’t know. Maybe it was the database problem, maybe network failure. It doesn’t matter. What matters is that customers were affected and we had to figure out what to do to protect ourselves against unexpected behavior of components that our service depends on.

Case 2

We had an administrative job running in the background, and living in the same process as the business logic exposed to our customers via REST API. The purpose of the job was deletion of no longer needed data. One day we deployed a new version of the service with some tiny changes made in the job. Just after that our customers started to receive Internal Server Error responses. In the beginning we thought that it’s unrelated to our little changes, because we touched a different area of the service.

So what was the reason? It turned out that there was a technical relationship. Because of an oversight, tasks of the job started to work in parallel, where the intention was to have them running sequentially. Unfortunately, the number of tasks running at the same time was far more than one hundred. Each of them was executing a long-running SQL statement. As a result, it quickly drained the shared connection pool. Sadly, once again, the customers were affected.

Case 3

We used a custom Feature Toggles Management Service (see the Feature Toggles concept), which allowed us to get a list of enabled features for a given tenant in our multi-tenant system. Unfortunately, the service wasn’t stable enough. The result was that it quite often responded with Internal Server Error status or even didn’t respond at all! The logic of our service used feature toggles intensively. The result was that users of our public API were highly impacted whenever Feature Toggles Management Service had a problem.

Now, it’s time to introduce patterns that help in such scenarios. Bear with me!

Patterns

Timeouts

It’s essential to understand that any resource pool can be exhausted and our responsibility is to prevent that. The most commonly used pools are connection pools and thread pools. Each resource that is leased from the pool has to be released eventually. Remember that pools are shared and the longer consumers lease the resource the more likely the pool will be drained and other consumers will be starved.

That’s why it’s very crucial to release the resource quickly. Very dangerous case is having exclusive access to the resource while calling a remote service. Unfortunately, the case is very common, because for network communication the standard is to use connection pools. Moreover, if we use blocking API and the thread is leased from the pool, it’s unavailable to others. If we have bad luck the remote call may never end. If it happens, our code can’t wait forever. To solve this problem we need to use proper techniques. The easiest, however very powerful, are operation timeouts. The goal is to protect ourselves against issues of our dependency, which does not respond timely.

How does it work? Each operation that uses a network underneath should have some upper limit, within which the operation has to complete. For example, when a simple database query usually takes 500ms to finish it’s reasonable to have a timeout of 5 seconds. If it doesn’t succeed within that time then it’s our responsibility to terminate the operation and throw the timeout exception. Ideally the connection should be closed.

How to guess what the best timeout value is? Unfortunately there is no clear-cut rule. Very helpful is having historical data with response times. Based on it it’s easy to calculate a baseline for a normal operating system. If we add some margin to the baseline we get the timeout value.

If we had had timeouts on the database access layer we would have slept better. The issue described in the “Case 1” has already been solved. The never ending database calls are timed out, which unblocks processing of subsequent messages. Timeouts also helped us to mitigate the problem of never ending calls to Feature Toggles Management Service mentioned in the “Case 3”. We stopped waiting for a response more than it’s needed and started to return Internal Server Error instead. I know that it doesn’t sound like a perfect solution, but bear in mind that it’s better to fail eventually than don’t reply at all. According to the rule, save our priceless resources and the time of our customers.

Retries

Some categories of remote call problems can be solved by retrying. It sounds like a very naive strategy, but it works excellent for transient errors. But how do we know that the error is transient? We never know, but we can use quite simple algorithms to decide whether retry makes sense. For example if HTTP server responds with status 400 (Bad Request), we know that it’s a client-side problem and sending the same request once again is pointless. On the other hand, statuses 500 (Internal Server Error) and 503 (Service Unavailable) are perfect candidates to give one more try. Of course, not every network communication uses HTTP protocol. From the socket level perspective I think it’s worth retrying every time we encounter network-level issues. Also a great candidate to retry is a timeout exception (see Timeouts pattern). If we didn’t receive a response in a reasonable time, maybe the communication link has been broken. Trying once again will establish yet another connection and the operation may succeed. The other protocol-agnostic technique to detect transient errors is to use the Leaky Bucket design pattern. To make a long story short imagine that we have a bucket with a hole at the bottom. When we pour the water it leaks through the hole, but if the hole is small and we pour fast enough it will never get empty. Now let’s try to employ this idea to computer science. When remote calls succeed it’s similar to pouring water to the bucket. When they fail, it’s an analogy of leaking water through the hole. Unless there is some water in the bucket these errors are transient. If the bucket becomes empty, because errors are too close together or there are a lot of them, it’s a sign of non-transient errors and there is no reason to retry. We can control the threshold to detect non-transient errors by tuning either the size of the bucket or the size of the hole. By increasing the size of the bucket or decreasing the diameter of the hole you increase the tolerance for errors. The greater the tolerance the later errors are classified as non-transient.

You may ask how often and how quickly should we retry? And I will disappoint you again. There is no clear-cut rule. Fast retries are very likely to fail again. Ideally, we should wait some time without blocking any resources, like threads. Use queues, if applicable, to schedule yet another attempt. Unfortunately, it does not play well when we are in the context of a synchronous call, because it significantly delays the response on which the upstream service is waiting.

Ideally, the retry pattern should follow the exponential backoff policy. For example, a second attempt after 1s, then after 2s, then after 4s, and so on. The aim of this algorithm is to succeed the operation fast, but if it’s impossible it gives some time to recover from a long lasting outage. It’s also quite important to add some random factor to a delay while scheduling yet another attempt. The retry pattern can look like this: 1s, 2.1s, 3.8s, 8.05s. If everyone uses the same pattern without the random factor and they start trying at the same time the calls would interfere and obstruct the recovery process of the broken component.

The other question is how many times should we try? It depends on the case and the likelihood that the external service would reply correctly eventually.

How did this pattern help us? In the “Case 3” we managed to decrease the error rate of API calls perceivable by our customers. How? In a very simple way. Before our API responds with an error, when there is a problem with Feature Toggles Management Service, we try to fetch toggles for a tenant once again. The second call to the downstream service is successful in majority of cases, thanks to the load balancer, which routes the request to a different instance of the service — the most likely, the healthy one.

Circuit Breakers

A circuit breaker is a switch used to protect electric circuits against exceeded power usage of other devices. It fails first, and opens the circuit to avoid burning a house. In software engineering we can use a similar mechanism to cut some components off when they are not operating correctly and give them time to recover. Circuit breaker can be implemented as a client-side proxy catching every call to an external service.

Circuit breaker is a state machine that consists of three states:

Closed
Open
Half-open

The closed state means that everything works normally and nothing extraordinary happens. Each call through the circuit breaker is forwarded to its destination. If a call fails, then the circuit breaker remembers it. Once the number of errors exceeds a threshold, the circuit breaker opens the circuit. It means that subsequent calls finish fast and don’t reach the destination. We have to involve system stakeholders to let them decide what finishing fast means. The most commonly used options are:

Fail fast
Return the last good response from the cache
Call a secondary service when the primary is not available

A circuit breaker has to be closed eventually. Commonly used strategy is to switch into the half-open state after a suitable amount of time. It means that some calls are forwarded to its destination. If they succeed the circuit breaker goes into the closed state, otherwise it goes into the open state.

We may wonder what is the answer to the question: “When the circuit breaker should go from open into the closed state?”. Ideally, it should measure fault density. One strategy is to use the aforementioned Leaky Buckets design pattern.

Circuit Breakers and Timeouts play very well together. If we get a timeout exception it means that we wait too long and it does not make sense to wait longer. If we suspect that subsequent requests will also fail it’s pointless to wait and lease resources from pools (like threads and connections) for such a long time. It’s better to give up and fail immediately.

Referring to the case studies, the introduction of circuit breakers mitigated a problem of unstable Feature Toggles Management Service mentioned in the “Case 3”. The solution was to add a proxy wrapping the client of the unreliable service. The proxy was a circuit breaker. After a successful call to the downstream service it stored the result in an internal cache. After an unsuccessful call to the downstream service, or when the circuit breaker was open (for example because of recurrent timeouts), it immediately returned the value previously stored in the cache. Our goal was to stop propagation of errors, and to decrease response times, when the downstream service has some problems.

Bulkheads

Bulkheads is a word that derives from ship construction. It stands for partitions dividing watertight compartments that can contain water in case of a leak. They prevent flooding the whole hull. Using the same technique in computer programs we can protect the whole system from failure by sacrificing only a single piece.

Examples:

Four servers running a service. Even when one instance suddenly stops, the other can handle the traffic.
Running a service across regions or data centers. Even when one region or data center becomes unavailable, the service works in others.
Having separate connection pools for different types of problems. Even when a less critical administrative job saturates its own pool does not have any impact on critical customer requests. If we had had exclusive connection pools in our system I wouldn’t have to mention the “Case 2”.

Summary

We were too optimistic, and we unintentionally ignored the fact that networks are unreliable. Now, we know that we were wrong, and it was a valuable lesson for us. Unfortunately, there was a cost that we had to incur. It’s always better to learn from the mistakes of others. After reading the article, you have a chance to learn from our mistakes.

My wish is to encourage you to use these patterns wisely. Every time you consider the usage of any pattern, you have to think about pros and cons. Everything depends on the case. That’s why I don’t have recipes ready for you, but you can use the case studies to find analogies and be inspired. It’s crucial to be open minded, whatever you do!