Resilient programming

Resilient programming

In the early 1980s Sun Microsystems slogan was “The Network is the Computer” (Cloudflare now owns the right to the trademark), and, as programmers, we now understand how true that is. Whether you’re a client side developer interacting with upstream REST services, or a server side developer making calls to a database, or any other up/down external (to this piece of code) service, you are most likely dealing with a network.

Networks are notoriously unreliable, hardware degrades, or is breaks completely, human errors cause routing misconfigurations, or malicious humans overwhelm them, combine with the fact that the service being communicated with can be in any possible state, from up and responsive, to up, but struggling, to completely unresponsive, and every possible state in between.

Thus, a programmer absolutely has to be aware of, expect, and be ready for these problems to occur, and deal with them appropriately, nobody wants their system to be dead unless it absolutely is!

There’s no way to deal with every possible permutation of every possible set of states directly, but there are general rules that will allow systems to keep on trucking regardless of most of the possible states.

There are three key strategies to learn. Like other things these strategies can, and often are, combined to deal with the network problems.

Fallback

The fallback strategy is simple, if an issue is detected talking to an external service then a response is sent that can partially satisfy the request. For example the fallback might be to send known stale data - this is the basis of eventually consistent systems. The response may not be the most recent data, but it’s good enough.

Retry

The retry strategy is to simply retry. That is, an issue is detected, so the code simply makes the response again.

Timeout

Another simple strategy, give the external system a deadline by which to respond, and if it doesn’t then return.

There’s a fourth strategy called a Circuit Breaker which is something that sits between the calling code and the external system, monitoring the performance of the external system and, should the external system become slow or unresponsive the circuit breaker will “close” and prevent all requests to the external system, but the circuit breaker itself is an external system to the calling code, and should be treated as such.

Armed with the above knowledge developers can combine the strategies to provide the best solution for their usecase.

For example, A caller could have a deadline where a call to the external service times out, which triggers a fallback to call another external service. Another combination might be that a service keeps retrying until a timeout occurs. There is no limit to the combinations, only the need to balance the effects of each strategy - there’s little point in retrying a service if the reason that an error state is occurring is that the external service has too many requests to deal with!

Summary

Writing resilient code is a must, and isn’t that hard to do. A programmer doesn’t need to necessarily know why the external service is unable to respond, but the programmer does need to know what they want to do, using combinations of very simple strategies.

Published:
comments powered by Disqus