Inter service communication is problematic, you can never be sure whether the other service got your message, errored, or failed in some way to respond. When networks are involved the problem is exacerbated.
When a system is being designed, it is impossible to eliminate all the possible issues that could arise. With cloud utilisation it’s even harder (a different company is in charge of the data-centre/hardware/networks).
The better option to take is to design systems to expect and deal with failure. With the mindset of expecting failures, the next step is to define a strategy on how a component should behave when a failure occurs.
The scenario is, one component, or service, tries to communicate with another. That, for reasons unknown, fails, the next step the first component takes determines how it deals with that failure.
Retry
When the component detects that a failure has occurred, it can retry the same call that it made previously. This assumes that whatever was previously preventing communication from happening has now been fixed, or mitigated in some manner (eg. a new version of the called component now exists, or a DNS round robin means that a different instance is called, or even the called component has healed or is dealing with fewer requests).
Unfortunately the new communication could add to the problems that the other component is facing (ie. more work to do)
Timeout
When a component tries to communicate with another component, a certain amount of time is given to wait for an appropriate response. If one isn’t forthcoming, an error is generated.
How much time is allowed is dependant upon the expectations of the system, for example a timeout of a couple of hundred milliseconds might be appropriate when asking a datastore to respond to an SQL query, but a timeout of several hours might be appropriate if transferring a large file (or processing a non-trivial dataset).
Fallback
If it’s determined that communication to another component isn’t feasible, for whatever reason, a fallback position is chosen.
A fallback position is dependant upon the context, but is often a default value that the system can live with.
For example, if a user cannot login, then the fallback position might be that they are given guest credentials that allow them to perform some tasks, but not others.
Circuit Breaker
A slightly more complicated approach is the circuit breaker. A circuit breaker receives the sent message, and passes it on to the final service. It notes how long the service takes to respond meaningfully to that message.
If the response breaks some set of rules (eg. takes too long, or breaks some metric for number of errors over the past n seconds) the circuit breaker will change the way that it behaves. Instead of passing all of the messages it receives directly to the destination component it doesn’t pass any.
This is more helpful than it sounds, the destination component now has an opportunity to heal, catch up on tasks it is behind on, or be replaced by a new instance.
The caller component will be sent error messages, indicating that the circuit breaker has been tripped. That fast failure allows the system to move on.
After some fixed period of time the circuit breaker will change behaviour again. It enters a third state, where some small number of messages are allowed through to the destination component. In effect the circuit breaker is ‘testing’ to see if the destination can be used fully again.
Eventually, should the destination prove to be ready, willing and able, the circuit breaker returns to its original state, and acts as a transparent proxy for the messages once more.
Conclusion
Although these approaches can be applied in a discrete manner, more often than not they are applied in combinations. Especially with the Circuit breaker pattern. It’s often the case that your service needs to recognise how to deal with the circuit breaker depending on what state the circuit breaker might be in.
There’s no reason to not combine approaches, eg. when communication times out, a fallback is returned.
In all cases, expecting failure and having a strategy to deal with it when it arrives ensures that your system isn’t catastrophically destroyed by that failure.