A Dead Letter Queue (DLQ) is an architectural pattern for handling messages or events that a system cannot process after multiple attempts.
Why is it named a Dead Letter Queue (DLQ)?
The physical mail in the snail mail/Postal service would occasionally encounter mail that was undeliverable - usually because the address was incorrect, illegible, or for some other reason.
What should happen to that mail? It got put into a bin/container where it could be dealt with later.
Why are DLQs used?
Standard resilience patterns, like retries, are perfect for handling transient failures - the quick network blips or short-term service degradations that resolve themselves. However, when a failure becomes permanent (exceeding your SLA or retry limits), you need a robust fallback strategy. This is where the DLQ comes in: it captures the message for later action while allowing you to surface a clean error to the user.
In the past, these failures were often just logged. The problem is that logs are easily buried in a flood of other data; if no one is manually auditing them, a failure can go unnoticed indefinitely. Moving these events to a DLQ ensures they aren’t just recorded, but stored in a dedicated location for resolution.
While it’s tempting to make every failure a pageable event, context matters. Nobody wants to be woken up at 2:00 AM because of a user’s malformed input. A DLQ allows for a more nuanced approach to on-call rotations:
Internal Failures: If your own upstream service fails, it may warrant a page.
Third-Party Failures: If an external vendor returns a 500 error or a “payment required” status, your support team often can’t fix it immediately.
Instead of triggering unnecessary alarms, the best strategy is to route these “un-processable” messages to the DLQ. This provides a central “to-do list” that your team can review and manage at the start of the next business day.
Common mistakes
A DLQ is great at capturing the fact that events have become unprocessable, but that’s of little use if the DLQ is never checked (it’s easy to “fire and forget”, messages being passed to the DLQ but never handled).
Keeping the failed messages forever - there should be a policy to remove the events after some given amount of time.
Overusing the DLQ - The DLQ is for messages that cannot be processed, it’s not a replacement for log messages that might be detailing poor use of the service.