Applications running in critical and production mode are supposed to be running continuously without any interruptions. This holds true even when there are unplanned disasters or outages. Organizations rely on their Business Continuity and Mitigation Systems that usually consists of running/hosting the services on different data centers located in different geographies.

The same applies in case of disasters in case of Cloud Applications and Databases. Though Azure replicates your SQL databases in at least 2 other data centers across geographically locations, the same doesn’t work for the Azure Service Bus Relay.

In case of outages the data center may come back after a short or a decent amount delay that might be caused due to a technical snag or other reasons such as power failure or loss of internet connection. However in case of a disaster either man-made or natural such as earthquake or fire etc there is a good chance that it may lead to permanent loss of data and services.

It is always suggested to have a backup plan in action and in case of Service Bus also this holds true. The backup plan will incur additional operational expenditure, it is a must unless the service is not that important.

Service Bus Relay endpoints must be geo-replicated to allow a service to be reachable in case of Service Bus outage. For the purpose of geo-replication, it is necessary that the service creates two or more relay endpoints in different namespaces and that the namespaces reside in different data centers across different geographical locations and that the endpoints have different names.

For example: You can configure the same service with the primary service being and the secondary service as in two different geographies. They have different namespaces and different endpoints as well.

In such a scenario, the client basically listens on both the endpoints. Service can be invoked from either of the endpoints by the client.

How the call to the endpoints work

One of the relays is randomly chosen as the default/primary endpoint with the other acting as the backup/secondary endpoint. The call from the client application goes to the primary endpoint and case of the operation being a failure with an error code which denotes that the relay being unavailable the applications resends the same request to the backup endpoint by creating a new channel.

This causes the roles to be switched between both the endpoints. The primary endpoint turning into secondary due to its unavailability and the secondary turning into the primary endpoint. However in case of an outage in both the data centers where the relay endpoints are hosted the roles are not switched and the request results in the operation failure.